blob: eb657477649b5ce47eb13e7e86dbe02044c2071b [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<id>https://doris.apache.org/zh-CN/blog</id>
<title>Apache Doris Blog</title>
<updated>2024-06-24T00:00:00.000Z</updated>
<generator>https://github.com/jpmonette/feed</generator>
<link rel="alternate" href="https://doris.apache.org/zh-CN/blog"/>
<subtitle>Apache Doris Blog</subtitle>
<icon>https://doris.apache.org/zh-CN/images/favicon.ico</icon>
<entry>
<title type="html"><![CDATA[Why Apache Doris is the best open source alternative to Rockset]]></title>
<id>https://doris.apache.org/zh-CN/blog/apache-doris-vs-rockset</id>
<link href="https://doris.apache.org/zh-CN/blog/apache-doris-vs-rockset"/>
<updated>2024-06-24T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Among of all the claim-to-be alternatives to Rockset, Apache Doris is one of the few that cover all the key features of Rockset.]]></summary>
<content type="html"><![CDATA[<p>OpenAI dropped a bomb on the data world by announcing <a href="https://openai.com/index/openai-acquires-rockset/" target="_blank" rel="noopener noreferrer">the acquisition of Rockset</a>, a cloud-based, fully managed analytical database. Among all the congratulating voices, one question is raised: <strong>why <a href="https://rockset.com" target="_blank" rel="noopener noreferrer">Rockset</a></strong>?</p><p><img loading="lazy" alt="OpenAI acquisition Rockset" src="https://cdnd.selectdb.com/zh-CN/assets/images/openai-twitter-rockset-629193903b8235732c88086caa39cc0c.png" width="1212" height="482" class="img_ev3q"></p><p>Founded in 2016 by Venkat Venkataramani, former Engineering Director at Meta, Rockset focuses on real-time search and data analytics. Compared to other DBMS, Rockset stands out by its:</p><ul><li><p><strong>Real-time data updates</strong>: Rockset ensures data freshness for users by its capabilities in fetching and delivering the latest data. It supports real-time updates at the granularity of data fields, which can be performed within milliseconds.</p></li><li><p><strong>Converged index</strong>: It reaps the benefits of inverted index, columnar storage, and row-oriented storage, and provides efficient and flexible data querying services.</p></li><li><p><strong>Native support for semi-structured data</strong>: Rockset is well-suited to the growing demand for semi-structured data processing, hash joins, and nested loop joins.</p></li><li><p><strong>SQL and JOIN compatibility</strong>: The Search Index of Rockset is optimized for various join queries.</p></li></ul><p>The news also gaves all Rockset users a ticking time bomb: they have to find an appropriate alternative to Rockset for their own use case within three months. This, of course, arises as an opportunity for other analytical databases on the market. However, of all the claim-to-be alternatives, only a few of them cover all the above-mentioned key features of Rockset. Among them, Apache Doris is worth looking into.</p><p>As an open-source real-time data warehouse, Apache Doris is trusted by over 4000 enterprise users worldwide with powerful functionalities including:</p><ul><li><p><strong>Real-time data updates</strong>: Apache Doris supports not only <a href="https://doris.apache.org/docs/table-design/data-model/unique" target="_blank" rel="noopener noreferrer">real-time updates</a> and deletion, but also real-time partial column updates, making it particularly useful in cases involving frequent data updates.</p></li><li><p><strong>Row/column hybrid storage</strong>: Apache Doris is a column-oriented data warehouse that achieves world-leading OLAP performance on <a href="https://benchmark.clickhouse.com/" target="_blank" rel="noopener noreferrer">ClickBench</a>. Additionally, it supports row-oriented storage to serve <a href="https://doris.apache.org/docs/query/high-concurrent-point-query/" target="_blank" rel="noopener noreferrer">high-concurrency point query scenarios</a>, which allows it to respond to almost a million query requests within milliseconds. </p></li><li><p><strong><a href="https://doris.apache.org/docs/table-design/index/inverted-index" target="_blank" rel="noopener noreferrer">Inverted index</a> and full-text searches</strong>: Apache Doris provides high efficiency and flexibility in keyword searching. It allows index creation on all fields and a flexible combination of data fields for multi-dimensional data analysis.</p></li><li><p><strong>Native support for semi-structured data</strong>: Apache Doris has introduced the <a href="https://doris.apache.org/docs/sql-manual/sql-types/Data-Types/VARIANT" target="_blank" rel="noopener noreferrer">VARIANT</a> data type to accommodate semi-structured data. It enables flexible data schema and high query speed on top of cost-efficient data storage. Compared to traditional JSON methods, VARIANT can bring a 10x performance improvement.</p></li><li><p><strong>Support for various SQL and <a href="https://doris.apache.org/docs/query/join-optimization/doris-join-optimization" target="_blank" rel="noopener noreferrer">join operations</a></strong>: Apache Doris is highly compatible with MySQL syntaxes and interfaces. It supports INNER JOIN, CROSS JOIN, and all types of OUTER JOIN. The best part is its capability of auto-optimization based on data types to guarantee optimal performance under different circumstances.</p></li></ul><p>As a Top-Level Project of the Apache Software Foundation, Apache Doris is supported by a robust and fast-growing community. It has accumulated over 11.8K GitHub stars and 636 contributors so far.</p><p>If you are seeking a fully managed solution instead of an open source product, you might want to look into <a href="https://www.velodb.io" target="_blank" rel="noopener noreferrer">VeloDB</a>. As the commercial service provider of Apache Doris, VeloDB offers a wider range of products that are more tailored to the needs of enterprises. <a href="https://www.velodb.io/cloud" target="_blank" rel="noopener noreferrer">VeloDB Cloud</a> decouples compute and storage on the basis of Apache Doris, thus realizing higher elastic scalability and cost efficiency. Like cloud-based Rockset, it frees users from tedious database operations and maintenance and redirects their focus to what drives their business growth.</p>]]></content>
<author>
<name>Zaki Lu</name>
</author>
<category label="Top News" term="Top News"/>
</entry>
<entry>
<title type="html"><![CDATA[Steps to industry-leading query speed: evolution of the Apache Doris execution engine]]></title>
<id>https://doris.apache.org/zh-CN/blog/evolution-of-the-apache-doris-execution-engine</id>
<link href="https://doris.apache.org/zh-CN/blog/evolution-of-the-apache-doris-execution-engine"/>
<updated>2024-06-18T00:00:00.000Z</updated>
<summary type="html"><![CDATA[From the Volcano Model to the Pipeline Execution Engine, and now PipelineX, Apache Doris brings its computation efficiency to a higher level with each iteration.]]></summary>
<content type="html"><![CDATA[<p>What makes a modern database system? The three key modules are query optimizer, execution engine, and storage engine. Among them, the role of execution engine to the DBMS is like the chef to a restaurant. This article focuses on the execution engine of the <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a> data warehouse, explaining the secret to its high performance.</p><p>To illustrate the role of the execution engine, let's follow the execution process of an SQL statement: </p><ul><li>Upon receiving an SQL query, the query optimizer performs syntax/lexical analysis and generates the optimal execution plan based on the cost model and optimization rules.</li></ul><ul><li>The execution engine then schedules the plan to the nodes, which operate on data in the underlying storage engine and then return the query results.</li></ul><p>The execution engine performs operations like data reading, filtering, sorting, and aggregation. The efficiency of these steps determines query performance and resource utilization. That's why different execution models bring distinction in query efficiency.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="volcano-model">Volcano Model<a href="#volcano-model" class="hash-link" aria-label="Volcano Model的直接链接" title="Volcano Model的直接链接"></a></h2><p>The Volcano Model (originally known as the Iterator Model) predominates in analytical databases, followed by the Materialization Model and Vectorized Model. In a Volcano Model, each operation is abstracted as an operator, so the entire SQL query is an operator tree. During query execution, the tree is traversed top-down by calling the <code>next()</code> interface, and data is pulled and processed from the bottom up. This is called a <strong>pull-based</strong> execution model. </p><p>The Volcano Model is flexible, scalable, and easy to implement and optimize. It underpins Apache Doris before version 2.1.0. When a user initiates an SQL query, Doris parses the query, generates a distributed execution plan, and dispatches tasks to the nodes for execution. Each individual task is an <strong>instance</strong>. Take a simple query as an example: </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select age, sex from employees where age &gt; 30</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><img loading="lazy" alt="Volcano Model" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-volcano-model-bf7cd6bf0fe5c369d9ae4ac0efa617d4.png" width="1280" height="874" class="img_ev3q"></p><p>In an instance, data flows between operators are propelled by the <code>next()</code> method. If the <code>next()</code> method of an operator is called, it will first call the <code>next()</code> of its child operator, obtain data from it, and then process the data to produce output. </p><p><code>next()</code> is a synchronous method. In other words, the current operator will be blocked if its child operator does not provide data for it. In this case, the <code>next()</code> method of the root operator needs to be called in a loop until all data is processed, which is when the instance finishes its computation.</p><p>Such execution mechanism faces a few bottlenecks in single-node, multi-core use cases:</p><ul><li><p><strong>Thread blocking</strong>: In a fixed-size thread pool, if an instance occupies a thread and it is blocked, that will easily cause a deadlock when there are a large number of instances requesting execution simultaneously. This is especially the case when the current instance is dependent on other instances. Additionally, if a node is running more instances than the number of CPU cores it has, the system scheduling mechanism will be heavily relied upon and a huge context switching overhead can be produced. In a colocation scenario, this will lead to an even larger thread switching overhead.</p></li><li><p><strong>CPU contention</strong>: The threads might compete for CPU resources so queries of different sizes and between different tenants might interfere with each other.</p></li><li><p><strong>Underutilization of the multi-core computing capabilities</strong>: Execution concurrency relies heavily on data distribution. Specifically, the number of instances running on a node is limited by the number of data buckets on that node. In this case, it's important to set an appropriate number of buckets. If you shard the data into too many buckets, that will become a burden for the system and bring unnecessary overheads; if the buckets are too few, you will not be able to utilize your CPU cores to the fullest. However, in a production environment, it is not always easy to estimate the proper number of buckets you need, thus performance loss. </p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="pipeline-execution-engine">Pipeline Execution Engine<a href="#pipeline-execution-engine" class="hash-link" aria-label="Pipeline Execution Engine的直接链接" title="Pipeline Execution Engine的直接链接"></a></h2><p>Based on the known issues of Volcano Model, we've replaced it with the Pipeline Execution Engine since Apache Doris 2.0.0. </p><p>As the name suggests, the Pipeline Execution Engine breaks down the execution plan into pipeline tasks, and schedules these pipeline tasks into a thread pool in a time-sharing manner. If a pipeline task is blocked, it will be put on hold to release the thread it is occupying. Meanwhile, it supports various scheduling strategies, meaning that you can allocate CPU resources to different queries and tenants more flexibly. </p><p>Additionally, the Pipeline Execution Engine pools together data within data buckets, so the number of running instances is no longer capped by the number of buckets. This not only enhances Apache Doris' utilization of multi-core systems, but also improves system performance and stability by avoiding frequent thread creation and deletion.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="example">Example<a href="#example" class="hash-link" aria-label="Example的直接链接" title="Example的直接链接"></a></h3><p>This is the execution plan of a join query. It includes two instances:</p><p><img loading="lazy" alt="Pipeline Execution Engine" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-pipeline-execution-engine-0f462e7876f79b16a3dd1cb21daa8444.png" width="1280" height="678" class="img_ev3q"></p><p>As illustrated, the Probe operation can only be executed after the hash table is built, while the Build operation is reliant on the computation results of the Exchange operator. Each of the two instances is divided into two pipeline tasks as such. Then these tasks will be scheduled in the "ready" queue of the thread pool. Following the specified strategies, the threads obtain the tasks to process. In a pipeline task, after one data block is finished, if the relevant data is ready and its runtime stays within the maximum allowed duration, the thread will continue to compute the next data block. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="design--implementation">Design &amp; implementation<a href="#design--implementation" class="hash-link" aria-label="Design &amp; implementation的直接链接" title="Design &amp; implementation的直接链接"></a></h3><p><strong>Avoid thread blocking</strong></p><p>As is mentioned earlier, the Volcano Model is faced with a few bottlenecks: </p><ol><li>If too many threads are blocked, the thread pool will be saturated and unable to respond to subsequent queries.</li></ol><ol start="2"><li>Thread scheduling is entirely managed by the operating system, without any user-level control or customization.</li></ol><p>How does Pipeline Execution Engine avoid such issues?</p><ol><li>We fix the size of the thread pool to match the CPU core count. Then we split all operators that are prone to blocking into pipeline tasks. For example, we use individual threads for disk I/O operations and RPC operations.</li></ol><ol start="2"><li>We design a user-space polling scheduler. It continuously checks the state of all executable pipeline tasks and assigns executable tasks to threads. With this in place, the operating system doesn't have to frequently switch threads, thus less overheads. It also allows customized scheduling strategies, such as assigning priorities to tasks.</li></ol><p><img loading="lazy" alt="Design &amp;amp; implementation" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipeline-design-implementation-21c2bccd9d7d386133decbf17c2040fb.png" width="1280" height="587" class="img_ev3q"></p><p><strong>Parallelization</strong></p><p>Before version 2.0, Apache Doris requires users to set a concurrency parameter for the execution engine (<code>parallel_fragment_exec_instance_num</code>), which does not dynamically change based on the workloads. Therefore, it is a burden for users to figure out an appropriate concurrency level that leads to optimal performance.</p><p>What's the industry's solution to this?</p><p>Presto's idea is to shuffle the data into a reasonable number of partitions during execution, which requires minimal concurrency control from users. On the other hand, DuckDB introduces an extra synchronization mechanism instead of shuffling. We decide to follow Presto's track of Presto because the DuckDB solution inevitably involves the use of locks, which works against our purpose of avoiding blocking.</p><p>Unlike Presto, Apache Doris doesn't need an extra Local Exchange mechanism to shards the data into an appropriate number of partitions. With its massively parallel processing (MPP) architecture, Doris already does so during shuffling. (In Presto's case, it re-partitions the data via Local Exchange for higher execution concurrency. For example, in hash aggregation, Doris further shards the data based on the aggregation key in order to fully utilize the CPU cores. Also, this can downsize the hash table that each execution thread has to build.)</p><p><img loading="lazy" alt="Design &amp;amp; implementation" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipeline-design-implementation-2-a687d57ca081c1170f31ab0145d5caff.png" width="1280" height="686" class="img_ev3q"></p><p>Based on the MPP architecture, we only need two improvements before we achieve what we want in Doris:</p><ul><li><strong>Increase the concurrency level during shuffling</strong>. For this, we only need to have the frontend (FE) perceive the backend (BE) environment and then set a reasonable number of partitions.</li></ul><ul><li><strong>Implement concurrent execution after data reading by the scan layer</strong>. To do this, we need a logical restructuring of the scan layer to decouple the threads from the number of data tablets. This is a pooling process. We pool the data read by scanner threads, so it can be fetched by multiple pipeline tasks directly. </li></ul><p><img loading="lazy" alt="Design &amp;amp; implementation" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipleine-design-implementation-3-fbbbb04f5de750bad1fd28694ca3b1d7.jpeg" width="1280" height="670" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="pipelinex">PipelineX<a href="#pipelinex" class="hash-link" aria-label="PipelineX的直接链接" title="PipelineX的直接链接"></a></h2><p>Introduced in Apache Doris 2.0.0, the pipeline execution engine has been improving query performance and stability under hybrid workload scenarios (queries of different sizes and from different tenants). In <a href="https://doris.apache.org/blog/release-note-2.1.0" target="_blank" rel="noopener noreferrer">version 2.1.0</a>, we've tackled the known issues and upgraded this from an experimental feature to a robust and reliable solution, which is what we call <a href="https://doris.apache.org/docs/query/pipeline/pipeline-x-execution-engine" target="_blank" rel="noopener noreferrer">PipelineX</a>.</p><p>PipelineX has provided answers to the following issues that used to challenge the Pipeline Execution Engine:</p><ul><li><strong>Limited execution concurrency</strong></li></ul><ul><li><strong>High execution overhead</strong></li></ul><ul><li><strong>High scheduling overhead</strong></li></ul><ul><li><strong>Poor readability of operator profile</strong></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="execution-concurrency">Execution concurrency<a href="#execution-concurrency" class="hash-link" aria-label="Execution concurrency的直接链接" title="Execution concurrency的直接链接"></a></h3><p>The Pipeline Execution Engine remains under the restriction of the static concurrency parameter at FE and the tablet count at the storage layer, making itself unable to capitalize on the full computing resources. Plus, it is easily affected by data skew. </p><p>For example, suppose that Table A contains 100 million rows but it has only 1 tablet, which means it is not sharded enough, let's see what can happen when you perform an aggregation query on it: </p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SELECT COUNT(*) FROM A GROUP BY A.COL_1;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>During query execution, the query plan is divided into two <strong>fragments</strong>. Each fragment, consisting of multiple operators, is dispatched by frontend (FE) to backend (BE). The BE starts threads to execute the fragments concurrently.</p><p><img loading="lazy" alt="Pipeline Execution concurrency" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-pipelinex-4e2f29eb887b9b2dc157ab32aea356b4.png" width="1280" height="707" class="img_ev3q"></p><p>Now, let's focus on Fragment 0 for further elaboration. Because there is only one tablet, Fragment 0 can only be executed by one thread. That means aggregation of 100 million rows by one single thread. If you have 16 CPU cores, ideally, the system can allocate 8 threads to execute Fragment 0. In this case, there is a concurrency disparity of 8 to 1. This is how <strong>the number of tablets restricts execution concurrency</strong> and also why we introduce the idea of <strong>Local Shuffle mechanism to remove that restriction</strong> in Apache Doris 2.1.0. So this is how it works in PipelineX: </p><ul><li>The threads execute their own pipeline tasks, but the pipeline tasks only maintain their runtime state (known as <strong>Local State</strong>), while the information that shared across all pipeline tasks (known as <strong>Global State</strong>) is managed by one pipeline object.</li></ul><ul><li>On a single BE, the Local Shuffle mechanism is responsible for data distribution and data balancing across pipeline tasks.</li></ul><p><img loading="lazy" alt="Pipeline Execution concurrency" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-pipelinex-1-6f7365d5417f254f369c691827d034eb.png" width="1220" height="1280" class="img_ev3q"></p><p>Apart from decoupling execution concurrency from tablet count, Local Shuffle can avoid performance loss due to data skew. Again, we will explain with the foregoing example.</p><p>This time, we shard Table A into two tablets instead of one, but the data is not evenly distributed. Tablet 1 and Tablet 3 hold 10 million and 90 million rows, respectively. The Pipeline Execution Engine and PipelineX Execution Engine respond differently to such data skew:</p><ul><li><strong>Pipeline Execution Engine</strong>: Thread 1 and Thread 2 executes Fragment 1 concurrently. The latter takes 9 times as long as Thread 1 because of the different data sizes they deal with.</li></ul><ul><li><strong>PipelineX Execution Engine</strong>: With Local Shuffle, data is distributed evenly to the two threads, so they take almost equal time to finish.</li></ul><p><img loading="lazy" alt="Pipeline vs PipelineX execution engine" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-pipelinex-3-4be6cbf01667eb21a88ce0914e3a404a.png" width="1280" height="972" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="execution-overhead">Execution overhead<a href="#execution-overhead" class="hash-link" aria-label="Execution overhead的直接链接" title="Execution overhead的直接链接"></a></h3><p>Under the Pipeline Execution Engine, because the expressions of different instances are individual, each instance is initialized individually. However, since the initialization parameters of instances share a lot in common, we can reuse the shared states to reduce execution overheads. This is what PipelineX does: it initializes the Global State at a time, and the Local State sequentially.</p><p><img loading="lazy" alt="Execution overhead" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipeline-execution-overhead-f3d56f790b260c95154e1039faf202d9.png" width="1280" height="625" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="scheduling-overhead">Scheduling overhead<a href="#scheduling-overhead" class="hash-link" aria-label="Scheduling overhead的直接链接" title="Scheduling overhead的直接链接"></a></h3><p>In the Pipeline Execution Engine, the blocked tasks are put into a blocked queue, where a dedicated thread takes polls and moves the executable tasks over to the runnable queue. This dedicated scheduling thread consumes a CPU core, and incurs overheads that can be particularly noticeable on systems with limited computing resources.</p><p><strong>As a better solution, PipelineX encapsulates the blocking conditions as dependencies, and the task status (blocked or runnable) will be triggered to change by event notifications.</strong> Specifically, when RPC data arrives, the relevant task will be considered as ready by the ExchangeSourceOperator and then moved to the runnable queue.</p><p><img loading="lazy" alt="Scheduling overhead" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipeline-scheduling-overhead-8fc879065b8ec7b550448f12d8cbad7d.png" width="1280" height="740" class="img_ev3q"></p><p>That means <strong>PipelineX implements event-driven scheduling</strong>. A query execution plan can be depicted as a DAG, where the pipeline tasks are abstracted as nodes and the dependencies as edges. Whether a pipeline task gets executed depends on whether all its associated dependencies have satisfied the requisite conditions.</p><p><img loading="lazy" alt="Scheduling overhead" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipeline-scheduling-overhead-2-2141c79fe8435bc10d6f4e933e009ef4.jpeg" width="1280" height="723" class="img_ev3q"></p><p>For simplicity of illustration, the above DAG only shows the dependencies between the upstream and downstream pipeline tasks. In fact, all blocking conditions are abstracted as dependencies. The complete execution workflow of a pipeline task is as follows:</p><p><img loading="lazy" alt="Scheduling overhead" src="https://cdnd.selectdb.com/zh-CN/assets/images/pipeline-scheduling-overhead-3-7a77f777afc96de866410fd9624ce605.png" width="1280" height="629" class="img_ev3q"></p><p>In event-driven execution, a pipeline task will only be executed after all its dependencies satisfy the conditions; otherwise, it will be added to the blocked queue. When an external event arrives, all blocked tasks will be re-evaluated to see if they're runnable.</p><p>The event-driven design of PipelineX eliminates the need for a polling thread and thus the consequential performance loss under high cluster loads. Moreover, the encapsulation of dependencies enables a more flexible scheduling framework, making it easier to spill data to disks.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="operator-profile">Operator profile<a href="#operator-profile" class="hash-link" aria-label="Operator profile的直接链接" title="Operator profile的直接链接"></a></h3><p>PipelineX has reorganized the metrics in the operator profiles, adding new ones and obsoleting irrelevant ones. Besides, with the dependencies encapsulated, we monitor how long the dependencies take to get ready by the metric <code>WaitForDependency</code>, so the profile can provide a clear picture of the time spent in each step. These are two examples:</p><ul><li><p><strong>Scan Operator</strong>: The total execution time of <code>OLAP_SCAN_OPERATOR</code> is 457.750ms, including that spent in data reading by the scanner (436.883ms) and that in actual execution.</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">OLAP_SCAN_OPERATOR (id=4. table name = Z03_DI_MID):</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - ExecTime: 457.750ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - WaitForDependency[OLAP_SCAN_OPERATOR_DEPENDENCY]Time: 436.883ms</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div></li><li><p><strong>Exchange Source Operator</strong>: The execution time of <code>EXCHANGE_OPERATOR</code> is 86.691us. The time spent waiting for data from upstream is 409.256us.</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">EXCHANGE_OPERATOR (id=3):</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - ExecTime: 86.691us</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - WaitForDependencyTime: 0ns</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - WaitForData0: 409.256us</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="whats-next">What's next<a href="#whats-next" class="hash-link" aria-label="What's next的直接链接" title="What's next的直接链接"></a></h2><p>From the Volcano Model to the Pipeline Execution Engine, Apache Doris 2.0.0 has overcome the deadlocks under high cluster load and greatly increased CPU utilization. Now, from the Pipeline Execution Engine to PipelineX, Apache Doris 2.1.0 is more production-friendly as it has ironed out the kinks in concurrency, overheads, and operator profile. </p><p>What's next in our roadmap is to support spilling data to disk in PipelineX to further improve query speed and system reliability. We also plan to advance further in terms of automation, such as self-adaptive concurrency and auto execution plan optimization, accompanied by NUMA technologies to harvest better performance from hardware resources. </p><p>If you want to talk to the amazing Doris developers who lead these changes, you are more than welcome to join the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2gmq5o30h-455W226d79zP3L96ZhXIoQ" target="_blank" rel="noopener noreferrer">Apache Doris</a> community.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Another lifesaver for data engineers: Apache Doris Job Scheduler for task automation]]></title>
<id>https://doris.apache.org/zh-CN/blog/job-scheduler-for-task-automation</id>
<link href="https://doris.apache.org/zh-CN/blog/job-scheduler-for-task-automation"/>
<updated>2024-06-06T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The built-in Doris Job Scheduler triggers pre-defined operations efficiently and reliably. It is useful in many cases including ETL and data lake analytics.]]></summary>
<content type="html"><![CDATA[<p>Job scheduling is an important part of data management as it enables regular data updates and cleanups. In a data platform, it is often undertaken by workflow orchestration tools like <a href="https://airflow.apache.org" target="_blank" rel="noopener noreferrer">Apache Airflow</a> and <a href="https://dolphinscheduler.apache.org/en-us" target="_blank" rel="noopener noreferrer">Apache Dolphinscheduler</a>. However, adding another component to the data architecture also means investing extra resources for management and maintenance. That's why <a href="https://doris.apache.org/blog/release-note-2.1.0" target="_blank" rel="noopener noreferrer">Apache Doris 2.1.0</a> introduces a built-in Job Scheduler. It is strategically more tailored to Apache Doris, and brings higher scheduling flexibility and architectural simplicity. </p><p>The Doris Job Scheduler triggers the pre-defined operations at specific time points or intervals, thus allowing for efficient and reliable task automation. Its key capabilities include: </p><ul><li><p><strong>Efficiency</strong>: It adopts the TimeWheel algorithm to ensure that the triggering of tasks is precise to the second.</p></li><li><p><strong>Flexibility</strong>: It supports both one-time jobs and regular jobs. For the latter, users can define the start/end time, and intervals of minutes, hours, days, or weeks.</p></li><li><p><strong>Execution thread pool and processing queue</strong>: It is supported by a Disruptor-based single-producer, multi-consumer model to avoid task execution overload.</p></li><li><p><strong>Traceability</strong>: It keeps track of the latest task execution records (configurable), which are queryable by a simple command. </p></li><li><p><strong>Availability</strong>: Like Apache Doris itself, the Doris Job Scheduler is easily recoverable and highly available.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="syntax--examples">Syntax &amp; examples<a href="#syntax--examples" class="hash-link" aria-label="Syntax &amp; examples的直接链接" title="Syntax &amp; examples的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="syntax-description">Syntax description<a href="#syntax-description" class="hash-link" aria-label="Syntax description的直接链接" title="Syntax description的直接链接"></a></h3><p>A valid job statement consists of the following elements:</p><ul><li><p><code>CREATE JOB</code>: Specifies the job name as a unique identifier.</p></li><li><p>The <code>ON SCHEDULE</code> clause: Specifies the type, trigger time, and frequency of the job.</p><ul><li><p><code>AT timestamp</code>: This is used to specify a one-time job. <code>AT CURRENT_TIMESTAMP</code> means that the job will run immediately upon creation. </p></li><li><p><code>EVERY</code>: This is used to specify a regular job. You can define the execution frequency of the job. The interval can be measured in weeks, days, hours, and minutes.</p><ul><li>The <code>EVERY</code> clause supports an optional <code>STARTS</code> clause with a timestamp to define the start time of the recurring schedule. <code>CURRENT_TIMESTAMP</code> can be used. It also supports an optional <code>ENDS</code> clause to specify the end time for the job.</li></ul></li></ul></li><li><p>The <code>DO</code> clause defines the action to be performed when the job is executed. At this time, the only supported operation is INSERT.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> JOB</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> job_name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULE schedule</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'string'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"> execute_sql</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">schedule: {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AT </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">timestamp</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> EVERY </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">interval</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token plain">STARTS </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">timestamp</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token plain">ENDS </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">timestamp</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">interval</span><span class="token plain">:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> quantity { WEEK </span><span class="token operator">|</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DAY</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HOUR</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">MINUTE</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Example:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> JOB my_job </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULE EVERY </span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">MINUTE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INSERT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INTO</span><span class="token plain"> db1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> db2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The above statement creates a job named <code>my_job</code>, which is to load data from <code>db2.tbl2</code> to <code>db1.tbl1</code> every minute.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="more-examples">More examples<a href="#more-examples" class="hash-link" aria-label="More examples的直接链接" title="More examples的直接链接"></a></h3><p><strong>Create a one-time job</strong>: Load data from <code>db2.tbl2</code> to <code>db1.tbl1</code> at 2025-01-01 00:00:00.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> JOB my_job </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULE AT </span><span class="token string" style="color:rgb(255, 121, 198)">'2025-01-01 00:00:00'</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INSERT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INTO</span><span class="token plain"> db1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> db2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Create a regular job without specifying the end time</strong>: Load data from <code>db2.tbl2</code> to <code>db1.tbl1</code> once a day starting from 2025-01-01 00:00:00.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> JOB my_job </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULE EVERY </span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DAY</span><span class="token plain"> STARTS </span><span class="token string" style="color:rgb(255, 121, 198)">'2025-01-01 00:00:00'</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INSERT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INTO</span><span class="token plain"> db1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> db2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl2 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> create_time </span><span class="token operator">&gt;=</span><span class="token plain"> days_add</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token function" style="color:rgb(80, 250, 123)">now</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Create a regular job within a specified period</strong>: Load data from <code>db2.tbl2</code> to <code>db1.tbl1</code> once a day, beginning at 2025-01-01 00:00:00 and finishing at 2026-01-01 00:10:00.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> JOB my_job </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULER EVERY </span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DAY</span><span class="token plain"> STARTS </span><span class="token string" style="color:rgb(255, 121, 198)">'2025-01-01 00:00:00'</span><span class="token plain"> ENDS </span><span class="token string" style="color:rgb(255, 121, 198)">'2026-01-01 00:10:00'</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INSERT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INTO</span><span class="token plain"> db1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> db2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">tbl2 create_time </span><span class="token operator">&gt;=</span><span class="token plain"> days_add</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token function" style="color:rgb(80, 250, 123)">now</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Asynchronous execution</strong>: Because jobs are executed in an asynchronous manner in Doris. Tasks that require asynchronous execution, such as <code>insert into select</code>, can be implemented by a job. </p><p>For example, to asynchronously execute data loading from <code>db2.tbl2</code> to <code>db1.tbl1</code>, simply create a one-time job for it and schedule it at <code>current_timestamp</code>.</p><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE JOB my_job ON SCHEDULE AT current_timestamp DO INSERT INTO db1.tbl1 SELECT * FROM db2.tbl2;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="auto-data-synchronization">Auto data synchronization<a href="#auto-data-synchronization" class="hash-link" aria-label="Auto data synchronization的直接链接" title="Auto data synchronization的直接链接"></a></h2><p>The combination of the Job Scheduler and the <a href="https://doris.apache.org/docs/lakehouse/lakehouse-overview#multi-catalog" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> feature of Apache Doris is an efficient way to implement regular data synchronization across data sources.</p><p>This is useful in many cases, such as for an e-commerce user who regularly needs to load business data from MySQL to Doris for analysis.</p><p><strong>Example</strong>: To filter consumers by total consumption amount, last visit time, sex, and city in the table below, and import the query results to Doris regularly.</p><p><img loading="lazy" alt="Auto data synchronization" src="https://cdnd.selectdb.com/zh-CN/assets/images/auto-data-synchronization-e697db722413038ee542074082391276.png" width="1280" height="463" class="img_ev3q"></p><p><strong>Step 1</strong>: Create a table in Doris</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">IF</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">EXISTS</span><span class="token plain"> user_activity</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> LARGEINT </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"User ID"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">date</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATE</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"Time of data import"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">city</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VARCHAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">20</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"User city"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">age</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SMALLINT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"User age"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">sex</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TINYINT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"User sex"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">last_visit_date</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATETIME</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">REPLACE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DEFAULT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"1970-01-01 00:00:00"</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"Time of user's last visit"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">cost</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> SUM </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DEFAULT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"0"</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"User's total consumption amount"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">max_dwell_time</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INT</span><span class="token plain"> MAX </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DEFAULT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"0"</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"Maximum dwell time of user"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">min_dwell_time</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INT</span><span class="token plain"> MIN </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DEFAULT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"99999"</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"Minimum dwell time of user"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">AGGREGATE </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">date</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">city</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">age</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">sex</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Step 2</strong>: Create a catalog in Doris to map to the data in MySQL</p><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE CATALOG activity PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "type"="jdbc",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "user"="root",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "jdbc_url" = "jdbc:mysql://127.0.0.1:9734/user?useSSL=false",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "driver_url" = "mysql-connector-java-5.1.49.jar",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "driver_class" = "com.mysql.jdbc.Driver"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Step 3</strong>: Ingest data from MySQL to Doris. Leverage the catalog mechanism and the Insert Into method for full data ingestion. (We recommend that such operations be executed during low-traffic hours to minimize potential service disruptions.)</p><ul><li><p><strong>One-time job</strong>: Schedule a one-time full-scale data loading that starts at 2024-8-10 03:00:00.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> JOB one_time_load_job</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULE </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">AT </span><span class="token string" style="color:rgb(255, 121, 198)">'2024-8-10 03:00:00'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INSERT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INTO</span><span class="token plain"> user_activity </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> activity</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">user</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">activity </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div></li><li><p><strong>Regular job</strong>: Create a regular job to update data periodically.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> JOB schedule_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ON</span><span class="token plain"> SCHEDULE EVERY </span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DAY</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DO</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INSERT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INTO</span><span class="token plain"> user_activity </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> activity</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">user</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">activity </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> create_time </span><span class="token operator">&gt;=</span><span class="token plain"> days_add</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token function" style="color:rgb(80, 250, 123)">now</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="technical-design--implementation">Technical design &amp; implementation<a href="#technical-design--implementation" class="hash-link" aria-label="Technical design &amp; implementation的直接链接" title="Technical design &amp; implementation的直接链接"></a></h2><p>Efficient scheduling often comes at the cost of significant resource consumption, and high-precision scheduling is even more resource-intensive. To implement job scheduling, some people rely on the built-in scheduling capabilities of Java, while others employ job scheduling libraries. But what if we want higher precision and lower memory usage than these solutions can reach? For that, the Doris makers combine the TimingWheel algorithm with the Disruptor framework to achieve second-level job scheduling.</p><p><img loading="lazy" alt="Technical design &amp;amp; implementation" src="https://cdnd.selectdb.com/zh-CN/assets/images/technical-design-and-implementation-86d5f8b242e698232e33c6c1f6e25c1e.png" width="1280" height="1000" class="img_ev3q"></p><p>To implement the TimingWheel algorithm, we leverage the HashedWheelTimer in Netty. The Job Manager puts tasks every 10 minutes (by default) in the TimeWheel for scheduling. In order to ensure efficient task triggering and avoid high resource usage, we adopt a Disruptor-based single-producer, multi-consumer model. The TimeWheel only triggers tasks but does not execute jobs directly. Tasks that need to be triggered upon expiration will be put into a Dispatch thread and distributed to an appropriate execution thread pool. Tasks that need to be executed immediately will be directly submitted to the corresponding execution thread pool.</p><p>This is how we improve processing efficiency by reducing unnecessary traversal: For one-time tasks, their definition will be removed after execution. For recurring tasks, the system events in the TimeWheel will periodically fetch the next round of execution tasks. This helps to avoid the accumulation of tasks in a single bucket.</p><p>In addition, for transactional tasks, the Job Scheduler can ensure data consistency and integrity by the transaction association and transaction callback mechanisms. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="applicable-scenarios">Applicable scenarios<a href="#applicable-scenarios" class="hash-link" aria-label="Applicable scenarios的直接链接" title="Applicable scenarios的直接链接"></a></h2><p>The Doris Job Scheduler is a Swiss Army Knife. It is not only useful in ETL and data lake analytics as we mentioned, but also critical for the implementation of <a href="https://doris.apache.org/docs/query/view-materialized-view/async-materialized-view" target="_blank" rel="noopener noreferrer">asynchronous materialized views</a>. An asynchronous materialized view is a pre-computed result set. Unlike normal materialized views, it can be built on multiple tables. Thus, as you can imagine, changes in any of the source tables will lead to the need for updates in the asynchronous materialized view. That's why we apply the job scheduling mechanism for periodic data refreshing in asynchronous materialized views, which is low-maintenance and also ensures data consistency.</p><p>Where are we going with the Doris Job Scheduler? The <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2gmq5o30h-455W226d79zP3L96ZhXIoQ" target="_blank" rel="noopener noreferrer">Apache Doris developer community</a> is looking at:</p><ul><li><p>Displaying the distribution of tasks executed in different time slots on the WebUI.</p></li><li><p>DAG jobs. This will allow data warehouse task orchestration within Apache Doris, which will unlock many possibilities when it is combined with the Multi-Catalog feature. </p></li><li><p>Support for more operations such as UPDATE and DELETE.</p></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris version 2.0.11 has been released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.11</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.11"/>
<updated>2024-06-05T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, about 123 improvements and bug fixes have been made in Doris 2.0.11 version.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 123 improvements and bug fixes have been made in Doris 2.0.11 version.</p><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-behavior-change">1 Behavior change<a href="#1-behavior-change" class="hash-link" aria-label="1 Behavior change的直接链接" title="1 Behavior change的直接链接"></a></h2><p>Since the inverted index is now mature and stable, it can replace the old BITMAP INDEX. Therefore, any newly created <code>BITMAP INDEX</code> will automatically switch to an <code>INVERTED INDEX</code>, while existing <code>BITMAP INDEX</code> will remain unchanged. This entire switching process is transparent to the user, with no changes to writing or querying. Additionally, users can disable this automatic switch by setting the FE configuration <code>enable_create_bitmap_index_as_inverted_index</code> to false. <a href="https://github.com/apache/doris/pull/35528" target="_blank" rel="noopener noreferrer">#35528</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-improvement-and-optimizations">2 Improvement and optimizations<a href="#2-improvement-and-optimizations" class="hash-link" aria-label="2 Improvement and optimizations的直接链接" title="2 Improvement and optimizations的直接链接"></a></h2><ul><li><p>Add Trino JDBC Catalog type mapping for JSON and TIME</p></li><li><p>FE exit when failed to transfer to (non) master to prevent unknown state and too many logs</p></li><li><p>Write audit log while doing drop stats table.</p></li><li><p>Ignore min/max column stats if table is partially analyzed to avoid inefficient query plan</p></li><li><p>Support minus operation for set like <code>set1 - set2</code></p></li><li><p>Improve perfmance of LIKE and REGEXP clause with concat (col, pattern_str), eg. <code>col1 LIKE concat('%', col2, '%')</code></p></li><li><p>Add query options for short circuit queries for upgrade compatibility</p></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/compare/2.0.10...2.0.11" target="_blank" rel="noopener noreferrer">github</a> .</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="credits">Credits<a href="#credits" class="hash-link" aria-label="Credits的直接链接" title="Credits的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p>@AshinGau, @BePPPower, @BiteTheDDDDt, @ByteYue, @CalvinKirs, @cambyzju, @csun5285, @dataroaring, @eldenmoon, @englefly, @feiniaofeiafei, @Gabriel39, @GoGoWen, @HHoflittlefish777, @hubgeter, @jacktengg, @jackwener, @jeffreys-cat, @Jibing-Li, @kaka11chen, @kobe6th, @LiBinfeng-01, @mongo360, @morningman, @morrySnow, @mrhhsg, @Mryange, @nextdreamblue, @qidaye, @sjyango, @starocean999, @SWJTU-ZhangLei, @w41ter, @wangbo, @wsjz, @wuwenchi, @xiaokang, @XieJiann, @xy720, @yujun777, @Yukang-Lian, @Yulei-Yang, @zclllyybb, @zddr, @zhangstar333, @zhiqiang-hhhh, @zy-kkk, @zzzxl1993</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris for log and time series data analysis in NetEase, why not Elasticsearch and InfluxDB?]]></title>
<id>https://doris.apache.org/zh-CN/blog/apache-doris-for-log-and-time-series-data-analysis-in-netease</id>
<link href="https://doris.apache.org/zh-CN/blog/apache-doris-for-log-and-time-series-data-analysis-in-netease"/>
<updated>2024-05-23T00:00:00.000Z</updated>
<summary type="html"><![CDATA[NetEase (NASDAQ: NTES) has replaced Elasticsearch and InfluxDB with Apache Doris in its monitoring and time series data analysis platforms, respectively, achieving 11X query performance and saving 70% of resources.]]></summary>
<content type="html"><![CDATA[<p>For most people looking for a log management and analytics solution, Elasticsearch is the go-to choice. The same applies to InfluxDB for time series data analysis. These were exactly the choices of <a href="https://finance.yahoo.com/quote/NTES/" target="_blank" rel="noopener noreferrer">NetEase,Inc. <em>(NASDAQ: NTES)</em></a>, one of the world's highest-yielding game companies but more than that. As NetEase expands its business horizons, the logs and time series data it receives explode, and problems like surging storage costs and declining stability come. As NetEase's pick among all big data components for platform upgrades, <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a> fits into both scenarios and brings much faster query performance. </p><p>We list the gains of NetEase after adopting Apache Doris in their monitoring platform and time series data platform, and share their best practice with users who have similar needs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="monitoring-platform-elasticsearch---apache-doris">Monitoring platform: Elasticsearch -&gt; Apache Doris<a href="#monitoring-platform-elasticsearch---apache-doris" class="hash-link" aria-label="Monitoring platform: Elasticsearch -> Apache Doris的直接链接" title="Monitoring platform: Elasticsearch -> Apache Doris的直接链接">​</a></h2><p>NetEase provides a collaborative workspace platform that combines email, calendar, cloud-based documents, instant messaging, and customer management, etc. To oversee its performance and availability, NetEase builds the Eagle monitoring platform, which collects logs for analysis. Eagle was supported by Elasticsearch and Logstash. The data pipeline was simple: Logstash gathers log data, cleans and transforms it, and then outputs it to Elasticsearch, which handles real-time log retrieval and analysis requests from users.</p><p><img loading="lazy" alt="Monitoring platform: Elasticsearch -&amp;gt; Apache Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/monitoring-platform-elasticsearch-5926a8f4794acda07e50b877ffc85c92.PNG" width="1280" height="158" class="img_ev3q"></p><p>Due to NetEase's increasingly sizable log dataset, Elastisearch's index design, and limited hardware resources, the monitoring platform exhibits <strong>high latency</strong> in daily queries. Additionally, Elasticsearch maintains high data redundancy for forward indexes, inverted indexes, and columnar storage. This adds to cost pressure.</p><p>After migration to Apache Doris, NetEase achieves a 70% reduction in storage costs and an 11-fold increase in query speed. </p><p><img loading="lazy" alt="Monitoring platform: Elasticsearch -&amp;gt; Apache Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/monitoring-platform-apache-doris-23c3a1008f0d3e6e59d53047ace7e185.PNG" width="1280" height="160" class="img_ev3q"></p><ul><li><p><strong>70% reduction in storage costs</strong>: This means a dataset that takes up 100TB in Elasticsearch only requires 30TB in Apache Doris. Moreover, thanks to the much-reduced storage space usage, they can replace their HDDs with more expensive SSDs for hot data storage to achieve higher query performance while staying within the same budget.</p></li><li><p><strong>11-fold increase in query speed</strong>: Apache Doris can deliver faster queries while consuming less CPU resources than Elasticsearch. As shown below, Doris has reliably low latency in queries of various sizes, while Elasticsearch demonstrates longer latency and greater fluctuations, and the smallest speed difference is 11-fold. </p></li></ul><p><img loading="lazy" alt="Apache Doris vs Elasticsearch" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-vs-elasticsearch-query-latency-542660f4457f559a4e594993e28aef4c.PNG" width="1280" height="720" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="time-series-data-platform-influxdb---apache-doris">Time series data platform: InfluxDB -&gt; Apache Doris<a href="#time-series-data-platform-influxdb---apache-doris" class="hash-link" aria-label="Time series data platform: InfluxDB -> Apache Doris的直接链接" title="Time series data platform: InfluxDB -> Apache Doris的直接链接">​</a></h2><p>NetEase is also an instant messaging (IM) PaaS provider. To support this, it builds a data platform to analyze time series data from their IM services. The platform was built on InfluxDB, a time series database. Data flowed into a Kafka message queue. After the fields were parsed and cleaned, they arrived in InfluxDB, ready to be queried. InfluxDB responded to both online and offline queries. The former was to generate metric monitoring reports and bills in real time, and the latter was to batch analyze data from a day ago. </p><p><img loading="lazy" alt="Time series data platform: InfluxDB -&amp;gt; Apache Doris " src="https://cdnd.selectdb.com/zh-CN/assets/images/time-series-data-platform-from-influxdb-to-apache-doris-480aab1f5537e6bd0fba6f1c6801f9c3.PNG" width="1280" height="588" class="img_ev3q"></p><p>This platform was also challenged by the increasing data size and diversifying data sources.</p><ul><li><p><strong>OOM</strong>: Offline data analysis across multiple data sources was putting InfluxDB under huge pressure and causing OOM errors.</p></li><li><p><strong>High storage costs</strong>: Cold data took up a large portion but it was stored the same way as hot data. That added up to huge expenditures.</p></li></ul><p><img loading="lazy" alt="Time series data platform: InfluxDB -&amp;gt; Apache Doris " src="https://cdnd.selectdb.com/zh-CN/assets/images/time-series-data-platform-influxdb-to-apache-doris-2-def95b716954bcd09bdffa13fef7ed1f.PNG" width="1280" height="588" class="img_ev3q"></p><p>Replacing InfluxDB with Apache Doris has brought higher cost efficiency to the data platform:</p><ul><li><p><strong>Higher throughput</strong>: Apache Doris maintains a writing throughput of 500MB/s and achieves a peak writing throughput of 1GB/s. With InfluxDB, they used to require 22 servers for a CPU utilization rate of 50%. Now, with Doris, it only takes them 11 servers at the same CPU utilization rate. That means Doris helps cut down resource consumption by half.</p></li><li><p><strong>67% less storage usage</strong>: The same dataset used 150TB of storage space with InfluxDB but only took up 50TB with Doris. Thus, Doris helps reduce storage costs by 67%.</p></li><li><p><strong>Faster and more stable query performance</strong>: The performance test was to select a random online query SQL and run it 99 consecutive times. As is shown below, Doris delivers generally faster response time and maintains stability throughout the 99 queries.</p></li></ul><p><img loading="lazy" alt="Doris vs InfluxDB" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-vs-influxdb-cost-effectivity-1026ec10820805c8bffc1f024a8ab2cb.png" width="1280" height="692" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="best-practice">Best practice<a href="#best-practice" class="hash-link" aria-label="Best practice的直接链接" title="Best practice的直接链接"></a></h2><p>Adopting a new product and putting it into a production environment is, after all, a big project. The NetEase engineers came across a few hiccups during the journey, and they are kind enough to share about how they solved these problems and save other users some detours.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="table-creation">Table creation<a href="#table-creation" class="hash-link" aria-label="Table creation的直接链接" title="Table creation的直接链接"></a></h3><p>Table schema design has a significant impact on database performance, and this holds for log and time series data processing as well. Apache Doris provides optimization options for these scenarios. These are some recommendations provided by NetEase.</p><ol><li><p><strong>Retrieval of the latest N logs</strong>: Using a <code>DATETIME</code> type time field as the primary key can largely speed queries up.</p></li><li><p><strong>Partitioning strategy</strong>: Use <code>PARTITION BY RANGE</code> based on a time field and enable <a href="https://doris.apache.org/docs/2.0/table-design/data-partition#dynamic-partition" target="_blank" rel="noopener noreferrer">dynamic partition</a>. This allows for auto-management of data partitions.</p></li><li><p><strong>Bucketing strategy</strong>: Adopt random bucketing and set the number of buckets to roughly three times the total number of disks in the cluster. (Apache Doris also provides an <a href="https://doris.apache.org/docs/2.0/table-design/data-partition/#auto-bucket" target="_blank" rel="noopener noreferrer">auto bucket</a> feature to avoid performance loss caused by improper data sharding.)</p></li><li><p><strong>Indexing</strong>: Create indexes for frequently searched fields to improve query efficiency. Pay attention to the parser for the fields that require full-text searching, because it determines query accuracy.</p></li><li><p><strong>Compaction</strong>: Optimize the compaction strategies based on your own business needs.</p></li><li><p><strong>Data compression</strong>: Enable <code>ZSTD</code> for better a higher compression ratio.</p></li></ol><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> log</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ts </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATETIME</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> host </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VARCHAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">20</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> msg </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">status</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> size </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_size </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">size</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_status </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">status</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_host </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">host</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_msg </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">msg</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED PROPERTIES</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"parser"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"unicode"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">ts</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> RANGE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">ts</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> RANDOM BUCKETS </span><span class="token number">250</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"compression"</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"zstd"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"compaction_policy"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"time_series"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.enable"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"true"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.create_history_partition"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"true"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.time_unit"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"DAY"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.start"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"-7"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.end"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"3"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.prefix"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"p"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"dynamic_partition.buckets"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"250"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cluster-configuration">Cluster configuration<a href="#cluster-configuration" class="hash-link" aria-label="Cluster configuration的直接链接" title="Cluster configuration的直接链接"></a></h3><p><strong>Frontend (FE) configuration</strong></p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)"># For higher data ingestion performance:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">enable_single_replica_load </span><span class="token operator">=</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># For more balanced tablet distribution:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">enable_round_robin_create_tablet </span><span class="token operator">=</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">tablet_rebalancer_type </span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">partition</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Memory optimization for frequent imports:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">max_running_txn_num_per_db </span><span class="token operator">=</span><span class="token plain"> </span><span class="token number">10000</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">streaming_label_keep_max_second </span><span class="token operator">=</span><span class="token plain"> </span><span class="token number">300</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">label_clean_interval_second </span><span class="token operator">=</span><span class="token plain"> </span><span class="token number">300</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Backend (BE) configuration</strong></p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">write_buffer_size=1073741824</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">max_tablet_version_num = 20000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">max_cumu_compaction_threads = 10 (Half of the total number of CPUs)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">enable_write_index_searcher_cache = false</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">disable_storage_page_cache = true</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">enable_single_replica_load = true</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">streaming_load_json_max_mb=250</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="stream-load-optimization">Stream Load optimization<a href="#stream-load-optimization" class="hash-link" aria-label="Stream Load optimization的直接链接" title="Stream Load optimization的直接链接"></a></h3><p>During peak times, the data platform is undertaking up to 1 million TPS and a writing throughput of 1GB/s. This is demanding for the system. Meanwhile, at peak time, a large number of concurrent write operations are loading data into lots of tables, but each individual write operation only involves a small amount of data. Thus, it takes a long time to accumulate a batch, which is contradictory to the data freshness requirement from the query side.</p><p>As a result, the data platform was bottlenecked by data backlogs in Apache Kafka. NetEase adopts the <a href="https://doris.apache.org/docs/2.0/data-operate/import/stream-load-manual" target="_blank" rel="noopener noreferrer">Stream Load</a> method to ingest data from Kafka to Doris. So the key was to accelerate Stream Load. After talking to the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris developers</a>, NetEase adopted two optimizations for their log and time series data analysis:</p><ul><li><p><strong>Single replica data loading</strong>: Load one data replica and pull data from it to generate more replicas. This avoids the overhead of ranking and creating indexes for multiple replicas.</p></li><li><p><strong>Single tablet data loading</strong> (<code>load_to_single_tablet=true</code>): Compared to writing data to multiple tablets, this reduces the I/O overhead and the number the small files generated during data loading. </p></li></ul><p>The above measures are effective in improving data loading performance:</p><ul><li><strong>2X data consumption speed from Kafka</strong></li></ul><p><img loading="lazy" alt="2X data consumption speed from Kafka" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-data-loading-performance-1-ee9e3f0841cd78fa0171bc08c18d6fbb.png" width="1280" height="456" class="img_ev3q"></p><ul><li><strong>75% lower data latency</strong></li></ul><p><img loading="lazy" alt="75% lower data latency" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-data-loading-performance-2-ad5092021a47b02cb0a874cd5511ea0f.png" width="1280" height="574" class="img_ev3q"></p><ul><li><strong>70% faster response of Stream Load</strong></li></ul><p><img loading="lazy" alt="70% faster response of Stream Load" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-data-loading-performance-3-32c579174b74d58a922ad4b29e03acd7.png" width="1280" height="459" class="img_ev3q"></p><p>Before putting the upgraded data platform in their production environment, NetEase has conducted extensive stress testing and grayscale testing. This is their experience in tackling errors along the way.</p><p><strong>1. Stream Load timeout:</strong></p><p> The early stage of stress testing often reported frequent timeout errors during data import. Additionally, despite the processes and cluster status being normal, the monitoring system couldn't collect the correct BE metrics. The engineers obtained the Doris BE stack using Pstack and analyzed it with PT-PMT. They discovered that the root cause was the lack of HTTP chunked encoding or content-length settings when initiating requests. This led Doris to mistakenly consider the data transfer as incomplete, causing it to remain in a waiting state. The solution was to simply add a chunked encoding setting on the client side.</p><p><strong>2. Data size in a single Stream Load exceeding threshold:</strong> </p><p> The default limit is 100 MB. The solution was to increase <code>streaming_load_json_max_mb</code> to 250 MB.</p><p><strong>3. Error:</strong> <code>alive replica num 0 &lt; quorum replica num 1</code></p><p> By the <code>show backends</code> command, it was discovered that one BE node was in OFFLINE state. A lookup in the <code>be_custom</code> configuration file revealed a <code>broken_storage_path</code>. Further inspection of the BE logs located an error message "too many open files," meaning the number of file handles opened by the BE process had exceeded the system's limit, and this caused I/O operations to fail. When Doris detected such an abnormality, it marked the disk as unavailable. Because the table was configured with one single replica, when the disk holding the only replica was unavailable, data writing failed.</p><p> The solution was to increase the maximum open file descriptor limit for the process to 1 million, delete the <code>be_custom.conf</code> file, and restart the BE node.</p><p><strong>4. FE memory jitter</strong></p><p> During grayscale testing, the FE could not be connected. The monitoring data showed that the JVM's 32 GB was exhausted, and the <code>bdb</code> directory under the FE's meta directory had ballooned to 50 GB. Memory jitter occurred every hour, with peak memory usage reaching 80%</p><p> The root cause was improper parameter configuration. During high-concurrency Stream Load operations, the FE records the related Load information. Each import adds about 200 KB of information to the memory. The cleanup time for such information is controlled by the <code>streaming_label_keep_max_second</code> parameter, which by default is 12 hours. Reducing this to 5 minutes can prevent the FE memory from being exhausted. However, they didn't modify the <code>label_clean_interval_second</code> parameter, which controls the interval of the label cleanup thread. The default value of this parameter is 1 hour, which explains the hourly memory jitter. </p><p> The solution was to dial down <code>label_clean_interval_second</code> to 5 minutes.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="query">Query<a href="#query" class="hash-link" aria-label="Query的直接链接" title="Query的直接链接"></a></h3><p>The engineers found results that did not match the filtering conditions in a query on the Eagle monitoring platform. </p><p><img loading="lazy" alt="Dorsi Query Optimization" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-query-optimization-9a78bd121d00c488676981931cf1e981.png" width="1280" height="936" class="img_ev3q"></p><p>This was due to the engineers' misconception of <code>match_all</code> in Apache Doris. <code>match_all</code> identifies data records that include all the specified tokens while tokenization is based on space and punctuation marks. In the unqualified result, although the timestamp did not match, the message included "29", which compensated for the unmatched part in the timestamp. That's why this data record was included as a query result.</p><p><img loading="lazy" alt="Dorsi Query Optimization" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-query-optimization-2-778c89a665a9de4e41aacc256b099954.png" width="1144" height="825" class="img_ev3q"></p><p>For Doris to produce what the engineers wanted in this query, <code>MATCH_PHRASE</code> should be used instead, because it also identifies the sequence of texts. </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> table_name </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> logmsg MATCH_PHRASE </span><span class="token string" style="color:rgb(255, 121, 198)">'keyword1 keyword2'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Note that when using <code>MATCH_PHRASE</code>, you should enable <code>support_phrase</code> during index creation. Otherwise, the system will perform a full table scan and a hard match, resulting in poor query efficiency.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_name4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">column_name4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED PROPERTIES</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"parser"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"english"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"support_phrase"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"true"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you want to enable <code>support_phrase</code> for existing tables that have already been populated with data, you can execute <code>DROP INDEX</code> and then <code>ADD INDEX</code> to replace the old index with a new one. This process is incremental and does not require rewriting the entire table.</p><p><strong>This is another advantage of Doris compared to Elasticsearch: It supports more flexible index management and allows easy addition and removal of indexes.</strong></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>Apache Doris supports the log and time series data analytic workloads of NetEase with higher query performance and less storage consumption. Beyond these, Apache Doris has other capabilities such as data lake analysis since it is designed as an all-in-one big data analytic platform. If you want a quick evaluation of whether Doris is right for your use case, come talk to the Doris makers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.1.3 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.1.3</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.1.3"/>
<updated>2024-05-21T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This version has updated several improvements, including writing data back to Hive, materialized view, permission management and bug fixes. It further enhances the performance and stability of the system.]]></summary>
<content type="html"><![CDATA[<p>Apache Doris 2.1.3 was officially released on May 21, 2024. This version has updated several improvements, including writing data back to Hive, materialized view, permission management and bug fixes. It further enhances the performance and stability of the system.</p><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub Release:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="feature-enhancements">Feature Enhancements<a href="#feature-enhancements" class="hash-link" aria-label="Feature Enhancements的直接链接" title="Feature Enhancements的直接链接"></a></h2><p><strong>1. Support writing back data to hive tables via Hive Catalog</strong></p><p>Starting from version 2.1.3, Apache Doris supports DDL and DML operations on Hive. Users can directly create libraries and tables in Hive through Apache Doris and write data to Hive tables by executing <code>INSERT INTO</code> statements. This feature allows users to perform complete data query and write operations on Hive through Apache Doris, further simplifying the integrated lakehouse architecture.</p><p>Please refer: <a href="https://doris.apache.org/docs/lakehouse/datalake-building/hive-build/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/lakehouse/datalake-building/hive-build/</a></p><p><strong>2. Support building new asynchronous materialized views on top of existing ones</strong></p><p>Users can create new asynchronous materialized views on top of existing ones, directly reusing pre-computed intermediate results for data processing. This simplifies complex aggregation and computation operations, reducing resource consumption and maintenance costs while further accelerating query performance and improving data availability.</p><p><strong>3. Support rewriting through nested materialized views</strong></p><p>Materialized View (MV) is a database object used to store query results. Now, Apache Doris supports rewriting through nested materialized views, which helps optimize query performance.</p><p><strong>4. New <code>SHOW VIEWS</code> statement</strong></p><p>The <code>SHOW VIEWS</code> statement can be used to query views in the database, facilitating better management and understanding of view objects in the database.</p><p><strong>5. Workload Group supports binding to specific BE nodes</strong></p><p>Workload Group can be bound to specific BE nodes, enabling more refined control over query execution to optimize resource usage and improve performance.</p><p><strong>6. Broker Load supports compressed JSON format</strong></p><p>Broker Load now supports importing compressed JSON format data, significantly reducing bandwidth requirements for data transmission and accelerating data import performance.</p><p><strong>7. TRUNCATE Function can use columns as scale parameters</strong></p><p>The TRUNCATE function can now accept columns as scale parameters, providing more flexibility when processing numerical data.</p><p><strong>8. Add new functions <code>uuid_to_int</code> and <code>int_to_uuid</code></strong></p><p>These two functions allow users to convert between UUID and integer, significantly helping in scenarios that require handling UUID data.</p><p><strong>9. Add <code>bypass_workload_group</code> session variable to bypass query queue</strong></p><p>The <code>bypass_workload_group</code> session variable allows certain queries to bypass the Workload Group queue and execute directly, which is useful for handling critical queries that require quick responses.</p><p><strong>10. Add strcmp function</strong></p><p>The strcmp function compares two strings and returns their comparison result, simplifying text data processing.</p><p><strong>11. Support HLL functions <code>hll_from_base64</code> and <code>hll_to_base64</code></strong></p><p>HyperLogLog (HLL) is an algorithm for cardinality estimation. These two functions allow users to decode HLL data from a Base64-encoded string or encode HLL data as a Base64 string, which is very useful for storing and transmitting HLL data.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimization-and-improvements">Optimization and Improvements<a href="#optimization-and-improvements" class="hash-link" aria-label="Optimization and Improvements的直接链接" title="Optimization and Improvements的直接链接"></a></h2><p><strong>1. Replace SipHash with XXHash to improve shuffle performance</strong></p><p>Both SipHash and XXHash are hashing functions, but XXHash may provide faster hashing speeds and better performance in certain scenarios. This optimization aims to improve performance during data shuffling by adopting XXHash.</p><p><strong>2. Asynchronous materialized views support NULL partition columns in OLAP tables</strong></p><p>This enhancement allows asynchronous materialized views to support NULL partition columns in OLAP tables, enhancing data processing flexibility.</p><p><strong>3. Limit maximum string length to 1024 when collecting column statistics to control BE memory usage</strong></p><p>Limiting the string length when collecting column statistics prevents excessive data from consuming too much BE memory, helping maintain system stability and performance.</p><p><strong>4. Support dynamic deletion of Bitmap cache to improve performance</strong></p><p>Dynamically deleting no longer needed Bitmap Cache can free up memory and improve system performance.</p><p><strong>5. Reduce memory usage during ALTER operations</strong></p><p>Reducing memory usage during ALTER operations improves the efficiency of system resource utilization.</p><p><strong>6. Support constant folding for complex types</strong></p><p>Supports constant folding for Array/Map/Struct complex types.</p><p><strong>7. Add support for Variant type in Aggregate Key Model</strong></p><p>The Variant data type can store multiple data types. This optimization allows aggregation operations on Variant type data, enhancing the flexibility of semi-structured data analysis.</p><p><strong>8. Support new inverted index format in CCR</strong></p><p><strong>9. Optimize rewriting performance for nested materialized views</strong></p><p><strong>10. Support decimal256 type in row-based storage format</strong></p><p>Supporting the decimal256 type in row-based storage extends the system's ability to handle high-precision numerical data.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavioral-changes">Behavioral Changes<a href="#behavioral-changes" class="hash-link" aria-label="Behavioral Changes的直接链接" title="Behavioral Changes的直接链接"></a></h2><p><strong>1. Authorization</strong></p><ul><li><p><strong>Grant_priv permission changes</strong>: <code>Grant_priv</code> can no longer be arbitrarily granted. When performing a <code>GRANT</code> operation, the user not only needs to have <code>Grant_priv</code> but also the permissions to be granted. For example, to grant <code>SELECT</code> permission on <code>table1</code>, the user needs both <code>GRANT</code> permission and <code>SELECT</code> permission on <code>table1</code>, enhancing security and consistency in permission management.</p></li><li><p><strong>Workload group and resource usage_priv</strong>: <code>Usage_priv</code> for Workload Group and Resource is no longer global but limited to Resource and Workload Group, making permission granting and usage more specific.</p></li><li><p><strong>Authorization for operations</strong>: Operations that were previously unauthorized now have corresponding authorizations for more detailed and comprehensive operational permission control.</p></li></ul><p><strong>2. LOG directory configuration</strong></p><p>The log directory configuration for FE and BE now uniformly uses the <code>LOG_DIR</code> environment variable. All other different types of logs will be stored with <code>LOG_DIR</code> as the root directory. To maintain compatibility between versions, the previous configuration item <code>sys_log_dir</code> can still be used.</p><p><strong>3. S3 Table Function (TVF)</strong></p><p>Due to issues with correctly recognizing or processing S3 URLs in certain cases, the parsing logic for object storage paths has been refactored. For file paths in S3 table functions, the <code>force_parsing_by_standard_uri</code> parameter needs to be passed to ensure correct parsing.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-issues">Upgrade Issues<a href="#upgrade-issues" class="hash-link" aria-label="Upgrade Issues的直接链接" title="Upgrade Issues的直接链接"></a></h2><p>Since many users use certain keywords as column names or attribute values, the following keywords have been set as non-reserved, allowing users to use them as identifiers.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><p><strong>1. Fix no data error when reading Hive tables on Tencent Cloud COSN</strong></p><p>Resolved the no data error that could occur when reading Hive tables on Tencent Cloud COSN, enhancing compatibility with Tencent Cloud storage services.</p><p><strong>2. Fix incorrect results returned by <code>milliseconds_diff</code> function</strong></p><p>Fixed an issue where the <code>milliseconds_diff</code> function returned incorrect results in some cases, ensuring the accuracy of time difference calculations.</p><p><strong>3. User-defined variables should be rorwarded to the Master node</strong></p><p>Ensured that user-defined variables are correctly passed to the Master node for consistency and correct execution logic across the entire system.</p><p><strong>4. Fix Schema Change issues when adding complex type columns</strong></p><p>Resolved Schema Change issues that could arise when adding complex type columns, ensuring the correctness of Schema Changes.</p><p><strong>5. Fix data loss issue in Routine Load when FE Master node changes</strong></p><p><code>Routine Load</code> is often used to subscribe to Kafka message queues. This fix addresses potential data loss issues that may occur during FE Master node changes.</p><p><strong>6. Fix Routine Load failure when Workload Group cannot be found</strong></p><p>Resolved an issue where <code>Routine Load</code> would fail if the specified Workload Group could not be found.</p><p><strong>7. Support column string64 to avoid join failures when string size overflows unit32</strong></p><p>In some cases, string sizes may exceed the unit32 limit. Supporting the <code>string64</code> type ensures correct execution of string JOIN operations.</p><p><strong>8. Allow Hadoop users to create Paimon Catalog</strong></p><p>Permitted authorized Hadoop users to create Paimon Catalogs.</p><p><strong>9. Fix <code>function_ipxx_cidr</code> function issues with constant parameters</strong></p><p>Resolved problems with the <code>function_ipxx_cidr</code> function when handling constant parameters, ensuring the correctness of function execution.</p><p><strong>10. Fix file download errors when restoring using HDFS</strong></p><p>Resolved "failed to download" errors encountered during data restoration using HDFS, ensuring the accuracy and reliability of data recovery.</p><p><strong>11. Fix column permission issues related to hidden columns</strong></p><p>In some cases, permission settings for hidden columns may be incorrect. This fix ensures the correctness and security of column permission settings.</p><p><strong>12. Fix issue where Arrow Flight cannot obtain the correct IP in K8s deployments</strong></p><p>This fix resolves an issue where Arrow Flight cannot correctly obtain the IP address in Kubernetes deployment environments.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris version 2.0.10 has been released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.10</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.10"/>
<updated>2024-05-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, about 83 improvements and bug fixes have been made in Doris 2.0.10 version.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 83 improvements and bug fixes have been made in Doris 2.0.10 version.</p><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvement-and-optimizations">Improvement and Optimizations<a href="#improvement-and-optimizations" class="hash-link" aria-label="Improvement and Optimizations的直接链接" title="Improvement and Optimizations的直接链接"></a></h2><ul><li><p>This enhancement introduces the <code>read_only</code> and <code>super_read_only</code> variables to the database system, ensuring compatibility with MySQL's read-only modes.</p></li><li><p>When the check status is not IO_ERROR, the disk path should not be added to the broken list. This ensures that only disks with actual I/O errors are marked as broken.</p></li><li><p>When performing a Create Table As Select (CTAS) operation from an external table, convert the <code>VARCHAR</code> column to <code>STRING</code> type.</p></li><li><p>Support mapping Paimon column type "ROW" to Doris type "STRUCT"</p></li><li><p>Choose disk tolerate with little skew when creating tablet</p></li><li><p>Write editlog to <code>set replica drop</code> to avoid confusing status on follower FE</p></li><li><p>Make the schema change memory space adaptive to avoid memory over limit</p></li><li><p>Inverted index 'unicode' tokenizer supports configuration to exclude stop words</p></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/compare/2.0.9...2.0.10" target="_blank" rel="noopener noreferrer">GitHub</a> .</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="credits">Credits<a href="#credits" class="hash-link" aria-label="Credits的直接链接" title="Credits的直接链接"></a></h2><p>Thanks to all who contributed to this release</p><p>@airborne12, @BePPPower, @ByteYue, @CalvinKirs, @cambyzju, @csun5285, @dataroaring, @deardeng, @DongLiang-0, @eldenmoon, @felixwluo, @HappenLee, @hubgeter, @jackwener, @kaijchen, @kaka11chen, @Lchangliang, @liaoxin01, @LiBinfeng-01, @luennng, @morningman, @morrySnow, @Mryange, @nextdreamblue, @qidaye, @starocean999, @suxiaogang223, @SWJTU-ZhangLei, @w41ter, @xiaokang, @xy720, @yujun777, @Yukang-Lian, @zhangstar333, @zxealous, @zy-kkk, @zzzxl1993</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Multi-tenant workload isolation: a better balance between isolation and utilization]]></title>
<id>https://doris.apache.org/zh-CN/blog/multi-tenant-workload-isolation-in-apache-doris</id>
<link href="https://doris.apache.org/zh-CN/blog/multi-tenant-workload-isolation-in-apache-doris"/>
<updated>2024-05-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Apache Doris supports workload isolation based on Resource Tag and Workload Group. It provides solutions for different tradeoffs among the level of isolation, resource utilization, and stable performance.]]></summary>
<content type="html"><![CDATA[<p>This is an in-depth introduction to the workload isolation capabilities of <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a>. But first of all, why and when do you need workload isolation? If you relate to any of the following situations, read on and you will end up with a solution:</p><ul><li><p>You have different business departments or tenants sharing the same cluster and you want to prevent the interference of workloads among them.</p></li><li><p>You have query tasks of varying priority levels and you want to give priority to your critical tasks (such as real-time data analytics and online transactions) in terms of resources and execution. </p></li><li><p>You need workload isolation but also want high cost-effectiveness and resource utilization rates.</p></li></ul><p>Apache Doris supports workload isolation based on Resource Tag and Workload Group. Resource Tag isolates the CPU and memory resources for different workloads at the level of backend nodes, while the Workload Group mechanism can further divide the resources within a backend node for higher resource utilization. </p><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>提示</div><div class="admonitionContent_S0QG"><p><a href="https://www.youtube.com/watch?v=Wd3l5C4k8Ok&amp;t=1s" target="_blank" rel="noopener noreferrer">Demo</a> of using the Workload Manager in Apache Doris to set a CPU soft/hard limit for Workload Groups.</p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="resource-isolation-based-on-resource-tag">Resource isolation based on Resource Tag<a href="#resource-isolation-based-on-resource-tag" class="hash-link" aria-label="Resource isolation based on Resource Tag的直接链接" title="Resource isolation based on Resource Tag的直接链接"></a></h2><p>Let's begin with the architecture of Apache Doris. Doris has two <a href="https://doris.apache.org/docs/get-starting/what-is-apache-doris#technical-overview" target="_blank" rel="noopener noreferrer">types of nodes</a>: frontends (FEs) and backends (BEs). FE nodes store metadata, manage clusters, process user requests, and parse query plans, while BE nodes are responsible for computation and data storage. Thus, BE nodes are the major resource consumers. </p><p>The main idea of a Resource Tag-based isolation solution is to divide computing resources into groups by assigning tags to BE nodes in a cluster, where BE nodes of the same tag constitute a Resource Group. A Resource Group can be deemed as a unit for data storage and computation. For data ingested into Doris, the system will write data replicas into different Resource Groups according to the configurations. Queries will also be assigned to their corresponding <a href="https://doris.apache.org/docs/admin-manual/resource-admin/multi-tenant#tag-division-and-cpu-limitation-are-new-features-in-version-015-in-order-to-ensure-a-smooth-upgrade-from-the-old-version-doris-has-made-the-following-forward-compatibility" target="_blank" rel="noopener noreferrer">Resource Groups</a> for execution. </p><p>For example, if you want to separate read and write workloads in a 3-BE cluster, you can follow these steps:</p><ol><li><p><strong>Assign Resource Tags to BE nodes</strong>: Bind 2 BEs to the "Read" tag and 1 BE to the "Write" tag. </p></li><li><p><strong>Assign Resource Tags to data replicas</strong>: Assuming that Table 1 has 3 replicas, bind 2 of them to the "Read" tag and 1 to the "Write" tag. Data written into Replica 3 will be synchronized to Replica 1 and Replica 2 and the data synchronization process consumes few resources of BE 1 and BE2.</p></li><li><p><strong>Assign workload groups to Resource Tags</strong>: Queries that include the "Read" tag in their SQLs will be automatically routed to the nodes tagged with "Read" (in this case, BE 1 and BE 2). For data writing tasks, you also need to assign them with the "Write" tag, so they can be routed to the corresponding node (BE 3). In this way, there will be no resource contention between read and write workloads except the data synchronization overheads from replica 3 to replicate 1 and 2.</p></li></ol><p><img loading="lazy" alt="Resource isolation based on Resource Tag" src="https://cdnd.selectdb.com/zh-CN/assets/images/resource-isolation-based-on-resource-tag-af8fea20b39cee40b735034a21e723d6.PNG" width="1280" height="719" class="img_ev3q"></p><p>Resource Tag also enables <strong>multi-tenancy</strong> in Apache Doris. For example, computing and storage resources tagged with "User A" are for User A only, while those tagged with "User B" are exclusive to User B. This is how Doris implements multi-tenant resource isolation with Resource Tags at the BE side.</p><p><img loading="lazy" alt="Resource isolation based on Resource Tag" src="https://cdnd.selectdb.com/zh-CN/assets/images/resource-isolation-based-on-resource-tag-2-30b41e82f10cf3f14ee9363bdcea06b1.PNG" width="1280" height="609" class="img_ev3q"></p><p>Dividing the BE nodes into groups ensures <strong>a high level of isolation</strong>:</p><ul><li><p>CPU, memory, and I/O of different tenants are physically isolated.</p></li><li><p>One tenant will never be affected by the failures (such as process crashes) of another tenant.</p></li></ul><p>But it has a few downsides:</p><ul><li><p>In read-write separation, when the data writing stops, the BE nodes tagged with "Write" become idle. This reduces overall cluster utilization.</p></li><li><p>Under multi-tenancy, if you want to further isolate different workloads of the same tenant by assigning separate BE nodes to each of them, you will need to endure significant costs and low resource utilization.</p></li><li><p>The number of tenants is tied to the number of data replicas. So if you have 5 tenants, you will need 5 data replicas. That's huge storage redundancy.</p></li></ul><p><strong>To improve on this,we provide a workload isolation solution based on Workload Group in Apache Doris 2.0.0, and enhanced it in <a href="https://doris.apache.org/blog/release-note-2.1.0" target="_blank" rel="noopener noreferrer">Apache Doris 2.1.0</a></strong></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="workload-isolation-based-on-workload-group">Workload isolation based on Workload Group<a href="#workload-isolation-based-on-workload-group" class="hash-link" aria-label="Workload isolation based on Workload Group的直接链接" title="Workload isolation based on Workload Group的直接链接"></a></h2><p>The <a href="https://doris.apache.org/docs/admin-manual/resource-admin/workload-group" target="_blank" rel="noopener noreferrer">Workload Group</a>-based solution realizes a more granular division of resources. It further divides CPU and memory resources within processes on BE nodes, meaning that the queries in one BE node can be isolated from each other to some extent. This avoids resource competition within BE processes and optimizes resource utilization.</p><p>Users can relate queries to Workload Groups, and thus limit the percentage of CPU and memory resources that a query can use. Under high cluster loads, Doris can automatically kill the most resource-consuming queries in a Workload Group. Under low cluster loads, Doris can allow multiple Workload Groups to share idle resources. </p><p>Doris supports both CPU soft limit and CPU hard limit. The soft limit allows Workload Groups to break the limit and utilize idle resources, enabling more efficient utilization. The hard limit is a hard guarantee of stable performance because it prevents the mutual impact of Workload Groups. </p><p><em>(CPU soft limit and CPU hard limit are contradictory to each other. You can choose between them based on your own use case.)</em></p><p><img loading="lazy" alt="Workload isolation based on Workload Group" src="https://cdnd.selectdb.com/zh-CN/assets/images/workload-isolation-based-on-workload-group-432d800f87b879c1ed7412a4f31e81ca.png" width="1280" height="710" class="img_ev3q"></p><p>Its differences from the Resource Tag-based solution include: </p><ul><li><p>Workload Groups are formed within processes. Multiple Workload Groups compete for resources within the same BE node.</p></li><li><p>The consideration of data replica distribution is out of the picture because Workload Group is only a way of resource management.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cpu-soft-limit">CPU soft limit<a href="#cpu-soft-limit" class="hash-link" aria-label="CPU soft limit的直接链接" title="CPU soft limit的直接链接"></a></h3><p>CPU soft limit is implemented by the <code>cpu_share</code> parameter, which is similar to weights conceptually. Workload Groups with higher <code>cpu_share</code> will be allocated more CPU time during a time slot. </p><p>For example, if Group A is configured with a <code>cpu_share</code> of 1, and Group B, 9. In a time slot of 10 seconds, when both Group A and Group B are fully loaded, Group A and Group B will be able to consume 1s and 9s of CPU time, respectively. </p><p>What happens in real-world cases is that, not all workloads in the cluster run at full capacity. Under the soft limit, if Group B has low or zero workload, then Group A will be able to use all 10s of CPU time, thus increasing the overall CPU utilization in the cluster.</p><p><img loading="lazy" alt="CPU soft limit" src="https://cdnd.selectdb.com/zh-CN/assets/images/CPU-soft-limit-c549d1cefbc8648c861695b19e405b6a.png" width="1280" height="732" class="img_ev3q"></p><p>A soft limit brings flexibility and a higher resource utilization rate. On the flip side, it might cause performance fluctuations.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cpu-hard-limit">CPU hard limit<a href="#cpu-hard-limit" class="hash-link" aria-label="CPU hard limit的直接链接" title="CPU hard limit的直接链接"></a></h3><p>CPU hard limit in Apache Doris 2.1.0 is designed for users who require stable performance. In simple terms, the CPU hard limit defines that a Workload Group cannot use more CPU resources than its limit whether there are idle CPU resources or not.</p><p>This is how it works: </p><p>Suppose that Group A is set with <code>cpu_hard_limit=10%</code> and Group B with <code>cpu_hard_limit=90%</code>. If both Group A and Group B run at full load, Group A and Group B will respectively use 10% and 90% of the overall CPU time. The difference lies in when the workload of Group B decreases. In such cases, regardless of how high the query load of Group A is, it should not use more than the 10% CPU resources allocated to it. </p><p><img loading="lazy" alt="CPU hard limit" src="https://cdnd.selectdb.com/zh-CN/assets/images/CPU-hard-limit-cff5be57e984d26b7df9039ce69e5018.png" width="1280" height="746" class="img_ev3q"></p><p>As opposed to soft limit, a hard limit guarantees stable system performance at the cost of flexibility and the possibility of a higher resource utilization rate. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-resource-limit">Memory resource limit<a href="#memory-resource-limit" class="hash-link" aria-label="Memory resource limit的直接链接" title="Memory resource limit的直接链接"></a></h3><blockquote><p>The memory of a BE node comprises the following parts:</p><ul><li><p>Reserved memory for the operating system.</p></li><li><p>Memory consumed by non-queries, which is not considered in the Workload Group's memory statistics.</p></li><li><p>Memory consumed by queries, including data writing. This can be tracked and controlled by Workload Group.</p></li></ul></blockquote><p>The <code>memory_limit</code> parameter defines the maximum (%) memory available to a Workload Group within the BE process. It also affects the priority of Resource Groups.</p><p>Under initial status, a high-priority Resource Group will be allocated more memory. By setting <code>enable_memory_overcommit</code>, you can allow Resource Groups to occupy more memory than the limits when there is idle space. When memory is tight, Doris will cancel tasks to reclaim the memory resources that they commit. In this case, the system will retain memory resources for high-priority resource groups as much as possible.</p><div style="text-align:center"><img loading="lazy" src="/images/memory-resource-limit.png" alt="Memory resource limit" style="display:inline-block;width:300px" class="img_ev3q"></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="query-queue">Query queue<a href="#query-queue" class="hash-link" aria-label="Query queue的直接链接" title="Query queue的直接链接"></a></h3><p>It happens that the cluster is undertaking more loads than it can handle. In this case, submitting new query requests will not only be fruitless but also interruptive to the queries in progress.</p><p>To improve on this, Apache Doris provides the <a href="https://doris.apache.org/docs/admin-manual/resource-admin/workload-group#query-queue" target="_blank" rel="noopener noreferrer">query queue</a> mechanism. Users can put a limit on the number of queries that can run concurrently in the cluster. A query will be rejected when the query queue is full or after a waiting timeout, thus ensuring system stability under high loads.</p><p><img loading="lazy" alt="Query queue" src="https://cdnd.selectdb.com/zh-CN/assets/images/query-queue-863a0ad9b10c79e223d84abe16783f3c.png" width="1280" height="682" class="img_ev3q"></p><p>The query queue mechanism involves three parameters: <code>max_concurrency</code>, <code>max_queue_size</code>, and <code>queue_timeout</code>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="tests">Tests<a href="#tests" class="hash-link" aria-label="Tests的直接链接" title="Tests的直接链接"></a></h2><p>To demonstrate the effectiveness of the CPU soft limit and hard limit, we did a few tests.</p><ul><li><p>Environment: single machine, 16 cores, 64GB </p></li><li><p>Deployment: 1 FE + 1 BE</p></li><li><p>Dataset: ClickBench, TPC-H</p></li><li><p>Load testing tool: Apache JMeter</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cpu-soft-limit-test">CPU soft limit test<a href="#cpu-soft-limit-test" class="hash-link" aria-label="CPU soft limit test的直接链接" title="CPU soft limit test的直接链接"></a></h3><p>Start two clients and continuously submit queries (ClickBench Q23) with and without using Workload Groups, respectively. Note that Page Cache should be disabled to prevent it from affecting the test results.</p><p><img loading="lazy" alt="CPU soft limit test" src="https://cdnd.selectdb.com/zh-CN/assets/images/CPU-soft-limit-test-6398c8f7b4a0be54817a4443c05343a6.png" width="1280" height="374" class="img_ev3q"></p><p>Comparing the throughputs of the two clients in both tests, it can be concluded that:</p><ul><li><p><strong>Without configuring Workload Groups</strong>, the two clients consume the CPU resources on an equal basis.</p></li><li><p><strong>Configuring Workload Groups</strong> and setting the <code>cpu_share</code> to 2:1, the throughput ratio of the two clients is 2:1. With a higher <code>cpu_share</code>, Client 1 is provided with a higher portion of CPU resources, and it delivers a higher throughput. </p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cpu-hard-limit-test">CPU hard limit test<a href="#cpu-hard-limit-test" class="hash-link" aria-label="CPU hard limit test的直接链接" title="CPU hard limit test的直接链接"></a></h3><p>Start a client, set <code>cpu_hard_limit=50%</code> for the Workload Group, and execute ClickBench Q23 for 5 minutes under a concurrency level of 1, 2, and 4, respectively. </p><p><img loading="lazy" alt="CPU hard limit test" src="https://cdnd.selectdb.com/zh-CN/assets/images/CPU-hard-limit-test-55edf1e86571dd5256567852fecefa08.png" width="1280" height="838" class="img_ev3q"></p><p>As the query concurrency increases, the CPU utilization rate remains at around 800%, meaning that 8 cores are used. On a 16-core machine, that's <strong>50% utilization</strong>, which is as expected. In addition, since CPU hard limits are imposed, the increase in TP99 latency as concurrency rises is also an expected outcome.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="test-in-simulated-production-environment">Test in simulated production environment<a href="#test-in-simulated-production-environment" class="hash-link" aria-label="Test in simulated production environment的直接链接" title="Test in simulated production environment的直接链接"></a></h2><p>In real-world usage, users are particularly concerned about query latency rather than just query throughput, since latency is more easily perceptible in user experience. That's why we decided to validate the effectiveness of Workload Group in a simulated production environment.</p><p>We picked out a SQL set consisting of queries that should be finished within 1s (ClickBench Q15, Q17, Q23 and TPC-H Q3, Q7, Q19), including single-table aggregations and join queries. The size of the TPC-H dataset is 100GB.</p><p>Similarly, we conduct tests with and without configuring Workload Groups.</p><p><img loading="lazy" alt="Test in simulated production environment" src="https://cdnd.selectdb.com/zh-CN/assets/images/test-in-simulated-production-environment-a97b3764ec737366e55558d2ffb5f89e.png" width="1280" height="619" class="img_ev3q"></p><p>As the results show:</p><ul><li><p><strong>Without Workload Group</strong> (comparing Test 1 &amp; 2): When dialing up the concurrency of Client 2, both clients experience a 2~3-time increase in query latency.</p></li><li><p><strong>Configuring Workload Group</strong> (comparing Test 3 &amp; 4): As the concurrency of Client 2 goes up, the performance fluctuation in Client 1 is much smaller, which is proof of how it is effectively protected by workload isolation.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="recommendations--plans">Recommendations &amp; plans<a href="#recommendations--plans" class="hash-link" aria-label="Recommendations &amp; plans的直接链接" title="Recommendations &amp; plans的直接链接"></a></h2><p>The Resource Tag-based solution is a thorough workload isolation plan. The Workload Group-based solution realizes a better balance between resource isolation and utilization, and it is complemented by the query queue mechanism for stability guarantee.</p><p>So which one to choose for your use case? Here is our recommendation:</p><ul><li><p><strong>Resource Tag</strong>: for use cases where different business lines of departments share the same cluster, so the resources and data are physically isolated for different tenants.</p></li><li><p><strong>Workload Group</strong>: for use cases where one cluster undertakes various query workloads for flexible resource allocation.</p></li></ul><p>In future releases, we will keep improving user experience of the Workload Group and query queue features:</p><ul><li><p>Freeing up memory space by canceling queries is a brutal method. We plan to implement that by disk spilling, which will bring higher stability in query performance.</p></li><li><p>Since memory consumed by non-queries in the BE is not included in Workload Group's memory statistics, users might observe a disparity between the BE process memory usage and Workload Group memory usage. We will address this issue to avoid confusion.</p></li><li><p>In the query queue mechanism, cluster load is controlled by setting the maximum query concurrency. We plan to enable dynamic maximum query concurrency based on resource availability at the BE. This is to create backpressure on the client side and thus improve the availability of Doris when clients keep submitting high loads.</p></li><li><p>The main idea of Resource Tag is to group the BE nodes, while that of Workload Group is to further divide the resources of a single BE node. For users to grasp these ideas, they need to learn about the concept of BE nodes in Doris first. However, from an operational perspective, users only need to understand the resource consumption percentage of each of their workloads and what priority they should have when cluster load is saturated. Thus, we will try and figure out a way to flatten the learning curve for users, such as keeping the concept of BE nodes in the black box. </p></li></ul><p>For further assistance on workload isolation in Apache Doris, join the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[From Presto, Trino, ClickHouse, and Hive to Apache Doris: SQL convertor for easy migration]]></title>
<id>https://doris.apache.org/zh-CN/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration</id>
<link href="https://doris.apache.org/zh-CN/blog/from-presto-trino-clickhouse-and-hive-to-apache-doris-sql-convertor-for-easy-migration"/>
<updated>2024-05-06T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Users can execute queries with their old SQL syntaxes directly in Doris or batch convert their existing SQL statements on the visual SQL conversion interface.]]></summary>
<content type="html"><![CDATA[<p><a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a> is an all-in-one data platform that is capable of real-time reporting, ad-hoc queries, data lakehousing, log management and analysis, and batch data processing. As more and more companies have been replacing their component-heavy data architecture with Apache Doris, there is an increasing need for a more convenient data migration solution. <strong>That's why the Doris SQL Convertor is made.</strong></p><p>Most database systems run their own SQL dialects. Thus, migration between systems often entails modifications of SQL syntaxes. Since SQLs work closely with a company's business logic, in many cases, users have to modify their business logic, too. To reduce the transition pain for users, Apache Doris 2.1 provides the Doris SQL Convertor. It supports the SQL syntaxes of Presto, Trino, Hive, ClickHouse, and PostgreSQL. With it, users can execute queries with their old SQL syntaxes directly in Doris or batch convert their existing SQL statements on the visual interface.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="doris-sql-convertor">Doris SQL Convertor<a href="#doris-sql-convertor" class="hash-link" aria-label="Doris SQL Convertor的直接链接" title="Doris SQL Convertor的直接链接"></a></h2><p>The Doris SQL Convertor requires <strong>zero rewriting</strong> of SQL. Simply <code>set sql_dialect = "trino"</code> in the session variable, then you can execute queries in Doris using Trino SQLs. </p><p>The SQL compatibility of it has been proven by extensive tests. For example, a user tested the Doris SQL Convertor with over 30,000 SQL queries from their production environment. Turned out that the Convertor successfully converted 99.6% of the Trino SQLs and 98% of the ClickHouse SQLs.</p><p>Currently, Presto, Trino, Hive, ClickHouse, and PostgreSQL dialects are supported. We are working to add Teradata, SQL Server, and Snowflake to the list, and consistently increase the compatibility level of each SQL dialect.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="installation--usage">Installation &amp; usage<a href="#installation--usage" class="hash-link" aria-label="Installation &amp; usage的直接链接" title="Installation &amp; usage的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="sql-conversion-service">SQL conversion service<a href="#sql-conversion-service" class="hash-link" aria-label="SQL conversion service的直接链接" title="SQL conversion service的直接链接"></a></h3><ol><li><p><strong>Download</strong> <strong><a href="https://selectdb-doris-1308700295.cos.ap-beijing.myqcloud.com/doris-sql-convertor/doris-sql-convertor-1.0.3-bin-x86.tar.gz" target="_blank" rel="noopener noreferrer">Doris SQL Convertor</a></strong></p></li><li><p>On any frontend (FE) node, start the service using the following command.</p></li></ol><ul><li><p>The SQL conversion service is stateless and can be started or stopped at any time.</p></li><li><p><code>port=5001</code> in the command specifies the service port. (You can use any available port.)</p></li><li><p>It is advisable to start a service individually for each FE node.</p></li></ul><div class="language-Shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">nohup ./doris-sql-convertor-1.0.1-bin-x86 run --host=0.0.0.0 --port=5001 &amp;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="3"><li><p>Start a Doris cluster <strong>(Use Doris 2.1.0 or newer)</strong>.</p></li><li><p>Set the URL for SQL conversion service in Doris. <code>127.0.0.1:5001</code> in the command represents the IP and port number of the node where the service is deployed.</p></li></ol><div class="language-Shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">MySQL&gt; set global sql_converter_service_url = "http://127.0.0.1:5001/api/v1/convert"</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>After deployment, you can execute SQL directly in the command line. You can start the service by <code>set sql_dialect = XXX</code>. The following examples are based on ClickHouse SQL dialects.</p><ul><li>Presto</li></ul><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> sql_dialect</span><span class="token operator">=</span><span class="token plain">presto</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">0</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> affected </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.00</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">start_time </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">20</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> array_distinct</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">arr_int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> FILTER</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">arr_str</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> x </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> x </span><span class="token operator">LIKE</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'%World%'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> to_date</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">'%Y-%m-%d'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> date_add</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'month'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col6</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> REGEXP_EXTRACT_ALL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'-.'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col7</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> JSON_EXTRACT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'{"id": "33"}'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'$.id'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col8</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> element_at</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">arr_int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col9</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> date_trunc</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'day'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col10 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> test_sqlconvert </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> date_trunc</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'day'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token operator">=</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATE</span><span class="token string" style="color:rgb(255, 121, 198)">'2024-05-20'</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------------------+-----------+-----------+------------+------+---------------------+-------------+------+------+---------------------+ </span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> col1 </span><span class="token operator">|</span><span class="token plain"> col2 </span><span class="token operator">|</span><span class="token plain"> col3 </span><span class="token operator">|</span><span class="token plain"> col4 </span><span class="token operator">|</span><span class="token plain"> col5 </span><span class="token operator">|</span><span class="token plain"> col6 </span><span class="token operator">|</span><span class="token plain"> col7 </span><span class="token operator">|</span><span class="token plain"> col8 </span><span class="token operator">|</span><span class="token plain"> col9 </span><span class="token operator">|</span><span class="token plain"> col10 </span><span class="token operator">|</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------------------+-----------+-----------+------------+------+---------------------+-------------+------+------+---------------------+ </span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">05</span><span class="token operator">-</span><span class="token number">20</span><span class="token plain"> </span><span class="token number">13</span><span class="token plain">:</span><span class="token number">14</span><span class="token plain">:</span><span class="token number">52</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"World"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">01</span><span class="token operator">-</span><span class="token number">14</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">06</span><span class="token operator">-</span><span class="token number">20</span><span class="token plain"> </span><span class="token number">13</span><span class="token plain">:</span><span class="token number">14</span><span class="token plain">:</span><span class="token number">52</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">'-0'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">'-1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"33"</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">05</span><span class="token operator">-</span><span class="token number">20</span><span class="token plain"> </span><span class="token number">00</span><span class="token plain">:</span><span class="token number">00</span><span class="token plain">:</span><span class="token number">00</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------------------+-----------+-----------+------------+------+---------------------+-------------+------+------+---------------------+ </span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">row</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.03</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>ClickHouse</li></ul><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> sql_dialect</span><span class="token operator">=</span><span class="token plain">clickhouse</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">0</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> affected </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.00</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> toString</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> arrayCompact</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">arr_int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> arrayFilter</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">x </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> x </span><span class="token operator">like</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'%World%'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain">arr_str</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> toDate</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> toYear</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> addMonths</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col6</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> extractAll</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'-.'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col7</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> JSONExtractString</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'{"id": "33"}'</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'id'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col8</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> arrayElement</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">arr_int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col9</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> date_trunc</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'day'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> col10 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> test_sqlconvert </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> date_trunc</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'day'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain">start_time</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'2024-05-20 00:00:00'</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------------------+-----------+-----------+------------+------+---------------------+-------------+------+------+---------------------+ </span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> col1 </span><span class="token operator">|</span><span class="token plain"> col2 </span><span class="token operator">|</span><span class="token plain"> col3 </span><span class="token operator">|</span><span class="token plain"> col4 </span><span class="token operator">|</span><span class="token plain"> col5 </span><span class="token operator">|</span><span class="token plain"> col6 </span><span class="token operator">|</span><span class="token plain"> col7 </span><span class="token operator">|</span><span class="token plain"> col8 </span><span class="token operator">|</span><span class="token plain"> col9 </span><span class="token operator">|</span><span class="token plain"> col10 </span><span class="token operator">|</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------------------+-----------+-----------+------------+------+---------------------+-------------+------+------+---------------------+ </span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">05</span><span class="token operator">-</span><span class="token number">20</span><span class="token plain"> </span><span class="token number">13</span><span class="token plain">:</span><span class="token number">14</span><span class="token plain">:</span><span class="token number">52</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"World"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">01</span><span class="token operator">-</span><span class="token number">14</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">06</span><span class="token operator">-</span><span class="token number">20</span><span class="token plain"> </span><span class="token number">13</span><span class="token plain">:</span><span class="token number">14</span><span class="token plain">:</span><span class="token number">52</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">'-0'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">'-1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"33"</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2024</span><span class="token operator">-</span><span class="token number">05</span><span class="token operator">-</span><span class="token number">20</span><span class="token plain"> </span><span class="token number">00</span><span class="token plain">:</span><span class="token number">00</span><span class="token plain">:</span><span class="token number">00</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------------------+-----------+-----------+------------+------+---------------------+-------------+------+------+---------------------+ </span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">row</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.02</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="visual-interface">Visual interface<a href="#visual-interface" class="hash-link" aria-label="Visual interface的直接链接" title="Visual interface的直接链接"></a></h3><p>For large-scale conversion, it is recommended to use the visual interface, on which you can batch upload the files for dialect conversion.</p><p>Follow these steps to deploy the visual conversion interface:</p><ol><li><p>Environment: Docker, Docker-Compose</p></li><li><p>Get Doris-SQL-Convertor Docker image</p></li><li><p>Create a network for the image</p></li></ol><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">docker network create app_network</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="4"><li>Decompress the package</li></ol><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">tar xzvf doris-sql-convertor-1.0.1.tar.gz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cd doris-sql-convertor</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="5"><li>Edit the environment variables</li></ol><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">FLASK_APP=server/app.py</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FLASK_DEBUG=1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">API_HOST=http://doris-sql-convertor-api:5000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># DOCKER TAG</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">API_TAG=latest</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WEB_TAG=latest</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="6"><li>Start it up</li></ol><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">sh start.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>After deployment, you can access the service by <code>ip:8080</code> via your local browser. <code>8080</code> is the default port. You can change the mapping port. On the visual interface, you can select the source dialect type and target dialect type, and then click "Convert".</p><div class="theme-admonition theme-admonition-info alert alert--info admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>Note</div><div class="admonitionContent_S0QG"><ol><li><p>For batch conversion, each SQL statement should end with <code>; </code>.</p></li><li><p>The Doris SQL Convertor supports 239 UNION ALL conversions at most.</p></li></ol></div></div><p>Join the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a> to seek guidance from the Doris makers or provide your feedback!</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Cross-cluster replication for read-write separation: story of a grocery store brand]]></title>
<id>https://doris.apache.org/zh-CN/blog/cross-cluster-replication-for-read-write</id>
<link href="https://doris.apache.org/zh-CN/blog/cross-cluster-replication-for-read-write"/>
<updated>2024-04-25T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Cross-cluster replication (CCR) in Apache Doris is proven to be fast, stable, and easy to use. It secures a real-time data synchronization latency of 1 second.]]></summary>
<content type="html"><![CDATA[<p>This is about how a grocery store brand leverages the <a href="https://doris.apache.org/docs/2.0/admin-manual/data-admin/ccr" target="_blank" rel="noopener noreferrer">Cross-Cluster Replication (CCR)</a> capability of Apache Doris to separate their data reading and writing workloads. In this case, where the freshness of groceries is guaranteed by the freshness of data, they use Apache Doris as their data warehouse to monitor and analyze their procurement, sale, and stock in real time for all their stores and supply chains. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-they-need-ccr">Why they need CCR<a href="#why-they-need-ccr" class="hash-link" aria-label="Why they need CCR的直接链接" title="Why they need CCR的直接链接"></a></h2><p>A major part of the user's data warehouse (including the ODS, DWD, DWS, and ADS layers) is built within Apache Doris, which employs a micro-batch scheduling mechanism to coordinate data across the data warehouse layers. However, this is pressured by the burgeoning business of the grocery store brand. The data size they have to receive, store, and analyze is getting larger and larger. That means their data warehouse has to handle bigger data writing batches and more frequent data queries. However, task scheduling during query execution might lead to resource preemption, so any resource shortage can easily compromise performance or even cause task failure or system disruption.</p><p> Naturally, the user thought of <strong>separating the reading and writing workloads.</strong> Specifically, they want to replicate data from the ADS layer (which is cleaned, transformed, aggregated, and ready to be queried) to a backup cluster dedicated to query services. <strong>This is implemented by the CCR in Apache Doris.</strong> It prevents abnormal queries from interrupting data writing and ensures cluster stability. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="before-ccr">Before CCR<a href="#before-ccr" class="hash-link" aria-label="Before CCR的直接链接" title="Before CCR的直接链接"></a></h2><p>Before CCR was available, they innovatively adopted the <a href="https://doris.apache.org/docs/2.0/lakehouse/lakehouse-overview#multi-catalog" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> feature of Doris for the same purpose. Multi-Catalog allows users to connect Doris to various data sources conveniently. It is actually designed for federated querying, but the user drew inspiration from it. They wrote a script and tried to pull incremental data via Catalog. Their data synchronization pipeline is as follows:</p><p><img loading="lazy" alt="Before CCR" src="https://cdnd.selectdb.com/zh-CN/assets/images/before-ccr-079a13cf3fe218976cce0015a6c6c752.jpeg" width="1280" height="1003" class="img_ev3q"></p><p>They loaded data from the source cluster to the target cluster by regular scheduling tasks. To identify incremental data, they added a <code>last_update_time</code> field to the tables. There were two downsides to this. Firstly, the data freshness of the target cluster was reliant on and hindered by the scheduling tasks. Secondly, for incremental data ingestion, in order to identify incremental data, the import SQL statement for every table has to include the logic to check the <code>last_update_time</code> field, otherwise the system just deletes and re-imports the entire table. Such requirement increases development complexity and data error rate. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="ccr-in-apache-doris">CCR in Apache Doris<a href="#ccr-in-apache-doris" class="hash-link" aria-label="CCR in Apache Doris的直接链接" title="CCR in Apache Doris的直接链接"></a></h2><p>Just when they were looking for a better solution, Apache Doris released CCR in version 2.0. Compared to the alternatives they've tried, CCR in Apache Doris is:</p><ul><li><p><strong>Lightweight in design</strong>: The data synchronization tasks consume very few machine resources. They run smoothly without reducing the overall performance of Apache Doris.</p></li><li><p><strong>Easy to use</strong>: It can be configured by one simple <code>POST</code> request.</p></li><li><p><strong>Unlimited in migration</strong>: Users can raise the upper limit of the data migration capabilities in CCR by optimizing their cluster configuration. </p></li><li><p><strong>Consistent in data</strong>: The DDL statements executed in the source cluster can be automatically synchronized into the target cluster, ensuring data consistency.</p></li><li><p><strong>Flexible in synchronization</strong>: It is able to perform both full data synchronization and incremental data synchronization.</p></li></ul><p>To start CCR in Doris simply requires two steps. Step one is to enable binlogs in both the source cluster and the target cluster. Step two is to send the name of the database or table to be replicated. Then the system will start synchronizing full or incremental data. The detailed workflow is as follows: </p><p><img loading="lazy" alt="CCR in Apache Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/ccr-in-apache-doris-31b9554f59ba15f637a5c54778915973.jpeg" width="1280" height="335" class="img_ev3q"></p><p>In the grocery store brand's case, they need to synchronize a few tables from the source cluster to the target cluster, each table having an incremental data size of about 50 million rows. After a month's trial run, the Doris CCR mechanism is proven to be stable and performant:</p><ul><li><p><strong>Higher stability and data accuracy</strong>: No replication failure has ever occurred during the trial period. Every data row is transferred and landed in the target cluster accurately. </p></li><li><p><strong>Streamlined workflows:</strong></p><ul><li><strong>Before CCR</strong>: The user had to write SQL for each table and write data via Catalog; For tables without a <code>last_update_time</code> field, incremental data synchronization can only be implemented by full-table deletion and re-import.</li></ul><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> catalog1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">db</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">destination_table_1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> catalog1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">db</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">source_table1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">time</span><span class="token plain"> </span><span class="token operator">&gt;</span><span class="token plain"> xxx</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> catalog1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">db</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">destination_table_2 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> catalog1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">db</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">source_table2 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">time</span><span class="token plain"> </span><span class="token operator">&gt;</span><span class="token plain"> xxx</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> catalog1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">db</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">destination_table_x </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> catalog1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">db</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">source_table_x</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li><strong>After CCR</strong>: It only requires one <code>POST</code> request to synchronize an entire database.</li></ul><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl </span><span class="token operator">-</span><span class="token plain">X POST </span><span class="token operator">-</span><span class="token plain">H </span><span class="token string" style="color:rgb(255, 121, 198)">"Content-Type: application/json"</span><span class="token plain"> </span><span class="token operator">-</span><span class="token plain">d </span><span class="token string" style="color:rgb(255, 121, 198)">'{</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "name": "ccr_test",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "src": {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "host": "localhost",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "port": "9030",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "thrift_port": "9020",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "user": "root",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "password": "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "database": "demo",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "table": ""</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> },</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "dest": {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "host": "localhost",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "port": "9030",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "thrift_port": "9020",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "user": "root",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "password": "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "database": "ccrt",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> "table": ""</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)"> }</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token string" style="color:rgb(255, 121, 198)">}'</span><span class="token plain"> http:</span><span class="token comment" style="color:rgb(98, 114, 164)">//127.0.0.1:9190/create_ccr</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div></li><li><p><strong>Faster data loading</strong>: With CCR, it only takes <strong>3~4 seconds</strong> to ingest a day's incremental data, as compared to more than 30 seconds with the Catalog method. As for real-time synchronization, CCR can finish data ingestion in 1 second, without reliance on manual updates or regular scheduling.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>Using CCR in Apache Doris, the grocery store brand separates reading and writing workloads into different clusters and thus improves overall system stability. This solution delivers a real-time data synchronization latency of about 1 second. To further ensure normal functioning, it has a real-time monitoring and alerting mechanism so any issue will be notified and attended to instantly, and a contingency plan to guarantee uninterrupted query services. It also supports partition-based data synchronization (e.g. <code>ALTER TABLE tbl1 REPLACE PARTITION</code>). With demonstrated effectiveness of CCR, they are planning to replicate more of their data assets for efficient and secure data usage.</p><p>CCR is also applicable when you need to build multiple data centers or derive a test dataset from your production environment. For further guidance on CCR, join the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris version 2.0.9 has been released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.9</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.9"/>
<updated>2024-04-23T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, about 68 improvements and bug fixes have been made in Doris 2.0.9 version.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 68 improvements and bug fixes have been made in Doris 2.0.9 version.</p><ul><li><p><strong>Quick Download</strong> : <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p></li><li><p><strong>GitHub</strong> : <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-behavior-change">1 Behavior change<a href="#1-behavior-change" class="hash-link" aria-label="1 Behavior change的直接链接" title="1 Behavior change的直接链接"></a></h2><p>NA</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-new-features">2 New features<a href="#2-new-features" class="hash-link" aria-label="2 New features的直接链接" title="2 New features的直接链接"></a></h2><ul><li><p>Support predicate apprear both on key and value mv column</p></li><li><p>Support mv with <code>bitmap_union(bitmap_from_array())</code></p></li><li><p>Add a FE config to force replicate allocation for OLAP tables in the cluster</p></li><li><p>Support date literal support timezone in new optimizer Nereids</p></li><li><p>Support slop in fulltext search <code>match_phrase</code> to specify word distence</p></li><li><p>Show index id in <code>SHOW PROC INDEXES</code></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="3-improvement-and-optimizations">3 Improvement and optimizations<a href="#3-improvement-and-optimizations" class="hash-link" aria-label="3 Improvement and optimizations的直接链接" title="3 Improvement and optimizations的直接链接"></a></h2><ul><li><p>Sdd a secondary argument in <code>first_value</code> / <code>last_value</code> to ignore NULL values</p></li><li><p>the offset params in <code>LEAD</code>/ <code>LAG</code> function could use 0</p></li><li><p>Adjust priority of materialized view match rule</p></li><li><p>TopN opt reads only limit number of records for better performance</p></li><li><p>Add profile for delete_bitmap get_agg function</p></li><li><p>Refine the Meta cache to get better performance</p></li><li><p>Add FE config <code>autobucket_max_buckets</code></p></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/compare/2.0.8...2.0.9" target="_blank" rel="noopener noreferrer">GitHub</a> .</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Arrow Flight SQL for 10X faster data transfer]]></title>
<id>https://doris.apache.org/zh-CN/blog/arrow-flight-sql-in-apache-doris-for-10x-faster-data-transfer</id>
<link href="https://doris.apache.org/zh-CN/blog/arrow-flight-sql-in-apache-doris-for-10x-faster-data-transfer"/>
<updated>2024-04-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Apache Doris 2.1 supports Arrow Flight SQL protocol for reading data from Doris. It delivers tens-fold speedups compared to PyMySQL and Pandas.]]></summary>
<content type="html"><![CDATA[<p>For years, JDBC and ODBC have been commonly adopted norms for database interaction. Now, as we gaze upon the vast expanse of the data realm, the rise of data science and data lake analytics brings bigger and bigger datasets. Correspondingly, we need faster and faster data reading and transmission, so we start to look for better answers than JDBC and ODBC. Thus, we include <strong>Arrow Flight SQL protocol</strong> into <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris 2.1</a>, which provides <strong>tens-fold speedups for data transfer</strong>. </p><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>Tip</div><div class="admonitionContent_S0QG"><p>A <a href="https://www.youtube.com/watch?v=zIqy24gI8DE" target="_blank" rel="noopener noreferrer">demo</a> of loading data from Apache Doris to Python using Arrow Flight SQL.</p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="high-speed-data-transfer-based-on-arrow-flight-sql">High-speed data transfer based on Arrow Flight SQL<a href="#high-speed-data-transfer-based-on-arrow-flight-sql" class="hash-link" aria-label="High-speed data transfer based on Arrow Flight SQL的直接链接" title="High-speed data transfer based on Arrow Flight SQL的直接链接"></a></h2><p>As a column-oriented data warehouse, Apache Doris arranges its query results in the form of data Blocks in a columnar format. Before version 2.1, the Blocks must be serialized into bytes in row-oriented formats before they can be transferred to a target client via a MySQL client or JDBC/ODBC driver. Moreover, if the target client is a columnar database or a column-oriented data science component like Pandas, the data should then be de-serialized. The serialization-deserialization process is a speed bump for data transmission.</p><p>Apache Doris 2.1 has a data transmission channel built on <a href="https://arrow.apache.org/docs/format/FlightSql.html" target="_blank" rel="noopener noreferrer">Arrow Flight SQL</a>. (<a href="https://arrow.apache.org/" target="_blank" rel="noopener noreferrer">Apache Arrow</a> is a software development platform designed for high data movement efficiency across systems and languages, and the Arrow format aims for high-performance, lossless data exchange.) It allows <strong>high-speed, large-scale data reading from Doris via SQL in various mainstream programming languages</strong>. For target clients that also support the Arrow format, the whole process will be free of serialization/deserialization, thus no performance loss. Another upside is, Arrow Flight can make full use of multi-node and multi-core architecture and implement parallel data transfer, which is another enabler of high data throughput.</p><p>For example, if a Python client reads data from Apache Doris, Doris will first convert the column-oriented Blocks to Arrow RecordBatch. Then in the Python client, Arrow RecordBatch will be converted to Pandas DataFrame. Both conversions are fast because the Doris Blocks, Arrow RecordBatch, and Pandas DataFrame are all column-oriented. </p><p><img loading="lazy" alt="img" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-speed-data-transfer-based-on-doris-arrow-flight-sql-c51538bca23f1062d141adab8fe055cb.png" width="1280" height="647" class="img_ev3q"></p><p>In addition, Arrow Flight SQL provides a general JDBC driver to facilitate seamless communication between databases that supports the Arrow Flight SQL protocol. This unlocks the the potential of Doris to be connected to a wider ecosystem and to be used in more cases. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="performance-test">Performance test<a href="#performance-test" class="hash-link" aria-label="Performance test的直接链接" title="Performance test的直接链接"></a></h2><p>The "tens-fold speedups" conclusion is based on our benchmark tests. We tried reading data from Doris using PyMySQL, Pandas, and Arrow Flight SQL, and jotted down the durations, respectively. The test data is the ClickBench dataset.</p><p><img loading="lazy" alt="Performance test" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-performance-test-a0ccb1f783b2f85210c63d8aa961f649.png" width="1980" height="1062" class="img_ev3q"></p><p>Results on various data types are as follows: </p><p><img loading="lazy" alt="Performance test results" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-performance-test-2-8c60cb82710df9d37e6707593830da6c.png" width="1280" height="481" class="img_ev3q"></p><p><strong>As shown, Arrow Flight SQL outperforms PyMySQL and Pandas in all data types by a factor ranging from 20 to several hundreds</strong>. </p><p><img loading="lazy" alt="Arrow Flight SQL outperforms PyMySQL and Pandas" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-performance-test-3-b48f9e4bdec4a27877fcddbb33e6375a.png" width="1280" height="502" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="usage">Usage<a href="#usage" class="hash-link" aria-label="Usage的直接链接" title="Usage的直接链接"></a></h2><p>With support for Arrow Flight SQL, Apache Doris can leverage the Python ADBC Driver for fast data reading. I will showcase a few frequently executed database operations using the Python ADBC Driver (version 3.9 or later), including DDL, DML, session variable setting, and <code>show</code> statements.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01--install-library">01 Install library<a href="#01--install-library" class="hash-link" aria-label="01 Install library的直接链接" title="01 Install library的直接链接"></a></h3><p>The relevant library is already published on PyPI. It can be installed simply as follows: </p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">pip install adbc_driver_manager</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">pip install adbc_driver_flightsql</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Import the following module/library to interact with the installed library: </p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">import adbc_driver_manager</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import adbc_driver_flightsql.dbapi as flight_sql</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02--connect-to-doris">02 Connect to Doris<a href="#02--connect-to-doris" class="hash-link" aria-label="02 Connect to Doris的直接链接" title="02 Connect to Doris的直接链接"></a></h3><p>Create a client for interacting with the Doris Arrow Flight SQL service. Prerequisites include: Doris frontend (FE) host, Arrow Flight port, and login username/password.</p><p>Configure parameters for Doris frontend (FE) and backend (BE):</p><ul><li><p>In <code>fe/conf/fe.conf</code>, set <code>arrow_flight_sql_port </code> to an available port, such as 9090.</p></li><li><p>In <code>be/conf/be.conf</code>, set <code>arrow_flight_port </code> to an available port, such as 9091.</p></li></ul><p>Suppose that the Arrow Flight SQL services for the Doris instance will run on ports 9090 and 9091 for FE and BE respectively, and the Doris username/password is "user" and "pass", the connection process would be:</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">conn = flight_sql.connect(uri="grpc://127.0.0.1:9090", db_kwargs={</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.USERNAME.value: "user",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.PASSWORD.value: "pass",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> })</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor = conn.cursor()</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Once the connection is established, you can interact with Doris using SQL statements through the returned cursor object. This allows you to perform various operations such as table creation, metadata retrieval, data import, and query execution.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03--create-table-and-retrieve-metadata">03 Create table and retrieve metadata<a href="#03--create-table-and-retrieve-metadata" class="hash-link" aria-label="03 Create table and retrieve metadata的直接链接" title="03 Create table and retrieve metadata的直接链接"></a></h3><p>Pass the query to the <code>cursor.execute()</code> function, which creates tables and retrieves metadata.</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("DROP DATABASE IF EXISTS arrow_flight_sql FORCE;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("create database arrow_flight_sql;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("show databases;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("use arrow_flight_sql;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("""CREATE TABLE arrow_flight_sql_test</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k0 INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k1 DOUBLE,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> K2 varchar(32) NULL DEFAULT "" COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k3 DECIMAL(27,9) DEFAULT "0",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k4 BIGINT NULL DEFAULT '10',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k5 DATE,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(k5) BUCKETS 5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES("replication_num" = "1");""")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("show create table arrow_flight_sql_test;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If the returned <code>StatusResult</code> is 0, that means the query is executed successfully. (Such design is to ensure compatibility with JDBC.)</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> StatusResult</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> StatusResult</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> Database</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 __internal_schema</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 arrow_flight_sql</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">.. ...</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">507 udf_auth_db</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">[508 rows x 1 columns]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> StatusResult</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> StatusResult</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> Table Create Table</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 arrow_flight_sql_test CREATE TABLE `arrow_flight_sql_test` (\n `k0`...</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="04--ingest-data">04 Ingest data<a href="#04--ingest-data" class="hash-link" aria-label="04 Ingest data的直接链接" title="04 Ingest data的直接链接"></a></h3><p>Execute an INSERT INTO statement to load test data into the table created:</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("""INSERT INTO arrow_flight_sql_test VALUES</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('0', 0.1, "ID", 0.0001, 9999999999, '2023-10-21'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('1', 0.20, "ID_1", 1.00000001, 0, '2023-10-21'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('2', 3.4, "ID_1", 3.1, 123456, '2023-10-22'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('3', 4, "ID", 4, 4, '2023-10-22'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('4', 122345.54321, "ID", 122345.54321, 5, '2023-10-22');""")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you see the following returned result, the data ingestion is successful.</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> StatusResult</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If the data size to ingest is huge, you can apply the Stream Load method using <a href="https://pypi.org/project/pydoris/" target="_blank" rel="noopener noreferrer">pydoris</a>.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="05--execute-queries">05 Execute queries<a href="#05--execute-queries" class="hash-link" aria-label="05 Execute queries的直接链接" title="05 Execute queries的直接链接"></a></h3><p>Perform queries on the above table, such as aggregation, sorting, and session variable setting.</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("select * from arrow_flight_sql_test order by k0;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("set exec_mem_limit=2000;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("show variables like \"%exec_mem_limit%\";")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("select k5, sum(k1), count(1), avg(k3) from arrow_flight_sql_test group by k5;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The results are as follows:</p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k0 k1 K2 k3 k4 k5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0 0.10000 ID 0.000100000 9999999999 2023-10-21</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 1 0.20000 ID_1 1.000000010 0 2023-10-21</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">2 2 3.40000 ID_1 3.100000000 123456 2023-10-22</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">3 3 4.00000 ID 4.000000000 4 2023-10-22</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">4 4 122345.54321 ID 122345.543210000 5 2023-10-22</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">[5 rows x 6 columns]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> StatusResult</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> Variable_name Value Default_Value Changed</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 exec_mem_limit 2000 2147483648 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k5 Nullable(Float64)_1 Int64_2 Nullable(Decimal(38, 9))_3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 2023-10-22 122352.94321 3 40784.214403333</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 2023-10-21 0.30000 2 0.500050005</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">[2 rows x 5 columns]</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="06--complete-code">06 Complete code<a href="#06--complete-code" class="hash-link" aria-label="06 Complete code的直接链接" title="06 Complete code的直接链接"></a></h3><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"># Doris Arrow Flight SQL Test</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># step 1, library is released on PyPI and can be easily installed.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># pip install adbc_driver_manager</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># pip install adbc_driver_flightsql</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import adbc_driver_manager</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import adbc_driver_flightsql.dbapi as flight_sql</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># step 2, create a client that interacts with the Doris Arrow Flight SQL service.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># Modify arrow_flight_sql_port in fe/conf/fe.conf to an available port, such as 9090.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># Modify arrow_flight_port in be/conf/be.conf to an available port, such as 9091.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">conn = flight_sql.connect(uri="grpc://127.0.0.1:9090", db_kwargs={</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.USERNAME.value: "root",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.PASSWORD.value: "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> })</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor = conn.cursor()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># interacting with Doris via SQL using Cursor</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">def execute(sql):</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print("\n### execute query: ###\n " + sql)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor.execute(sql)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print("### result: ###")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(cursor.fetchallarrow().to_pandas())</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># step3, execute DDL statements, create database/table, show stmt.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("DROP DATABASE IF EXISTS arrow_flight_sql FORCE;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("show databases;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("create database arrow_flight_sql;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("show databases;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("use arrow_flight_sql;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("""CREATE TABLE arrow_flight_sql_test</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k0 INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k1 DOUBLE,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> K2 varchar(32) NULL DEFAULT "" COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k3 DECIMAL(27,9) DEFAULT "0",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k4 BIGINT NULL DEFAULT '10',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k5 DATE,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(k5) BUCKETS 5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES("replication_num" = "1");""")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("show create table arrow_flight_sql_test;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># step4, insert into</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("""INSERT INTO arrow_flight_sql_test VALUES</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('0', 0.1, "ID", 0.0001, 9999999999, '2023-10-21'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('1', 0.20, "ID_1", 1.00000001, 0, '2023-10-21'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('2', 3.4, "ID_1", 3.1, 123456, '2023-10-22'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('3', 4, "ID", 4, 4, '2023-10-22'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('4', 122345.54321, "ID", 122345.54321, 5, '2023-10-22');""")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># step5, execute queries, aggregation, sort, set session variable</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("select * from arrow_flight_sql_test order by k0;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("set exec_mem_limit=2000;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("show variables like \"%exec_mem_limit%\";")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">execute("select k5, sum(k1), count(1), avg(k3) from arrow_flight_sql_test group by k5;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># step6, close cursor </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.close()</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="examples-of-data-transmission-at-scale">Examples of data transmission at scale<a href="#examples-of-data-transmission-at-scale" class="hash-link" aria-label="Examples of data transmission at scale的直接链接" title="Examples of data transmission at scale的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01--python">01 Python<a href="#01--python" class="hash-link" aria-label="01 Python的直接链接" title="01 Python的直接链接"></a></h3><p>In Python, after connecting to Doris using the ADBC Driver, you can use various ADBC APIs to load the Clickbench dataset from Doris into Python. Here's how:</p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">#!/usr/bin/env python</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># -*- coding: utf-8 -*-</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import adbc_driver_manager</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import adbc_driver_flightsql.dbapi as flight_sql</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import pandas</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from datetime import datetime</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">my_uri = "grpc://0.0.0.0:`fe.conf_arrow_flight_port`"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">my_db_kwargs = {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.USERNAME.value: "root",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.PASSWORD.value: "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">sql = "select * from clickbench.hits limit 1000000;"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># PEP 249 (DB-API 2.0) API wrapper for the ADBC Driver Manager.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">def dbapi_adbc_execute_fetchallarrow():</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> conn = flight_sql.connect(uri=my_uri, db_kwargs=my_db_kwargs)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor = conn.cursor()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> start_time = datetime.now()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor.execute(sql)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> arrow_data = cursor.fetchallarrow()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> dataframe = arrow_data.to_pandas()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print("\n##################\n dbapi_adbc_execute_fetchallarrow" + ", cost:" + str(datetime.now() - start_time) + ", bytes:" + str(arrow_data.nbytes) + ", len(arrow_data):" + str(len(arrow_data)))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(dataframe.info(memory_usage='deep'))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(dataframe)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># ADBC reads data into pandas dataframe, which is faster than fetchallarrow first and then to_pandas.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">def dbapi_adbc_execute_fetch_df():</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> conn = flight_sql.connect(uri=my_uri, db_kwargs=my_db_kwargs)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor = conn.cursor()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> start_time = datetime.now()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor.execute(sql)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> dataframe = cursor.fetch_df() </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print("\n##################\n dbapi_adbc_execute_fetch_df" + ", cost:" + str(datetime.now() - start_time))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(dataframe.info(memory_usage='deep'))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(dataframe)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"># Can read multiple partitions in parallel.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">def dbapi_adbc_execute_partitions():</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> conn = flight_sql.connect(uri=my_uri, db_kwargs=my_db_kwargs)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor = conn.cursor()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> start_time = datetime.now()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partitions, schema = cursor.adbc_execute_partitions(sql)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cursor.adbc_read_partition(partitions[0])</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> arrow_data = cursor.fetchallarrow()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> dataframe = arrow_data.to_pandas()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print("\n##################\n dbapi_adbc_execute_partitions" + ", cost:" + str(datetime.now() - start_time) + ", len(partitions):" + str(len(partitions)))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(dataframe.info(memory_usage='deep'))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> print(dataframe)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">dbapi_adbc_execute_fetchallarrow()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">dbapi_adbc_execute_fetch_df()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">dbapi_adbc_execute_partitions()</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Results are as follows (omitting the repeated outputs). <strong>It only takes 3s</strong> to load a Clickbench dataset containing 1 million rows and 105 columns. </p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">##################</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> dbapi_adbc_execute_fetchallarrow, cost:0:00:03.548080, bytes:784372793, len(arrow_data):1000000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">&lt;class 'pandas.core.frame.DataFrame'&gt;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">RangeIndex: 1000000 entries, 0 to 999999</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Columns: 105 entries, CounterID to CLID</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">dtypes: int16(48), int32(19), int64(6), object(32)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">memory usage: 2.4 GB</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">None</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CounterID EventDate UserID EventTime WatchID JavaEnable Title GoodEvent ... UTMCampaign UTMContent UTMTerm FromTag HasGCLID RefererHash URLHash CLID</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">0 245620 2013-07-09 2178958239546411410 2013-07-09 19:30:27 8302242799508478680 1 OWAProfessionov — Мой Круг (СВАО Интернет-магазин 1 ... 0 -7861356476484644683 -2933046165847566158 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">999999 1095 2013-07-03 4224919145474070397 2013-07-03 14:36:17 6301487284302774604 0 @дневники Sinatra (ЛАДА, цена для деталли кто ... 1 ... 0 -296158784638538920 1335027772388499430 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">[1000000 rows x 105 columns]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">##################</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> dbapi_adbc_execute_fetch_df, cost:0:00:03.611664</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">##################</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> dbapi_adbc_execute_partitions, cost:0:00:03.483436, len(partitions):1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">##################</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> low_level_api_execute_query, cost:0:00:03.523598, stream.address:139992182177600, rows:-1, bytes:784322926, len(arrow_data):1000000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">##################</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> low_level_api_execute_partitions, cost:0:00:03.738128streams.size:3, 1, -1</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02--jdbc">02 JDBC<a href="#02--jdbc" class="hash-link" aria-label="02 JDBC的直接链接" title="02 JDBC的直接链接"></a></h3><p>The open-source JDBC driver for the Arrow Flight SQL protocol provides compatibility with the standard JDBC API. It allows most BI tools to access Doris via JDBC and supports high-speed transfer of Apache Arrow data. </p><p>Usage of this driver is similar to using that for the MySQL protocol. You just need to replace <code>jdbc:mysql</code> in the connection URL with <code>jdbc:arrow-flight-sql</code>. The returned result will be in the JDBC ResultSet data structure. </p><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">import java.sql.Connection;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import java.sql.DriverManager;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import java.sql.ResultSet;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">import java.sql.Statement;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Class.forName("org.apache.arrow.driver.jdbc.ArrowFlightJdbcDriver");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">String DB_URL = "jdbc:arrow-flight-sql://0.0.0.0:9090?useServerPrepStmts=false"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> + "&amp;cachePrepStmts=true&amp;useSSL=false&amp;useEncryption=false";</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">String USER = "root";</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">String PASS = "";</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Connection conn = DriverManager.getConnection(DB_URL, USER, PASS);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Statement stmt = conn.createStatement();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ResultSet resultSet = stmt.executeQuery("show tables;");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">while (resultSet.next()) {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> String col1 = resultSet.getString(1);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> System.out.println(col1);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">resultSet.close();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">stmt.close();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">conn.close();</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03--java">03 JAVA<a href="#03--java" class="hash-link" aria-label="03 JAVA的直接链接" title="03 JAVA的直接链接"></a></h3><p>Similar to that with Python, you can directly create an ADBC client with JAVA to read data from Doris. Firstly, you need to obtain the FlightInfo. Then, you connect to each endpoint to pull the data.</p><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">// method one</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">AdbcStatement stmt = connection.createStatement()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">stmt.setSqlQuery("SELECT * FROM " + tableName)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// executeQuery, two steps:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// 1. Execute Query and get returned FlightInfo;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// 2. Create FlightInfoReader to sequentially traverse each Endpoint;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">QueryResult queryResult = stmt.executeQuery()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// method two</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">AdbcStatement stmt = connection.createStatement()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">stmt.setSqlQuery("SELECT * FROM " + tableName)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// Execute Query and parse each Endpoint in FlightInfo, and use the Location and Ticket to construct a PartitionDescriptor</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">partitionResult = stmt.executePartitioned();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">partitionResult.getPartitionDescriptors()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">//Create ArrowReader for each PartitionDescriptor to read data</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ArrowReader reader = connection2.readPartition(partitionResult.getPartitionDescriptors().get(0).getDescriptor()))</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="04--spark">04 Spark<a href="#04--spark" class="hash-link" aria-label="04 Spark的直接链接" title="04 Spark的直接链接"></a></h3><p>For Spark users, apart from connecting to Flight SQL Server using JDBC and JAVA, you can apply the <a href="https://github.com/qwshen/spark-flight-connector" target="_blank" rel="noopener noreferrer">Spark-Flight-Connector</a>, which enables Spark to act as a client for reading and writing data from/to a Flight SQL Server. This is made possible by the fast data conversion between the Arrow format and the Block in Apache Doris, which is <strong>10 times faster than the conversion between CSV and Block</strong>. Moreover, the Arrow data format provides more comprehensive and robust support for complex data types such as Map and Array.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="hop-on-the-trend-train">Hop on the trend train<a href="#hop-on-the-trend-train" class="hash-link" aria-label="Hop on the trend train的直接链接" title="Hop on the trend train的直接链接"></a></h2><p>A number of enterprise users of Doris has tried loading data from Doris to Python, Spark, and Flink using Arrow Flight SQL and enjoyed much faster data reading speed. In the future, we plan to include the support for Arrow Flight SQL in data writing, too. By then, most systems built with mainstream programming languages will be able to read and write data from/to Apache Doris by an ADBC client. That's high-speed data interaction which opens up numerous possibilities. On our to-do list, we also envision leveraging Arrow Flight to implement parallel data reading by multiple backends and facilitate federated queries across Doris and Spark. </p><p>Download <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">Apache Doris 2.1</a> and get a taste of 100 times faster data transfer powered by Arrow Flight SQL. If you need assistance, come find us in the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris developer and user community</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.1.2 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.1.2</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.1.2"/>
<updated>2024-04-12T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 2.1.2 has been officially released on April 12, 2024. This version submits several enhancements and bug fixes to further improve the performance and stability.]]></summary>
<content type="html"><![CDATA[<p>Dear community, Apache Doris 2.1.2 has been officially released on April 12, 2024. This version submits several enhancements and bug fixes to further improve the performance and stability.</p><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub Release:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changed">Behavior Changed<a href="#behavior-changed" class="hash-link" aria-label="Behavior Changed的直接链接" title="Behavior Changed的直接链接"></a></h2><ol><li>Set the default value of the <code>data_consistence</code> property of EXPORT to partition to make export more stable during load. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32830" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32830</a></li></ul><ol start="2"><li><p>Some of MySQL Connector (eg, dotnet MySQL.Data) rely on variable's column type to make connection.</p><p>eg, select @<a href="/zh-CN/blog/[@autocommit](https://github.com/autocommit)">@autocommit</a> should with column type BIGINT, not BIT, otherwise it will throw error. So we change column type of @<a href="https://github.com/autocommit" target="_blank" rel="noopener noreferrer">@autocommit</a> to BIGINT. </p></li></ol><ul><li><a href="https://github.com/apache/doris/pull/33282" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33282</a></li></ul><ol start="3"><li>Auto Partition syntax changes, see <a href="https://doris.apache.org/zh-CN/docs/table-design/data-partition#%E8%87%AA%E5%8A%A8%E5%88%86%E5%8C%BA" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/table-design/data-partition#%E8%87%AA%E5%8A%A8%E5%88%86%E5%8C%BA</a></li></ol><ul><li><a href="https://github.com/apache/doris/pull/32737" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32737</a></li></ul><ol start="4"><li>Auto Partition prohibits the simultaneous use of Dynamic Partition on a single table.</li></ol><ul><li><a href="https://github.com/apache/doris/pull/33736" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33736</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-problem">Upgrade Problem<a href="#upgrade-problem" class="hash-link" aria-label="Upgrade Problem的直接链接" title="Upgrade Problem的直接链接"></a></h2><ol><li>Normal workload group is not created when upgrade from 2.0 or other old versions. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33197" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33197</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-feature">New Feature<a href="#new-feature" class="hash-link" aria-label="New Feature的直接链接" title="New Feature的直接链接"></a></h2><ol><li>Add processlist table in information_schema database, users could use this table to query active connections. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32511" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32511</a></li></ul><ol start="2"><li>Add a new table valued function <code>LOCAL</code> to allow access file system like shared storage. </li></ol><ul><li><a href="https://github.com/apache/doris-website/pull/494" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris-website/pull/494</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimization">Optimization<a href="#optimization" class="hash-link" aria-label="Optimization的直接链接" title="Optimization的直接链接"></a></h2><ol><li>Skip some useless process to make graceful stop more quickly in K8s env. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33212" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33212</a></li></ul><ol start="2"><li>Add rollup table name in profile to help find the mv selection problem. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33137" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33137</a></li></ul><ol start="3"><li>Add test connection function to DB2 database to allow user check the connection when create DB2 Catalog. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33335" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33335</a></li></ul><ol start="4"><li>Add DNS Cache for FQDN to accelerate the connect process among BEs in K8s env. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32869" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32869</a></li></ul><ol start="5"><li>Refresh external table's rowcount async to make the query plan more stable. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32997" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32997</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bugfix">Bugfix<a href="#bugfix" class="hash-link" aria-label="Bugfix的直接链接" title="Bugfix的直接链接"></a></h2><ol><li>Fix Iceberg Catalog of HMS and Hadoop do not support Iceberg properties like "io.manifest.cache-enabled" to enable manifest cache in Iceberg. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33113" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33113</a></li></ul><ol start="2"><li>The offset params in <code>LEAD</code>/<code>LAG</code> function could use 0 as offset. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33174" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33174</a></li></ul><ol start="3"><li>Fix some timeout issues with load. </li></ol><ul><li><p><a href="https://github.com/apache/doris/pull/33077" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33077</a></p></li><li><p><a href="https://github.com/apache/doris/pull/33260" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33260</a></p></li></ul><ol start="4"><li>Fix core problem related with <code>ARRAY</code>/<code>MAP</code>/<code>STRUCT</code> compaction process. </li></ol><ul><li><p><a href="https://github.com/apache/doris/pull/33130" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33130</a></p></li><li><p><a href="https://github.com/apache/doris/pull/33295" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33295</a></p></li></ul><ol start="5"><li>Fix runtime filter wait timeout. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/33369" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/33369</a></li></ul><ol start="6"><li>Fix <code>unix_timestamp</code> core for string input in auto partition. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32871" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32871</a></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.0.8 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.8</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.8"/>
<updated>2024-04-09T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, about 65 improvements and bug fixes have been made in Doris 2.0.8 version.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 65 improvements and bug fixes have been made in Doris 2.0.8 version.</p><ul><li><p><strong>Quick Download</strong> : <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p></li><li><p><strong>GitHub</strong> : <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-behavior-change">1 Behavior change<a href="#1-behavior-change" class="hash-link" aria-label="1 Behavior change的直接链接" title="1 Behavior change的直接链接"></a></h2><p>The <code>ADMIN SHOW</code> statement can not be executed with high version of MySQL 8.x jdbc driver. So rename these statement, remove the <code>ADMIN</code> keywords. </p><ul><li><a href="https://github.com/apache/doris/pull/29492" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/29492</a></li></ul><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">ADMIN </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> CONFIG </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> CONFIG</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ADMIN </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> REPLICA </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> REPLICA</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ADMIN DIAGNOSE TABLET </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> TABLET DIAGNOSIS</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ADMIN </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> TABLET </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SHOW</span><span class="token plain"> TABLET</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-new-features">2 New features<a href="#2-new-features" class="hash-link" aria-label="2 New features的直接链接" title="2 New features的直接链接"></a></h2><p>N/A</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="3-improvement-and-optimizations">3 Improvement and optimizations<a href="#3-improvement-and-optimizations" class="hash-link" aria-label="3 Improvement and optimizations的直接链接" title="3 Improvement and optimizations的直接链接"></a></h2><ul><li><p>Make Inverted Index work with TopN opt in Nereids</p></li><li><p>Limit the max string length to 1024 while collecting column stats to control BE memory usage</p></li><li><p>JDBC Catalog close when JDBC client is not empty</p></li><li><p>Accept all Iceberg database and do not check the name format of database</p></li><li><p>Refresh external table's rowcount async to avoid cache miss and unstable query plan</p></li><li><p>Simplify the isSplitable method of hive external table to avoid too many hadoop metrics</p></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/compare/2.0.7...2.0.8" target="_blank" rel="noopener noreferrer">GitHub</a> .</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="4-credits">4 Credits<a href="#4-credits" class="hash-link" aria-label="4 Credits的直接链接" title="4 Credits的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p>924060929, AcKing-Sam, amorynan, AshinGau, BePPPower, BiteTheDDDDt, ByteYue, cambyzju, dongsilun, eldenmoon, feiniaofeiafei, gnehil, Jibing-Li, liaoxin01, luwei16, morningman, morrySnow, mrhhsg, Mryange, nextdreamblue, platoneko, starocean999, SWJTU-ZhangLei, wuwenchi, xiaokang, xinyiZzz, Yukang-Lian, Yulei-Yang, zclllyybb, zddr, zhangstar333, zhiqiang-hhhh, ziyanTOP, zy-kkk, zzzxl1993</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Auto-increment columns in databases: a simple magic that makes a big difference]]></title>
<id>https://doris.apache.org/zh-CN/blog/auto-increment-columns-in-databases</id>
<link href="https://doris.apache.org/zh-CN/blog/auto-increment-columns-in-databases"/>
<updated>2024-04-08T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Auto-increment columns in Apache Doris accelerates dictionary encoding and pagination without damaging data writing performance. This is an introduction to its usage, applicable scenarios, and implementation details.]]></summary>
<content type="html"><![CDATA[<p>Auto-increment column is a bread-and-butter feature of single-node transactional databases. It assigns a unique identifier for each row in a way that requires the least manual effort from users. With an auto-increment column in the table, whenever a new row is inserted into the table, the new row will be assigned with the next available value from the auto-increment sequence. This is an automated mechanism that makes database maintenance easy and reliable.</p><p>Auto-increment column is the bedrock of many features in databases:</p><ul><li><p><strong>Dictionary encoding</strong>: User IDs and Order IDs are often stored as strings. However, strings are not friendly to precise deduplication query execution. So for optimal performance, a common practice is to perform dictionary encoding on the strings and then construct a bitmap for aggregation operations. The role of an auto-increment column in this process is that <strong>it speeds up dictionary encoding and thus accelerates string deduplication</strong>.</p></li><li><p><strong>Primary key generation</strong>: An auto-increment column is the perfect candidate for the primary key of a table. Primary keys must be unique and not empty, while auto-increment columns guarantee a unique identifier for each row. </p></li><li><p><strong>Detailed data updates</strong>: Updating detail tables is tricky, but it can be easy if you add a auto-increment table to it. It gives each data record in the database a unique ID, which can work as the primary key, and then data updates can be done based on the primary key.</p></li><li><p><strong>Efficient pagination</strong>: Pagination is often required in data display. It is typically implemented by the <code>limit</code> or <code>offset</code> + <code>order by</code> statement in SQL queries. However, such implementation involves full data reading and sorting, which doesn't make so much sense in deep pagination queries (those with large offsets). This is when auto-increment columns come to the rescue. Like I said, it gives a unique identifier to each row, so the maximum identifier of the last page can be used as the filtering condition for the next page. Thus, it can avoid a lot of unnecessary data scanning and increase pagination efficiency.</p></li></ul><p>The idea of auto-increment columns is intuitive, but when it comes to distributed databases, it becomes a different game, because it has to consider global transactions. As a distributed DBMS, Apache Doris provides an innovative and efficient auto-increment solution that does no harm to data writing performance.</p><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>提示</div><div class="admonitionContent_S0QG"><p>To give AUTO_INCREMENT column a spin, follow this quick <a href="https://www.youtube.com/watch?v=FGVp2RQvGBo" target="_blank" rel="noopener noreferrer">demo</a>.</p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="syntax--usage">Syntax &amp; usage<a href="#syntax--usage" class="hash-link" aria-label="Syntax &amp; usage的直接链接" title="Syntax &amp; usage的直接链接"></a></h2><p>To enable an auto-increment column in Doris, add <code>AUTO_INCREMENT</code> property to the column in the table creation statement (<a href="https://doris.apache.org/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE/" target="_blank" rel="noopener noreferrer">CREAT-TABLE</a>). You can specify a starting value for the auto-increment column via <code>AUTO_INCREMENT(start_value)</code>; if not, the default starting value is 1.</p><p>For example, you can create a table in the <a href="https://doris.apache.org/docs/data-table/data-model#duplicate-model" target="_blank" rel="noopener noreferrer">Duplicate Key model</a>, with one of the key columns being an auto-increment column. </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line theme-code-block-highlighted-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AUTO_INCREMENT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">value</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Apart from a key column, you can also specify a value column as an auto-increment column (example below):</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">uid</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">name</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line theme-code-block-highlighted-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AUTO_INCREMENT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">value</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">uid</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">name</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">uid</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>AUTO_INCREMENT is supported in both the Duplicate Key model and the <a href="https://doris.apache.org/docs/data-table/data-model/#unique-model" target="_blank" rel="noopener noreferrer">Unique Key model</a>. Usage in the latter is similar.</p><p>I will walk you down the rest of the road with the table below as an example: </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AUTO_INCREMENT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">name</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">65533</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">value</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">UNIQUE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>When you ingest data into this table using an <code>insert into</code> statement, if the <code>id</code> column has no specified value in the original data file, it will be auto-filled with auto-increment values.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">values</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"Bob"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">10</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"Alice"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">20</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"Jack"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">30</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">3</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> affected </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.09</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><span class="token string" style="color:rgb(255, 121, 198)">'label'</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">'label_183babcb84ad4023_a2d6266ab73fb5aa'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'status'</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">'VISIBLE'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'txnId'</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">'7'</span><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------+-------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> id </span><span class="token operator">|</span><span class="token plain"> name </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------+-------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Bob </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">10</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Alice </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">20</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">3</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Jack </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">30</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------+-------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">3</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.05</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Similarly, when you ingest a data file <code>test.csv</code> by Stream Load, the <code>id</code> column will be auto-filled with auto-increment values, too.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">test</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">csv:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Tom</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token number">40</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">John</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token number">50</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl </span><span class="token comment" style="color:rgb(98, 114, 164)">--location-trusted -u user:passwd -H "columns:name,value" -H "column_separator:," -T ./test.csv http://{host}:{port}/api/{db}/tbl/_stream_load</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------+-------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> id </span><span class="token operator">|</span><span class="token plain"> name </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">value</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------+-------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Bob </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">10</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Alice </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">20</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">3</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Jack </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">30</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Tom </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">40</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">5</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> John </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">50</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------+-------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">5</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.04</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="applicable-scenarios">Applicable scenarios<a href="#applicable-scenarios" class="hash-link" aria-label="Applicable scenarios的直接链接" title="Applicable scenarios的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01-dictionary-encoding">01 Dictionary encoding<a href="#01-dictionary-encoding" class="hash-link" aria-label="01 Dictionary encoding的直接链接" title="01 Dictionary encoding的直接链接"></a></h3><p>In Apache Doris, the bitmap data type and the bitmap-related aggregations are implemented with RoaringBitmap, which can deliver high performance especially when dictionary encoding produces dense values. </p><p>As is mentioned, auto-increment columns enable fast dictionary encoding. I will put you into the context of user profiling to show you how that works.</p><p>For analysis of offline page views (PV) and unique visitors (UV), store the details in a user behavior table: </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dwd_dup_tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim1</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim2</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim3</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim4</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim5</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">visit_time</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATE</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">32</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Create a dictionary table as follows leveraging AUTO_INCREMENT:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dictionary_tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">aid</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AUTO_INCREMENT</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">UNIQUE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">32</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Load the existing <code>user_id</code> into the dictionary table, and create mappings from <code>user_id</code> to integer values.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> dictionary_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">user_id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> user_id </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> dwd_dup_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">group</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> user_id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you only need to load the incremental <code>user_id</code> into the dictionary table, you can use the following command. In practice, you can also use the <a href="https://doris.apache.org/docs/ecosystem/flink-doris-connector/" target="_blank" rel="noopener noreferrer">Flink Doris Connector</a> for data writing. </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> dictionary_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">user_id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">user_id </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> dwd_dup_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">left</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">join</span><span class="token plain"> dictionary_tbl</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">on</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">user_id </span><span class="token operator">=</span><span class="token plain"> dictionary_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">user_id </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">visit_time </span><span class="token string" style="color:rgb(255, 121, 198)">'2023-12-10'</span><span class="token plain"> </span><span class="token operator">and</span><span class="token plain"> dictionary_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">user_id </span><span class="token operator">is</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Suppose you have your analytic dimensions as <code>dim1</code>, <code>dim3</code>, <code>dim5</code>, create a table in the <a href="https://doris.apache.org/docs/data-table/data-model#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate Key model</a> to accommodate the results of data aggregation:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dws_agg_tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim1</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim3</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim5</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id_bitmap</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> BITMAP BITMAP_UNION </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">pv</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> SUM </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">AGGREGATE </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim1</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim3</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim5</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">dim1</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">32</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Load the aggregated results into the table:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">insert</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">into</span><span class="token plain"> dws_agg_tbl</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">dim1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">dim3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">dim5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> BITMAP_UNION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">TO_BITMAP</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">dictionary_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">aid</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">COUNT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> dwd_dup_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INNER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">JOIN</span><span class="token plain"> dictionary_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">on</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">user_id </span><span class="token operator">=</span><span class="token plain"> dictionary_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">user_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">group</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">dim1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">dim3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> dwd_dup_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">dim5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Then you query PV/UV using the following statement:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> dim1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> dim3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> dim5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> bitmap_count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">user_id_bitmap</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> uv</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> pv </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> dws_agg_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02-detailed-data-updates">02 Detailed data updates<a href="#02-detailed-data-updates" class="hash-link" aria-label="02 Detailed data updates的直接链接" title="02 Detailed data updates的直接链接"></a></h3><p>In Doris, the Unique Key model is applicable to use cases with frequent data updates, while the Duplicate Key model is designed for detailed data storage with no data updating requirements.</p><p>However, in real life, users might need to update their detailed data sometimes, which can be hard to implement because the data tables don't come with unique key columns.</p><p>In this case, you can <strong>use an auto-increment column as the primary key for the detailed data</strong>.</p><p>For example, a financial institution keeps record of customer loans and writes it into a Duplicate Key table, in which one single user might have multiple borrowing records. </p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE loan_records (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `user_id` VARCHAR(20) DEFAULT NULL COMMENT 'Customer ID',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `loan_amount` DECIMAL(10, 2) DEFAULT NULL COMMENT 'Amount of loan',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `interest_rate` DECIMAL(10, 2) DEFAULT NULL COMMENT 'Interest rate',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `loan_start_date` DATE DEFAULT NULL COMMENT 'Start date of the loan',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `loan_end_date` DATE DEFAULT NULL COMMENT 'End date of the loan',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `total_debt` DECIMAL(10, 2) DEFAULT NULL COMMENT 'Amount of debt'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) DUPLICATE KEY(`user_id`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`user_id`) BUCKETS 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_allocation" = "tag.location.default: 3"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Suppose that in a promotional campaign, the institution offers a 10% discount on interest rates to its existing customers. Correspondingly, there is a need to update the <code>interest_rate</code> and <code>total_debt</code> in the table.</p><p>For that sake, you can create a Unique Key table for the same data, but add an <code>auto_id</code> field and set it as the primary key. </p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE loan_records (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `auto_id` BIGINT NOT NULL AUTO_INCREMENT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `user_id` VARCHAR(20) DEFAULT NULL COMMENT 'Customer ID',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `loan_amount` DECIMAL(10, 2) DEFAULT NULL COMMENT 'Amount of loan',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `interest_rate` DECIMAL(10, 2) DEFAULT NULL COMMENT 'Interest rate',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `loan_start_date` DATE DEFAULT NULL COMMENT 'Start date of the loan',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `loan_end_date` DATE DEFAULT NULL COMMENT 'End date of the loan',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `total_debt` DECIMAL(10, 2) DEFAULT NULL COMMENT 'Amount of debt'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) UNIQUE KEY(`auto_id`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`auto_id`) BUCKETS 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_allocation" = "tag.location.default: 3"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Now, write a few new records into the table and see what happens. (Note that you don't have to write in the <code>auto_id</code> field.)</p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">INSERT INTO loan_records (user_id, loan_amount, interest_rate, loan_start_date, loan_end_date, total_debt) VALUES</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10001', 5000.00, 5.00, '2024-03-01', '2024-03-31', 5020.55),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10002', 10000.00, 5.00, '2024-03-01', '2024-05-01', 10082.56),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10003', 2000.00, 5.00, '2024-03-01', '2024-03-15', 2003.84),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10004', 7500.00, 5.00, '2024-03-01', '2024-04-15', 7546.23),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10005', 3000.00, 5.00, '2024-03-01', '2024-03-21', 3008.22),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10002', 8000.00, 5.00, '2024-03-01', '2024-06-01', 8100.82),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10007', 6000.00, 5.00, '2024-03-01', '2024-04-10', 6032.88),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10008', 4000.00, 5.00, '2024-03-01', '2024-03-26', 4013.70),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10001', 5500.00, 5.00, '2024-03-01', '2024-04-05', 5526.37),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">('10010', 9000.00, 5.00, '2024-03-01', '2024-05-10', 9086.30);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Check with the <code>select * from loan_records</code> statement, and you can see a unique ID is already in place for each newly-ingested record:</p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select * from loan_records;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+---------+-------------+---------------+-----------------+---------------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| auto_id | user_id | loan_amount | interest_rate | loan_start_date | loan_end_date | total_debt |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+---------+-------------+---------------+-----------------+---------------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 10001 | 5000.00 | 5.00 | 2024-03-01 | 2024-03-31 | 5020.55 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 4 | 10004 | 7500.00 | 5.00 | 2024-03-01 | 2024-04-15 | 7546.23 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 2 | 10002 | 10000.00 | 5.00 | 2024-03-01 | 2024-05-01 | 10082.56 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 3 | 10003 | 2000.00 | 5.00 | 2024-03-01 | 2024-03-15 | 2003.84 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 6 | 10002 | 8000.00 | 5.00 | 2024-03-01 | 2024-06-01 | 8100.82 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 8 | 10008 | 4000.00 | 5.00 | 2024-03-01 | 2024-03-26 | 4013.70 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 7 | 10007 | 6000.00 | 5.00 | 2024-03-01 | 2024-04-10 | 6032.88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 9 | 10001 | 5500.00 | 5.00 | 2024-03-01 | 2024-04-05 | 5526.37 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 5 | 10005 | 3000.00 | 5.00 | 2024-03-01 | 2024-03-21 | 3008.22 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 10 | 10010 | 9000.00 | 5.00 | 2024-03-01 | 2024-05-10 | 9086.30 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+---------+-------------+---------------+-----------------+---------------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">10 rows in set (0.01 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Execute these two SQL statements to update <code>interest_rate</code> and <code>total_debt</code>, respectively:</p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">update loan_records set interest_rate = interest_rate * 0.9 where user_id &lt;= 10005;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">update loan_records set total_debt = loan_amount + (loan_amount * (interest_rate / 100) * DATEDIFF(loan_end_date, loan_start_date) / 365);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Check again to see if the old records have been replaced by the new ones:</p><div class="language-Python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select * from loan_records order by auto_id;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+---------+-------------+---------------+-----------------+---------------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| auto_id | user_id | loan_amount | interest_rate | loan_start_date | loan_end_date | total_debt |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+---------+-------------+---------------+-----------------+---------------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 10001 | 5000.00 | 4.50 | 2024-03-01 | 2024-03-31 | 5018.49 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 2 | 10002 | 10000.00 | 4.50 | 2024-03-01 | 2024-05-01 | 10075.21 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 3 | 10003 | 2000.00 | 4.50 | 2024-03-01 | 2024-03-15 | 2003.45 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 4 | 10004 | 7500.00 | 4.50 | 2024-03-01 | 2024-04-15 | 7541.61 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 5 | 10005 | 3000.00 | 4.50 | 2024-03-01 | 2024-03-21 | 3007.40 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 6 | 10002 | 8000.00 | 4.50 | 2024-03-01 | 2024-06-01 | 8090.74 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 7 | 10007 | 6000.00 | 5.00 | 2024-03-01 | 2024-04-10 | 6032.88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 8 | 10008 | 4000.00 | 5.00 | 2024-03-01 | 2024-03-26 | 4013.70 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 9 | 10001 | 5500.00 | 4.50 | 2024-03-01 | 2024-04-05 | 5523.73 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 10 | 10010 | 9000.00 | 5.00 | 2024-03-01 | 2024-05-10 | 9086.30 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+---------+-------------+---------------+-----------------+---------------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">10 rows in set (0.01 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03-efficient-pagination">03 Efficient pagination<a href="#03-efficient-pagination" class="hash-link" aria-label="03 Efficient pagination的直接链接" title="03 Efficient pagination的直接链接"></a></h3><p>Imagine that you need to sort the data in a specific order and then retrieve record No. 90,001 to record No. 90,010. This means you have a large offset of 90,000. This is what we call a deep pagination query. Even though you only require a result set of 10 rows, the database system still has to read the entire dataset into memory and perform a full sorting.</p><p><strong>For higher execution efficiency in deep pagination queries, you can harness the power of auto-increment columns</strong>. The main idea is to record the <code>max_value</code> from the <code>unique_value</code> column of the previous page, and push down predicates by <code>where unique_value &gt; max_value limit rows_per_page</code>.</p><p>For example, during table creation, you enable an auto-increment column: <code>unique_value</code>, which gives each row an identifier.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">demo</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">records_tbl</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">name</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">26</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">address</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">41</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">city</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">nation</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">16</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">region</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">13</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">phone</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">16</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">mktsegment</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">unique_value</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AUTO_INCREMENT</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">name</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">user_id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"replication_allocation"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"tag.location.default: 3"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>In pagination queries, suppose that each page displays 100 results, this is how you retrieve the first page of the result set. </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> records_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> unique_value </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">limit</span><span class="token plain"> </span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Use programs to record the maximum <code>unique_value</code> in the returned result. Suppose that the maximum is 99, you can query data from the second page using the following statement:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> records_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> unique_value </span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token number">99</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> unique_value </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">limit</span><span class="token plain"> </span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you need to query data from a deeper page, for example, page 101, which means it's hard to get the maximum <code>unique_value</code> from the previous page directly, then you can use the statement as follows:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> user_id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> address</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> city</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> nation</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> region</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> phone</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> mktsegment</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> records_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> unique_value </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> max_value </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> records_tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> unique_value </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">limit</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">offset</span><span class="token plain"> </span><span class="token number">9999</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> previous_data</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> records_tbl</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">unique_value </span><span class="token operator">&gt;</span><span class="token plain"> previous_data</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">max_value</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> unique_value </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">limit</span><span class="token plain"> </span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="implementation">Implementation<a href="#implementation" class="hash-link" aria-label="Implementation的直接链接" title="Implementation的直接链接"></a></h2><p>Typical OLTP databases perform incremental ID matching by their transaction mechanisms. However, in an MPP-based distributed database system like Apache Doris, such an approach can easily suffocate data writing performance. </p><p>That's why Apache Doris 2.1 innovates the implementation of auto-increment IDs. In a data ingestion task, one of the backend (BE) nodes will work as the coordinator, which is responsible for the allocation of auto-increment IDs. The coordinator BE requests a range of IDs in bulk from the frontend (FE). The FE makes sure that the ID ranges allocated to each BE do not overlap, thus guaranteeing the uniqueness of IDs.</p><p>I illustrate the process with the figure below. StreamLoad1 has BE1 as the coordinator. BE1 requests a batch of IDs (range: 1-1000) from the FE and caches the IDs locally. Once all 1000 IDs are allocated, BE1 will request a new batch from the FE. At the same time, StreamLoad 2 selects BE3 as the coordinator, and BE3 also requests IDs from the FE. Since IDs 1-1000 have already been allocated to BE1, the FE assigns IDs 1001-2000 to BE3.</p><p><img loading="lazy" alt="the implementation of auto-increment IDs" src="https://cdnd.selectdb.com/zh-CN/assets/images/the-implementation-of-auto-increment-IDs-d48d7814da087bde1ef5fe3fbf0db7b5.png" width="1280" height="1100" class="img_ev3q"></p><p>Suppose that StreamLoad1 and StreamLoad2 each write in 50 new data records, the auto-increment IDs assigned to them will be 1-50 and 1001-1050. </p><p>Suppose that StreamLoad3 arises later and selects BE1 as the coordinator, BE1 will assign IDs starting from 51 to the data written by StreamLoad3. From the user's side, they will see that rows written by StreamLoad3 get smaller ID numbers than those by StreamLoad2, even though StreamLoad2 precedes StreamLoad3 in time.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="note">Note<a href="#note" class="hash-link" aria-label="Note的直接链接" title="Note的直接链接"></a></h2><p>Attention is required regarding: </p><ul><li><p><strong>Scope of uniqueness guarantee</strong>: Doris ensures that the values generated on an auto-increment column are unique within the table, but this only applies to values auto-filled by Doris. If a user explicitly inserts values into the auto-increment column, Doris cannot guarantee the uniqueness of those values.</p></li><li><p><strong>Density and continuity of values</strong>: Doris ensures that the values generated by the auto-increment column are dense. However, for performance reasons, it cannot guarantee that the auto-filled values are continuous. This means there may be occurrences of value jumps in the auto-increment column. Additionally, since the auto-increment values are pre-allocated and cached in BE, the magnitude of the auto-increment values cannot reflect the order of data import.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>AUTO_INCREMENT brings higher stability and reliability for Doris in large-scale data processing. If it sounds like something you need, download <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">Apache Doris</a> and try it out. For issues you come across along the way, join us in the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris developer and user community</a> and we are happy to help.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.1.1 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.1.1</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.1.1"/>
<updated>2024-04-03T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 2.1.1 is now available, with several enhancements and bug fixes based on 2.1.0, enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<p>Dear community members, Apache Doris 2.1.1 has been officially released on April 3, 2024, with several enhancements and bug fixes based on 2.1.0, enabling smoother user experience.</p><ul><li><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p></li><li><p><strong>GitHub:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changed">Behavior Changed<a href="#behavior-changed" class="hash-link" aria-label="Behavior Changed的直接链接" title="Behavior Changed的直接链接"></a></h2><ol><li>Change float type output format to improve float type serialization performance.</li></ol><ul><li><a href="https://github.com/apache/doris/pull/32049" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32049</a></li></ul><ol start="2"><li>Change system table value functions active_queries(), workload_groups() to system tables. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32314" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32314</a></li></ul><ol start="3"><li>Disable show query/load profile stmt because there are not so many developers use it and the pipeline and pipelinex engine not support it. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32467" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32467</a></li></ul><ol start="4"><li>Upgrade arrow flight version to 15.0.2 to fix some bugs, so that please use ADBC 15.0.2 version to access Doris. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32827" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32827</a>.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-problem">Upgrade Problem<a href="#upgrade-problem" class="hash-link" aria-label="Upgrade Problem的直接链接" title="Upgrade Problem的直接链接"></a></h2><ol><li>BE will core when rolling pgrade problem from 2.0.x to 2.1.x.</li></ol><ul><li><p><a href="https://github.com/apache/doris/pull/32672" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32672</a></p></li><li><p><a href="https://github.com/apache/doris/pull/32444" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32444</a></p></li><li><p><a href="https://github.com/apache/doris/pull/32162" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32162</a></p></li></ul><ol start="2"><li>JDBC Catalog will have query errors when rolling grade rom 2.0.x to 2.1.x. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32618" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32618</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-feature">New Feature<a href="#new-feature" class="hash-link" aria-label="New Feature的直接链接" title="New Feature的直接链接"></a></h2><ol><li>Enable column auth by default.</li></ol><ul><li><a href="https://github.com/apache/doris/pull/32659" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32659</a></li></ul><ol start="2"><li>Get correct cores for pipeline and pipelinex engine when running within docker or k8s. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32370" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32370</a></li></ul><ol start="3"><li>Support read parquet int96 type. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32394" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32394</a></li></ul><ol start="4"><li>Enable proxy protocol to support IP transparency. Using this protocol, IP transparency for load balancing can be achieved, so that after load balancing, Doris can still obtain the client's real IP and implement permission control such as whitelisting. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32338/files" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32338/files</a></li></ul><ol start="5"><li>Add workload group queue related columns for active_queries system table. Uses could use this system to monitor the workload queue usage. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32259" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32259</a></li></ul><ol start="6"><li>Add new system table backend_active_tasks to monitor the realtime query statics on every BE. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31945" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31945</a></li></ul><ol start="7"><li>Add ipv4 and ipv6 support for spark-doris connector. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32240" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32240</a></li></ul><ol start="8"><li>Add inverted index support for CCR. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32101" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32101</a></li></ul><ol start="9"><li>Support select experimental session variable. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31837" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31837</a></li></ul><ol start="10"><li>Support materialized view with bitmap_union(bitmap_from_array()) case. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31962" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31962</a></li></ul><ol start="11"><li>Support partition prune for <code>HIVE_DEFAULT_PARTITION</code>. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31736" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31736</a></li></ul><ol start="12"><li>Support function in set variable statement. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32492" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32492</a></li></ul><ol start="13"><li>Support arrow serialization for varint type. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32809" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32809</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimization">Optimization<a href="#optimization" class="hash-link" aria-label="Optimization的直接链接" title="Optimization的直接链接"></a></h2><ol><li>Auto resume routine load when be restart or during upgrade. And keep the routine load stable. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32239" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32239</a></li></ul><ol start="2"><li>Routine Load: optimize allocate task to be algorithm for load balance. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32021" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32021</a></li></ul><ol start="3"><li>Spark Load: update spark version for spark load to resolve cve problem. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/30368" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/30368</a></li></ul><ol start="4"><li>Skip cooldown if the tablet is dropped. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32079" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32079</a></li></ul><ol start="5"><li>Support using workload group to manage routine load. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31671" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31671</a></li></ul><ol start="6"><li>[MTMV ]<!-- -->Improve the performance for query rewritting by materialized view. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31886" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31886</a></li></ul><ol start="7"><li>Reduce jvm heap memory consumed by profiles of BrokerLoadJob. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31985" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31985</a></li></ul><ol start="8"><li>Imporve the high QPS query by speed up PartitionPrunner. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31970" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31970</a></li></ul><ol start="9"><li>Reduce duplicated memory consumption for column name and column path for schema cache. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31141" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31141</a></li></ul><ol start="10"><li>Support more join types for query rewriting by materialized view such as INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, LEFT SEMI JOIN, RIGHT SEMI JOIN, LEFT ANTI JOIN, RIGHT ANTI JOIN.</li></ol><ul><li><a href="https://github.com/apache/doris/pull/32909" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32909</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bugfix">Bugfix<a href="#bugfix" class="hash-link" aria-label="Bugfix的直接链接" title="Bugfix的直接链接"></a></h2><ol><li>Do not push down topn-filter through right/full outer join if the first orderkey is nulls first. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32633" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32633</a></li></ul><ol start="2"><li>Fix memory leak in Java UDF.</li></ol><ul><li><a href="https://github.com/apache/doris/pull/32630" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32630</a></li></ul><ol start="3"><li>If some odbc tables use the same resource, and restore not all odbc tables, it will not retain the resource.
and check some conf for backup/restore </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31989" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31989</a></li></ul><ol start="4"><li>Fold constant will core for variant type. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32265" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32265</a></li></ul><ol start="5"><li>Routine load will pause when transaction fail in some cases. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32638" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32638</a></li></ul><ol start="6"><li>the result of left semi join with empty right side should be false instead of null. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32477" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32477</a></li></ul><ol start="7"><li>Fix core when build inverted index for a new column with no data. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32669" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32669</a></li></ul><ol start="8"><li>Fix be core caused by null-safe-equal join. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32623" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32623</a></li></ul><ol start="9"><li>Partial update: fix data correctness risk when load delete sign data into a table with sequence col. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32574" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32574</a></li></ul><ol start="10"><li>Select outfile: Fix the column type mapping in the orc/parquet file format. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32281" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32281</a></li></ul><ol start="11"><li>Fix BE core during restore stage. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32489" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32489</a></li></ul><ol start="12"><li>Use array_agg func after other agg func like count, sum, may make be core. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32387" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32387</a></li></ul><ol start="13"><li>Variant type should always nullable or there will some bugs. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32248" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32248</a></li></ul><ol start="14"><li>Fix the bug of handling empty blocks in schema change. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32396" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32396</a></li></ul><ol start="15"><li>Fix BE will core when use json_length() in some cases. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32145" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32145</a></li></ul><ol start="16"><li>Fix error when query iceberg table using date cast predicate </li></ol><ul><li><a href="https://github.com/apache/doris/pull/32194" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32194</a></li></ul><ol start="17"><li>Fix some bugs when build inverted index for variant type. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31992" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31992</a></li></ul><ol start="18"><li>Wrong result of two or more map_agg functions in query. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31928" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31928</a></li></ul><ol start="19"><li>Fix wrong result of money_format function. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31883" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31883</a></li></ul><ol start="20"><li>Fix connection hang after too many connections. </li></ol><ul><li><a href="https://github.com/apache/doris/pull/31594" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31594</a></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.0.7 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.7</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.7"/>
<updated>2024-03-26T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, about 80 improvements and bug fixes have been made in Doris 2.0.7 version.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 80 improvements and bug fixes have been made in Doris 2.0.7 version.</p><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-behavior-change">1 Behavior change<a href="#1-behavior-change" class="hash-link" aria-label="1 Behavior change的直接链接" title="1 Behavior change的直接链接"></a></h2><ul><li><p><code>round</code> function defaults to rounding normally as MySQL, eg. round(5/2) return 3 instead of 2.</p><ul><li><a href="https://github.com/apache/doris/pull/31583" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31583</a></li></ul></li><li><p><code>round</code> datetime with scale from string literal as MySQL, eg. round '2023-10-12 14:31:49.666' to '2023-10-12 14:31:50' .</p><ul><li><a href="https://github.com/apache/doris/pull/27965" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27965</a> </li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-new-features">2 New features<a href="#2-new-features" class="hash-link" aria-label="2 New features的直接链接" title="2 New features的直接链接"></a></h2><ul><li><p>Support make miss slot as null alias when converting outer join to anti join to speed up query</p><ul><li><a href="https://github.com/apache/doris/pull/31854" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/31854</a></li></ul></li><li><p>Enable proxy protocol to support IP transparency for Nginx and HAProxy.</p><ul><li><a href="https://github.com/apache/doris/pull/32338" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/32338</a></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="3-improvement-and-optimizations">3 Improvement and optimizations<a href="#3-improvement-and-optimizations" class="hash-link" aria-label="3 Improvement and optimizations的直接链接" title="3 Improvement and optimizations的直接链接"></a></h2><ul><li><p>Add DEFAULT_ENCRYPTION column in <code>information_schema</code> table and add <code>processlist</code> table for better compatibility for BI tools</p></li><li><p>Automatically test connectivity by default when creating a JDBC Catalog.</p></li><li><p>Enhance auto resume to keep routine load stable</p></li><li><p>Use lowercase by default for Chinese tokenizer in inverted index</p></li><li><p>Add error msg if exceeded maximum default value in repeat function</p></li><li><p>Skip hidden file and dir in Hive table</p></li><li><p>Reduce file meta cache size and disable cache for some cases to avoid OOM</p></li><li><p>Reduce jvm heap memory consumed by profiles of BrokerLoadJob</p></li><li><p>Remove sort which is under table sink to speed up query like <code>INSERT INTO t1 SELECT * FROM t2 ORDER BY k</code>.</p></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/compare/2.0.6...2.0.7" target="_blank" rel="noopener noreferrer">github</a> .</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="4-credits">4 Credits<a href="#4-credits" class="hash-link" aria-label="4 Credits的直接链接" title="4 Credits的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p>924060929,airborne12,amorynan,ByteYue,dataroaring,deardeng,feiniaofeiafei,felixwluo,freemandealer,gavinchou,hello-stephen,HHoflittlefish777,jacktengg,jackwener,jeffreys-cat,Jibing-Li,KassieZ,LiBinfeng-01,luwei16,morningman,mrhhsg,Mryange,nextdreamblue,platoneko,qidaye,rohitrs1983,seawinde,shuke987,starocean999,SWJTU-ZhangLei,w41ter,wsjz,wuwenchi,xiaokang,XieJiann,XuJianxu,yujun777,Yulei-Yang,zhangstar333,zhiqiang-hhhh,zy-kkk,zzzxl1993</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis]]></title>
<id>https://doris.apache.org/zh-CN/blog/variant-in-apache-doris-2.1</id>
<link href="https://doris.apache.org/zh-CN/blog/variant-in-apache-doris-2.1"/>
<updated>2024-03-26T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Doris 2.1.0 provides a new data type: Variant, for semi-structured data analysis, which enables 8 times faster query performance than JSON with one-third storage space.]]></summary>
<content type="html"><![CDATA[<p>Semi-structured data is data arranged in flexible formats. Unlike structured data, it does not require data users to pre-define the table schema for it, so it provides convenience for data storage and analysis. Common forms of semi-structured data include XML, JSON, and log files. They are widely seen in the following industry scenarios:</p><ul><li><p><strong>E-commerce</strong> platforms store user reviews of products as semi-structured data for sentiment analysis and user behavior pattern mining.</p></li><li><p><strong>Telecommunication</strong> use cases often require schemaless support for their network data and complicated nested JSON data.</p></li><li><p><strong>Mobile applications</strong> keep records of user behavior in the form of semi-structured data, because after new features are introduced, the user behavior attributes can change. A non-fixed schema can adapt to these changes easily and save the trouble of frequent manual modification. </p></li><li><p><strong>Internet of Vehicles</strong> (IoV) and <strong>Internet of Things</strong> (IoT) platforms receive real-time data from vehicle sensors, such as speed, location, and fuel consumption, based on which they perform real-time monitoring, fault alerting, and route planning. Such data is also stored as semi-structured data.</p></li></ul><p>As an open-source real-time data warehouse, <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a> provides semi-structured data processing capabilities, and the newly-released <a href="https://doris.apache.org/blog/release-note-2.1.0" target="_blank" rel="noopener noreferrer">version 2.1.0</a> makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-newly-added-data-type-variant">A newly-added data type: Variant<a href="#a-newly-added-data-type-variant" class="hash-link" aria-label="A newly-added data type: Variant的直接链接" title="A newly-added data type: Variant的直接链接"></a></h2><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>提示</div><div class="admonitionContent_S0QG"><p>To help you quickly learn and use Variant data type, we provide <strong><a href="https://www.youtube.com/watch?v=FVfsnkZUBsU" target="_blank" rel="noopener noreferrer">a hands-on demo</a> </strong></p></div></div><p>In Apache Doris 2.1.0, we have introduced a new data type: <a href="https://doris.apache.org/docs/sql-manual/sql-reference/Data-Types/VARIANT" target="_blank" rel="noopener noreferrer">Variant</a>. Fields of the Variant data type can accommodate integers, strings, boolean values, and any combination of them. With Variant, you don't have to define the specific columns in the table schema in advance.</p><p>The Variant data type is well-suited to handle nested structures, which tend to change dynamically. Upon data writing, the Variant type automatically infers column information based on the data and its structure in the columns, and then merges it into the existing table schema. It stores the JSON keys and their corresponding values as dynamic sub-columns. </p><p>Meanwhile, you can include both Variant columns and static columns of pre-defined data types in the same table. This Schema-on-Write method provides greater flexibility in storage and queries. Powered by the columnar storage, vectorized execution engine, and query optimizer of Doris, the Variant type delivers high efficiency in queries and storage. </p><p>Compared to the JSON type, storage data in the Variant type can save up to 65% of disk space, and increase query speed by 8 times. (See details later in this post)</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="usage-guide">Usage guide<a href="#usage-guide" class="hash-link" aria-label="Usage guide的直接链接" title="Usage guide的直接链接"></a></h2><p>Create table: syntax keyword <code>variant</code></p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)">-- No index</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">IF</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">EXISTS</span><span class="token plain"> ${table_name} </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> v VARIANT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">table_properties</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">-- Create index for the v column, specify the parser</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">IF</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">EXISTS</span><span class="token plain"> ${table_name} </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> v VARIANT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_var</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">v</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token plain">PROPERTIES</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"parser"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"english|unicode|chinese"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'your comment'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">table_properties</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">-- Create Bloom Filter for the v column</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">IF</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">EXISTS</span><span class="token plain"> ${table_name} </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> v VARIANT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">properties</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"replication_num"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"1"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"bloom_filter_columns"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"v"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Query: access sub-column via <code>[]</code>. The sub-columns are also of the Variant type.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> v</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"properties"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"title"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> ${table_name}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Now, let's show you how to create a table containing the Variant data type and conduct data ingestion and queries to it. The dataset is Github Events records. This is one of the formatted records:</p><div class="language-JSON codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-JSON codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "id": "14186154924",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "type": "PushEvent",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "actor": {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "id": 282080,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "login": "brianchandotcom",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "display_login": "brianchandotcom",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "gravatar_id": "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "url": "https://api.github.com/users/brianchandotcom",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "avatar_url": "https://avatars.githubusercontent.com/u/282080?"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> },</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "repo": {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "id": 1920851,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "name": "brianchandotcom/liferay-portal",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "url": "https://api.github.com/repos/brianchandotcom/liferay-portal"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> },</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "payload": {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "push_id": 6027092734,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "size": 4,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "distinct_size": 4,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "ref": "refs/heads/master",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "head": "91edd3c8c98c214155191feb852831ec535580ba",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "before": "abb58cc0db673a0bd5190000d2ff9c53bb51d04d",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "commits": [""]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> },</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "public": true,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "created_at": "2020-11-13T18:00:00Z"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01--create-table">01 Create table<a href="#01--create-table" class="hash-link" aria-label="01 Create table的直接链接" title="01 Create table的直接链接"></a></h3><ul><li><p>Create 3 columns of the Variant type: <code>actor</code>, <code>repo</code> and <code>payload</code></p></li><li><p>Meanwhile, create inverted index for the <code>payload</code> column: <code>idx_payload</code></p></li><li><p><code>USING INVERTED</code> specifies the index as inverted index, which accelerates conditional filtering on sub-columns</p></li></ul><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATABASE</span><span class="token plain"> test_variant</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USE</span><span class="token plain"> test_variant</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">IF</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">EXISTS</span><span class="token plain"> github_events </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> id </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">type</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VARCHAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">30</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> actor VARIANT </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> repo VARIANT </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> payload VARIANT </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">public</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BOOLEAN</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> created_at </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATETIME</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INDEX</span><span class="token plain"> idx_payload </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">payload</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">USING</span><span class="token plain"> INVERTED PROPERTIES</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"parser"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"english"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'inverted index for payload'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">id</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">id</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">properties</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">"replication_num"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"1"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><blockquote><p>Note: If the <code>Payload</code> column has too many sub-columns, creating indexes on it may lead to an excessive number of index columns and decrease data writing performance. If the data analysis only involves equivalence queries, it is advisable to build Bloom Filter index on the Variant columns. This can bring better performance than inverted index. For a single Variant column, if the parsing properties are the same but you have multiple parsing requirements, you can replicate the column and specify various indexes for each of them.</p></blockquote><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02--ingest-data-by-stream-load">02 Ingest data by Stream Load<a href="#02--ingest-data-by-stream-load" class="hash-link" aria-label="02 Ingest data by Stream Load的直接链接" title="02 Ingest data by Stream Load的直接链接"></a></h3><p>Load the <code>gh_2022-11-07-3.json</code> file, which is Github Events records of an hour. One formatted row of it looks like this: </p><div class="language-JSON codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-JSON codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">wget http://doris-build-hk-1308700295.cos.ap-hongkong.myqcloud.com/regression/variant/gh_2022-11-07-3.json</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T gh_2022-11-07-3.json -H "read_json_by_line:true" -H "format:json" http://127.0.0.1:18148/api/test_variant/github_events/_strea</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">m_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "TxnId": 2,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "Label": "086fd46a-20e6-4487-becc-9b6ca80281bf",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "Comment": "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "TwoPhaseCommit": "false",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "Status": "Success",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "Message": "OK",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "NumberTotalRows": 139325,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "NumberLoadedRows": 139325,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "NumberFilteredRows": 0,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "NumberUnselectedRows": 0,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "LoadBytes": 633782875,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "LoadTimeMs": 7870,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "BeginTxnTimeMs": 19,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "StreamLoadPutTimeMs": 162,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "ReadDataTimeMs": 2416,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "WriteDataTimeMs": 7634,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "CommitAndPublishTimeMs": 55</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Check if the data loading succeeds:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)">-- Check the number of rows</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> github_events</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">----------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">----------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">139325</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">----------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">row</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.25</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">-- View a random row</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> github_events </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">limit</span><span class="token plain"> </span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+---------------------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> id </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">type</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> actor </span><span class="token operator">|</span><span class="token plain"> repo </span><span class="token operator">|</span><span class="token plain"> payload </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">public</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> created_at </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+---------------------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">25061821748</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> PushEvent </span><span class="token operator">|</span><span class="token plain"> {</span><span class="token string" style="color:rgb(255, 121, 198)">"gravatar_id"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"display_login"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"jfrog-pipelie-intg"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"url"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"https://api.github.com/users/jfrog-pipelie-intg"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"id"</span><span class="token plain">:</span><span class="token number">98024358</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"login"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"jfrog-pipelie-intg"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"avatar_url"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"https://avatars.githubusercontent.com/u/98024358?"</span><span class="token plain">} </span><span class="token operator">|</span><span class="token plain"> {</span><span class="token string" style="color:rgb(255, 121, 198)">"url"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"id"</span><span class="token plain">:</span><span class="token number">562683829</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"name"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"jfrog-pipelie-intg/jfinte2e_1667789956723_16"</span><span class="token plain">} </span><span class="token operator">|</span><span class="token plain"> {</span><span class="token string" style="color:rgb(255, 121, 198)">"commits"</span><span class="token plain">:</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token plain">{</span><span class="token string" style="color:rgb(255, 121, 198)">"sha"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"334433de436baa198024ef9f55f0647721bcd750"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"author"</span><span class="token plain">:{</span><span class="token string" style="color:rgb(255, 121, 198)">"email"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"98024358+jfrog-pipelie-intg@users.noreply.github.com"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"name"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"jfrog-pipelie-intg"</span><span class="token plain">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"message"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"commit message 10238493157623136117"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"distinct"</span><span class="token plain">:</span><span class="token boolean">true</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"url"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16/commits/334433de436baa198024ef9f55f0647721bcd750"</span><span class="token plain">}</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"before"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"f84a26792f44d54305ddd41b7e3a79d25b1a9568"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"head"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"334433de436baa198024ef9f55f0647721bcd750"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"size"</span><span class="token plain">:</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"push_id"</span><span class="token plain">:</span><span class="token number">11572649828</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"ref"</span><span class="token plain">:</span><span class="token string" style="color:rgb(255, 121, 198)">"refs/heads/test-notification-sent-branch-10238493157623136113"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token string" style="color:rgb(255, 121, 198)">"distinct_size"</span><span class="token plain">:</span><span class="token number">1</span><span class="token plain">} </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2022</span><span class="token operator">-</span><span class="token number">11</span><span class="token operator">-</span><span class="token number">07</span><span class="token plain"> </span><span class="token number">11</span><span class="token plain">:</span><span class="token number">00</span><span class="token plain">:</span><span class="token number">00</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+---------------------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">row</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.23</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>View schema information via <code>desc</code>. The sub-columns will be automatically extended in the storage layer, and the data types of the sub-columns are automatically inferred.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)">-- No display of extended columns</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">desc</span><span class="token plain"> github_events</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------------+-------------+------+-------+---------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> Field </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Type</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">Null</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Key</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Default</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Extra </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------------+-------------+------+-------+---------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> id </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">No</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">type</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VARCHAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">30</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor </span><span class="token operator">|</span><span class="token plain"> VARIANT </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> repo </span><span class="token operator">|</span><span class="token plain"> VARIANT </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> payload </span><span class="token operator">|</span><span class="token plain"> VARIANT </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">public</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BOOLEAN</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> created_at </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATETIME</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------------+-------------+------+-------+---------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">7</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.01</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">-- Displaying extended columns of Variant columns</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> describe_extend_variant_column </span><span class="token operator">=</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token number">0</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> affected </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.01</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">desc</span><span class="token plain"> github_events</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------------------------------------------------------------+------------+------+-------+---------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> Field </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Type</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">Null</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Key</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">Default</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Extra </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------------------------------------------------------------+------------+------+-------+---------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> id </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BIGINT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">No</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">true</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">type</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VARCHAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor </span><span class="token operator">|</span><span class="token plain"> VARIANT </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">avatar_url </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">display_login </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">id </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">INT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">login </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> actor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">url </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> created_at </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DATETIME</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> payload </span><span class="token operator">|</span><span class="token plain"> VARIANT </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">action</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">before </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">comment</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">author_association </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">comment</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">body </span><span class="token operator">|</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TEXT</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> Yes </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">false</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> NONE </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">------------------------------------------------------------+------------+------+-------+---------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">406</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.07</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>With the <code>desc</code> statement, you can specify which partition you want to check the schema of: </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESCRIBE</span><span class="token plain"> ${table_name} </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">$partition_name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03--query">03 Query<a href="#03--query" class="hash-link" aria-label="03 Query的直接链接" title="03 Query的直接链接"></a></h3><blockquote><p>Note: When filtering and aggregating sub-columns, an additional CAST operation is required to ensure data type consistency. This is because the storage types may not be fixed, and the <code>CAST</code> expression in SQL can unify the data types. For example, <code>SELECT * FROM tbl WHERE CAST(var['title'] AS TEXT) MATCH 'hello world'</code>.</p></blockquote><p><strong>The following are simple examples of queries on Variant columns</strong></p><ol><li>Retrieve the Top 5 repositories with the most Stars from <code>github_events</code>.</li></ol><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">repo</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"name"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">text</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> repo_name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> stars</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> github_events</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">type</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'WatchEvent'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> repo_name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> stars </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESC</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">LIMIT</span><span class="token plain"> </span><span class="token number">5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">--------------------------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> repo_name </span><span class="token operator">|</span><span class="token plain"> stars </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">--------------------------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> aplus</span><span class="token operator">-</span><span class="token plain">framework</span><span class="token operator">/</span><span class="token plain">app </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">78</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> lensterxyz</span><span class="token operator">/</span><span class="token plain">lenster </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">77</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> aplus</span><span class="token operator">-</span><span class="token plain">framework</span><span class="token operator">/</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">database</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">46</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> stashapp</span><span class="token operator">/</span><span class="token plain">stash </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">42</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> aplus</span><span class="token operator">-</span><span class="token plain">framework</span><span class="token operator">/</span><span class="token plain">image </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">34</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">--------------------------+-------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">5</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.03</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="2"><li>Count the number of events containing the keyword <code>doris</code>.</li></ol><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> github_events</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">'comment'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">'body'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">text</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">MATCH</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'doris'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">3</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">---------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">1</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">row</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.04</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="3"><li>Check the ID of the issue that has the most comments and the repository it belongs to.</li></ol><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">repo</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"name"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> string</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> repo_name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"issue"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"number"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> issue_number</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> comments</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">distinct</span><span class="token plain"> cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">actor</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"login"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> string</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> authors </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> github_events </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">type</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'IssueCommentEvent'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"action"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> string</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'created'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">cast</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">payload</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"issue"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">"number"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token number">10</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> repo_name</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> issue_number </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HAVING</span><span class="token plain"> authors </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> comments </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> repo_name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token operator">-</span><span class="token operator">&gt;</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">LIMIT</span><span class="token plain"> </span><span class="token number">50</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">--------------------------------------+--------------+----------+---------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> repo_name </span><span class="token operator">|</span><span class="token plain"> issue_number </span><span class="token operator">|</span><span class="token plain"> comments </span><span class="token operator">|</span><span class="token plain"> authors </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">--------------------------------------+--------------+----------+---------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> facebook</span><span class="token operator">/</span><span class="token plain">react</span><span class="token operator">-</span><span class="token plain">native </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">35228</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">5</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> swsnu</span><span class="token operator">/</span><span class="token plain">swppfall2022</span><span class="token operator">-</span><span class="token plain">team4 </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">27</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">5</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">|</span><span class="token plain"> belgattitude</span><span class="token operator">/</span><span class="token plain">nextjs</span><span class="token operator">-</span><span class="token plain">monorepo</span><span class="token operator">-</span><span class="token plain">example </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">2865</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"> </span><span class="token operator">|</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token operator">+</span><span class="token comment" style="color:rgb(98, 114, 164)">--------------------------------------+--------------+----------+---------+</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token number">3</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">rows</span><span class="token plain"> </span><span class="token operator">in</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">set</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">0.03</span><span class="token plain"> sec</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="04--notes">04 Notes<a href="#04--notes" class="hash-link" aria-label="04 Notes的直接链接" title="04 Notes的直接链接"></a></h3><p>Based on our test results, it is safe to say that there is no efficiency disparity between Variant dynamic columns and pre-defined static columns. However, in log data processing, when users need to add fields to the table, such as container labels in Kubernetes, JSON parsing and type inference during data writing incur additional overhead.</p><p>To strike a balance between flexibility and efficiency for the Variant data type, we recommend keeping the number of columns below 1000. A small number of columns will reduce overheads caused by data parsing and type inference and thus increase data writing performance.</p><p>It is also advisable to ensure field type consistency whenever possible. This is because Doris automatically performs compatible type conversions to unify fields of different data types. If it cannot find a compatible type, it will convert the data to the JSONB type, which may result in degraded performance compared to the int or text type.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="variant-vs-json">Variant VS JSON<a href="#variant-vs-json" class="hash-link" aria-label="Variant VS JSON的直接链接" title="Variant VS JSON的直接链接"></a></h2><p>To see how the newly added Variant type impacts data storage and queries, we did comparison tests on pre-defined static columns, Variant columns, and JSON columns with ClickBench.</p><p><strong>Test environment</strong>: 16 core, 64GB, AWS EC2 instance, 1TB ESSD</p><p><strong>Test result</strong>:</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01-storage-space">01 Storage space<a href="#01-storage-space" class="hash-link" aria-label="01 Storage space的直接链接" title="01 Storage space的直接链接"></a></h3><p>As the results show, storing data as Variant columns takes up a similar storage space to storing it as pre-defined static columns. Compared with the JSON type, the Variant type requires 65% less space. <strong>In other words, the Variant type only takes up one-third of the storage space that JSON does. The difference will be even more notable with low-cardinality data because of columnar storage.</strong></p><p><img loading="lazy" alt="Storage space" src="https://cdnd.selectdb.com/zh-CN/assets/images/storage-space-1a7ce7030524f8ce1553d8872503825a.png" width="1280" height="476" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02-query-performance">02 Query performance<a href="#02-query-performance" class="hash-link" aria-label="02 Query performance的直接链接" title="02 Query performance的直接链接"></a></h3><p>We tested with 43<a href="https://github.com/ClickHouse/ClickBench/blob/main/selectdb/queries.sql" target="_blank" rel="noopener noreferrer"> Clickbench</a> SQL queries. Queries on the Variant columns are about 10% slower than those on pre-defined static columns, and <strong>8 times faster than those on</strong> <strong>JSON</strong> <strong>columns</strong>. (For I/O reasons, most cold runs on JSONB data failed with OOM.) </p><p><img loading="lazy" alt="Query Performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/query-performance-9442542741fd22019a68e5807dc1c2fb.png" width="1280" height="394" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="design--implementation-of-variant">Design &amp; implementation of Variant<a href="#design--implementation-of-variant" class="hash-link" aria-label="Design &amp; implementation of Variant的直接链接" title="Design &amp; implementation of Variant的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01--data-writing--type-inference">01 Data writing &amp; type inference<a href="#01--data-writing--type-inference" class="hash-link" aria-label="01 Data writing &amp; type inference的直接链接" title="01 Data writing &amp; type inference的直接链接"></a></h3><p>In Apache Doris, this is a normal writing process: data sorting, merging, and Segment file generation in the Memtable. Variant writing works similarly. It involves type inference and data merging of the same JSON keys within the Memtable, resulting in the creation of a prefix tree. The tree keeps the type and column information of every JSON field, and merges all type information of the same column into the least common type, generates columns, encodes them into the Doris storage formats, and appends them to the segment.</p><p>Each Segment file not only contains data after type encoding and compaction, but also includes the metadata of dynamically generated columns. Such design ensures data integrity and queryability while also improving storage efficiency. <strong>By type inference and merging in the memory, the Variant type largely reduces disk space usage compared to traditional raw text storage</strong>. </p><p><img loading="lazy" alt="Data Writing &amp;amp; Type inferece" src="https://cdnd.selectdb.com/zh-CN/assets/images/data-writing-and-type-inference-ef3da2624eff730ddec2ebceaecfccfb.png" width="1280" height="364" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02-column-change-column-adding-or-column-type-changes">02 Column change (column adding or column type changes)<a href="#02-column-change-column-adding-or-column-type-changes" class="hash-link" aria-label="02 Column change (column adding or column type changes)的直接链接" title="02 Column change (column adding or column type changes)的直接链接"></a></h3><p>During the writing process, all metadata and data of the leaf nodes in the prefix tree will be appended to the Segment file, and the metadata of the Rowsets will be merged. Here is an example of the merging process:</p><p><img loading="lazy" alt="Column change (column adding or column type changes)" src="https://cdnd.selectdb.com/zh-CN/assets/images/column-change-c7807e83ad624ebb11654d5ebdcdcc70.png" width="1280" height="245" class="img_ev3q"></p><p>In the end, the Rowset will use the <code>Least Common Column Schema</code> as the metadata after data merging. (Least common column schema is a schema with the most sub-columns and the sub-column type being the least common type.) This allows for dynamic column extension and type changes. </p><p>Based on this mechanism, the stored schema for Variant can be considered data-driven. It offers greater flexibility compared to the Schema Change process in Doris. The diagram below illustrates the directions for type changes (type changes can only be performed in the direction indicated by the arrows, with JSONB being the common type for all types):</p><p><img loading="lazy" alt="Column change (column adding or column type changes)" src="https://cdnd.selectdb.com/zh-CN/assets/images/column-change2-4ae8a9d1a6389b9ca9bc49e5c84b3164.png" width="1248" height="474" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03-index--query-acceleration">03 Index &amp; query acceleration<a href="#03-index--query-acceleration" class="hash-link" aria-label="03 Index &amp; query acceleration的直接链接" title="03 Index &amp; query acceleration的直接链接"></a></h3><p>In Variant, the leaf nodes are stored in a columnar format in the Segment file, which is exactly the same as the storage format for static pre-defined columns. Thus, queries on Variant columns can also be accelerated by dictionary encoding, vectorization, and indexes (ZoneMap, inverted index, BloomFilter, etc.). Since the same column might be of different types in different files, users need to specify a type as the hint during query execution. Here is an example query: </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token comment" style="color:rgb(98, 114, 164)">-- var['title'] is to access the 'title' sub-column of var, which is a Variant column. If there is inverted index for var, the queries will be accelerated by inverted index.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> CAST</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">var</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">'titile'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">text</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">MATCH</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"hello world"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token comment" style="color:rgb(98, 114, 164)">-- If there is Bloom Filter for var, equivalence queries will be accelerated by Bloom Filter.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token operator">*</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> tbl </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> CAST</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">var</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token string" style="color:rgb(255, 121, 198)">'id'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">as</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">bigint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token number">1010101</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Predicates will be pushed down to the storage layer (Segment), where the storage type is checked against the target type of the CAST operation. If the types match, a more efficient predicate filtering mechanism will be utilized. This approach reduces unnecessary data reading and conversion, thus improving query performance.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="04--storage-optimization-for-sparse-columns">04 Storage optimization for sparse columns<a href="#04--storage-optimization-for-sparse-columns" class="hash-link" aria-label="04 Storage optimization for sparse columns的直接链接" title="04 Storage optimization for sparse columns的直接链接"></a></h3><p>Examples of sparse JSON columns: </p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><span class="token string" style="color:rgb(255, 121, 198)">"a"</span><span class="token plain">:</span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">]</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"b"</span><span class="token plain">:</span><span class="token number">2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"c"</span><span class="token plain">:</span><span class="token number">3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"x_1"</span><span class="token plain"> : </span><span class="token number">1</span><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"x_2"</span><span class="token plain">: </span><span class="token string" style="color:rgb(255, 121, 198)">"3"</span><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><span class="token string" style="color:rgb(255, 121, 198)">"a"</span><span class="token plain">:</span><span class="token number">1</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"b"</span><span class="token plain">:</span><span class="token number">2</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"c"</span><span class="token plain">:</span><span class="token number">3</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"x_1"</span><span class="token plain"> : </span><span class="token number">1</span><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"x_2"</span><span class="token plain">: </span><span class="token string" style="color:rgb(255, 121, 198)">"3"</span><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><span class="token string" style="color:rgb(255, 121, 198)">"a"</span><span class="token plain">:</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"b"</span><span class="token plain">:</span><span class="token number">5</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"c"</span><span class="token plain">:</span><span class="token number">6</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"x_3"</span><span class="token plain"> : </span><span class="token number">1</span><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"x_4"</span><span class="token plain">: </span><span class="token string" style="color:rgb(255, 121, 198)">"3"</span><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{</span><span class="token string" style="color:rgb(255, 121, 198)">"a"</span><span class="token plain">:</span><span class="token number">7</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"b"</span><span class="token plain">:</span><span class="token number">8</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"c"</span><span class="token plain">:</span><span class="token number">9</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"x_5"</span><span class="token plain"> : </span><span class="token number">1</span><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"x_6"</span><span class="token plain">: </span><span class="token string" style="color:rgb(255, 121, 198)">"3"</span><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The <code>a, b, c</code> columns are dense. They are included in almost every row. While the <code> x_?</code> columns are sparse. Only a few of them are not null. If the system stores every column in a columnar way, it will suffer huge storage pressure and exploding meta. </p><p>To solve this, Doris detects the sparsity of columns based on the percentage of null values upon data ingestion. The highly sparse columns (with a high proportion of null values) will be packed into JSONB encoding and stored in a separate column. </p><p><img loading="lazy" alt=" Storage optimization for sparse columns" src="https://cdnd.selectdb.com/zh-CN/assets/images/storage-optimization-for-sparse-columns-4ece65d5d623312590b94532dc8c0237.png" width="1280" height="684" class="img_ev3q"></p><p>Such optimization for storing sparse columns will relieve pressure on meta and data compaction and increase flexibility. </p><p>Queries on the sparse columns are implemented in exactly the same way as those on other columns.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="use-case">Use case<a href="#use-case" class="hash-link" aria-label="Use case的直接链接" title="Use case的直接链接"></a></h2><p>GuanceDB, an observability platform, used an Elasticsearch-based solution for storing logs and user behavior data. However, Elasticsearch has inadequate schemaless support, so it is inefficient in processing large amounts of user-defined fields. Under the Dynamic Mapping mechanism in Elasticsearch, frequent field type conflicts led to data losses and required lots of human intervention. Meanwhile, the writing process in Elasticsearch was resource-intensive and the performance in massive data aggregation was less than ideal.</p><p>For a data architectural upgrade, GuanceDB works with <a href="https://www.velodb.io/" target="_blank" rel="noopener noreferrer">VeloDB</a> and builds an Apache Doris-based observability solution. They utilize the Variant data type to realize partition-based schema change, which is more flexible and efficient. In addition, Doris imposes no upper limit on the number of columns, meaning that it can better accommodate schema-free data. </p><p>The Doris-based solution also delivers lower CPU usage in data writing and higher speed in complicated data aggregation (accelerated by inverted index and query optimization techniques). After the upgrade, GuanceDB <strong>decreased their machine costs by 70% and doubled their overall query speed</strong>, with an over 4-time performance increase in simple queries. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>The Variant data type has stood the test of many users before the official release of Apache Doris 2.1.0. It is production-available now. In the future, we plan to realize more lightweight changes for Variant to facilitate data modeling. </p><p>For more information about Variant and guides on how to build a semi-structured data analytics solution for your case, come talk to the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris developer team</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Another big leap: Apache Doris 2.1.0 is released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.1.0</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.1.0"/>
<updated>2024-03-12T00:00:00.000Z</updated>
<summary type="html"><![CDATA[We appreciate the 237 contributors who made nearly 6000 commits in total to the Apache Doris project, and the nearly 100 enterprise users who provided valuable feedback.]]></summary>
<content type="html"><![CDATA[<p>Dear Apache Doris community, we are thrilled to announce the advent of Apache Doris 2.1.0. In this version, you can expect:</p><ul><li><p><strong>Higher out-of-the-box query performance</strong>: 100% faster speed proven by TPC-DS 1TB benchmark tests.</p></li><li><p><strong>Improved data lake analytics capabilities</strong>: 4~6 times faster than Trino and Spark, compatibility with various SQL dialects for smooth migration, read/write interface based on Arrow Flight for 100 times faster data transfer.</p></li><li><p><strong>Solid support for semi-structured data analysis</strong>: a newly-added Variant data type, support for more IP types, and a more comprehensive suite of analytic functions.</p></li><li><p><strong>Materialized view with multiple tables</strong>: a new feature to accelerate multi-table joins, allowing transparent rewriting, auto refresh, materialized views of external tables, and direct query.</p></li><li><p><strong>Enhanced real-time writing efficiency</strong>: faster data writing at scale powered by AUTO_INCREMENT column, AUTO PARTITION, forward placement of MemTable, and Group Commit. </p></li><li><p><strong>Better workload management</strong>: optimizations of the Workload Group mechanism for higher performance stability and the display of SQL resource consumption in the runtime.</p></li></ul><p>We appreciate the 237 contributors who made nearly 6000 commits in total to the Apache Doris project, and the nearly 100 enterprise users who provided valuable feedback. We will keep aiming for the stars with our agile release planning, and we appreciate your feedback in the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris developer and user community</a>. </p><p><strong>Download from GitHub</strong>: <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><p><strong>Download from website</strong>: <a href="https://doris.apache.org/download" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="higher-performance">Higher performance<a href="#higher-performance" class="hash-link" aria-label="Higher performance的直接链接" title="Higher performance的直接链接"></a></h2><p>Apache Doris V2.1 makes a big leap in out-of-the-box query performance. It can deliver high performance even for complicated SQL queries without any fine-tuning. TPC-DS 1TB benchmark tests with 1 Frontend and 3 Backends (48C, 192G each) show that:</p><ul><li><p>The total query execution time of V2.1.0 is 245.7s, <strong>up 100%</strong> from the 489.6s of V2.0.5;</p></li><li><p>V2.1 is more than twice as fast as V2.0.5 on one-third of the total 99 SQL queries, and outperforms V2.0.5 on over 80 of the SQL queries; </p></li><li><p>V2.1 delivers better performance in data filtering, sorting, aggregation, multi-table joins, sub-queries, and window function computation.</p></li></ul><p><img loading="lazy" alt="2.1-Doris-TPC-DS-higher-performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-Doris-TPC-DS-best-performance-196cfa22d1783b3e3a367bed4e300dd1.png" width="1280" height="675" class="img_ev3q"></p><p>Meanwhile, we have compared Doris V2.1.0 against many other OLAP systems with the same hardware environment under various data sizes. <strong>Recurring results show that Doris is undoubtedly far ahead</strong>.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="smarter-optimizer">Smarter optimizer<a href="#smarter-optimizer" class="hash-link" aria-label="Smarter optimizer的直接链接" title="Smarter optimizer的直接链接"></a></h3><p>In our last big release, we introduced a new query optimizer that enables fast performance for most use cases without any manual fine-tuning. Now, the V2.1 query optimizer is an upgrade on that basis. It comes with:</p><ul><li><p><strong>More solid infrastructure</strong>: We have improved the statistics-based inference and the cost model that underpin the query optimizer, so it can collect statistical information from a wider range to undertake more complicated optimization tasks.</p></li><li><p><strong>Extended optimization rules</strong>: Absorbing feedback from our many actual use cases, we have improved many frequently used rules (operator pushdown, etc.) and introduced new rules to fit in more scenarios.</p></li><li><p><strong>Enhanced enumeration framework</strong>: Building on Cascades and DPhyper, the V2.1 query optimizer has a clearer enumeration strategy that achieves a better balance between quality and efficiency. For example, we have dialed up the default limit of query plans in the enumeration table from 5 to 8, and then we have sharpened the DPhyper enumeration capabilities to produce better query plans.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="better-heuristic-optimization">Better heuristic optimization<a href="#better-heuristic-optimization" class="hash-link" aria-label="Better heuristic optimization的直接链接" title="Better heuristic optimization的直接链接"></a></h3><p>In large-scale data analytics or data lake scenarios, it is always challenging and time-consuming to collect statistical information to provide references for query plans. For that, the V2.1 query optimizer, leveraging a combination of heuristic technologies, is able to generate <strong>high-quality query plans without statistical reference</strong>. Meanwhile, the RuntimeFilter is part of the trick. It is now more self-adaptive. It can self-adjust the predicates in expressions during execution, so it can enable higher performance without statistical information.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="parallel-adaptive-scan">Parallel Adaptive Scan<a href="#parallel-adaptive-scan" class="hash-link" aria-label="Parallel Adaptive Scan的直接链接" title="Parallel Adaptive Scan的直接链接"></a></h3><p>A complex data query will involve large sums of data scanning, during which the scan I/O can be the bottleneck for query execution speed. That's why we have Parallel Scan, which means one scan thread can read multiple tablets (buckets). However, that is highly dependent on the bucket number you set for data partitioning in the first place. If the user has set an inappropriate number of buckets, the scan thread will not be able to work parallelly. </p><p>That's why we have adopted Parallel Adaptive Scan in Doris V2.1. What happens is that the tablets are pooled so the scanning process can be divided into a flexible number of threads based on the total number of rows. (The upper limit is 48 threads.) In this way, users no longer have to worry that their query speed might be dragged down by unreasonable bucket numbers.</p><p><img loading="lazy" alt="Parallel Adaptive Scan" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-parallel-adaptive-scan-ac95c58d6afe58c11121e37ee355685e.png" width="1280" height="424" class="img_ev3q"></p><p>In 2.1 and future versions, we recommend that you set <strong>the number of buckets equal to the total number of disks in the cluster</strong>, in order to fully utilize the I/O resources of the entire cluster.</p><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>Parallel Adaptive Scan is currently available for the Duplicate Key model and the Merge-on-Write tables of the Unique Key model. We plan to add it to the Aggregate Key model and the Merge-on-Read tables of the Unique Key model in version 2.1.1.</p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="local-shuffle">Local Shuffle<a href="#local-shuffle" class="hash-link" aria-label="Local Shuffle的直接链接" title="Local Shuffle的直接链接"></a></h3><p>We have introduced Local Shuffle to prevent uneven data distribution. Benchmark tests show that Local Shuffle in combination with Parallel Adaptive Scan can guarantee fast query performance despite unreasonable bucket number settings upon table creation.</p><p>For queries across multiple instances, uneven data distribution can prolong the query execution time. To address data skew across instances on a single backend (BE), we have introduced Local Shuffle in V2.1. It aims to shuffle and distribute data as evenly as possible, thereby accelerating queries. For example, in a typical aggregation query, a Local Shuffle node will redistribute the data evenly across different pipeline tasks, before the data is aggregated.</p><p><img loading="lazy" alt="Local Shuffle" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-local-shuffle-6f7365d5417f254f369c691827d034eb.png" width="1220" height="1280" class="img_ev3q"></p><p>For a proof of concept, we have simulated unreasonable bucket number settings. Firstly, we use the ClickBench dataset and run flat-table queries with the bucket number being 1 and 16, respectively. Then, we use the TPC-H 100G dataset and run join queries with 1 bucket and 16 buckets in each partition, respectively. Results from the runs show minimal fluctuations, which means the combination of Parallel Adaptive Scan and Local Shuffle is able to guarantee high query performance even with inappropriately sharded or unevenly distributed data.</p><p><img loading="lazy" alt="Clickbench and Local Shuffle" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-Clickbench-and-Local-shuffle-1b3c08174cf021f9bb10a1b0ea1c3b72.png" width="1280" height="701" class="img_ev3q"></p><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/query-acceleration/pipeline-x-execution-engine/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/query-acceleration/pipeline-x-execution-engine/</a></p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="increase-performance-on-arm">Increase performance on ARM<a href="#increase-performance-on-arm" class="hash-link" aria-label="Increase performance on ARM的直接链接" title="Increase performance on ARM的直接链接"></a></h2><p>V2.1 is specifically adapted to and optimized for ARM architecture. Compared to Doris 2.0.3, it has achieved over 100% performance improvement on multiple test datasets:</p><ul><li><p><strong>ClickBench large flat-table queries</strong>: The execution time of 43 SQL queries for V2.1 adds up to 30.73 seconds, as compared to 102.36 seconds for V2.0.3, representing a <strong>230%</strong> speedup.</p></li><li><p><strong>TPC-H multi-table joins</strong>: The execution time of 22 SQL queries for V2.1 adds up to 90.4 seconds, as compared to 174.8 seconds for V2.0.3, representing a <strong>93%</strong> speedup.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improved-data-lake-analytics-capabilities">Improved data lake analytics capabilities<a href="#improved-data-lake-analytics-capabilities" class="hash-link" aria-label="Improved data lake analytics capabilities的直接链接" title="Improved data lake analytics capabilities的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="data-lake-analytic-performance">Data lake analytic performance<a href="#data-lake-analytic-performance" class="hash-link" aria-label="Data lake analytic performance的直接链接" title="Data lake analytic performance的直接链接"></a></h3><p>V2.1 also reaches new heights in data lake analysis. According to TPC-DS benchmark tests (1TB) of Doris V2.1 against Trino V435,</p><ul><li><p>Without caching, Apache Doris is <strong>45% faster than</strong> <strong>Trino</strong>, with their total execution time being 717s and 1296s, respectively. Specifically, Doris outperforms Trino in 80% of the total 99 SQL queries.</p></li><li><p>If you enable file cache, you can expect another 2.2-time speedup from Doris (323s). <strong>That is 4 times the speed of Trino, with a straight win in all 99 SQL queries.</strong></p></li></ul><p>In addition, TPC-DS 10TB benchmark tests show that Apache Doris 2.1 is 4.2 times as fast as Spark 3.5.0 and 6.1 times as Spark 3.3.1.</p><p><img loading="lazy" alt="Data lake analytic performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-TPC-DS-326352a5581f27b587ff6e7d451c0ee4.png" width="950" height="550" class="img_ev3q"></p><p>This is achieved by a series of optimizations in I/O for HDFS and object storage, parquet/ORC file reading, floating-point decompression, predicate pushdown, caching, and scan task scheduling. It is also built upon a more precise cost model in the optimizer and more accurate statistics collection for different data sources.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="sql-dialects-compatibility">SQL dialects compatibility<a href="#sql-dialects-compatibility" class="hash-link" aria-label="SQL dialects compatibility的直接链接" title="SQL dialects compatibility的直接链接"></a></h3><p>SQL incompatibility used to bother our users when they migrated from their existing OLAP systems (built on Clickhouse, Trino, Presto, Hive, etc.) to Doris, because they had to modify and update a significant amount of business query logic. Also, if they tried to use Doris as a unified data analysis gateway, they would also need to integrate it with their Hive or Spark systems, and incompatible SQLs could make it tough.</p><p>To facilitate a smooth migration or integration, we have enabled SQL dialect conversion in V2.1. Users can continue using the SQL dialect they are used to after simply setting the SQL dialect type for the current session in Doris. </p><p>So far, the ClickHouse, Presto, Trino, Hive, and Spark SQL dialects have been supported in this experimental feature. For example, by <code>set sql_dialect = "trino"</code>, you can perform queries using Trino SQL syntax, without any modifications. Tests in user production environment show that Doris V2.1 is compatible with 99% of Trino SQL. </p><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See Doc: <a href="https://doris.apache.org/docs/lakehouse/sql-dialect/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/lakehouse/sql-dialect/</a></p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="high-speed-data-interface-for-100-fold-performance">High-speed data interface for 100-fold performance<a href="#high-speed-data-interface-for-100-fold-performance" class="hash-link" aria-label="High-speed data interface for 100-fold performance的直接链接" title="High-speed data interface for 100-fold performance的直接链接"></a></h3><p>Most big data systems today adopt columnar in-memory data formats and interact with other database systems using MySQL/JDBC/ODBC protocols. That means during data transfer, there is a need to covert the data from columnar format to row-based format to fit in with the MySQL/JDBC/ODBC protocols, and then vice versa. This serialization and deserialization process slows down the data transfer speed, which becomes more noticeable when the data size is huge, like in data science scenarios.</p><p>Apache Arrow is a columnar in-memory format designed for large-scale data processing. It has efficient data structures that facilitate faster data transfer across different systems. If both the source database and target client support Arrow Flight SQL protocol, data transfer between them will entail no data serialization and deserialization. That can cut down a huge chunk of overheads. Moreover, Arrow Flight can give full play to the multi-node and multi-core architecture to parallelize operations and thus increase throughputs.</p><p><img loading="lazy" alt="High-speed data interface for 100-fold performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-arrow-flight-3585c9cac786689f50ad42c8b76d3b8e.png" width="1280" height="436" class="img_ev3q"></p><p>Reading data from Apache Doris using Python used to be a complex process. Firslty, data blocks in Doris had to be converted from its columnar format into row-based bytes. Then, in the Python client, the data had to be deserialized into a Pandas data structure. These steps largely slow down data transfer.</p><p>Now this is revolutionized in Doris V2.1, where we provide a high-throughput data read/write interface based on Arrow Flight: HTTP Data API. Using Arrow Flight SQL, Doris converts the columnar data blocks into Arrow RecordBatch, which is also in columnar format. Then, in the Python client, Arrow RecordBatch is converted into column-oriented Pandas DataFrame. Both conversions are highly efficient and involve no serialization and deserialization. </p><p>This allows fast data access to Apache Doris by data science tools like Pandas and Numpy, which means Apache Doris can be seamlessly integrated with the entire AI and data science ecosystem. This unveils a future of endless possibilities. </p><div class="language-C++ codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-C++ codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">conn = flight_sql.connect(uri="grpc://127.0.0.1:9090", db_kwargs={</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.USERNAME.value: "user",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> adbc_driver_manager.DatabaseOptions.PASSWORD.value: "pass",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> })</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor = conn.cursor()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">cursor.execute("select * from arrow_flight_sql_test order by k0;")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">print(cursor.fetchallarrow().to_pandas())</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>According to our comparative tests using different MySQL clients for the common data types, the Arrow Flight SQL protocol delivers almost 100 times faster performance than the MySQL protocol in data transfer.</p><p><img loading="lazy" alt="MySQL protocol" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-arrow-flight-sql-f8707eb6a387ed03a83fd20c373e810a.png" width="1280" height="502" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="other-improvements">Other improvements<a href="#other-improvements" class="hash-link" aria-label="Other improvements的直接链接" title="Other improvements的直接链接"></a></h3><ul><li><p>Paimon Catalog: upgrade to Paimon 0.6.0, optimized reading of Read Optimized tables, able to bring 10-fold speeds when Paimon data is fully merged</p></li><li><p>Iceberg Catalog: upgrade to Iceberg 1.4.3, fixed compatibility issues in AWS S3 authentication</p></li><li><p>Hudi Catalog: upgrade to Hudi 0.14.1, fixed compatibility issues in Hudi Flink Catalog</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="materialized-view-with-multiple-tables">Materialized view with multiple tables<a href="#materialized-view-with-multiple-tables" class="hash-link" aria-label="Materialized view with multiple tables的直接链接" title="Materialized view with multiple tables的直接链接"></a></h2><p>As a typical "trade disk space for time" strategy, materialized views pre-compute and store SQL query results so that when the same queries are requested, the materialized view table can directly provide the results. This increases query performance and reduces resource consumption by avoiding repetitive computation.</p><p>Previous versions of Doris offer strong consistency for single-table materialized views, ensuring atomicity between the base table and the materialized view table. They also support smart routing for query statements on materialized views, allowing for efficient query execution.</p><p><strong>What's more exciting is that, in V2.1, we have introduced materialized views with multiple tables (also known as<a href="https://doris.apache.org/docs/query-acceleration/async-materialized-view/" target="_blank" rel="noopener noreferrer">asynchronous materialized view</a>).</strong> As the name implies, you can build a materialized view across tables. It can be based on full data or incremental data, and it can be refreshed manually or periodically. For multi-table joins or large data scale scenarios, the optimizer transparently rewrites queries based on the cost model and automatically searches for the right materialized view for <strong>optimal query performance</strong>. You can build asynchronous materialized views for external tables, and you can perform queries on these views directly. In other words, <strong>this can be a game changer for</strong> <strong>data warehouse</strong> <strong>layering, data modeling, job scheduling, and data processing</strong>.</p><p>Now let's get started: </p><p><strong>1. Create the tables:</strong></p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">use tpch;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE IF NOT EXISTS orders (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderkey integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_custkey integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderstatus char(1) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_totalprice decimalv3(15,2) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderdate date not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderpriority char(15) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_clerk char(15) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_shippriority integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_comment varchar(79) not null</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DUPLICATE KEY(o_orderkey, o_custkey)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION BY RANGE(o_orderdate)(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(o_orderkey) BUCKETS 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES ("replication_num" = "1");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into orders values</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (1, 1, 'ok', 99.5, '2023-10-17', 'a', 'b', 1, 'yy'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (2, 2, 'ok', 109.2, '2023-10-18', 'c','d',2, 'mm'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (3, 3, 'ok', 99.5, '2023-10-19', 'a', 'b', 1, 'yy');</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE IF NOT EXISTS lineitem (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_orderkey integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_partkey integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_suppkey integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_linenumber integer not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_quantity decimalv3(15,2) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_extendedprice decimalv3(15,2) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_discount decimalv3(15,2) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_tax decimalv3(15,2) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_returnflag char(1) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_linestatus char(1) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipdate date not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_commitdate date not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_receiptdate date not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipinstruct char(25) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipmode char(10) not null,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_comment varchar(44) not null</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DUPLICATE KEY(l_orderkey, l_partkey, l_suppkey, l_linenumber)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION BY RANGE(l_shipdate)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (FROM ('2023-10-17') TO ('2023-10-20') INTERVAL 1 DAY)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(l_orderkey) BUCKETS 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES ("replication_num" = "1");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into lineitem values</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (1, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-17', '2023-10-17', '2023-10-17', 'a', 'b', 'yyyyyyyyy'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (2, 2, 3, 4, 5.5, 6.5, 7.5, 8.5, 'o', 'k', '2023-10-18', '2023-10-18', '2023-10-18', 'a', 'b', 'yyyyyyyyy'),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (3, 2, 3, 6, 7.5, 8.5, 9.5, 10.5, 'k', 'o', '2023-10-19', '2023-10-19', '2023-10-19', 'c', 'd', 'xxxxxxxxx');</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CREATE TABLE IF NOT EXISTS partsupp (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_partkey INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_suppkey INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_availqty INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_supplycost DECIMALV3(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_comment VARCHAR(199) NOT NULL </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DUPLICATE KEY(ps_partkey, ps_suppkey)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(ps_partkey) BUCKETS 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "1"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>2. Create materialized view:</strong></p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE MATERIALIZED VIEW mv1 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> BUILD DEFERRED REFRESH AUTO ON MANUAL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partition by(l_shipdate)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY RANDOM BUCKETS 2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES ('replication_num' = '1') </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AS </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select l_shipdate, o_orderdate, l_partkey, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_suppkey, sum(o_totalprice) as sum_total</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> left join orders on lineitem.l_orderkey = orders.o_orderkey </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate = o_orderdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipdate,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderdate,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_partkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_suppkey;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>To sum up, asynchronous materialized view in V2.1 supports:</p><ul><li><p><strong>Transparent rewriting</strong>: transparently rewrites common operators including Select, Where, Join, Group By, and Aggregation, for faster query speed. For example, in BI reporting, you can create materialized views for some particularly slow queries.</p></li><li><p><strong>Auto refresh</strong>: periodic refresh, manual refresh, full refresh, (partition-based) incremental refresh.</p></li><li><p><strong>Materialized view of external tables</strong>: You can build materialized views based on external data such as Hive, Hudi, and Iceberg tables. You can also synchronize data from data lakes into Doris internal tables via materialized views.</p></li><li><p><strong>Direct query on materialized views</strong>: If you regard the making of materialized views as an ETL process, then the materialized views will be the result set of ETL. In this sense, materialized views can be seen as data tables, so users can conduct queries on them directly.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="enhanced-storage">Enhanced storage<a href="#enhanced-storage" class="hash-link" aria-label="Enhanced storage的直接链接" title="Enhanced storage的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="auto_increment-column">AUTO_INCREMENT column<a href="#auto_increment-column" class="hash-link" aria-label="AUTO_INCREMENT column的直接链接" title="AUTO_INCREMENT column的直接链接"></a></h3><p>AUTO_INCREMENT column is a common feature in OLTP databases. It provides an efficient way to automatically assign unique identifiers to newly inserted data rows. However, it is less commonly found in distributed OLAP databases because the value allocation for AUTO_INCREMENT columns involves global transactions. </p><p>As an MPP-based OLAP system, Apache Doris V2.1 implements AUTO_INCREMENT column with an innovative pre-allocation strategy. Leveraging the uniqueness guarantee provided by AUTO_INCREMENT, users can achieve efficient dictionary encoding and query pagination.</p><p><strong>Dictionary encoding</strong>: AUTO_INCREMENT column is helpful for queries that require accurate deduplication, such as PV/UV calculation or user segmentation. Utilizing AUTO_INCREMENT column, you can create a dictionary table for string values like UserID or OrderID. Simply writing user data in batches or in real time to the dictionary table can generate a dictionary. Then, by applying various dimensional conditions, the corresponding bitmaps can be aggregated.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `demo`.`dictionary_tbl` (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `user_id` varchar(50) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `aid` BIGINT NOT NULL AUTO_INCREMENT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) ENGINE=OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">UNIQUE KEY(`user_id`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`user_id`) BUCKETS 32</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"replication_allocation" = "tag.location.default: 3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"enable_unique_key_merge_on_write" = "true"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Query pagination</strong>: Pagination is often necessary when displaying data on a webpage. Traditional pagination typically involves using <code>limit</code>, <code>offset</code> + <code>order by</code> in SQL queries. However, this can be inefficient in deep pagination queries, because even if only a small portion of the data is being requested, the database still needs to read and sort the entire dataset. This is addressed by AUTO_INCREMENT column. It generates a unique identifier for each row, remembers the maximum one from the previous page and uses it as a reference for retrieving the next page.</p><p>The following is an example, where <code>unique_value</code> is an AUTO INCREMENT column.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `demo`.`records_tbl2` (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `key` int(11) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `name` varchar(26) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `address` varchar(41) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `city` varchar(11) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `nation` varchar(16) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `region` varchar(13) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `phone` varchar(16) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `mktsegment` varchar(11) NOT NULL COMMENT "",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `unique_value` BIGINT NOT NULL AUTO_INCREMENT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) DUPLICATE KEY (`key`, `name`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`key`) BUCKETS 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "3"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>In pagination display, where each page displays 100 records, this is how you can fetch the data of the first page:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select * from records_tbl2 order by unique_value limit 100;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The program marks down the maximum of <code>unique_value</code> in the returned result (assuming it is 99). This is how you can fetch the data of the second page:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select * from records_tbl2 where unique_value &gt; 99 order by unique_value limit 100;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you need data from the latter pages, for example, page 101, it will be difficult to retrieve the maximum <code>unique_value</code> of page 100, so this is how you can perform the query:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select key, name, address, city, nation, region, phone, mktsegment</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from records_tbl2, (select unique_value as max_value from records_tbl2 order by uniuqe_value limit 1 offset 9999) as previous_data</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where records_tbl2.uniuqe_value &gt; previous_data.max_value</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by unique_value limit 100;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/advanced/auto-increment/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/advanced/auto-increment/</a></p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="auto-partition">AUTO PARTITION<a href="#auto-partition" class="hash-link" aria-label="AUTO PARTITION的直接链接" title="AUTO PARTITION的直接链接"></a></h3><p>Before V2.1, Doris requires users to manually create data partitions before data ingestion, otherwise data loading will just fail. Now, to release burden on operation and maintenance, V2.1 allows AUTO PARTITION. Upon data ingestion, it detects whether a partition exists for the data based on the partitioning column. If not, it automatically creates one and starts data ingestion. </p><p>To apply AUTO PARTITION in Doris:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `DAILY_TRADE_VALUE`</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `TRADE_DATE` datev2 NULL COMMENT 'Trade Date',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `TRADE_ID` varchar(40) NULL COMMENT 'Trade ID',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ......</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">UNIQUE KEY(`TRADE_DATE`, `TRADE_ID`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">AUTO PARTITION BY RANGE date_trunc(`TRADE_DATE`, 'year')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "1"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>提示</div><div class="admonitionContent_S0QG"><ol><li><p>Currently, you can only specify one partitioning column for AUTO PARTITION, and it has to be NOT NULL.</p></li><li><p>It supports AUTO PARTITION by Range or by List. For the former, it supports <code>date_trunc</code> as the partitioning function, and <code>DATE</code> or <code>DATETIME</code> format for the partitioning column. For the latter, it does not support function calling, it supports <code>BOOLEAN</code>, <code>TINYINT</code>, <code>SMALLINT</code>, <code>INT</code>, <code>BIGINT</code>, <code>LARGEINT</code>, <code>DATE</code>, <code>DATETIME</code>, <code>CHAR</code>, and <code>VARCHAR</code> for the partitioning column, and the values are enumeration values.</p></li><li><p>For AUTO PARTITION by List, if there is no partition for a value in the partitioning column, Doris will create one for it.</p></li></ol></div></div><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/zh-CN/docs/table-design/data-partition#%E8%87%AA%E5%8A%A8%E5%88%86%E5%8C%BA" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/table-design/data-partition#%E8%87%AA%E5%8A%A8%E5%88%86%E5%8C%BA</a></p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="100-faster-insert-into-select">100% faster INSERT INTO SELECT<a href="#100-faster-insert-into-select" class="hash-link" aria-label="100% faster INSERT INTO SELECT的直接链接" title="100% faster INSERT INTO SELECT的直接链接"></a></h3><p><code>INSERT INTO…SELECT</code> is one of the most frequently used statements in ETL. It enables fast data migration, transformation, cleaning, and aggregation. That's why we've been optimizing its performance. In V2.0, we introduced Single Replica Load to reduce repetitive data writing and data compaction. </p><p>For further improvement, in V2.1, we have moved forward the execution of MemTable to reduce data ingestion overheads. Tests show that this can <strong>double the data ingestion speed in most cases compared to V2.0</strong>. </p><p><img loading="lazy" alt="100% faster INSERT INTO SELECT" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-INSERT-INTO-SELECT-EN-381ebcb56119c8c3020120fde276cfc3.png" width="1280" height="751" class="img_ev3q"></p><p>The process comparison before and after moving forward the execution of MemTable is illustrated above. The Sink node no longer sends encoded data blocks but instead processes MemTable locally and sends the generated segments to downstream nodes. This reduces the overheads caused by multiple data encoding and improves the speed and accuracy of memory backpressure. In addition, we have replaced Ping-Pong RPC with Streaming RPC so there will be less waiting during data transfer.</p><p>We've done tests to see how moving forward MemTable impacts data ingestion performance.</p><p><strong>Test environment</strong>: 1 Frontend + 3 Backend, 16C 64G each node, 3 high-performance cloud disks (to make sure that disk I/O is not a bottleneck)</p><p><strong>Test results</strong>: </p><p>In single-replica ingestion, the execution time of V2.1 is only 36% of what takes V2.0 to finish. In three-replica ingestion, that figure is 54%. This means, V2.1 has sped up the performance of <code>INSERT INTO…SELECT</code> by more than 100% in general.</p><p><img loading="lazy" alt="Insert into table" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-insert-into-select-0a877bdc14d4fabffc7e0f5da6c1c4d3.png" width="3501" height="1031" class="img_ev3q"></p><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>V2.1 moves forward the execution of MemTable by default, so you don't have to modify the data ingestion command. You can return to the old ingestion method by setting <code>enable_memtable_on_sink_node=false</code> in MySQL connection.</p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="high-concurrency-real-time-data-ingestion--group-commit">High-concurrency real-time data ingestion / Group Commit<a href="#high-concurrency-real-time-data-ingestion--group-commit" class="hash-link" aria-label="High-concurrency real-time data ingestion / Group Commit的直接链接" title="High-concurrency real-time data ingestion / Group Commit的直接链接"></a></h3><p>For data writing, V2.1 has a back pressure mechanism to avoid excessive data versions, so as to reduce resource consumption caused by data version merging. </p><p>During data ingestion, data batches are written to an in-memory table and then written to disk as individual RowSet files. Each RowSet file corresponds to a specific data import version. The background compaction process automatically merges the RowSets, combining the small ones into a big one in order to increase query speed and reduce storage consumption. However, each compaction process consumes CPU, memory, and disk IO resources. The more frequently that data is written, the more RowSets are generated, and the more resources the compaction process consumes. The backpressure mechanism is a solution to this. It will throw a -235 error when there are too many data versions.</p><p><img loading="lazy" alt="High-concurrency real-time data ingestion " src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-group-commit-d6b8b226770cf1fdb5836854ed595e49.png" width="1280" height="912" class="img_ev3q"></p><p>In addition, V2.1 supports Group Commit, which means to accumulate multiple data writings in the backend and commit them as one batch. In this way, users don't have to keep their writing frequency at a low level because Doris will merge multiple writings into one. </p><p><img loading="lazy" alt="Group Comit" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-doris-group-commit-2-4bc76d5f1b65ab702ea7156c1eab24a4.png" width="1280" height="757" class="img_ev3q"></p><p>Group Commit so far supports two modes: <code>sync_mode</code> and <code>async_mode</code>. The <code>sync_mode</code> commits multiple imports within a single transaction, and then the data becomes immediately visible. While in the <code>async_mode</code>, data is first written to the Write-Ahead Log (WAL). Then Doris, based on the system load and the value of <code>group_commit_interval</code>, asynchronously commits the data, after which the data becomes visible. When a single import is huge, the system automatically switches to the <code>sync_mode</code> to prevent the WAL from occupying too much disk space. </p><p>Benchmark tests on Group Commit (<code>async_mode</code>) with JDBC ingestion and the Stream Load method present great results.</p><ul><li><p><strong>JDBC ingestion:</strong> </p><ul><li><p>A 1 Frontend + 1 Backend cluster, TPC-H SF10 Lineitem table (22GB, 180 million rows);</p></li><li><p>At a concurrency level of 20, with each Insert involving less than 100 rows, Doris V2.1 reaches a writing speed of 106,900 row/s and a throughput of 11.46 MB/s. CPU usage of the Backend remains at 10%~20%.</p></li></ul></li><li><p><strong>Stream Load:</strong></p><ul><li><p>A 1 Frontend + 3 Backends cluster, httplogs (31GB, 247 million rows);</p></li><li><p>At a concurrency level of 10, with each writing involving less than 1MB, Doris returns a -235 error when disabling Group Commit. With Group Commit enabled, it delivers stable performance and reaches a writing speed of 810,000 row/s and a throughput of 104 MB/s.</p></li><li><p>At a concurrency level of 10, with each writing involving less than 10MB, enabling Group Commit increases the writing speed by 45% and the writing throughput by 79%.</p></li></ul></li></ul><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc and full test results: <a href="https://doris.apache.org/docs/data-operate/import/import-way/group-commit-manual/#performance" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/data-operate/import/import-way/group-commit-manual/#performance</a></p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="semi-structured-data-analysis">Semi-structured data analysis<a href="#semi-structured-data-analysis" class="hash-link" aria-label="Semi-structured data analysis的直接链接" title="Semi-structured data analysis的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-new-data-type-variant">A new data type: Variant<a href="#a-new-data-type-variant" class="hash-link" aria-label="A new data type: Variant的直接链接" title="A new data type: Variant的直接链接"></a></h3><p>Before V2.1, Doris processes semi-structured data in two ways:</p><ol><li><p>It requires users to pre-define table schema, make a flat table, and parse the data before it is loaded into Doris. This method ensures fast data writing and avoids parsing upon query execution. The downside is the lack of flexibility. Any change to the table schema will require a lot of maintenance efforts.</p></li><li><p>It accommodates semi-structured data with JSON or stores it as JSON strings. Raw JSON data is ingested into Doris without any pre-processing and is parsed by functions upon query execution. This option requires no extra effort from the users, but you might need to put up with inefficient data parsing and reading.</p></li></ol><p>V2.1 supports a new data type named Variant. It can accommodate semi-structured data such as JSON as well as compound data structures that contain various data types such as integers, strings, and booleans. Users don't have to pre-define the exact data types for a Variant column in the table schema. </p><p>The Variant type is handy when processing nested data structures, where the structure can change dynamically. During data writing, it is capable of auto-inference for columns based on what is given, after which it merges them into the existing table schema, and stores the JSON keys and their corresponding values as dynamic sub-columns. </p><p>You can include both Variant columns and static columns with pre-defined data types in the same table. This provides greater flexibility in storage and queries. Additionally, the Variant type is empowered by the columnar storage, vectorized execution engine, and query optimizer for high efficiency in queries and storage.</p><p>Use Variant in Doris:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- No index</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE IF NOT EXISTS ${table_name} (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k BIGINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> v VARIANT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">table_properties;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Create index for the v column, specify the parser</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE IF NOT EXISTS ${table_name} (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k BIGINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> v VARIANT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_var(v) USING INVERTED [PROPERTIES("parser" = "english|unicode|chinese")] [COMMENT 'your comment']</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">table_properties;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Perform queries, access sub-columns using`[]`</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT v["properties"]["title"] from ${table_name}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Variant VS JSON</strong></p><p>In Apache Doris, JSON data is stored in a binary JSONB format, and the entire JSON row is stored in segments in a row-oriented way. However, with the Variant type, it automatically infers the data type upon data writing and stores the JSON data in a columnar method. Thus, no parsing is needed during queries.</p><p>Furthermore, the Variant type is optimized for sparse JSON scenarios. It only extracts frequently occurring columns. The sparse columns are stored in a separate format.</p><p>Tests prove that <strong>data in Variant columns takes up the same storage space as data in static columns, which is only 35% of that in JSON format</strong>. The Variant type should be a more cost-effective choice in low-cardinality scenarios.</p><p><img loading="lazy" alt="Variant vs JSON" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-variant-vs-json-e0a280dc9f81543c22cb53f85170ac38.png" width="2585" height="962" class="img_ev3q"></p><p>In terms of query performance, <strong>the Variant type enables 8 times higher query speed than JSON</strong> in hot runs and even more in cold runs.</p><p><img loading="lazy" alt="Variant vs JSON" src="https://cdnd.selectdb.com/zh-CN/assets/images/2.1-variant-vs-json-2-c897e07d34381c7ce54e53284d6e04bf.png" width="3556" height="1095" class="img_ev3q"></p><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>提示</div><div class="admonitionContent_S0QG"><ul><li><p>Currently, the Variant type is not supported in the Aggregate Key model of Doris. It can not be the primary key or sorting key in a Unique Key model table or Duplicate model table;</p></li><li><p>It is recommended to go with the RANDOM mode or Group Commit for higher writing performance;</p></li><li><p>It is recommended to extract non-standard JSON types such as date or decimal as static fields to enable higher performance;</p></li><li><p>In columnar format, arrays of two or more dimensions, as well as arrays with nested objects, will be stored as JSONB encoding, resulting in lower performance than native arrays;</p></li><li><p>Queries involving filtering or aggregation require the use of Cast, where the storage layer will provide hints for the storage engine for predicate pushdown based on the storage type and the Cast type and thus accelerate queries.</p></li></ul></div></div><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/sql-manual/sql-reference/Data-Types/VARIANT/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/sql-manual/sql-reference/Data-Types/VARIANT/</a></p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="ip-types">IP types<a href="#ip-types" class="hash-link" aria-label="IP types的直接链接" title="IP types的直接链接"></a></h3><p>IP address is a widely used field in statistical analysis for network traffic monitoring. Doris V2.1 provides native support for IPv4 and IPv6. It stores IP data in binary format, which cuts down storage space usage by 60% compared to IP string in plain texts. Along with these IP types, we have added over 20 functions for IP data processing, including:</p><ul><li><p>IPV4_NUM_TO_STRING: It converts a big-endian representation of an IPv4 address of Int16, Int32, or Int64 into its corresponding string representation;</p></li><li><p>IPV4_CIDR_TO_RANGE: It receives an IPv4 address and a CIDR-containing Int16 value, and returns a structure containing two IPv4 fields, representing the lower range (min) and upper range (max) of the subnet, respectively;</p></li><li><p>INET_ATON: It retrieves a string containing an IPv4 address in the format of A.B.C.D, where A, B, C, and D are decimal numbers separated by periods.</p></li></ul><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/sql-manual/sql-reference/Data-Types/IPV6/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/sql-manual/sql-reference/Data-Types/IPV6/</a></p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="more-powerful-functions-for-compound-data-types">More powerful functions for compound data types<a href="#more-powerful-functions-for-compound-data-types" class="hash-link" aria-label="More powerful functions for compound data types的直接链接" title="More powerful functions for compound data types的直接链接"></a></h3><p>V2.1 provides more supported data types:</p><ul><li><code>explode_map</code>: supports exploding rows into columns for the Map data type (only with the new optimizer)</li></ul><p>Each key-value pair in the Map field is expanded into a separate row, with the Map field replaced by two separate fields representing the key and value. The <code>explode_map</code> function should be used in conjunction with Lateral View. You can apply multiple Lateral Views. The result is a Cartesian product.</p><p>This is how it is used:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Create table</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CREATE TABLE `sdu` (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `id` INT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `name` TEXT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `score` MAP&lt;TEXT,INT&gt; NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) ENGINE=OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DUPLICATE KEY(`id`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">COMMENT 'OLAP'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`id`) BUCKETS 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"replication_allocation" = "tag.location.default: 1"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Insert data</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into sdu values (0, "zhangsan", {"Chinese":"80","Math":"60","English":"90"});</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into sdu values (1, "lisi", {"null":null});</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into sdu values (2, "wangwu", {"Chinese":"88","Math":"90","English":"96"});</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into sdu values (3, "lisi2", {null:null});</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into sdu values (4, "amory", NULL);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select name, course_0, score_0 from sdu lateral view explode_map(score) tmp as course_0,score_0;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| name | course_0 | score_0 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Chinese | 80 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Math | 60 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | English | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| lisi | null | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Chinese | 88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Math | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | English | 96 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| lisi2 | NULL | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select name, course_0, score_0, course_1, score_1 from sdu lateral view explode_map(score) tmp as course_0,score_0 lateral view explode_map(score) tmp1 as course_1,score_1;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| name | course_0 | score_0 | course_1 | score_1 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Chinese | 80 | Chinese | 80 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Chinese | 80 | Math | 60 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Chinese | 80 | English | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Math | 60 | Chinese | 80 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Math | 60 | Math | 60 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Math | 60 | English | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | English | 90 | Chinese | 80 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | English | 90 | Math | 60 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | English | 90 | English | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| lisi | null | NULL | null | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Chinese | 88 | Chinese | 88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Chinese | 88 | Math | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Chinese | 88 | English | 96 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Math | 90 | Chinese | 88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Math | 90 | Math | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Math | 90 | English | 96 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | English | 96 | Chinese | 88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | English | 96 | Math | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | English | 96 | English | 96 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| lisi2 | NULL | NULL | NULL | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+----------+---------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><code>explode_map_outer</code> and <code>explode_outer</code> serve the same purpose. They display rows with NULL values in the Map-type columns.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select name, course_0, score_0 from sdu lateral view explode_map_outer(score) tmp as course_0,score_0;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| name | course_0 | score_0 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Chinese | 80 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | Math | 60 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| zhangsan | English | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| lisi | null | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Chinese | 88 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | Math | 90 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| wangwu | English | 96 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| lisi2 | NULL | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| amory | NULL | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+----------+---------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li><code>IN</code>: supports the <code>STRUCT</code> data type (only with the new optimizer)</li></ul><p>The <code>IN</code> predicate supports Struct type data constructed using the <code>struct()</code>function as the left parameter. It also allows selecting a column that contains Struct type data from a table. It supports a Struct-type array constructed using the <code>struct()</code> function as the right parameter.</p><p>It is an efficient alternative to WHERE clauses with many OR conditions. For example, instead of using expressions like <code>(a = 1 and b = '2') or (a = 1 and b = '3') or (...)</code>, you can use <code>struct(a,b) in (struct(1, '2'), struct(1, '3'), ...)</code>.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select struct(1,"2") in (struct(1,3), struct(1,"2"), struct(1,1), null);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| cast(struct(1, '2') as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;) IN (NULL, cast(struct(1, '2') as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;), cast(struct(1, 1) as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;), cast(struct(1, 3) as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;)) |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select struct(1,"2") not in (struct(1,3), struct(1,"2"), struct(1,1), null);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| ( not cast(struct(1, '2') as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;) IN (NULL, cast(struct(1, '2') as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;), cast(struct(1, 1) as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;), cast(struct(1, 3) as STRUCT&lt;col1:TINYINT,col2:TEXT&gt;))) |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 0 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li><code>MAP_AGG</code>: It receives expr1 as the key, expr2 as the corresponding value, and returns a MAP.</li></ul><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/sql-manual/sql-functions/aggregate-functions/map-agg/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/sql-manual/sql-functions/aggregate-functions/map-agg/</a></p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="workload-management">Workload management<a href="#workload-management" class="hash-link" aria-label="Workload management的直接链接" title="Workload management的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="hard-isolation-of-resources">Hard isolation of resources<a href="#hard-isolation-of-resources" class="hash-link" aria-label="Hard isolation of resources的直接链接" title="Hard isolation of resources的直接链接"></a></h3><p>On the basis of the Workload Group mechanism, which imposes a soft limit on the resources that a workload group can use, Doris 2.1 introduces a hard limit on CPU resource consumption for workload groups as a way to ensure <strong>higher stability in query performance</strong>. This means that regardless of the overall CPU availability on the physical machine, workload groups configured with hard limits cannot exceed the maximum CPU usage specified in the configuration. In this way, as long as there is no significant change in user query workload, there will be stable query performance. A caveat is that, in addition to CPU usage, memory, I/O, and resource contention at the software level will all impact query execution. Thus, when the cluster switches between idle and heavy load, even with CPU hard limits configured, there might still be fluctuations in query performance. However, you can still expect better performance from the hard limits than the soft limits.</p><div class="theme-admonition theme-admonition-tip alert alert--success admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>提示</div><div class="admonitionContent_S0QG"><p><strong>Note</strong></p><ol><li><p>In Doris V2.0, CPU resource isolation was implemented based on a priority queue. In V2.1, this relies on the CGroup mechanism. Therefore, please note that you should configure the CGroups in advance after upgrading from V2.0 to V2.1. </p></li><li><p>Currently, the Workload Group mechanism supports query-query workload isolation and ingestion-query isolation. Note that if you need to impose hard limits on import workloads, you should enable <code>memtable_on_sink_node</code>.</p></li><li><p>Users need to specify either soft limits or hard limits for the current cluster using a switch. Currently, it is not supported to run both modes on the same cluster. In the future, we will consider bringing in simultaneous support for both modes based on user feedback.</p></li></ol></div></div><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/admin-manual/workload-group/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/admin-manual/workload-group/</a></p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="topsql">TopSQL<a href="#topsql" class="hash-link" aria-label="TopSQL的直接链接" title="TopSQL的直接链接"></a></h3><p>V2.1 allows users to check the most resource-consuming SQL queries in the runtime. This can be a big help when handling cluster load spikes caused by unexpected large queries.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql [(none)]&gt;desc function active_queries();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------------------+--------+------+-------+---------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| Field | Type | Null | Key | Default | Extra |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------------------+--------+------+-------+---------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| BeHost | TEXT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| BePort | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| QueryId | TEXT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| StartTime | TEXT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| QueryTimeMs | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| WorkloadGroupId | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| QueryCpuTimeMs | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| ScanRows | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| ScanBytes | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| BePeakMemoryBytes | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| CurrentUsedMemoryBytes | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| ShuffleSendBytes | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| ShuffleSendRows | BIGINT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| Database | TEXT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| FrontendInstance | TEXT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| Sql | TEXT | No | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------------------+--------+------+-------+---------+-------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The <code>active_queries()</code> function records the audit information of queries running on various Backends in Doris. You can query <code>active_queries()</code> like querying a regular table. It supports operations including querying, filtering with predicates, sorting, and joining. Common metrics captured by this function include the SQL execution time, CPU time, peak memory usage on a single Backend, data volume scanned, and data volume shuffled during the query execution. It also allows rolling up to the Backend level to examine the global resource consumption of SQL queries.</p><p>Note that only the SQL in the runtime will be displayed. The SQLs that finish execution will be written into the audit logs (fe.audit.log, mostly) instead. A few frequently used SQLs are as follows: </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">Check the top N longest-running SQLs in the cluster</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select QueryId,max(QueryTimeMs) as query_time from active_queries() group by QueryId order by query_time desc limit 10;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Check the top N most CPU-consuming SQLs in the cluster</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select QueryId, sum(QueryCpuTimeMs) as cpu_time from active_queries() group by QueryId order by cpu_time desc limit 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Check the top N SQLs with the most scan rows and their execution time</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select t1.QueryId,t1.scan_rows, t2.query_time from </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (select QueryId, sum(ScanRows) as scan_rows from active_queries() group by QueryId order by scan_rows desc limit 10) t1 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> left join (select QueryId,max(QueryTimeMs) as query_time from active_queries() group by QueryId) t2 on t1.QueryId = t2.QueryId</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Check the load of all Backends and sort them in descending order based on CPU time/scan rows/shuffle bytes.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select BeHost,sum(QueryCpuTimeMs) as query_cpu_time, sum(ScanRows) as scan_rows,sum(ShuffleSendBytes) as shuffle_bytes from active_queries() group by BeHost order by query_cpu_time desc,scan_rows desc ,shuffle_bytes desc limit 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Check the top N SQL queries with the highest peak memory usage on a single Backend.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select QueryId,max(BePeakMemoryBytes) as be_peak_mem from active_queries() group by QueryId order by be_peak_mem desc limit 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Currently, the main displayed workload types include <code>Select</code> and <code>Insert Into...Select</code>. The iterative versions of V2.1 are expected to support displaying the resource usage of Stream Load and Broker Load.</p><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>See doc: <a href="https://doris.apache.org/docs/sql-manual/sql-functions/table-functions/active_queries/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/sql-manual/sql-functions/table-functions/active_queries/</a></p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="others">Others<a href="#others" class="hash-link" aria-label="Others的直接链接" title="Others的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="decimal-256">Decimal 256<a href="#decimal-256" class="hash-link" aria-label="Decimal 256的直接链接" title="Decimal 256的直接链接"></a></h3><p>For users in the financial sector or high-end manufacturing, V2.1 supports a high-precision data type: Decimal, which supports up to 76 significant digits (To enable this experimental feature, please set <code>enable_decimal256=true</code>.)</p><p>Example:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `test_arithmetic_expressions_256` (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k1 decimal(76, 30),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> k2 decimal(76, 30)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(k1)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "1"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> );</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into test_arithmetic_expressions_256 values</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (1.000000000000000000000000000001, 9999999999999999999999999999999999999999999998.999999999999999999999999999998),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (2.100000000000000000000000000001, 4999999999999999999999999999999999999999999999.899999999999999999999999999998),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (3.666666666666666666666666666666, 3333333333333333333333333333333333333333333333.333333333333333333333333333333);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Query and result:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select k1, k2, k1 + k2 a from test_arithmetic_expressions_256 order by 1, 2;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| k1 | k2 | a |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1.000000000000000000000000000001 | 9999999999999999999999999999999999999999999998.999999999999999999999999999998 | 9999999999999999999999999999999999999999999999.999999999999999999999999999999 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 2.100000000000000000000000000001 | 4999999999999999999999999999999999999999999999.899999999999999999999999999998 | 5000000000000000000000000000000000000000000001.999999999999999999999999999999 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 3.666666666666666666666666666666 | 3333333333333333333333333333333333333333333333.333333333333333333333333333333 | 3333333333333333333333333333333333333333333336.999999999999999999999999999999 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------------------------+-------------------------------------------------------------------------------+-------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">3 rows in set (0.09 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>The Decimal256 type consumes more CPU resources so the queries might not be as fast compared to other data types.</p></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="job-scheduler">Job scheduler<a href="#job-scheduler" class="hash-link" aria-label="Job scheduler的直接链接" title="Job scheduler的直接链接"></a></h3><p>According to user feedback, there is a recurring need for scheduled job execution, such as:</p><ul><li><p>Periodic backup;</p></li><li><p>Scheduled data expiration;</p></li><li><p>Periodic import jobs: scheduling incremental or full data synchronization jobs using the Catalog feature;</p></li><li><p>Regular ETL: such as loading data from a flat table into a specified table on a scheduled basis, pulling data from detailed tables and storing it in aggregate tables at specific intervals, and performing scheduled denormalization for tables in the ODS layer and updates to the existing flat table.</p></li></ul><p>Despite the availability of various external scheduling systems such as Airflow and DolphinScheduler, there still exists a consistency challenge. When an external scheduling system triggers an import job in Doris and successfully executes it, but unexpectedly experiences a crash. In this case, since the external scheduling system fails to retrieve the execution result, it assumes the schedule has failed. Then it will trigger its fault tolerance mechanism. Either retries or a direct failure will result in:</p><ul><li><p><strong>Waste of resources</strong>: Since the scheduling system can mistakenly consider a job as failed, it might reschedule the execution of a job that has already succeeded, resulting in unnecessary resource consumption.</p></li><li><p><strong>Data duplication or loss</strong>: On the one hand, retrying the import job might lead to duplicate data imports, resulting in data redundancy or inconsistency. On the other hand, if the job is marked as failed, it can result in the neglect or loss of data that has actually been successfully imported.</p></li><li><p><strong>Time delay</strong>: After the fault tolerance mechanism is triggered, extra time is needed for job scheduling and retries, prolonging the overall data processing time.</p></li><li><p><strong>Compromised system stability</strong>: Frequent retries or immediate failures can increase the load on both the scheduling system and Doris, thereby undermining the stability and performance of the system.</p></li></ul><p>V2.1 provides a good option for regular job scheduling: Doris Job Scheduler. It can trigger the pre-defined operations on schedule or at fixed intervals. The Doris Job Scheduler is accurate to the second. In addition to consistency guarantee for data writing, it provides:</p><ol><li><p><strong>Efficiency</strong>: The Doris Job Scheduler can schedule jobs and events at specified time intervals to ensure efficient data processing. By employing the time wheel algorithm, it guarantees precise triggering of events at a granularity of seconds.</p></li><li><p><strong>Flexibility</strong>: It offers multiple scheduling options, such as scheduling at intervals of minutes, hours, days, or weeks. It supports both one-time scheduling and recurring (cyclic) event scheduling. For the latter, you can specify the start and end times for the scheduling period.</p></li><li><p><strong>Event pool and high-performance processing queues</strong>: It utilizes Disruptor for a high-performance producer-consumer model to minimize job overload.</p></li><li><p><strong>Traceable scheduling records</strong>: It stores the latest job execution records (configurable), which users can view via a simple command.</p></li><li><p><strong>High availability</strong>: On the basis of the Doris high availability mechanism, the jobs are easily self-recoverable.</p></li></ol><p>An example of creating a scheduled job:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">// Execute an insert statement every day from 2023-11-17 to 2038</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">JOB e_daily</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ON SCHEDULE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> EVERY 1 DAY </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> STARTS '2023-11-17 23:59:00'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ENDS '2038-01-19 03:14:07'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> COMMENT 'Saves total number of sessions'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DO</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INSERT INTO site_activity.totals (time, total)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SELECT CURRENT_TIMESTAMP, COUNT(*)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> FROM site_activity.sessions where create_time &gt;= days_add(now(),-1) ;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><div class="theme-admonition theme-admonition-note alert alert--secondary admonition_LlT9"><div class="admonitionHeading_tbUL"><span class="admonitionIcon_kALy"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>备注</div><div class="admonitionContent_S0QG"><p>Doris Job Scheduler only supports Insert operations on internal tables currently. See doc: <a href="https://doris.apache.org/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-JOB/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-JOB/</a></p></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changed">Behavior changed<a href="#behavior-changed" class="hash-link" aria-label="Behavior changed的直接链接" title="Behavior changed的直接链接"></a></h2><ul><li><p>The Unique Key model enables Merge-on-Write by default, which means <code>enable_unique_key_merge_on_write=true</code> will be included as a default setting when a table is created in the Unique Key model.</p></li><li><p>Since inverted index has proven to be more performant than bitmap index, V2.1 and future versions stop supporting bitmap index. Existing bitmap indexes will remain effective but new creation is not allowed. We will remove bitmap index-related code in the future.</p></li><li><p><code>cpu_resource_limit</code> is no longer supported. It is to put a limit on the number of scanner threads on Doris Backend. Since the Workload Group mechanism also supports such settings, the already configured <code>cpu_resource_limit</code> will be invalid.</p></li><li><p>Segment compaction is enabled by default. This means Doris supports compaction of multiple segments in the same rowset, which is useful in single-batch ingestion of large datasets.</p></li><li><p>Audit log plug-in</p><ul><li><p>Since V2.1.0, Doris has a built-in audit log plug-in. Users can simply enable or disable it by setting the <code>enable_audit_plugin</code> parameter. </p></li><li><p>If you have already installed your own audit log plug-in, you can either continue using it after upgrading to Doris V2.1, or uninstall it and use the one in Doris. Please note that the audit log table will be relocated after switching plug-in.</p></li><li><p>For more details, please see doc: <a href="https://doris.apache.org/docs/ecosystem/audit-plugin/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/ecosystem/audit-plugin/</a></p></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="credits">Credits<a href="#credits" class="hash-link" aria-label="Credits的直接链接" title="Credits的直接链接"></a></h2><p>Special thanks to the following contributors for making this happen: </p><p>morrySnow, Gabriel39, BiteTheDDDDt, kaijchen, starocean999, morningman, jackwener, zy-kkk, englefly, Jibing-Li, XieJiann, yujun777, Mryange, HHoflittlefish777, LiDongyangLi, HappenLee, zhangstar333, lihangyu, zclllyybb, amory, bobhan1, AKIRA, zhangdong, ZouXinyiZou, HuJerryHu, yiguolei, airborne12, wangbo, jacktengg, jacktengg, TangSiyang2001, BePPPower, Yukang-Lian, mymeiyi, liugddx, kaka11chen, AshinGau, DrogonJackDrogon, wsjz, seuhezhiqiang, zhannngchen, shuke987, KassieZ, huanghaibin, zzzxl1993, Nitin-Kashyap, AlexYue, dataroaring, seawinde, walter, xzj7019, xiaokang, SWJTU-ZhangLei, liaoxin01, dutyu, wuwenchihdu, LiBinfeng-01, daidai, qidaye, mch_ucchi, zhangguoqiang, zhengyu, plat1ko, LemonLiTree, ixzc, deardeng, yiguolei, catpineapple, LingAdonisLing, DongLiang-0, whuxingying, Tanya-W, Yulei-Yang, zzzzzzzs, caoliang-web, xueweizhang, yangshijie, Luwei, lsy3993, xy720, HowardQin, DeadlineFen, Petrichor, caiconghui, KirsCalvinKirs, SunChenyangSun, ChouGavinChou, Luzhijing, gnehil, wudi, zhiqqqq, zfr95, zxealous, kkop, yagagagaga, Chester, LuGuangmingLu, Lightman, Xiaocc, taoxutao, yuanyuan8983, KirsCalvinKirs, DuRipeng, GoGoWen, JingDas, camby, camby, Euporia, rohitrs1983, felixwluo, wudongliang, FreeOnePlus, PaiVallishPai, XuJianxu, seuhezhiqiang, luozenglin, 924060929, HB, LiuLijiaLiu, Ma1oneZhang, bingquanzhao, chunping, echo-dundun, feiniaofeiafei, walter, yongjinhou, zgxme, zhangy5, httpshirley, ChenyangSunChenyang, ZenoYang, ZhangYu0123, hechao, herry2038, jayhua, koarz, nanfeng, LiChuangLi, LiuGuangdongLiu, Jeffrey, liuJiwenliu, Stalary, DuanXujianDuan, HuZhiyuHu, jiafeng.zhang, nanfeng, py023, xiongjx, yuxuan-luo, zhaoshuo, XiaoChangmingXiao, ElvinWei, LiuHongLiu, QiHouliangQi, Hyman-zhao, HelgeLarsHelge, Uniqueyou, YangYAN, acnot, amory, feifeifeimoon, flynn, gohalo, htyoung, realize096, shee, wangqt, xyfsjq, zzwwhh, songguangfan, 467887319, BirdAmosBird, ZhuArmandoZhu, CanGuan, ChengDaqi2023, ChinaYiGuan, gitccl, colagy, DeadlineFen, Doris-Extras, HonestManXin, q763562998, guardcrystal, Dragonliu2018, ZhaoLongZhao, LuoMetaLuo, Miaohongkai, YinShaowenYin, Centurybbx, hongkun-Shao, Wanghuan, Xinxing, XueYuhai, Yoko, HeZhangJianHe, ZhongJinHacker, alan_rodriguez, allenhooo, beat4ocean, bigben0204, chen, czzmmc, dalong, deadlinefen, didiaode18, dong-shuai, feelshana, fornaix, hammer, xuke-hat, hqx871, i78086, irenesrl, julic20s, kindred77, lihuigang, wenluowen, lxliyou001, CSTGluigi, ranxiang327, shysnow, sunny, vhwzIs, wangtao, wangtianyi2004, wyx123654, xuefengze, xiangran0327, xy, yimeng, ytwp, yujian, zhangstar333, figurant, sdhzwc, LHG41278, zlw5307</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Breaking down data silos with a unified data warehouse: an Apache Doris-based CDP]]></title>
<id>https://doris.apache.org/zh-CN/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp</id>
<link href="https://doris.apache.org/zh-CN/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp"/>
<updated>2024-03-05T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The insurance company uses Apache Doris, a unified data warehouse, in replacement of Spark + Impala + HBase + NebulaGraph, in their Customer Data Platform for 4 times faster customer grouping.]]></summary>
<content type="html"><![CDATA[<p>The data silos problem is like arthritis for online business, because almost everyone gets it as they grow old. Businesses interact with customers via websites, mobile apps, H5 pages, and end devices. For one reason or another, it is tricky to integrate the data from all these sources. Data stays where it is and cannot be interrelated for further analysis. That's how data silos come to form. The bigger your business grows, the more diversified customer data sources you will have, and the more likely you are trapped by data silos. </p><p>This is exactly what happens to the insurance company I'm going to talk about in this post. By 2023, they have already served over 500 million customers and signed 57 billion insurance contracts. When they started to build a customer data platform (CDP) to accommodate such a data size, they used multiple components. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-silos-in-cdp">Data silos in CDP<a href="#data-silos-in-cdp" class="hash-link" aria-label="Data silos in CDP的直接链接" title="Data silos in CDP的直接链接"></a></h2><p>Like most data platforms, their CDP 1.0 had a batch processing pipeline and a real-time streaming pipeline. Offline data was loaded, via Spark jobs, to Impala, where it was tagged and divided into groups. Meanwhile, Spark also sent it to NebulaGraph for OneID computation (elaborated later in this post). On the other hand, real-time data was tagged by Flink and then stored in HBase, ready to be queried.</p><p>That led to a component-heavy computation layer in the CDP: Impala, Spark, NebulaGraph, and HBase.</p><p><img loading="lazy" alt="apache doris data silos in CDP" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-data-silos-in-CDP-df4e64a7cadc2fa6fca8de1807571aa4.png" width="1280" height="1060" class="img_ev3q"></p><p>As a result, offline tags, real-time tags, and graph data were scattered across multiple components. Integrating them for further data services was costly due to redundant storage and bulky data transfer. What's more, due to discrepancies in storage, they had to expand the size of the CDH cluster and NebulaGraph cluster, adding to the resource and maintenance costs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="apache-doris-based-cdp">Apache Doris-based CDP<a href="#apache-doris-based-cdp" class="hash-link" aria-label="Apache Doris-based CDP的直接链接" title="Apache Doris-based CDP的直接链接"></a></h2><p>For CDP 2.0, they decide to introduce a unified solution to clean up the mess. At the computation layer of CDP 2.0, <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a> undertakes both real-time and offline data storage and computation. </p><p>To ingest <strong>offline data</strong>, they utilize the <a href="https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual" target="_blank" rel="noopener noreferrer">Stream Load</a> method. Their 30-thread ingestion test shows that it can perform over 300,000 upserts per second. To load <strong>real-time data</strong>, they use a combination of <a href="https://doris.apache.org/docs/ecosystem/flink-doris-connector" target="_blank" rel="noopener noreferrer">Flink-Doris-Connector</a> and Stream Load. In addition, in real-time reporting where they need to extract data from multiple external data sources, they leverage the <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> feature for <strong>federated queries</strong>. </p><p><img loading="lazy" alt="apache doris based-CDP" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-based-CDP-be99e2c46e0588eb6d6540e0f557ddbb.png" width="1280" height="1068" class="img_ev3q"></p><p>The customer analytic workflows on this CDP go like this. First, they sort out customer information, then they attach tags to each customer. Based on the tags, they divide customers into groups for more targeted analysis and operation. </p><p>Next, I'll delve into these workloads and show you how Apache Doris accelerates them. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="oneid">OneID<a href="#oneid" class="hash-link" aria-label="OneID的直接链接" title="OneID的直接链接"></a></h2><p>Has this ever happened to you when you have different user registration systems for your products and services? You might collect the email of UserID A from one product webpage, and later the social security number of UserID B from another. Then you find out that UserID A and UserID B actually belong to the same person because they go by the same phone number.</p><p>That's why OneID arises as an idea. It is to pool the user registration information of all business lines into one large table in Apache Doris, sort it out, and make sure that one user has a unique OneID. </p><p>This is how they figure out which registration information belongs to the same user leveraging the functions in Apache Doris.</p><p><img loading="lazy" alt="apache doris OneID" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-OneID-56d81b3a97eeeff7e9ce266e71263161.png" width="1280" height="543" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="tagging-services">Tagging services<a href="#tagging-services" class="hash-link" aria-label="Tagging services的直接链接" title="Tagging services的直接链接"></a></h2><p>This CDP accommodates information of <strong>500 million customers</strong>, which come from over <strong>500 source tables</strong> and are attached to over <strong>2000 tags</strong> in total.</p><p>By timeliness, the tags can be divided into real-time tags and offline tags. The real-time tags are computed by Apache Flink and written into the flat table in Apache Doris, while the offline tags are computed by Apache Doris as they are derived from the user attribute table, business table, and user behavior table in Doris. Here is the company's best practice in data tagging: </p><p><strong>1. Offline tags:</strong></p><p>During the peaks of data writing, a full update might easily cause an OOM error given their huge data scale. To avoid that, they utilize the <a href="https://doris.apache.org/docs/data-operate/import/import-way/insert-into-manual" target="_blank" rel="noopener noreferrer">INSERT INTO SELECT</a> function of Apache Doris and enable <strong>partial column update</strong>. This will cut down memory consumption by a lot and maintain system stability during data loading.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">set enable_unique_key_partial_update=true;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into tb_label_result(one_id, labelxx) </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select one_id, label_value as labelxx</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from .....</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>2. Real-time tags:</strong></p><p>Partial column update is also available for real-time tags, since even real-time tags are updated at different paces. All that is needed is to set <code>partial_columns</code> to <code>true</code>.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv http://127.0.0.1:48037/api/db1/user_profile/_stream_load</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>3. High-concurrency point queries:</strong></p><p>With its current business size, the company is receiving query requests for tags at a concurrency level of over 5000 QPS. They use a combination of strategies to guarantee high performance. Firstly, they adopt <a href="https://doris.apache.org/docs/query-acceleration/hight-concurrent-point-query#using-preparedstatement" target="_blank" rel="noopener noreferrer">Prepared Statement</a> for pre-compilation and pre-execution of SQL. Secondly, they fine-tune the parameters for Doris Backend and the tables to optimize storage and execution. Lastly, they enable <a href="https://doris.apache.org/docs/query-acceleration/hight-concurrent-point-query#enable-row-cache" target="_blank" rel="noopener noreferrer">row cache</a> as a complement to the column-oriented Apache Doris.</p><ul><li>Fine-tune Doris Backend parameters in <code>be.conf</code>:</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">disable_storage_row_cache = false </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">storage_page_cache_limit=40%</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Fine-tune table parameters upon table creation:</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">enable_unique_key_merge_on_write = true</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">store_row_column = true</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">light_schema_change = true</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>4. Tag computation (join):</strong></p><p>In practice, many tagging services are implemented by multi-table joins in the database. That often involves more than 10 tables. For optimal computation performance, they adopt the <a href="https://doris.apache.org/docs/query-acceleration/join-optimization/colocation-join" target="_blank" rel="noopener noreferrer">colocation group</a> strategy in Doris. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="customer-grouping">Customer Grouping<a href="#customer-grouping" class="hash-link" aria-label="Customer Grouping的直接链接" title="Customer Grouping的直接链接"></a></h2><p>The customer grouping pipeline in CDP 2.0 goes like this: Apache Doris receives SQL from customer service, executes the computation, and sends the result set to S3 object storage via SELECT INTO OUTFILE. The company has divided their customers into 1 million groups. The customer grouping task that used to take <strong>50 seconds in Impala</strong> to finish now only needs <strong>10 seconds in Doris</strong>. </p><p><img loading="lazy" alt="apache doris customer grouping" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-customer-grouping-7c42996acf6d17eb8be01be7848e6ee6.png" width="1280" height="402" class="img_ev3q"></p><p>Apart from grouping the customers for more fine-grained analysis, sometimes they do analysis in a reverse direction. That is, to target a certain customer and find out to which groups he/she belongs. This helps analysts understand the characteristics of customers as well as how different customer groups overlap.</p><p>In Apache Doris, this is implemented by the BITMAP functions: <code>BITMAP_CONTAINS</code> is a fast way to check if a customer is part of a certain group, and <code>BITMAP_OR</code>, <code>BITMAP_INTERSECT</code>, and <code>BITMAP_XOR</code> are the choices for cross analysis. </p><p><img loading="lazy" alt="apache doris bitmap" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-bitmap-da70b0e27411c1ef101d8f48731ba27e.png" width="1280" height="649" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>From CDP 1.0 to CDP 2.0, the insurance company adopts Apache Doris, a unified data warehouse, to replace Spark+Impala+HBase+NebulaGraph. That increases their data processing efficiency by breaking down the data silos and streamlining data processing pipelines. In CDP 3.0 to come, they want to group their customer by combining real-time tags and offline tags for more diversified and flexible analysis. The <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a> and the <a href="https://www.velodb.io" target="_blank" rel="noopener noreferrer">VeloDB</a> team will continue to be a supporting partner during this upgrade.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.0.5 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.5</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.5"/>
<updated>2024-02-28T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, about 217 improvements and bug fixes have been made in Doris 2.0.5 version.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 217 improvements and bug fixes have been made in Doris 2.0.5 version.</p><p><strong>Quick Download:</strong> <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub:</strong> <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-change">Behavior change<a href="#behavior-change" class="hash-link" aria-label="Behavior change的直接链接" title="Behavior change的直接链接"></a></h2><ul><li>change char function behaviour: select char(0) = '\0' return true as MySQL<ul><li><a href="https://github.com/apache/doris/pull/30034" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/30034</a></li></ul></li><li>Allow exporting empty data<ul><li><a href="https://github.com/apache/doris/pull/30703" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/30703</a></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-features">New features<a href="#new-features" class="hash-link" aria-label="New features的直接链接" title="New features的直接链接"></a></h2><ul><li>Eliminate left outer join with is null condition</li><li>Add show-tablets-belong stmt for analyzing a batch of tablet-ids</li><li>InferPredicates support In, such as a = b &amp; a in <!-- -->[1, 2]<!-- --> -&gt; b in <!-- -->[1, 2]</li><li>Optimize plan when column stats are unavailable</li><li>Optimize plan using rollup column stats</li><li>Support analyze materialized view</li><li>Support ShowProcessStmt Show all Fe connection</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvement-and-optimizations">Improvement and optimizations<a href="#improvement-and-optimizations" class="hash-link" aria-label="Improvement and optimizations的直接链接" title="Improvement and optimizations的直接链接"></a></h2><ul><li>Optimize query plan when column stats are unaviable</li><li>Optimize query plan using rollup column stats</li><li>Stop analyze quickly after user close auto analyze</li><li>Catch load column stats exception, avoid print too much stack info to fe.out</li><li>Select materialized view by specify the view name in sql</li><li>Change auto analyze max table width default value to 100</li><li>Escape characters for columns in recovery predicate pushdown in jdbc catalog</li><li>Fix jdbc mysql catalog to_date fun pushdown</li><li>Optimize the close logic of JDBC client</li><li>Optimize jdbc connection pool parameter settings</li><li>Obtain hudi partition information through HMS's API</li><li>Optimize routine load job error msg and memory</li><li>Skip all backup/restore jobs if max allowd option is set to 0</li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/compare/2.0.4-rc06...2.0.5-rc02" target="_blank" rel="noopener noreferrer">github</a>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="credits">Credits<a href="#credits" class="hash-link" aria-label="Credits的直接链接" title="Credits的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p>airborne12, alexxing662, amorynan, AshinGau, BePPPower, bingquanzhao, BiteTheDDDDt, ByteYue, caiconghui, cambyzju, catpineapple, dataroaring, eldenmoon, Emor-nj, englefly, felixwluo, GoGoWen, HappenLee, hello-stephen, HHoflittlefish777, HowardQin, JackDrogon, jacktengg, jackwener, Jibing-Li, KassieZ, LemonLiTree, liaoxin01, liugddx, LuGuangming, morningman, morrySnow, mrhhsg, Mryange, mymeiyi, nextdreamblue, qidaye, ryanzryu, seawinde,starocean999, TangSiyang2001, vinlee19, w41ter, wangbo, wsjz, wuwenchi, xiaokang, XieJiann, xingyingone, xy720,xzj7019, yujun777, zclllyybb, zhangstar333, zhannngchen, zhiqiang-hhhh, zxealous, zy-kkk, zzzxl1993</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[A financial anti-fraud solution based on the Apache Doris data warehouse]]></title>
<id>https://doris.apache.org/zh-CN/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse</id>
<link href="https://doris.apache.org/zh-CN/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse"/>
<updated>2024-02-22T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Financial fraud prevention is a race against time. This post will get into details about how a retail bank builds their fraud risk management platform based on Apache Doris and how it performs. ]]></summary>
<content type="html"><![CDATA[<p>Financial fraud prevention is a race against time. Implementation-wise, it relies heavily on the data processing power, especially under large datasets. Today I'm going to share with you the use case of a retail bank with over 650 million individual customers. They have compared analytics components including <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a>, ClickHouse, Greenplum, Cassandra, and Kylin. After 5 rounds of deployment and comparsion based on 89 custom test cases, they settled on Apache Doris, because they witnessed a six-fold writing speed and faster multi-table joins in Apache Doris as compared to the mighty ClickHouse.</p><p>I will get into details about how the bank builds their fraud risk management platform based on Apache Doris and how it performs. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="fraud-risk-management-platform">Fraud Risk Management Platform<a href="#fraud-risk-management-platform" class="hash-link" aria-label="Fraud Risk Management Platform的直接链接" title="Fraud Risk Management Platform的直接链接"></a></h2><p>In this platform, <strong>80% of ad-hoc queries</strong> return results in less than <strong>2 seconds,</strong> and <strong>95%</strong> of them are finished in under <strong>5 seconds.</strong> On average, the solution <strong>intercepts tens of thousands of suspicious transactions</strong> every day and <strong>avoids losses of millions of dollars</strong> for bank customers. </p><p>This is an overview of the entire platform from an architectural perspective. </p><p><img loading="lazy" alt="Fraud Risk Management Platform" src="https://cdnd.selectdb.com/zh-CN/assets/images/fraud-risk-management-platform-262b039604139527d92106f9c6a67847.png" width="1280" height="530" class="img_ev3q"></p><p>The <strong>source data</strong> can be roughly categorized as:</p><ul><li>Dimension data: mostly stored in PostgreSQL</li><li>Real-time transaction data: decoupled from various external systems via Kafka message queues</li><li>Offline data: directly ingested from external systems to Hive, making data reconciliation easy</li></ul><p>For <strong>data ingestion</strong>, this is how they collect the three types of source data. First of all, they leverage the <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/jdbc" target="_blank" rel="noopener noreferrer">JDBC Catalog</a> to to synchronize metadata and user data from PostgreSQL. </p><p>The transaction data needs to be combined with dimension data for further analysis. Thus, they employ a Flink SQL API to read dimension data from PostgreSQL, and real-time transaction data from Kafka. Then, in Flink, they do multi-stream joins and generate wide tables. For real-time refreshing of dimension tables, they use a Lookup Join mechanism, which dynamically looks up and refreshes dimension data when processing data streams. They also utilize Java UDFs to serve their specific needs in ETL. After that, they write the data into Apache Doris via the<a href="https://doris.apache.org/docs/ecosystem/flink-doris-connector/" target="_blank" rel="noopener noreferrer"> Flink-Doris-Connector</a>. </p><p>The offline data is cleaned, transformed, and written into Hive, Kafka, and PostgreSQL, for which Doris creates catalogs as mappings, based on its <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> capability, to facilitate federated analysis. In this process, Hive Metastore is in place to access and refresh data from Hive automatically.</p><p>In terms of <strong>data modeling</strong>, they use Apache Doris as a data warehouse and apply different <a href="https://doris.apache.org/docs/data-table/data-model" target="_blank" rel="noopener noreferrer">data models</a> for different layers. Each layer aggregates or rolls up data from the previous layer at a coarser granularity. Eventually, it produces a highly aggregated Rollup or Materialized View. </p><p>Now let me show you what analytics tasks are running on this platform. Based on the scale of monitoring and human involvement, these tasks can be divided into real-time risk reporting, multi-dimensional analysis, federated queries, and auto alerting. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-risk-report">Real-time risk report<a href="#real-time-risk-report" class="hash-link" aria-label="Real-time risk report的直接链接" title="Real-time risk report的直接链接"></a></h2><p>When it comes to fraud prevention, what is diminishing the effectiveness of your anti-fraud efforts? It is incomplete exposure of potential risks and untimely risk identification. That's why people always want real-time, full-scale monitoring and reporting.</p><p>The bank's solution to that is built on Apache Flink and Apache Doris. First of all, they put together the 17 dimensions. After cleaning, aggregation, and other computations, they visualize the data on a real-time dashboard. </p><p>As for <strong>scale</strong>, it analyzes the workflows of <strong>over 10 million customers, 30,000 clerks, 10,000 branches, and 1000 products</strong>. </p><p>As for <strong>speed</strong>, the bank now has evolved from next-day data refreshing to near real-time data processing. Targeted analysis can be done within minutes instead of hours. The solution also supports complicated ad-hoc queries to capture underlying risks by monitoring how the data models and rules run. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="multi-dimensional-analysis-to-identify-risks">Multi-dimensional analysis to identify risks<a href="#multi-dimensional-analysis-to-identify-risks" class="hash-link" aria-label="Multi-dimensional analysis to identify risks的直接链接" title="Multi-dimensional analysis to identify risks的直接链接"></a></h2><p>Case tracing is another common anti-fraud practice. The bank has a fraud model library. Based on the fraud models, they analyze the risks of each transaction and visualize the results in near real time, so their staff can take prompt measures if needed. </p><p>For that purpose, they use Apache Doris for <strong>multi-dimensional analysis</strong> of cases. They check the patterns of transactions, including sources, types, and time, for a comprehensive overview. During this process, they often need to combine <strong>over 10 filtering conditions</strong> of different dimensions. This is empowered by the <strong>ad-hoc query</strong> capabilities of Apache Doris. Both rule-based matching and list-based matching of cases can be done <strong>within seconds</strong> without manual efforts.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="federated-queries-to-locate-risk-details">Federated queries to locate risk details<a href="#federated-queries-to-locate-risk-details" class="hash-link" aria-label="Federated queries to locate risk details的直接链接" title="Federated queries to locate risk details的直接链接"></a></h2><p>Apart from identifying risks from each transaction, the bank also receives risk reports from customers. In these cases, the corresponding transaction will be labeled as "risky", and it will be categorized and recorded in the ticketing system. The labels make sure that the high-risk transactions are promptly attended to. </p><p>One problem is that, the ticketing system is overloaded with such data, so it is not able to directly present all the details of the risky transactions. What needs to be done is to relate the tickets to the transaction details so the bank staff can locate the actual risks. </p><p>How is that implemented? Every day, Apache Doris traverses the incremental tickets and the basic information table to get the ticket IDs, and then it relates the ticket IDs to the dimension data stored in itself. At the end, the ticket details are presented at the frontend of Doris. This entire process takes <strong>only a few minutes</strong>. This is a big game change compared to the old time when they had to manually look up the suspicious transaction.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="auto-alerting">Auto alerting<a href="#auto-alerting" class="hash-link" aria-label="Auto alerting的直接链接" title="Auto alerting的直接链接"></a></h2><p>Based on Apache Doris, the bank designs their own alerting rules, models, and strategies. The system monitors how everything runs. Once it detects a situation that matches the alert rules, it will trigger an alarm. They have also established a real-time feedback mechanism for the alerting rules, so if a newly added rule causes any negative effects, it will be adjusted or removed rapidly. </p><p>So far, the bank has added nearly 100 alerting rules for various risk types to the system. During the past two months, <strong>over 100 alarms</strong> were issued with <strong>over 95% accuracy</strong> in less than <strong>5 seconds</strong> after the risk situation arises. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>For a comprehensive anti-fraud solution, the bank conducts full-scale real-time monitoring and reporting for all their data workflows. Then, for each transaction, they look into the multiple dimensions of it to identify risks. For the suspicious transactions reported by the bank customers, they perform federated queries to retrieve the full details of them. Also, an auto alerting mechanism is always on patrol to safeguard the whole system. These are the various types of analytic workloads in this solution. The implementation of them rely on the capabilities of Apache Doris, which is a data warehouse designed to be an all-in-one platform for various workloads. If you try to build your own anti-fraud solution, the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris open source developers</a> are happy to exchange ideas with you.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[A deep dive into inverted index: how it speeds up text searches by 40 times]]></title>
<id>https://doris.apache.org/zh-CN/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris</id>
<link href="https://doris.apache.org/zh-CN/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris"/>
<updated>2024-02-01T00:00:00.000Z</updated>
<summary type="html"><![CDATA[As an open-source real-time data warehouse, Apache Doris provides a rich choice of indexes to speed up data scanning and filtering.This post is a deep dive into inverted index and NGram BloomFilter index, providing a hands-on guide to applying them for various queries.]]></summary>
<content type="html"><![CDATA[<p>As an open-source real-time data warehouse, <a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">Apache Doris</a> provides a rich choice of indexes to speed up data scanning and filtering. Based on user involvement, they can be divided into built-in smart indexes and user-created indexes. The former is automatically generated by Apache Doris on data ingestion, such as ZoneMap index and prefix index, while the latter is the index users choose for various use cases, including inverted index and NGram BloomFilter index.</p><p>This post is a deep dive into inverted index and NGram BloomFilter index, providing a hands-on guide to applying them for various queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="sample-dataset">Sample dataset<a href="#sample-dataset" class="hash-link" aria-label="Sample dataset的直接链接" title="Sample dataset的直接链接"></a></h2><p>The test dataset comprises about 130 million Amazon customer reviews. It is a few Snappy-compressed Parquet files with a total size of 37GB. These are a few samples:</p><p><img loading="lazy" alt="img" src="https://cdnd.selectdb.com/zh-CN/assets/images/sample-dataset-0a343d2a31f6f2afd8617577cf9e8823.png" width="4733" height="557" class="img_ev3q"></p><p>Each row includes 15 columns including <code>customer_id</code>, <code>review_id</code>, <code>product_id</code>, <code>product_category</code>, <code>star_rating</code>, <code>review_headline</code>, and <code>review_body</code>. </p><p>A lot of these columns can be accelerated by indexes based on their structures. For example, <code>customer_id</code> is a high-cardinality numerical field while <code>product_id</code> is a low-cardinality fixed-length text field, and <code>product_title</code> and <code>review_body</code> are short and long text fields, respectively.</p><p>Queries on these columns can be roughly divided into two types:</p><ul><li><strong>Text searches</strong>: searches for certain contents in the <code>review_body</code> field.</li><li><strong>Non-primary key column queries</strong>: query reviews about certain <code>product_id</code> or from certain <code>customer_id</code>.</li></ul><p>These are also the main threads of this article. I will present to you how indexes can speed up these queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="prerequisites">Prerequisites<a href="#prerequisites" class="hash-link" aria-label="Prerequisites的直接链接" title="Prerequisites的直接链接"></a></h2><p>For a quick run, here we use a single-node cluster (1 frontend, 1 backend).</p><ol><li>Deploy Apache Doris: refer to <a href="https://doris.apache.org/docs/get-starting/quick-start/" target="_blank" rel="noopener noreferrer">Quick Start</a></li><li>Create a table using the following statements: </li></ol><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `amazon_reviews` ( </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `review_date` int(11) NULL, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `marketplace` varchar(20) NULL, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `customer_id` bigint(20) NULL, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `review_id` varchar(40) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `product_id` varchar(10) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `product_parent` bigint(20) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `product_title` varchar(500) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `product_category` varchar(50) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `star_rating` smallint(6) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `helpful_votes` int(11) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `total_votes` int(11) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `vine` boolean NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `verified_purchase` boolean NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `review_headline` varchar(500) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `review_body` string NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) ENGINE=OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DUPLICATE KEY(`review_date`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">COMMENT 'OLAP'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`review_date`) BUCKETS 16</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"replication_allocation" = "tag.location.default: 1",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"compression" = "ZSTD"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="3"><li>Download datasets: Snappy-compressed Parquet files with a total size of 37GB</li></ol><ul><li><a href="https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2010.snappy.parquet" target="_blank" rel="noopener noreferrer">amazon_reviews_2010</a></li><li><a href="https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2011.snappy.parquet" target="_blank" rel="noopener noreferrer">amazon_reviews_2011</a></li><li><a href="https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2012.snappy.parquet" target="_blank" rel="noopener noreferrer">amazon_reviews_2012</a></li><li><a href="https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2013.snappy.parquet" target="_blank" rel="noopener noreferrer">amazon_reviews_2013</a></li><li><a href="https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2014.snappy.parquet" target="_blank" rel="noopener noreferrer">amazon_reviews_2014</a> </li><li><a href="https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet" target="_blank" rel="noopener noreferrer">amazon_reviews_2015</a></li></ul><ol start="4"><li>Execute the following commands to load the datasets</li></ol><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T amazon_reviews_2010.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T amazon_reviews_2011.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T amazon_reviews_2012.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T amazon_reviews_2013.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T amazon_reviews_2014.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl --location-trusted -u root: -T amazon_reviews_2015.snappy.parquet -H "format:parquet" http://${BE_IP}:${BE_PORT}/api/${DB}/amazon_reviews/_stream_load</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="5"><li>Check and verify: After the above steps, execute the following statements in the MySQL client to check and see the size of the dataset. It can be seen from below that 135589433 rows are loaded and they take up 25.873GB in Apache Doris, which is 30% smaller than the original Parquet files. </li></ol><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT COUNT() FROM amazon_reviews;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| count(*) |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 135589433 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.02 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SHOW DATA FROM amazon_reviews;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------+----------------+-----------+--------------+-----------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| TableName | IndexName | Size | ReplicaCount | RowCount | RemoteSize |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------+----------------+-----------+--------------+-----------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| amazon_reviews | amazon_reviews | 25.873 GB | 16 | 135589433 | 0.000 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| | Total | 25.873 GB | 16 | | 0.000 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------+----------------+-----------+--------------+-----------+------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">2 rows in set (0.00 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="accelerate-text-searches">Accelerate text searches<a href="#accelerate-text-searches" class="hash-link" aria-label="Accelerate text searches的直接链接" title="Accelerate text searches的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="no-index">No index<a href="#no-index" class="hash-link" aria-label="No index的直接链接" title="No index的直接链接"></a></h3><p>Now let's try running text searches on the <code>review_body</code> field. Specifically, we're trying to retrieve the top 5 products whose reviews include the keywords "is super awesome". The results should be sorted in descending order based on the number of reviews. Each result should include the product ID, a randomly selected product title, the average star rating, and the total number of reviews. </p><p>This is the query statement:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> any(product_title),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AVG(star_rating) AS rating,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> COUNT() AS count</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> amazon_reviews</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> review_body LIKE '%is super awesome%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count DESC,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> rating DESC,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LIMIT 5;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Since the <code>review_body</code> field contains lengthy reviews, such text searches can be time-consuming. Without enabling any indexes, it took <strong>7.6 seconds</strong> to return the results: </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+--------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_id | any_value(product_title) | rating | count |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+--------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00992CF6W | Minecraft | 4.8235294117647056 | 17 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B009UX2YAC | Subway Surfers | 4.7777777777777777 | 9 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 4.875 | 8 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B0086700CM | Temple Run | 5 | 6 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+--------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">5 rows in set (7.60 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="ngram-bloomfilter-index">Ngram BloomFilter index<a href="#ngram-bloomfilter-index" class="hash-link" aria-label="Ngram BloomFilter index的直接链接" title="Ngram BloomFilter index的直接链接"></a></h3><p>Now, let's try accelerating such text searches using the Ngram BloomFilter index.</p><ul><li><code>gram_size</code>: the value of "N" in "Ngram", representing the length of consecutive characters. In the snippet below, <code>"gram_size"="10"</code> means that the texts will be divided into a number of 10-character strings, which are the basis of the Ngram BloomFilter index.</li><li><code>bf_size</code>: the size of the BloomFilter in bytes. <code>"bf_size"="10240"</code> indicates that the BloomFilter occupies 10240 bytes of space.</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE amazon_reviews ADD INDEX review_body_ngram_idx(review_body) USING NGRAM_BF PROPERTIES("gram_size"="10", "bf_size"="10240");</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>This time, the query is finished within <strong>0.93 seconds</strong>. That means Ngram BloomFilter brings a speedup of <strong>8 times</strong>.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+--------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_id | any_value(product_title) | rating | count |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+--------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00992CF6W | Minecraft | 4.8235294117647056 | 17 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B009UX2YAC | Subway Surfers | 4.7777777777777777 | 9 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 4.875 | 8 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B0086700CM | Temple Run | 5 | 6 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+--------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">5 rows in set (0.93 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>So how does Ngram BloomFilter do the magic?</strong> The way it works can be explained in two parts.</p><ul><li><strong>Ngram tokenization</strong>: When <code>gram_size=5</code>, the phrase "hello world" is split into <!-- -->["hello", "ello ", "llo w", "lo wo", "o wor", " worl", "world"]<!-- -->. These sub-strings are then hashed and added to a BloomFilter of the <code>bf_size</code>. Since data in Apache Doris is stored by page, the BloomFilters are generated also by page. </li><li><strong>Query acceleration</strong>: For example, to query the word "hello" in the texts, "hello" is tokenized and compared with the BloomFilters of each page. If the BloomFilter detects a potential match (there might be false positives) in a page, that page is loaded for further matching. Otherwise, that page is skipped. </li></ul><p>By skipping the irrelevant pages, the BloomFilter index reduces unnecessary data scanning and thus greatly reduces query latency.</p><p><img loading="lazy" alt="img" src="https://cdnd.selectdb.com/zh-CN/assets/images/data-storage-structure-in-apache-doris-da2f97a4dbdbfbe800121484566d7d25.png" width="1280" height="644" class="img_ev3q"></p><div style="text-align:center"> Data storage structure in Apache Doris </div><p><img loading="lazy" alt="img" src="https://cdnd.selectdb.com/zh-CN/assets/images/illustration-of-ngram-bloomfilter-d0db39fe1aee6af22cf7ad2949396c3b.png" width="1280" height="697" class="img_ev3q"></p><div style="text-align:center"> Illustration of Ngram BloomFilter </div><p><strong>How to find the optimal parameter configurations for Ngram BloomFilter?</strong></p><p><code>gram_size</code> determines the matching efficiency, while <code>bf_size</code> impacts the false positive rate. Typically, a large <code>bf_size</code> reduces the false positive rate but also requires more storage space. Thus, we suggest that you configure these two parameters based on these two factors: </p><ol><li><p>Text length:</p><ul><li>For short texts (words or phrases), a small <code>gram_size</code> (2~4) and a small <code>bf_size</code> are recommended.</li><li>For long texts (sentences or paragraphs), a large <code>gram_size</code> (5~10) and a large <code>bf_size</code> work better.</li></ul></li><li><p>Query pattern: </p><ul><li>If the queries often involve phrases or complete words, a large <code>gram_size</code> will be more efficient.</li><li>For fuzzy matching or diverse queries, a small <code>gram_size</code> allows more flexible matching.</li></ul></li></ol><h3 class="anchor anchorWithStickyNavbar_LWe7" id="inverted-index">Inverted index<a href="#inverted-index" class="hash-link" aria-label="Inverted index的直接链接" title="Inverted index的直接链接"></a></h3><p><a href="https://doris.apache.org/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch" target="_blank" rel="noopener noreferrer">Inverted index</a> is another way to accelerate text searches. Creating inverted index is simple:</p><ol><li><p><strong>Add inverted index</strong>: Refer to the snippet below to create inverted index for the <code>review_body</code> column of the <code>amazon_reviews</code> table. Inverted index supports phrase searching, in which the order of the tokenized words will affect the search results.</p></li><li><p><strong>Add inverted index for historical data</strong>: You can also create inverted index for historical data.</p></li></ol><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE amazon_reviews ADD INDEX review_body_inverted_idx(`review_body`) </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> USING INVERTED PROPERTIES("parser" = "english","support_phrase" = "true"); </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">BUILD INDEX review_body_inverted_idx ON amazon_reviews;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="3"><li><strong>Check and verify</strong>: You can check and see the created indexes using the following statement:</li></ol><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; show BUILD INDEX WHERE TableName="amazon_reviews";</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| JobId | TableName | PartitionName | AlterInvertedIndexes | CreateTime | FinishTime | TransactionId | State | Msg | Progress |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 10152 | amazon_reviews | amazon_reviews | [ADD INDEX review_body_inverted_idx (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">review_body</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) USING INVERTED PROPERTIES("parser" = "english", "support_phrase" = "true")], | 2024-01-23 15:42:28.658 | 2024-01-23 15:48:42.990 | 11 | FINISHED | | NULL |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-------+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.00 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you want to see how tokenization works, you can test with the <code>TOKENIZE</code> function. Just input the text that needs to be tokenized and the parameters: </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT TOKENIZE('I can honestly give the shipment and package 100%, it came in time that it was supposed to with no hasels, and the book was in PERFECT condition.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">super awesome buy, and excellent for my college classs', '"parser" = "english","support_phrase" = "true"');</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| tokenize('I can honestly give the shipment and package 100%, it came in time that it was supposed to with no hasels, and the book was in PERFECT condition. super awesome buy, and excellent for my college classs', '"parser" = "english","support_phrase" = "true"') |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| ["i", "can", "honestly", "give", "the", "shipment", "and", "package", "100", "it", "came", "in", "time", "that", "it", "was", "supposed", "to", "with", "no", "hasels", "and", "the", "book", "was", "in", "perfect", "condition", "super", "awesome", "buy", "and", "excellent", "for", "my", "college", "classs"] |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.05 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>With inverted index, now we retrieve customer reviews containing "is super awesome" using <code>MATCH_PHRASE</code>. </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> any(product_title),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AVG(star_rating) AS rating,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> COUNT() AS count</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> amazon_reviews</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> review_body MATCH_PHRASE 'is super awesome'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count DESC,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> rating DESC,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LIMIT 5;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The clause <code>review_body MATCH_PHRASE 'is super awesome'</code> searches for text fragments in the <code>review_body</code> column that contains all three keywords "is", "super", and "awesome" in that exact order, with no other words in between. </p><p>The <code>MATCH</code> query is case-insensitive, which is also what sets it apart from the <code>LIKE</code> query. The <code>MATCH</code> query is more efficient in large datasets.</p><p>Results show that inverted index has decreased the query latency to <strong>0.19 seconds</strong>, bringing a <strong>4-time performance increase</strong> compared to the Ngram BloomFilter index, and <strong>a nearly 40-time increase</strong> compared to having no indexes at all.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+-------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_id | any_value(product_title) | rating | count |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+-------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00992CF6W | Minecraft | 4.833333333333333 | 18 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B009UX2YAC | Subway Surfers | 4.7 | 10 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00DJFIMW6 | Minion Rush: Despicable Me Official Game | 5 | 7 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B0086700CM | Temple Run | 5 | 6 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00KWVZ750 | Angry Birds Epic RPG | 5 | 6 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+------------------------------------------+-------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">5 rows in set (0.19 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>How does inverted index make it possible?</strong></p><p>Inverted index splits the texts into words and maps each word to a row number. Then the tokenized words are sorted alphabetically and and a skip list index is created. When executing queries of specific words, the system locates the row numbers in this orderly mapping using the skip list index and binary search methods. Based on the row numbers, the system retrieves the entire data record. </p><p>This approach avoids line-by-line matching and reduces computational complexity from O(n) to O(logn). That's how inverted index speeds up queries on large datasets. </p><p><img loading="lazy" alt="img" src="https://cdnd.selectdb.com/zh-CN/assets/images/illustration-of-inverted-index-460ac7b75f89211aeba6c32af670cccd.png" width="1280" height="744" class="img_ev3q"></p><div style="text-align:center"> Illustration of Inverted Index </div><p>To provide a deeper understanding of inverted index, I will start from its read/write logic. In Doris, logically, inverted index is applied at the column level of a table. However, from a physical storage and implementation perspective, it is actually built on data files. </p><ul><li><strong>Writing</strong>: When data is written to a data file, it is also synchronously written to the inverted index file, and the row numbers are matched.</li><li><strong>Query</strong>: In a query, if the <code>WHERE</code> condition involves a column for which an inverted index has been built, Doris will go directly to the index file and returns the corresponding row numbers. Then, based on the row numbers, it skips the irrelevant pages and rows and only reads the target rows. </li></ul><p>In short, inverted index enables high-speed text searches by mapping, and its implementation relies on the coordination of data files and index files.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="accelerate-non-primary-key-column-queries">Accelerate non-primary key column queries<a href="#accelerate-non-primary-key-column-queries" class="hash-link" aria-label="Accelerate non-primary key column queries的直接链接" title="Accelerate non-primary key column queries的直接链接"></a></h2><p>To showcase the impact of inverted index on non-primary key column queries, let's try some multi-dimensional queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="no-index-1">No index<a href="#no-index-1" class="hash-link" aria-label="No index的直接链接" title="No index的直接链接"></a></h3><p>Retrieve the review from Customer ID 13916588 about Product ID B002DMK1R0. Without indexes, the system has to scan the entire table. The query is finished within <strong>1.81 seconds</strong>.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT product_title,review_headline,review_body,star_rating </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM amazon_reviews </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE product_id='B002DMK1R0' AND customer_id=13916588;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_title | review_headline | review_body | star_rating |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| Magellan Maestro 4700 4.7-Inch Bluetooth Portable GPS Navigator | Nice Features But... | This is a great GPS. Gets you where you are going. Don't forget to buy the seperate (grr!) cord for the traffic kit though! | 4 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (1.81 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="inverted-index-1">Inverted index<a href="#inverted-index-1" class="hash-link" aria-label="Inverted index的直接链接" title="Inverted index的直接链接"></a></h3><p>This query is executed in a different way than what is said above, because the system does not have to tokenize the <code>product_id</code> and <code>customer_id</code>, but creates a Value→RowID inverted index table. </p><p>First of all, create inverted index via the following statement: </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE amazon_reviews ADD INDEX product_id_inverted_idx(product_id) USING INVERTED ;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE amazon_reviews ADD INDEX customer_id_inverted_idx(customer_id) USING INVERTED ;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">BUILD INDEX product_id_inverted_idx ON amazon_reviews;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">BUILD INDEX customer_id_inverted_idx ON amazon_reviews;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>With inverted index, the same query is finished within <strong>0.06 seconds</strong>. That represents a <strong>30-time</strong> higher speed compared to the previous 1.81 seconds.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT product_title,review_headline,review_body,star_rating FROM amazon_reviews WHERE product_id='B002DMK1R0' AND customer_id='13916588';</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_title | review_headline | review_body | star_rating |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| Magellan Maestro 4700 4.7-Inch Bluetooth Portable GPS Navigator | Nice Features But... | This is a great GPS. Gets you where you are going. Don't forget to buy the seperate (grr!) cord for the traffic kit though! | 4 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-----------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------+-------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.06 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="profile">Profile<a href="#profile" class="hash-link" aria-label="Profile的直接链接" title="Profile的直接链接"></a></h3><p>This is an excerpt of the SegmentIterator Profile, from which you can tell why inverted index accelerates query execution.</p><p>(Note that if you need to check the Profile of a query, make sure you have executed <code>SET enable_profile=true;</code> in the MySQL client before you execute the query. Then you can check the Profile at <em>http://FE_IP:FE_HTTP_PORT/QueryProfile</em>)</p><div class="language-YAML codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-YAML codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SegmentIterator:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - FirstReadSeekCount: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - FirstReadSeekTime: 0ns</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - FirstReadTime: 13.119ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - IOTimer: 19.537ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - InvertedIndexQueryTime: 11.583ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RawRowsRead: 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsConditionsFiltered: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsInvertedIndexFiltered: 16.907403M (16907403)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsShortCircuitPredInput: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsVectorPredFiltered: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsVectorPredInput: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - ShortPredEvalTime: 0ns</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - TotalPagesNum: 27</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - UncompressedBytesRead: 3.71 MB</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - VectorPredEvalTime: 0ns</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><code>RowsInvertedIndexFiltered: 16.907403M (16907403)</code> and <code>RawRowsRead: 1</code> means that the inverted index has filtered out 16907403 rows and only reads 1 row (the target row). <code>FirstReadTime: 13.119ms</code> means that it takes 13.119 ms to read the page where the target row is located, and <code>InvertedIndexQueryTime: 11.583ms</code> means that the system <strong>filters out 16907403 rows within only 11.58 ms</strong>. </p><p>For comparision, this is the SegmentIterator Profile when no index is used:</p><div class="language-YAML codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-YAML codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SegmentIterator:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - FirstReadSeekCount: 9.374K (9374)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - FirstReadSeekTime: 400.522ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - FirstReadTime: 3s144ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - IOTimer: 2s564ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - InvertedIndexQueryTime: 0ns</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RawRowsRead: 16.680706M (16680706)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsConditionsFiltered: 226.698K (226698)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsInvertedIndexFiltered: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsShortCircuitPredInput: 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsVectorPredFiltered: 16.680705M (16680705)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsVectorPredInput: 16.680706M (16680706)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - RowsZonemapFiltered: 226.698K (226698)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - ShortPredEvalTime: 2.723ms</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - TotalPagesNum: 5.421K (5421)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - UncompressedBytesRead: 277.05 MB</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> - VectorPredEvalTime: 8.114ms</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Without inverted index, it takes 3.14s to load 16680706 rows (<code>FirstReadTime: 3s144ms</code>). Then, the system conducts filtering by Predicate Evaluate and screens out 16680705 rows. The conditional filtering process only takes less than 10ms, making original data loading the most time-consuming task.</p><p>To sum up, inverted index increases query execution efficiency by enabling quick retrieval of the target rows and thus reducing unnecessary data loading.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="accelerate-low-cardinality-text-column-queries">Accelerate low-cardinality text column queries<a href="#accelerate-low-cardinality-text-column-queries" class="hash-link" aria-label="Accelerate low-cardinality text column queries的直接链接" title="Accelerate low-cardinality text column queries的直接链接"></a></h2><p>So inverted index is a big accelerator for queries on high-cardinality text columns, but that might raise a concern: For low-cardinality columns, will too many indexes bring excessive overheads and undermine query performance? </p><p>The answer is: no. Let me show you why and how. The following example uses <code>product_category</code> as the predicate column for filtering. </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT COUNT(DISTINCT product_category) FROM amazon_reviews ;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| count(DISTINCT product_category) |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 43 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------------------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.57 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>As is shown, the <code>product_category</code> column has only 43 distinct categories, making it a typical low-cardinality text column. Now, let's add inverted index to it.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE amazon_reviews ADD INDEX product_category_inverted_idx(`product_category`) USING INVERTED;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">BUILD INDEX product_category_inverted_idx ON amazon_reviews;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>After adding inverted index, run the following SQL query to retrieve the top 3 products with the most reviews in the "Mobile_Electronics" product category. </p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_id,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_title,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AVG(star_rating) AS rating,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> any(review_body),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> any(review_headline),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> COUNT(*) AS count </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> amazon_reviews </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_category = 'Mobile_Electronics' </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> product_title, product_id </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count DESC </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LIMIT 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>With inverted index, the query takes 1.54s to finish.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_id | product_title | rating | any_value(review_body) | any_value(review_headline) | count |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00J46XO9U | iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White) | 4.3766233766233764 | Great cable and works well. Exact fit as Apple cable. I would recommend this to anyone who is looking to save money and for a quality cable. | Apple certified lightning cable | 1078 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B004911E9M | Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod | 2.4281805745554035 | A total waste of money for me because I needed it for a iPhone 4. The plug will only go in upside down and thus won't work at all. | Won't work with a iPhone 4! | 731 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B002D4IHYM | New Trent Easypak 7000mAh Portable Triple USB Port External Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB cable) | 4.5216095380029806 | I bought this product based on the reviews that i read and i am very glad that i did. I did have a problem with the product charging my itouch after i received it but i emailed the company and they corrected the problem immediately. VERY GOOD customer service, very prompt. The product itself is very good. It charges my power hungry itouch very quickly and the imax battery power lasts for a long time. All in all a very good purchase that i would recommend to anyone who owns an itouch. | Great product &amp; company | 671 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">3 rows in set (1.54 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Now let's try again without enabling inverted index. The same query takes 1.8s to finish. (You can simply disable inverted index by executing <code>set enable_inverted_index_query=false;</code> in the MySQL client.)</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| product_id | product_title | rating | any_value(review_body) | any_value(review_headline) | count |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B00J46XO9U | iXCC Lightning Cable 3ft, iPhone charger, for iPhone X, 8, 8 Plus, 7, 7 Plus, 6s, 6s Plus, 6, 6 Plus, SE 5s 5c 5, iPad Air 2 Pro, iPad mini 2 3 4, iPad 4th Gen [Apple MFi Certified](Black and White) | 4.3766233766233764 | These cables are great. They feel quality, and best of all, they work as they should. I have no issues with them whatsoever and will be buying more when needed. | Just like the original from Apple | 1078 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B004911E9M | Wall AC Charger USB Sync Data Cable for iPhone 4, 3GS, and iPod | 2.4281805745554035 | I ordered two of these chargers for an Iphone 4. Then I started experiencing weird behavior from the touch screen. It would select the wrong area of the screen, or it would refuse to scroll beyond a certain point and jump back up to the top of the page. This behavior occurs whenever either of the two that I bought are attached and charging. When I remove them, it works fine once again. Needless to say, these items are being returned. | Beware - these chargers are defective | 731 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| B002D4IHYM | New Trent Easypak 7000mAh Portable Triple USB Port External Battery Charger/Power Pack for Smartphones, Tablets and more (w/built-in USB cable) | 4.5216095380029806 | I received this in the mail 4 days ago, and after charging it for 6 hours, I've been using it as the sole source for recharging my 3Gs to see how long it would work. I use my Iphone A LOT every day and usually by the time I get home it's down to 50% or less. After 4 days of using the IMAX to recharge my Iphone, it finally went from 3 bars to 4 this afternoon when I plugged my iphone in. It charges the iphone very quickly, and I've been topping my phone off (stopping around 95% or so) twice a day. This is a great product and the size is very similar to a deck of cards (not like an iphone that someone else posted) and is very easy to carry in a jacket pocket or back pack. I bought this for a 4 day music festival I'm going to, and I have no worries at all of my iphone running out of juice! | FANTASTIC product! | 671 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">3 rows in set (1.80 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>To sum up, inverted index can bring a 15% speedup for queries on low-cardinality columns. So it is not only harmless but also beneficial to low-cardinality data filtering.</p><p>In addition, Apache Doris adopts effective dictionary encoding and compression for low-cardinality columns. It also utilizes built-in indexes like ZoneMap for filtering. Thus, it can deliver ideal query performance even without inverted indexes.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>Inverted index in Apache Doris optimizes data filtering based on the predicate column (the <code>WHERE</code> clause in SQL queries). It reduces unnecessary data scanning and significantly increases query speed on high-cardinality columns and guarantees no negative effects on low-cardinality columns. It supports lightweight index management, including ADD/DROP INDEX and BUILD INDEX. It can be easily enabled or disabled via <code>enable_inverted_index_query=true/false</code>. </p><p>Inverted index and NGram BloomFilter index apply to different scenarios. This is how you decide which one is the optimal choice: </p><ul><li><strong>Non-primary key column queries</strong>: These cases often involve widely scattered values and a low hit rate. <strong>Inverted index</strong> can work in conjunction with the built-in smart indexes in Doris to accelerate these queries. It has well-established support for scalar data types including characters, numerics, and datetime.</li><li><strong>Text searches on short texts</strong>: If the dataset includes short texts that are highly diverse, <strong>NGram BloomFilter</strong> will be an effective choice for fuzzy matching (<code>LIKE</code>) on short texts. If the short texts are very similar (with lots of identical content), <strong>inverted index</strong> will be more efficient because it ensures a smaller dictionary and faster retrieval of the row numbers. </li><li><strong>Text searches on long texts</strong>: Inverted index is a better choice for long texts. Compared to brutal-force string matching, it largely reduces CPU resource consumption. </li></ul><p>Inverted index has been available in Apache Doris for almost a year and stood the test of many users in their production environment with massive data. In future versions of Apache Doris regarding inverted index, we plan to add support for:</p><ul><li><strong>Self-defined tokenization</strong>: provides a user-defined tokenizer to fit in different use cases.</li><li><strong>More data types</strong>: Users will be able to create inverted index for complex data types including Array and Map.</li></ul><p>If you encounter any issues while trying it out in Apache Doris or would like to know more details, join our <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a> community and talk to us!</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.0.4 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.4</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.4"/>
<updated>2024-01-26T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, 333 improvements and bug fixes have been made in Doris 2.0.4.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 333 improvements and bug fixes have been made in Doris 2.0.4 version.</p><p><strong>Quick Download</strong> : <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><p><strong>GitHub</strong> : <a href="https://github.com/apache/doris/releases" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-change">Behavior change<a href="#behavior-change" class="hash-link" aria-label="Behavior change的直接链接" title="Behavior change的直接链接"></a></h2><ul><li><p>More reasonable and accurate precision and scale inference for decimal data type</p><ul><li><a href="https://github.com/apache/doris/pull/28034" target="_blank" rel="noopener noreferrer">[improvement](decimal) use new way for decimal arithmetic precision promotion</a></li></ul></li><li><p>Support drop policy for user or role</p><ul><li><a href="https://github.com/apache/doris/pull/29488" target="_blank" rel="noopener noreferrer">[fix](polixy)support drop policy for user or role</a></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-features">New features<a href="#new-features" class="hash-link" aria-label="New features的直接链接" title="New features的直接链接"></a></h2><ul><li>Support datev1, datetimev1 and decimalv2 datatypes in new optimizer Nereids.</li><li>Support ODBC table for new optimizer Nereids.</li><li>Add <code>lower_case</code> and <code>ignore_above</code> option for inverted index</li><li>Support <code>match_regexp</code> and <code>match_phrase_prefix</code> optimization by inverted index</li><li>Support paimon native reader in datalake</li><li>Support audit-log for <code>insert into</code> SQL</li><li>Support reading parquet file in lzo compressed format</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="three-improvement-and-optimizations">Three Improvement and optimizations<a href="#three-improvement-and-optimizations" class="hash-link" aria-label="Three Improvement and optimizations的直接链接" title="Three Improvement and optimizations的直接链接"></a></h2><ul><li>Improve storage management including balance, migration, publish and others.</li><li>Improve storage cooldown policy to use save disk space.</li><li>Performance optimization for substr with ascii string.</li><li>Improve partition prune when date function is used.</li><li>Improve auto analyze visibility and performance.</li></ul><p>See the complete list of improvements and bug fixes on github <a href="https://github.com/apache/doris/issues?q=label%3Adev%2F2.0.4-merged+is%3Aclosed" target="_blank" rel="noopener noreferrer">dev/2.0.4-merged</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="credits">Credits<a href="#credits" class="hash-link" aria-label="Credits的直接链接" title="Credits的直接链接"></a></h2><p>Last but not least, this release would not have been possible without the following contributors: </p><p>airborne12, amorynan, AshinGau, BePPPower, bingquanzhao, BiteTheDDDDt, bobhan1, ByteYue, caiconghui,CalvinKirs, cambyzju, caoliang-web, catpineapple, csun5285, dataroaring, deardeng, dutyu, eldenmoon, englefly, feifeifeimoon, fornaix, Gabriel39, gnehil, HappenLee, hello-stephen, HHoflittlefish777,hubgeter, hust-hhb, ixzc, jacktengg, jackwener, Jibing-Li, kaka11chen, KassieZ, LemonLiTree,liaoxin01, LiBinfeng-01, lihuigang, liugddx, luwei16, morningman, morrySnow, mrhhsg, Mryange, nextdreamblue, Nitin-Kashyap, platoneko, py023, qidaye, shuke987, starocean999, SWJTU-ZhangLei, w41ter, wangbo, wsjz, wuwenchi, Xiaoccer, xiaokang, XieJiann, xingyingone, xinyiZzz, xuwei0912, xy720, xzj7019, yujun777, zclllyybb, zddr, zhangguoqiang666, zhangstar333, zhannngchen, zhiqiang-hhhh, zy-kkk, zzzxl1993</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Financial data warehousing: fast, secure, and highly available with Apache Doris]]></title>
<id>https://doris.apache.org/zh-CN/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris</id>
<link href="https://doris.apache.org/zh-CN/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris"/>
<updated>2024-01-08T00:00:00.000Z</updated>
<summary type="html"><![CDATA[A whole-journey guide for financial users looking for fast data processing performance, data security, and high service availability.]]></summary>
<content type="html"><![CDATA[<p>This is a whole-journey guide for Apache Doris users, especially those from the financial sector which requires a high level of data security and availability. If you don't know how to build a real-time data pipeline and make the most of the <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a> functionalities, start with this post and you will be loaded with inspiration after reading.</p><p>This is the best practice of a non-banking payment service provider that serves over 25 million retailers and processes data from 40 million end devices. Data sources include MySQL, Oracle, and MongoDB. They were using Apache Hive as an offline data warehouse but feeling the need to add a real-time data processing pipeline. <strong>After introducing Apache Doris, they increase their data ingestion speed by 2~5 times, ETL performance by 3~12 times, and query execution speed by 10~15 times.</strong></p><p>In this post, you will learn how to integrate Apache Doris into your data architecture, including how to arrange data inside Doris, how to ingest data into it, and how to enable efficient data updates. Plus, you will learn about the enterprise features that Apache Doris provides to guarantee data security, system stability, and service availability.</p><p><img loading="lazy" src="https://cdn.selectdb.com/static/offline_vs_real_time_data_warehouse_6b3fd0d1bc.png" alt="offline-vs-real-time-data-warehouse" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="building-a-real-time-data-warehouse-with-apache-doris">Building a real-time data warehouse with Apache Doris<a href="#building-a-real-time-data-warehouse-with-apache-doris" class="hash-link" aria-label="Building a real-time data warehouse with Apache Doris的直接链接" title="Building a real-time data warehouse with Apache Doris的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="choice-of-data-models">Choice of data models<a href="#choice-of-data-models" class="hash-link" aria-label="Choice of data models的直接链接" title="Choice of data models的直接链接"></a></h3><p>Apache Doris arranges data with three data models. The main difference between these models lies in whether or how they aggregate data.</p><ul><li><strong><a href="https://doris.apache.org/docs/data-table/data-model#duplicate-model" target="_blank" rel="noopener noreferrer">Duplicate Key model</a></strong>: for detailed data queries. It supports ad-hoc queries of any dimension.</li><li><strong><a href="https://doris.apache.org/docs/data-table/data-model#unique-model" target="_blank" rel="noopener noreferrer">Unique Key model</a></strong>: for use cases with data uniqueness constraints. It supports precise deduplication, multi-stream upserts, and partial column updates.</li><li><strong><a href="https://doris.apache.org/docs/data-table/data-model#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate Key model</a></strong>: for data reporting. It accelerates data reporting by pre-aggregating data.</li></ul><p>The financial user adopts different data models in different data warehouse layers:</p><ul><li><strong>ODS - Duplicate Key model</strong>: As a payment service provider, the user receives a million settlement data every day. Since the settlement cycle can span a whole year, the relevant data needs to be kept intact for a year. Thus, the proper way is to put it in the Duplicate Key model, which does not perform any data aggregations. An exception is that some data is prone to constant changes, like order status from retailers. Such data should be put into the Unique Key model so that the newly updated record of the same retailer ID or order ID will always replace the old one.</li><li><strong>DWD &amp; DWS - Unique Key model</strong>: Data in the DWD and DWS layers are further abstracted, but it is all put in the Unique Key model so that the settlement data can be automatically updated.</li><li><strong>ADS - Aggregate Key model</strong>: Data is highly abstracted in this layer. It is pre-aggregated to mitigate the computation load of downstream analytics.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="partitioning-and-bucketing-strategies">Partitioning and bucketing strategies<a href="#partitioning-and-bucketing-strategies" class="hash-link" aria-label="Partitioning and bucketing strategies的直接链接" title="Partitioning and bucketing strategies的直接链接"></a></h3><p>The idea of partitioning and bucketing is to "cut" data into smaller pieces to increase data processing speed. The key is to set an appropriate number of data partitions and buckets. Based on their use case, the user tailors the bucketing field and bucket number to each table. For example, they often need to query the dimensional data of different retailers from the retailer flat table, so they specify the retailer ID column as the bucketing field, and list the recommended bucket number for various data sizes.</p><p><img loading="lazy" src="https://cdn.selectdb.com/static/partitioning_and_bucketing_strategies_c91ad6a340.png" alt="partitioning-and-bucketing-strategies" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="multi-source-data-migration">Multi-source data migration<a href="#multi-source-data-migration" class="hash-link" aria-label="Multi-source data migration的直接链接" title="Multi-source data migration的直接链接"></a></h3><p>In the adoption of Apache Doris, the user had to migrate all local data from their branches into Doris, which was when they found out their branches were using <strong>different databases</strong> and had <strong>data files of very different formats</strong>, so the migration could be a mess.</p><p><img loading="lazy" src="https://cdn.selectdb.com/static/multi_source_data_migration_2b4f54e005.png" alt="multi-source-data-migration" class="img_ev3q"></p><p>Luckily, Apache Doris supports a rich collection of data integration methods for both real-time data streaming and offline data import.</p><ul><li><strong>Real-time data streaming</strong>: Apache Doris fetches MySQL Binlogs in real time. Part of them is written into Doris directly via Flink CDC, while the high-volume ones are synchronized into Kafka for peak shaving, and then written into Doris via the Flink-Doris-Connector.</li><li><strong>Offline data import</strong>: This includes more diversified data sources and data formats. Historical data and incremental data from S3 and HDFS will be ingested into Doris via the <a href="https://doris.apache.org/docs/data-operate/import/import-way/broker-load-manual" target="_blank" rel="noopener noreferrer">Broker Load</a> method, data from Hive or JDBC will be synchronized to Doris via the <a href="https://doris.apache.org/docs/data-operate/import/import-way/insert-into-manual" target="_blank" rel="noopener noreferrer">Insert Into</a> method, and files will be loaded to Doris via the Flink-Doris-Connector and Flink FTP Connector. (FTP is how the user transfers files across systems internally, so they developed the Flink-FTP-Connector to support the complicated data formats and multiple newline characters in data.)</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="full-data-ingestion-and-incremental-data-ingestion">Full data ingestion and incremental data ingestion<a href="#full-data-ingestion-and-incremental-data-ingestion" class="hash-link" aria-label="Full data ingestion and incremental data ingestion的直接链接" title="Full data ingestion and incremental data ingestion的直接链接"></a></h3><p>To ensure business continuity and data accuracy, the user figures out the following ways to ingest full data and incremental data:</p><ul><li><strong>Full data ingestion</strong>: Create a temporary table of the target schema in Doris, ingest full data into the temporary table, and then use the <code>ALTER TABLE t1 REPLACE WITH TABLE t2</code> statement for atomic replacement of the regular table with the temporary table. This method prevents interruptions to queries on the frontend.</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">alter table ${DB_NAME}.${TBL_NAME} drop partition IF EXISTS p${P_DOWN_DATE};</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE ${DB_NAME}.${TBL_NAME} ADD PARTITION IF NOT EXISTS p${P_DOWN_DATE} VALUES [('${P_DOWN_DATE}'), ('${P_UP_DATE}'));</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LOAD LABEL ${TBL_NAME}_${load_timestamp} ...</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li><strong>Incremental data ingestion</strong>: Create a new data partition to accommodate incremental data.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="offline-data-processing">Offline data processing<a href="#offline-data-processing" class="hash-link" aria-label="Offline data processing的直接链接" title="Offline data processing的直接链接"></a></h3><p>The user has moved their offline data processing workload to Apache Doris and thus <strong>increased execution speed by 5 times</strong>. </p><p><img loading="lazy" src="https://cdn.selectdb.com/static/offline_data_processing_82e20fc59a.png" alt="offline-data-processing" class="img_ev3q"></p><ul><li><strong>Before</strong>: The old Hive-based offline data warehouse used the TEZ execution engine to process 30 million new data records every day. With 2TB computation resources, the whole pipeline took 2.5 hours. </li><li><strong>After</strong>: Apache Doris finishes the same tasks within only 30 minutes and consumes only 1TB. Script execution takes only 10 seconds instead of 8 minutes.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="enterprise-features-for-financial-players">Enterprise features for financial players<a href="#enterprise-features-for-financial-players" class="hash-link" aria-label="Enterprise features for financial players的直接链接" title="Enterprise features for financial players的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="multi-tenant-resource-isolation">Multi-tenant resource isolation<a href="#multi-tenant-resource-isolation" class="hash-link" aria-label="Multi-tenant resource isolation的直接链接" title="Multi-tenant resource isolation的直接链接"></a></h3><p>This is required because it often happens that the same piece of data is requested by multiple teams or business systems. These tasks can lead to resource preemption and thus performance decrease and system instability.</p><p><strong>Resource limit for different workloads</strong></p><p>The user classifies their analytics workloads into four types and sets a resource limit for each of them. In particular, they have four different types of Doris accounts and set a limit on the CPU and memory resources for each type of account.</p><p><img loading="lazy" src="https://cdn.selectdb.com/static/multi_tenant_resource_isolation_772a57a4f1.png" alt="multi-tenant-resource-isolation" class="img_ev3q"></p><p>In this way, when one tenant requires excessive resources, it will only compromise its own efficiency but not affect other tenants.</p><p><strong>Resource tag-based isolation</strong></p><p>For data security under the parent-subsidiary company hierarchy, the user has set isolated resource groups for the subsidiaries. Data of each subsidiary is stored in its own resource group with three replicas, while data of the parent company is stored with four replicas: three in the parent company resource group, and the other one in the subsidiary resource group. Thus, when an employee from a subsidiary requests data from the parent company, the query will only executed in the subsidiary resource group. Specifically, they take these steps:</p><p><img loading="lazy" src="https://cdn.selectdb.com/static/resource_tag_based_isolation_442e20f09c.png" alt=" resource-tag-based-isolation" class="img_ev3q"></p><p><strong>Workload group</strong></p><p>The resource tag-based isolation plan ensures isolation on a physical level, but as Apache Doris developers, we want to further optimize resource utilization and pursue more fine-grained resource isolation. For these purposes, we released the <a href="https://doris.apache.org/docs/admin-manual/workload-group" target="_blank" rel="noopener noreferrer">Workload Group</a> feature in <a href="https://doris.apache.org/blog/release-note-2.0.0" target="_blank" rel="noopener noreferrer">Apache Doris 2.0</a>. </p><p>The Workload Group mechanism relates queries to workload groups, which limit the share of CPU and memory resources of the backend nodes that a query can use. When cluster resources are in short supply, the biggest queries will stop execution. On the contrary, when there are plenty of available cluster resources and a workload group requires more resources than the limit, it will get assigned with the idle resources proportionately. </p><p>The user is actively planning their transition to the Workload Group plan and utilizing the task prioritizing mechanism and query queue feature to organize the execution order.</p><p><strong>Fine-grained user privilege management</strong></p><p>For regulation and compliance reasons, this payment service provider implements strict privilege control to make sure that everyone only has access to what they are supposed to access. This is how they do it:</p><ul><li><strong>User privilege setting</strong>: System users of different subsidiaries or with different business needs are granted different data access privileges.</li><li><strong>Privilege control over databases, tables, and rows</strong>: The <code>ROW POLICY</code> mechanism of Apache Doris makes these operations easy.</li><li><strong>Privilege control over columns</strong>: This is done by creating views.</li></ul><p><img loading="lazy" src="https://cdn.selectdb.com/static/fine_grained_user_privilege_management_f0cd060011.png" alt="fine-grained-user-privilege-management.png" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cluster-stability-guarantee">Cluster stability guarantee<a href="#cluster-stability-guarantee" class="hash-link" aria-label="Cluster stability guarantee的直接链接" title="Cluster stability guarantee的直接链接"></a></h3><ul><li><strong>Circuit Breaking</strong>: From time to time, system users might input faulty SQL, causing excessive resource consumption. A circuit-breaking mechanism is in place for that. It will promptly stop these resource-intensive queries and prevent interruption to the system.</li><li><strong>Data ingestion concurrency control</strong>: The user has a frequent need to integrate historical data into their data platform. That involves a lot of data modification tasks and might stress the cluster. To solve that, they turn on the <a href="https://doris.apache.org/docs/data-table/data-model#merge-on-write-of-unique-model" target="_blank" rel="noopener noreferrer">Merge-on-Write</a> mode in the Unique Key model, enable <a href="https://doris.apache.org/docs/advanced/best-practice/compaction#vertical-compaction" target="_blank" rel="noopener noreferrer">Vertical Compaction</a> and <a href="https://doris.apache.org/docs/advanced/best-practice/compaction#segment-compaction" target="_blank" rel="noopener noreferrer">Segment Compaction</a>, and tune the data compaction parameters to control data ingestion concurrency.</li><li><strong>Network traffic control</strong>: Considering their two clusters in different cities, they employ Quality of Service (QoS) strategies tailored to different scenarios for precise network isolation and ensuring network quality and stability.</li><li><strong>Monitoring and alerting</strong>: The user has integrated Doris with their internal monitoring and alerting platform so any detected issues will be notified via their messaging software and email in real time.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cross-cluster-replication">Cross-cluster replication<a href="#cross-cluster-replication" class="hash-link" aria-label="Cross-cluster replication的直接链接" title="Cross-cluster replication的直接链接"></a></h3><p>Disaster recovery is crucial for the financial industry. The user leverages the Cross-Cluster Replication (CCR) capability and builds a dual-cluster solution. As the primary cluster undertakes all the queries, the major business data is also synchronized into the backup cluster and updated in real time, so that in the case of service downtime in the primary cluster, the backup cluster will take over swiftly and ensure business continuity.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>We appreciate the user for their active <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">communication</a> with us along the way and are glad to see so many Apache Doris features fit in their needs. They are also planning on exploring federated query, compute-storage separation, and auto maintenance with Apache Doris. We look forward to more best practice and feedback from them.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris speeds up data reporting, tagging, and data lake analytics]]></title>
<id>https://doris.apache.org/zh-CN/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics</id>
<link href="https://doris.apache.org/zh-CN/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics"/>
<updated>2023-12-27T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The user leverages the capabilities of Apache Doris in reporting, customer tagging, and data lake analytics and achieves high performance.]]></summary>
<content type="html"><![CDATA[<p>As much as we say <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a> is an all-in-one data platform that is capable of various analytics workloads, it is always compelling to demonstrate that by real use cases. That's why I would like to share this user story with you. It is about how they leverage the capabilities of Apache Doris in reporting, customer tagging, and data lake analytics and achieve high performance.</p><p>This fintech service provider is a long-term user of Apache Doris. They have almost 10 clusters for production, hundreds of Doris backend nodes, and thousands of CPU Cores. The total data size is near 1 PB. Every day, they have hundreds of workflows running simultaneously, receive almost 10 billion new data records, and respond to millions of data queries.</p><p>Before migrating to Apache Doris, they used ClickHouse, MySQL, and Elasticsearch. Then frictions arise from their ever-enlarging data size. They found it hard to scale out the ClickHouse clusters because there were too many dependencies. As for MySQL, they had to switch between various MySQL instances because one MySQL instance had its limits and cross-instance queries were not supported.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="reporting">Reporting<a href="#reporting" class="hash-link" aria-label="Reporting的直接链接" title="Reporting的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="from-clickhouse--mysql-to-apache-doris">From ClickHouse + MySQL to Apache Doris<a href="#from-clickhouse--mysql-to-apache-doris" class="hash-link" aria-label="From ClickHouse + MySQL to Apache Doris的直接链接" title="From ClickHouse + MySQL to Apache Doris的直接链接"></a></h3><p>Data reporting is one of the major services they provide to their customers and they are bound by an SLA. They used to support such service with a combination of ClickHouse and MySQL, but they found significant fluctuations in their data synchronization duration, making it hard for them to meet the service levels outlined in their SLA. Diagnosis showed that it was because the multiple components add to the complexity and instability of data synchronization tasks. To fix that, they have used Apache Doris as a unified analytic engine to support data reporting. </p><div style="text-align:center"><img loading="lazy" src="https://cdn.selectdb.com/static/from_clickhouse_mysql_to_apache_doris_6387c0363a.png" alt="from-clickhouse-mysql-to-apache-doris" width="840" style="display:inline-block" class="img_ev3q"></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-improvements">Performance improvements<a href="#performance-improvements" class="hash-link" aria-label="Performance improvements的直接链接" title="Performance improvements的直接链接"></a></h3><p>With Apache Doris, they ingest data via the <a href="https://doris.apache.org/docs/1.2/data-operate/import/import-way/broker-load-manual" target="_blank" rel="noopener noreferrer">Broker Load</a> method and reach an SLA compliance rate of over 99% in terms of data synchronization performance.</p><div style="text-align:center"><img loading="lazy" src="https://cdn.selectdb.com/static/data_synchronization_size_and_duration_327e4dc1fe.png" alt="data-synchronization-size-and-duration" width="640" style="display:inline-block" class="img_ev3q"></div><p>As for data queries, the Doris-based architecture maintains an <strong>average query response time</strong> of less than <strong>10s</strong> and a <strong>P90 response time</strong> of less than <strong>30s</strong>. This is a 50% speedup compared to the old architecture. </p><div style="text-align:center"><img loading="lazy" src="https://cdn.selectdb.com/static/average_query_response_time_372d71ef16.png" alt="average-query-response-time" width="840" style="display:inline-block" class="img_ev3q"></div><div style="text-align:center"><img loading="lazy" src="https://cdn.selectdb.com/static/query_response_time_percentile_756c6f6a71.png" alt="query-response-time-percentile" width="840" style="display:inline-block" class="img_ev3q"></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="tagging">Tagging<a href="#tagging" class="hash-link" aria-label="Tagging的直接链接" title="Tagging的直接链接"></a></h2><p>Tagging is a common operation in customer analytics. You assign labels to customers based on their behaviors and characteristics, so that you can divide them into groups and figure out targeted marketing strategies for each group of them. </p><p>In the old processing architecture where Elasticsearch was the processing engine, raw data was ingested and tagged properly. Then, it will be merged into JSON files and imported into Elasticsearch, which provides data services for analysts and marketers. In this process, the merging step was to reduce updates and relieve load for Elasticsearch, but it turned out to be a troublemaker:</p><ul><li>Any problematic data in any of the tags could spoil the entire merging operation and thus interrupt the data services.</li><li>The merging operation was implemented based on Spark and MapReduce and took up to 4 hours. Such a long time frame could encroach on marketing opportunities and lead to unseen losses.</li></ul><div style="text-align:center"><img loading="lazy" src="https://cdn.selectdb.com/static/tagging_services_3263e21c36.png" alt="tagging-services" width="840" style="display:inline-block" class="img_ev3q"></div><p>Then Apache Doris takes this over. Apache Doris arranges tag data with its data models, which process data fast and smoothly. The aforementioned merging step can be done by the <a href="https://doris.apache.org/docs/data-table/data-model#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate Key model</a>, which aggregates tag data based on the specified Aggregate Key upon data ingestion. The <a href="https://doris.apache.org/docs/data-table/data-model#unique-model" target="_blank" rel="noopener noreferrer">Unique Key model</a> is handy for partial column updates. Again, all you need is to specify the Unique Key. This enables swift and flexible data updating and saves you from the trouble of replacing the entire flat table. You can also put your detailed data into a <a href="https://doris.apache.org/docs/data-table/data-model#duplicate-model" target="_blank" rel="noopener noreferrer">Duplicate model</a> to speed up certain queries. <strong>In practice, it took the user 1 hour to finish the data ingestion, compared to 4 hours with the old architecture.</strong></p><p>In terms of query performance, Doris is equipped with well-developed bitmap indexes and techniques tailored to high-concurrency queries, so in this case, it can finish <strong>customer segmentation within seconds</strong> and reach over <strong>700 QPS in user-facing queries</strong>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-lake-analytics">Data lake analytics<a href="#data-lake-analytics" class="hash-link" aria-label="Data lake analytics的直接链接" title="Data lake analytics的直接链接"></a></h2><p>In data lake scenarios, the data size you need to handle tends to be huge, but the data processing volume in each query tends to vary. To ensure fast data ingestion and high query performance of huge data sets, you need more resources. On the other hand, during non-peak time, you want to scale down your cluster for more efficient resource management. How do you handle this dilemma?</p><p>Apache Doris has a few features that are designed for data lake analytics, including Multi-Catalog and Compute Node. The former shields you from the headache of data ingestion in data lake analytics while the latter enables elastic cluster scaling.</p><p>The <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/?_highlight=multi&amp;_highlight=catalog" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> mechanism allows you to connect Doris to a variety of external data sources so you can use Doris as a unified query gateway without worrying about bulky data ingestion into Doris.</p><p>The <a href="https://doris.apache.org/docs/advanced/compute-node/" target="_blank" rel="noopener noreferrer">Compute Node</a> of Apache Doris is a backend role that is designed for remote federated query workloads, like those in data lake analytics. Normal Doris backend nodes are responsible for both SQL query execution and data management, while the Compute Nodes in Doris, as the name implies, only perform computation. Compute Nodes are stateless, making them elastic enough for cluster scaling.</p><p>The user introduces Compute Nodes into their cluster and deploys them with other components in a hybrid configuration. As a result, the cluster automatically scales down during the night, when there are fewer query requests, and scales out during the daytime to handle the massive query workload. This is more resource-efficient.</p><p>For easier deployment, they have also optimized their Deploy on Yarn process via Skein. As is shown below, they define the number of Compute nodes and the required resources in the YAML file, and then pack the installation file, configuration file, and startup script into the distributed file system. In this way, they can start or stop the entire cluster of over 100 nodes within minutes using one simple line of code.</p><div style="text-align:center"><img loading="lazy" src="https://cdn.selectdb.com/static/skein_3516ba1a83.png" alt="skein" width="560" style="display:inline-block" class="img_ev3q"></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>For data reporting and customer tagging, Apache Doris smoothens data ingestion and merging steps, and delivers high query performance based on its own design and functionality. For data lake analytics, the user improves resource efficiency by elastic scaling of clusters using the Compute Node. Along their journey with Apache Doris, they have also developed a data ingestion task prioritizing mechanism and contributed it to the Doris project. A gesture to facilitate their use case ends up benefiting the whole <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">open source community</a>. This is a great example of open-source products thriving on user involvement.</p><p>Check Apache Doris <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">repo</a> on GitHub</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[From Elasticsearch to Apache Doris: upgrading an observability platform]]></title>
<id>https://doris.apache.org/zh-CN/blog/from-elasticsearch-to-apache-doris-upgrading-an-observability-platform</id>
<link href="https://doris.apache.org/zh-CN/blog/from-elasticsearch-to-apache-doris-upgrading-an-observability-platform"/>
<updated>2023-12-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[GuanceDB, an observability platform, replaces Elasticsearch with Apache Doris as its query and storage engine and realizes 70% less storage costs and 200%~400% data query performance.]]></summary>
<content type="html"><![CDATA[<p>Observability platforms are akin to the immune system. Just like immune cells are everywhere in human bodies, an observability platform patrols every corner of your devices, components, and architectures, identifying any potential threats and proactively mitigating them. However, I might have gone too far with that metaphor, because till these days, we have never invented a system as sophisticated as the human body, but we can always make advancements.</p><p>The key to upgrading an observability platform is to increase data processing speed and reduce costs. This is based on two reasons:</p><ol><li>The faster you can identify abnormalities from your data, the more you can contain the potential damage.</li><li>An observability platform needs to store a sea of data, and low storage cost is the only way to make that sustainable.</li></ol><p>This post is about how GuanceDB, an observability platform, makes progress in these two aspects by replacing Elasticsearch with Apache Doris as its query and storage engine. <strong>The result is 70% less storage costs and 200%~400% data query performance.</strong></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="guancedb">GuanceDB<a href="#guancedb" class="hash-link" aria-label="GuanceDB的直接链接" title="GuanceDB的直接链接"></a></h2><p>GuanceDB is an all-around observability solution. It provides services including data analytics, data visualization, monitoring and alerting, and security inspection. From GuanceDB, users can have an understanding of their objects, network performance, applications, user experience, system availability, etc.</p><p>From the standpoint of a data pipeline, GuanceDB can be divided into two parts: data ingestion and data analysis. I will get to them one by one.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="data-integration">Data integration<a href="#data-integration" class="hash-link" aria-label="Data integration的直接链接" title="Data integration的直接链接"></a></h3><p>For data integration, GuanceDB uses its self-made tool called DataKit. It is an all-in-one data collector that extracts from different end devices, business systems, middleware, and data infrastructure. It can also preprocess data and relate it with metadata. It provides extensive support for data, from logs, and time series metrics, to data of distributed tracing, security events, and user behaviors from mobile APPs and web browsers. To cater to diverse needs across multiple scenarios, it ensures compatibility with various open-source probes and collectors as well as data sources of custom formats.</p><p><img loading="lazy" alt="observability-platform-architecture" src="https://cdnd.selectdb.com/zh-CN/assets/images/observability-platform-architecture-e6d61cc145b4fcaa0e8f81f9a3453836.png" width="2000" height="930" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="query--storage-engine">Query &amp; storage engine<a href="#query--storage-engine" class="hash-link" aria-label="Query &amp; storage engine的直接链接" title="Query &amp; storage engine的直接链接"></a></h3><p>Data collected by DataKit, goes through the core computation layer and arrive in GuanceDB, which is a multil-model database that combines various database technologies. It consists of the query engine layer and the storage engine layer. By decoupling the query engine and the storage engine, it enables pluggable and interchangeable architecture. </p><p><img loading="lazy" alt="observability-platform-query-engine-storage-engine" src="https://cdnd.selectdb.com/zh-CN/assets/images/observability-platform-query-engine-storage-engine-59ec8b8bcce25f1d2e401c8ef964a742.png" width="2400" height="1060" class="img_ev3q"></p><p>For time series data, they built Metric Store, which is a self-developed storage engine based on VictoriaMetrics. For logs, they integrate Elasticsearch and OpenSearch. GuanceDB is performant in this architecture, while Elasticsearch demonstrates room for improvement:</p><ul><li><strong>Data writing</strong>: Elasticsearch consumes a big share of CPU and memory resources. It is not only costly but also disruptive to query execution.</li><li><strong>Schemaless support</strong>: Elasticsearch provides schemaless support by Dynamic Mapping, but that's not enough to handle large amounts of user-defined fields. In this case, it can lead to field type conflict and thus data loss.</li><li><strong>Data aggregation</strong>: Large aggregation tasks often trigger a timeout error in Elasticsearch. </li></ul><p>So this is where the upgrade happens. GuanceDB tried and replaced Elasticsearch with <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a>. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="dql">DQL<a href="#dql" class="hash-link" aria-label="DQL的直接链接" title="DQL的直接链接"></a></h2><p>In the GuanceDB observability platform, almost all queries involve timestamp filtering. Meanwhile, most data aggregations need to be performed within specified time windows. Additionally, there is a need to perform rollups of time series data on individual sequences within a time window. Expressing these semantics using SQL often requires nested subqueries, resulting in complex and cumbersome statements.</p><p>That's why GuanceDB developed their own Data Query Language (DQL). With simplified syntax elements and computing functions optimized for observability use cases, this DQL can query metrics, logs, object data, and data from distributed tracing.</p><p><img loading="lazy" alt="observability-platform-query-engine-storage-engine-apache-doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/observability-platform-query-engine-storage-engine-apache-doris-b7491e169fe7abf5488259b2d973ed8b.png" width="2400" height="878" class="img_ev3q"></p><p>This is how DQL works together with Apache Doris. GuanceDB has found a way to make full use of the analytic power of Doris, while complementing its SQL functionalities.</p><p>As is shown below, Guance-Insert is the data writing component, while Guance-Select is the DQL query engine.</p><ul><li><strong>Guance-Insert</strong>: It allows data of different tenants to be accumulated in different batches, and strikes a balance between writing throughput and writing latency. When logs are generated in large volumes, it can maintain a low data latency of 2~3 seconds.</li><li><strong>Guance-Select</strong>: For query execution, if the query SQL semantics or function is supported in Doris, Guance-Select will push the query down to the Doris Frontend for computation; if not, it will go for a fallback option: acquire columnar data in Arrow format via the Thrift RPC interface, and then finish computation in Guance-Select. The catch is that it cannot push the computation logic down to Doris Backend, so it can be slightly slower than executing queries in Doris Frontend.</li></ul><p><img loading="lazy" alt="DQL-GranceDB-apache-doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/DQL-GranceDB-apache-doris-8e46a296f0c966f5742651d64d85cd2a.png" width="2400" height="984" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="observations">Observations<a href="#observations" class="hash-link" aria-label="Observations的直接链接" title="Observations的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="storage-cost-70-down-query-speed-300-up">Storage cost 70% down, query speed 300% up<a href="#storage-cost-70-down-query-speed-300-up" class="hash-link" aria-label="Storage cost 70% down, query speed 300% up的直接链接" title="Storage cost 70% down, query speed 300% up的直接链接"></a></h3><p>Previously, with Elasticsearch clusters, they used 20 cloud virtual machines (16vCPU 64GB) and had independent index writing services (that's another 20 cloud virtual machines). Now with Apache Doris, they only need 13 cloud virtual machines of the same configuration in total, representing <strong>a 67% cost reduction</strong>. This is contributed by three capabilities of Apache Doris:</p><ul><li><strong>High writing throughput</strong>: Under a consistent writing throughput of 1GB/s, Doris maintains a CPU usage of less than 20%. That equals 2.6 cloud virtual machines. With low CPU usage, the system is more stable and better prepared for sudden writing peaks.</li></ul><p><img loading="lazy" alt="writing-throughput-cpu-usage-apache-doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/writing-throughput-cpu-usage-apache-doris-a629606fb8dc90bc682efb76c80f7cc9.png" width="1948" height="886" class="img_ev3q"></p><ul><li><strong>High data compression ratio</strong>: Doris utilizes the ZSTD compression algorithm on top of columnar storage. It can realize a compression ratio of 8:1. Compared to 1.5:1 in Elasticsearch, Doris can reduce storage costs by around 80%.</li><li><strong><a href="https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How" target="_blank" rel="noopener noreferrer">Tiered storage</a></strong>: Doris allows a more cost-effective way to store data: to put hot data in local disks and cold data object storage. Once the storage policy is set, Doris can automatically manage the "cooldown" process of hot data and move cold data to object storage. Such data lifecycle is transparent to the data application layer so it is user-friendly. Also, Doris speeds up cold data queries by local cache.</li></ul><p>With lower storage costs, Doris does not compromise query performance. It doubles the execution speed of queries that return a single row and those that return a result set. For aggregation queries without sampling, Doris runs at 4 times the speed of Elasticsearch.</p><p><strong>To sum up, Apache Doris achieves 2~4 times the query performance of Elasticsearch with only 1/3 of the storage cost it consumes.</strong></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="inverted-index-for-full-text-search">Inverted index for full-text search<a href="#inverted-index-for-full-text-search" class="hash-link" aria-label="Inverted index for full-text search的直接链接" title="Inverted index for full-text search的直接链接"></a></h3><p>Inverted index is the magic potion for log analytics because it can considerably increase full-text search performance and reduce query overheads. </p><p>It is especially useful in these scenarios:</p><ul><li>Full-text search by <code>MATCH_ALL</code>, <code>MATCH_ANY</code>, and <code>MATCH_PHRASE</code>. <code>MATCH_PHRASE</code> in combination with inverted index is the alternative to the Elasticsearch full-text search functionality.</li><li>Equivalence queries (=, ! =, IN), range queries (&gt;, &gt;=, &lt;, &lt;=), and support for numerics, datetime, and strings.</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE httplog</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `ts` DATETIME,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `clientip` VARCHAR(20),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `request` TEXT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_ip (`clientip`) USING INVERTED,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_req (`request`) USING INVERTED PROPERTIES("parser" = "english") </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DUPLICATE KEY(`ts`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">...</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Retrieve the latest 10 records of Client IP "8.8.8.8"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM httplog WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Retrieve the latest 10 records with "error" or "404" in the "request" field</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM httplog WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Retrieve the latest 10 records with "image" and "faq" in the "request" field</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM httplog WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Retrieve the latest 10 records with "query error" in the "request" field</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM httplog WHERE request MATCH_PHRASE 'query error' ORDER BY ts DESC LIMIT 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>As a powerful accelerator for full-text searches, inverted index in Doris is flexible because we witness the need for on-demand adjustments. In Elasticsearch, indexes are fixed upon creation, so there needs to be good planning of which fields need to be indexed, otherwise, any changes to the index will require a complete rewrite.</p><p>In contrast, Doris allows for dynamic indexing. You can add inverted index to a field during runtime and it will take effect immediately. You can also decide which data partitions to create indexes on.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-new-data-type-for-dynamic-schema-change">A new data type for dynamic schema change<a href="#a-new-data-type-for-dynamic-schema-change" class="hash-link" aria-label="A new data type for dynamic schema change的直接链接" title="A new data type for dynamic schema change的直接链接"></a></h3><p>By nature, an observability platform requires support for dynamic schema, because the data it collects is prone to changes. Every click by a user on the webpage might add a new metric to the database. </p><p>Looking around the database landscape, you will find that static schema is the norm. Some databases take a step further. For example, Elasticsearch realizes dynamic schema by mapping. However, this functionality can be easily interrupted by field type conflicts or unexpired historical fields.</p><p>The Doris solution for dynamic schema is a newly-introduced data type: Variant, and GuanceDB is among the first to try it out. (It will officially be available in Apache Doris V2.1.)</p><p>The Variant data type is the move of Doris to embrace semi-structured data analytics. It can solve a lot of the problems that often harass database users:</p><ul><li><strong>JSON</strong> <strong>data storage</strong>: A Variant column in Doris can accommodate any legal JSON data, and can automatically recognize the subfields and data types.</li><li><strong>Schema explosion due to too many fields</strong>: The frequently occurring subfields will be stored in a column-oriented manner to facilitate analysis, while the less frequently seen subfields will be merged into the same column to streamline the data schema.</li><li><strong>Write failure due to data type conflicts</strong>: A Variant column allows different types of data in the same field, and applies different storage for different data types.</li></ul><p><strong>Difference</strong> <strong>between Variant and Dynamic Mapping</strong></p><p>From a functional perspective, the biggest difference between Variant in Doris and Dynamic Mapping in Elasticsearch is that the scope of Dynamic Mapping extends throughout the entire lifecycle of the current table, while that of Variant can be limited to the current data partition. </p><p>For example, if a user has changed the business logic and renamed some Variant fields today, the old field name will remain on the partitions before today, but will not appear on the new partitions since tomorrow. <strong>So there is a lower risk of data type conflict.</strong></p><p>In the case of field type conflicts in the same partition, the two fields will be changed to JSON type to avoid data error or data loss. For example, there are two <code>status</code> fields in the user's business system: One is strings, and the other is numerics, so in queries, the user can decide whether to query the string field, or the nuemric field, or both. (E.g. If you specify <code>status = "ok"</code> in the filters, the query will only be executed on the string field.)</p><p>From the users' perspective, they can use the Variant type as simply as other data types. They can add or remove Variant fields based on their business needs, and no extra syntax or annotation is required.</p><p>Currently, the Variant type requires extra type assertion, we plan to automate this process in future versions of Doris. GuanceDB is one step faster in this aspect. They have realized auto type assertion for their DQL queries. In most cases, type assertion is based on the actual data type of Variant fields. In some rare cases when there is a type conflict, the Variant fields will be upgraded to JSON fields, and then type assertion will be based on the semantics of operators in DQL queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>GuanceDB's transition from Elasticsearch to Apache Doris showcases a big stride in improving data processing speed and reducing costs. For these purposes, Apache Doris has optimized itself in the two major aspects of data processing: data integration and data analysis. It has expanded its schemaless support to flexibly accommodate more data types, introduced features like inverted index and tiered storage to enable faster and more cost-effective queries. Evolution is an ongoing process. Apache Doris has never stopped improving itself. We have a lot of new features under development and the Doris <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">community</a> embrace any input and feedback.</p><p>Check Apache Doris GitHub <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">repo</a></p><p>Find Apache Doris makers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a></p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 2.0.3 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.3</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.3"/>
<updated>2023-12-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, 1000 improvements and bug fixes have been made in Doris 2.0.3.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, about 1000 improvements and bug fixes have been made in Doris 2.0.3 version, including optimizer statistics, inverted index, complex datatypes, data lake, replica management.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-change">Behavior change<a href="#behavior-change" class="hash-link" aria-label="Behavior change的直接链接" title="Behavior change的直接链接"></a></h2><ul><li>The output format of the complex data type array/map/struct has been changed to be consistent to the input format and JSON specification. The main changes from the previous version are that DATE/DATETIME and STRING/VARCHAR are enclosed in double quotes and null values inside ARRAY/MAP are displayed as <code>null</code> instead of <code>NULL</code>.<ul><li><a href="https://github.com/apache/doris/pull/25946" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25946</a></li></ul></li><li>SHOW_VIEW permission is supported. Users with SELECT or LOAD permission will no longer be able to execute the 'SHOW CREATE VIEW' statement and must be granted the SHOW_VIEW permission separately.<ul><li><a href="https://github.com/apache/doris/pull/25370" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25370</a></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-features">New features<a href="#new-features" class="hash-link" aria-label="New features的直接链接" title="New features的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-support-collecting-statistics-for-optimizer-automatically">1. Support collecting statistics for optimizer automatically<a href="#1-support-collecting-statistics-for-optimizer-automatically" class="hash-link" aria-label="1. Support collecting statistics for optimizer automatically的直接链接" title="1. Support collecting statistics for optimizer automatically的直接链接"></a></h3><p>Collecting statistics helps the optimizer understand the data distribution characteristics and choose a better plan to greatly improve query performance. It is officially supported starting from version 2.0.3 and is enabled all day by default.</p><p>see more:<a href="https://doris.apache.org/docs/query-acceleration/statistics/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/query-acceleration/statistics/</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-support-complex-datatypes-for-more-datalake-source">2. Support complex datatypes for more datalake source<a href="#2-support-complex-datatypes-for-more-datalake-source" class="hash-link" aria-label="2. Support complex datatypes for more datalake source的直接链接" title="2. Support complex datatypes for more datalake source的直接链接"></a></h3><ul><li>Support complex datatypes for JAVA UDF, JDBC and Hudi MOR<ul><li><a href="https://github.com/apache/doris/pull/24810" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24810</a></li><li><a href="https://github.com/apache/doris/pull/26236" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26236</a></li></ul></li><li>Support complex datatypes for Paimon<ul><li><a href="https://github.com/apache/doris/pull/25364" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25364</a></li></ul></li><li>Suport Paimon version 0.5<ul><li><a href="https://github.com/apache/doris/pull/24985" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24985</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-add-more-builtin-functions">3. Add more builtin functions<a href="#3-add-more-builtin-functions" class="hash-link" aria-label="3. Add more builtin functions的直接链接" title="3. Add more builtin functions的直接链接"></a></h3><ul><li>Support the BitmapAgg function in new optimizer<ul><li><a href="https://github.com/apache/doris/pull/25508" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25508</a></li></ul></li><li>Supports SHA series digest functions<ul><li><a href="https://github.com/apache/doris/pull/24342" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24342</a> </li></ul></li><li>Support the BITMAP datatype in the aggregate functions min_by and max_by <ul><li><a href="https://github.com/apache/doris/pull/25430" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25430</a> </li></ul></li><li>Add milliseconds/microseconds_add/sub/diff functions<ul><li><a href="https://github.com/apache/doris/pull/24114" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24114</a></li></ul></li><li>Add some json functions: json_insert, json_replace, json_set<ul><li><a href="https://github.com/apache/doris/pull/24384" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24384</a></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvement-and-optimizations">Improvement and optimizations<a href="#improvement-and-optimizations" class="hash-link" aria-label="Improvement and optimizations的直接链接" title="Improvement and optimizations的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-performance-optimizations">1. Performance optimizations<a href="#1-performance-optimizations" class="hash-link" aria-label="1. Performance optimizations的直接链接" title="1. Performance optimizations的直接链接"></a></h3><ul><li>When the inverted index MATCH WHERE condition with a high filter rate is combined with the common WHERE condition with a low filter rate, the I/O of the index column is greatly reduced. </li><li>Optimize the efficiency of random data access after the where filter.</li><li>Optimizes the performance of the old get_json_xx function on JSON data types by 2~4x.</li><li>Supports the configuration to reduce the priority of the data read thread, ensuring the CPU resources for real-time writing.</li><li>Adds <code>uuid-numeric</code> function that returns largeint, which is 20 times faster than <code>uuid</code> function that returns string.</li><li>Optimized the performance of case when by 3x.</li><li>Cut out unnecessary predicate calculations in storage engine execution.</li><li>Accelerate count performance by pushing down count operator to storage tier.</li><li>Optimizes the computation performance of the nullable type in and or expressions.</li><li>Supports rewriting the limit operator before <code>join</code> in more scenarios to improve query performance. </li><li>Eliminate useless <code>order by</code> operators from inline view to improve query performance.</li><li>Optimizes the accuracy of cardinality estimates and cost models in some cases. </li><li>Optimized jdbc catalog predicate pushdown logic.</li><li>Optimized the read efficiency of the file cache when it's enable for the first time.</li><li>Optimizes the hive table sql cache policy and uses the partition update time stored in HMS to improve the cache hit ratio. </li><li>Optimize mow compaction efficiency.</li><li>Optimized thread allocation logic for external table query to reduce memory usage </li><li>Optimize memory usage for column reader.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-distributed-replica-management-improvements">2. Distributed replica management improvements<a href="#2-distributed-replica-management-improvements" class="hash-link" aria-label="2. Distributed replica management improvements的直接链接" title="2. Distributed replica management improvements的直接链接"></a></h3><p>Distributed replica management improvements include skipping partition deletion, colocate group deletion, balance failure due to continuous write, and hot and cold seperation table balance.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-security-enhancement">3. Security enhancement<a href="#3-security-enhancement" class="hash-link" aria-label="3. Security enhancement的直接链接" title="3. Security enhancement的直接链接"></a></h3><ul><li>The audit log plug-in uses a token instead of a plaintext password to enhance security<ul><li><a href="https://github.com/apache/doris/pull/26278" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26278</a></li></ul></li><li>log4j configures security enhancement<ul><li><a href="https://github.com/apache/doris/pull/24861" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24861</a> </li></ul></li><li>Sensitive user information is not displayed in logs<ul><li><a href="https://github.com/apache/doris/pull/26912" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26912</a></li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bugfix-and-stability">Bugfix and stability<a href="#bugfix-and-stability" class="hash-link" aria-label="Bugfix and stability的直接链接" title="Bugfix and stability的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-complex-datatypes">1. Complex datatypes<a href="#1-complex-datatypes" class="hash-link" aria-label="1. Complex datatypes的直接链接" title="1. Complex datatypes的直接链接"></a></h3><ul><li>Fix issues that fixed-length CHAR(n) was not truncated correctly in map/struct.<ul><li><a href="https://github.com/apache/doris/pull/25725" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25725</a></li></ul></li><li>Fix write failure for struct datatype nested for map/array<ul><li><a href="https://github.com/apache/doris/pull/26973" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26973</a></li></ul></li><li>Fix the issue that count distinct did not support array/map/struct <ul><li><a href="https://github.com/apache/doris/pull/25483" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25483</a></li></ul></li><li>Fix be crash in updating to 2.0.3 after the delete complex type appeared in query <ul><li><a href="https://github.com/apache/doris/pull/26006" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26006</a></li></ul></li><li>Fix be crash when JSON datatype is in WHERE clause.<ul><li><a href="https://github.com/apache/doris/pull/27325" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27325</a></li></ul></li><li>Fix be crash when ARRAY datatype is in OUTER JOIN clause.<ul><li><a href="https://github.com/apache/doris/pull/25669" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25669</a></li></ul></li><li>Fix reading incorrect result for DECIMAL datatype in ORC format.<ul><li><a href="https://github.com/apache/doris/pull/26548" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26548</a></li><li><a href="https://github.com/apache/doris/pull/25977" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25977</a></li><li><a href="https://github.com/apache/doris/pull/26633" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26633</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-inverted-index">2. Inverted index<a href="#2-inverted-index" class="hash-link" aria-label="2. Inverted index的直接链接" title="2. Inverted index的直接链接"></a></h3><ul><li>Fix incorrect result for OR NOT combination in WHERE clause were incorrect when disable inverted index query. <ul><li><a href="https://github.com/apache/doris/pull/26327" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26327</a></li></ul></li><li>Fix be crash when write a empty with inverted index<ul><li><a href="https://github.com/apache/doris/pull/25984" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25984</a></li></ul></li><li>Fix be crash in index compaction when the output of compaction is empty.<ul><li><a href="https://github.com/apache/doris/pull/25486" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25486</a></li></ul></li><li>Fixed the problem of adding an inverted index to be crashed when no data is written to the newly added column.</li><li>Fix be crash when BUILD INDEX after ADD COLUMN without new data written.<ul><li><a href="https://github.com/apache/doris/pull/27276" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27276</a></li></ul></li><li>Fix missing and leak problem of hardlink for inverted index file.<ul><li><a href="https://github.com/apache/doris/pull/26903" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26903</a></li></ul></li><li>Fix index file corrupt when disk is full temporarilly<ul><li><a href="https://github.com/apache/doris/pull/28191" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/28191</a></li></ul></li><li>Fix incorrect result due to optimization for skip reading index column<ul><li><a href="https://github.com/apache/doris/pull/28104" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/28104</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-materialized-view">3. Materialized View<a href="#3-materialized-view" class="hash-link" aria-label="3. Materialized View的直接链接" title="3. Materialized View的直接链接"></a></h3><ul><li>Fix the problem of BE crash caused by repeated expressions in the group by statement</li><li>Fix be crash when there are duplicate expressions in <code>group by</code> statements.<ul><li><a href="https://github.com/apache/doris/pull/27523" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27523</a></li></ul></li><li>Disables the float/doubld type in the <code>group by</code> clause when a view is created. <ul><li><a href="https://github.com/apache/doris/pull/25823" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25823</a></li></ul></li><li>Improve the function of select query matching materialized view <ul><li><a href="https://github.com/apache/doris/pull/24691" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24691</a> </li></ul></li><li>Fix an issue that materialized views could not be matched when a table alias was used <ul><li><a href="https://github.com/apache/doris/pull/25321" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25321</a></li></ul></li><li>Fix the problem using percentile_approx when creating materialized views <ul><li><a href="https://github.com/apache/doris/pull/26528" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26528</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-table-sample">4. Table sample<a href="#4-table-sample" class="hash-link" aria-label="4. Table sample的直接链接" title="4. Table sample的直接链接"></a></h3><ul><li>Fix the problem that table sample query can not work on table with partitions.<ul><li><a href="https://github.com/apache/doris/pull/25912" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25912</a> </li></ul></li><li>Fix the problem that table sample query can not work when specify tablet.<ul><li><a href="https://github.com/apache/doris/pull/25378" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25378</a> </li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="5-unique-with-merge-on-write">5. Unique with merge on write<a href="#5-unique-with-merge-on-write" class="hash-link" aria-label="5. Unique with merge on write的直接链接" title="5. Unique with merge on write的直接链接"></a></h3><ul><li>Fix null pointer exception in conditional update based on primary key <ul><li><a href="https://github.com/apache/doris/pull/26881" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26881</a> </li></ul></li><li>Fix field name capitalization issues in partial update <ul><li><a href="https://github.com/apache/doris/pull/27223" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27223</a> </li></ul></li><li>Fix duplicate keys occur in mow during schema change repairement. <ul><li><a href="https://github.com/apache/doris/pull/25705" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25705</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="6-load-and-compaction">6. Load and compaction<a href="#6-load-and-compaction" class="hash-link" aria-label="6. Load and compaction的直接链接" title="6. Load and compaction的直接链接"></a></h3><ul><li>Fix unkown slot descriptor error in routineload for running multiple tables <ul><li><a href="https://github.com/apache/doris/pull/25762" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25762</a></li></ul></li><li>Fix be crash due to concurrent memory access when caculating memory <ul><li><a href="https://github.com/apache/doris/pull/27101" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27101</a> </li></ul></li><li>Fix be crash on duplicate cancel for load.<ul><li><a href="https://github.com/apache/doris/pull/27111" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27111</a></li></ul></li><li>Fix broker connection error during broker load<ul><li><a href="https://github.com/apache/doris/pull/26050" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26050</a></li></ul></li><li>Fix incorrect result delete predicates in concurrent case of compation and scan.<ul><li><a href="https://github.com/apache/doris/pull/24638" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24638</a></li></ul></li><li>Fix the problem tha compaction task would print too many stacktrace logs <ul><li><a href="https://github.com/apache/doris/pull/25597" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25597</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="7-data-lake-compatibility">7. Data Lake compatibility<a href="#7-data-lake-compatibility" class="hash-link" aria-label="7. Data Lake compatibility的直接链接" title="7. Data Lake compatibility的直接链接"></a></h3><ul><li>Solve the problem that the iceberg table contains special characters that cause query failure <ul><li><a href="https://github.com/apache/doris/pull/27108" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27108</a> </li></ul></li><li>Fix compatibility issues of different hive metastore versions <ul><li><a href="https://github.com/apache/doris/pull/27327" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27327</a> </li></ul></li><li>Fix an error reading max compute partition table <ul><li><a href="https://github.com/apache/doris/pull/24911" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24911</a> </li></ul></li><li>Fix the issue that backup to object storage failed <ul><li><a href="https://github.com/apache/doris/pull/25496" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25496</a> </li><li><a href="https://github.com/apache/doris/pull/25803" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25803</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="8-jdbc-external-table-compatibility">8. JDBC external table compatibility<a href="#8-jdbc-external-table-compatibility" class="hash-link" aria-label="8. JDBC external table compatibility的直接链接" title="8. JDBC external table compatibility的直接链接"></a></h3><ul><li>Fix Oracle date type format error in jdbc catalog <ul><li><a href="https://github.com/apache/doris/pull/25487" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25487</a> </li></ul></li><li>Fix MySQL 0000-00-00 date exception in jdbc catalog <ul><li><a href="https://github.com/apache/doris/pull/26569" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26569</a> </li></ul></li><li>Fix an exception in reading data from Mariadb where the default value of the time type is current_timestamp <ul><li><a href="https://github.com/apache/doris/pull/25016" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25016</a> </li></ul></li><li>Fix be crash when processing BITMAP datatype in jdbc catalog<ul><li><a href="https://github.com/apache/doris/pull/25034" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25034</a> </li><li><a href="https://github.com/apache/doris/pull/26933" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26933</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="9-sql-planner-and-optimizer">9. SQL Planner and Optimizer<a href="#9-sql-planner-and-optimizer" class="hash-link" aria-label="9. SQL Planner and Optimizer的直接链接" title="9. SQL Planner and Optimizer的直接链接"></a></h3><ul><li><p>Fix partition prune error in some scenes</p><ul><li><a href="https://github.com/apache/doris/pull/27047" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27047</a></li><li><a href="https://github.com/apache/doris/pull/26873" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26873</a></li><li><a href="https://github.com/apache/doris/pull/25769" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25769</a></li><li><a href="https://github.com/apache/doris/pull/27636" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27636</a></li></ul></li><li><p>Fix incorrect sub-query processing in some scenarios</p><ul><li><a href="https://github.com/apache/doris/pull/26034" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26034</a></li><li><a href="https://github.com/apache/doris/pull/25492" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25492</a></li><li><a href="https://github.com/apache/doris/pull/25955" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25955</a></li><li><a href="https://github.com/apache/doris/pull/27177" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27177</a></li></ul></li><li><p>Fix some semantic parsing errors</p><ul><li><a href="https://github.com/apache/doris/pull/24928" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/24928</a></li><li><a href="https://github.com/apache/doris/pull/25627" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25627</a></li></ul></li><li><p>Fix data loss during right outer/anti join</p><ul><li><a href="https://github.com/apache/doris/pull/26529" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26529</a></li></ul></li><li><p>Fix incorrect pushing down of predicate pass aggregation operators.</p><ul><li><a href="https://github.com/apache/doris/pull/25525" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25525</a></li></ul></li><li><p>Fix incorrect result header in some cases</p><ul><li><a href="https://github.com/apache/doris/pull/25372" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/25372</a></li></ul></li><li><p>Fix incorrect plan when the nullsafeEquals expression (&lt;=&gt;) is used as the join condition</p><ul><li><a href="https://github.com/apache/doris/pull/27127" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/27127</a></li></ul></li><li><p>Fix correct column prune in set operation operator.</p><ul><li><a href="https://github.com/apache/doris/pull/26884" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/26884</a></li></ul></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="others">Others<a href="#others" class="hash-link" aria-label="Others的直接链接" title="Others的直接链接"></a></h3><ul><li>Fix BE crash when the order of columns in a table is changed and then upgraded to 2.0.3.<ul><li><a href="https://github.com/apache/doris/pull/28205" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/28205</a></li></ul></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/issues?q=label%3Adev%2F2.0.3-merged+is%3Aclosed" target="_blank" rel="noopener noreferrer">github dev/2.0.3-merged</a> .</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Empowering cyber security by enabling 7 times faster log analysis]]></title>
<id>https://doris.apache.org/zh-CN/blog/empowering-cyber-security-by-enabling-seven-times-faster-log-analysis</id>
<link href="https://doris.apache.org/zh-CN/blog/empowering-cyber-security-by-enabling-seven-times-faster-log-analysis"/>
<updated>2023-12-07T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This is about how a cyber security service provider built its log storage and analysis system (LSAS) and realized 3X data writing speed, 7X query execution speed, and visualized management.]]></summary>
<content type="html"><![CDATA[<p>This is about how a cyber security service provider built its log storage and analysis system (LSAS) and realized 3X data writing speed, 7X query execution speed, and visualized management. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="log-storage--analysis-platform">Log storage &amp; analysis platform<a href="#log-storage--analysis-platform" class="hash-link" aria-label="Log storage &amp; analysis platform的直接链接" title="Log storage &amp; analysis platform的直接链接"></a></h2><p>In this use case, the LSAS collects system logs from its enterprise users, scans them, and detects viruses. It also provides data management and file tracking services. </p><p>Within the LSAS, it scans local files and uploads the file information as MD5 values to its cloud engine and identifies suspicious viruses. The cloud engine returns a log entry to tell the risk level of the files. The log entry includes messages like <code>file_name</code>, <code>file_size</code>, <code>file_level</code>, and <code>event_time</code>. Such information goes into a Topic in Apache Kafka, and then the real-time data warehouse normalizes the log messages. After that, all log data will be backed up to the offline data warehouse. Some log data requires further security analysis, so it will be pulled into the analytic engine and the self-developed Extended Detection and Response system (XDR) for more comprehensive detection. </p><p><img loading="lazy" alt="cyber-security-log-storage-and-analysis-platform" src="https://cdnd.selectdb.com/zh-CN/assets/images/cyber-security-log-storage-and-analysis-platform-83b6323a2b975c59ddcf59de91f96847.png" width="1280" height="536" class="img_ev3q"></p><p>The above process comes down to log writing and analysis, and the company faced some issues in both processes with their old system, which used StarRocks as the analytic engine.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="slow-data-writing">Slow data writing<a href="#slow-data-writing" class="hash-link" aria-label="Slow data writing的直接链接" title="Slow data writing的直接链接"></a></h3><p>The cloud engine interacts with tens of millions of terminal software and digests over 100 billion logs every day. The enormous data size poses a big challenge. The LSAS used to rely on StarRocks for log storage. With the ever-increasing daily log influx, data writing gradually slows down. The severe backlogs during peak times undermines system stability. They tried scaling the cluster from 3 nodes to 13 nodes, but the writing speed wasn't substantially improved.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="slow-query-execution">Slow query execution<a href="#slow-query-execution" class="hash-link" aria-label="Slow query execution的直接链接" title="Slow query execution的直接链接"></a></h3><p>From an execution standpoint, extracting security information from logs involves a lot of keyword matching in the text fields (URL, payload, etc.). The StarRocks-based system does that by the SQL LIKE operator, which implements full scanning and brutal-force matching. In that way, queries on a 100-billion-row table often take one or several minutes. After screening out irrelevant data based on time range, the query response time still ranges from seconds to dozens of seconds, and it gets worse with concurrent queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="architectural-upgrade">Architectural upgrade<a href="#architectural-upgrade" class="hash-link" aria-label="Architectural upgrade的直接链接" title="Architectural upgrade的直接链接"></a></h2><p>In the search for a new database tool, the cyber security company set their eye on <a href="https://doris.apache.org/zh-CN/" target="_blank" rel="noopener noreferrer">Apache Doris</a>, which happened to have sharpened itself up in <a href="https://doris.apache.org/zh-CN/blog/release-note-2.0.0" target="_blank" rel="noopener noreferrer">version 2.0</a> for log analysis. It supports <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index/" target="_blank" rel="noopener noreferrer">inverted index</a> to empower text search, and <a href="https://doris.apache.org/docs/dev/data-table/index/ngram-bloomfilter-index?_highlight=ngram" target="_blank" rel="noopener noreferrer">NGram BloomFilter</a> to speed up the LIKE operator. </p><p>Although StarRocks was a fork of Apache Doris, it has rewritten part of the code and is now very different from Apache Doris in terms of features. The foregoing inverted index and NGram BloomFilter are a fragment of the current advancements that Apache Doris has made.</p><p>They tried Apache Doris out to evaluate its writing speed, query performance, and the associated storage and maintenance costs. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="300-data-writing-speed">300% data writing speed<a href="#300-data-writing-speed" class="hash-link" aria-label="300% data writing speed的直接链接" title="300% data writing speed的直接链接"></a></h3><p>To test the peak performance of Apache Doris, they only used 3 servers and connected it to Apache Kafka to receive their daily data input, and this is the test result compared to the old StarRocks-based LSAS.</p><p><img loading="lazy" alt="apache-doris-vs-starrocks-writing-throughput" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-vs-starrocks-writing-throughput-e462779d45f4ba298ecbdc75b2f90b68.png" width="1280" height="403" class="img_ev3q"></p><p>Based on the peak performance of Apache Doris, it's estimated that a 3-server cluster with 30% of CPU usage will be able to handle the writing workload. That can save them over 70% of hardware resources. Notably, in this test, they enabled inverted index for half of the fields. If it were disabled, the writing speed could be increased by another 50%.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="60-storage-cost">60% storage cost<a href="#60-storage-cost" class="hash-link" aria-label="60% storage cost的直接链接" title="60% storage cost的直接链接"></a></h3><p>With inverted index enabled, Apache Doris used even smaller storage space than the old system without inverted indexes. The data compression ratio was 1: 5.7 compared to the previous 1: 4.3.</p><p>In most databases and similar tools, the index file is often 2~4 times the size of the data file it belongs to, but in Apache Doris, the index-data size is basically one to one. That means Apache Doris can save a lot of storage space for users. This is because it has adopted columnar storage and the ZStandard compression. With data and indexes being stored column by column, it is easier to compress them, and the ZStandard algorithm is faster with higher compression ratio so it is perfect for log processing. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="690-query-speed">690% query speed<a href="#690-query-speed" class="hash-link" aria-label="690% query speed的直接链接" title="690% query speed的直接链接"></a></h3><p>To compare the query performance before and after upgrading, they tested the old and the new systems with 79 of their frequently executed SQL statements on the same 100 billion rows of log data with the same cluster size of 10 backend nodes.</p><p>They jotted down the query response time as follows:</p><p>The new Apache Doris-based system is faster in all 79 queries. On average, it reduces the query execution time by a factor of 7.</p><p><img loading="lazy" alt="apache-doris-vs-starrocks-query-performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-vs-starrocks-query-performance-d4377592d59672165b17a6bc5158d8fe.png" width="1280" height="1017" class="img_ev3q"></p><p>Among these queries, the greatest increases in speed were enabled by a few features and optimizations of Apache Doris for log analysis.</p><p><strong>1. Inverted index accelerating keyword searches: Q23, Q24, Q30, Q31, Q42, Q43, Q50</strong></p><p>Example: Q43 was sped up 88.2 times.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT count() from table2 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE ( event_time &gt;= 1693065600000 and event_time &lt; 1693152000000) </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (rule_hit_big MATCH 'xxxx');</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>How is <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index/" target="_blank" rel="noopener noreferrer">inverted index</a> implemented? Upon data writing, Apache Doris tokenizes the texts into words, and takes notes of which word exists in which rows. For example, the word "machine" is in Row 127 and Row 201. In keyword searches, the system can quickly locate the relevant data by tracking the row numbers in the indexes.</p><p>Inverted index is much more efficient than brutal-force scanning in text searches. For one thing, it doesn't have to read that much data. For another, it doesn't require text matching. So it is able to increase execution speed by orders of magnitudes.</p><p><img loading="lazy" alt="cyber-security-inverted-index" src="https://cdnd.selectdb.com/zh-CN/assets/images/cyber-security-inverted-index-20f3d1267475f3074304b15f8a901db3.png" width="961" height="720" class="img_ev3q"></p><p><strong>2. NGram BloomFilter accelerating the LIKE operator: Q75, Q76, Q77, Q78</strong></p><p>Example: Q75 was sped up 44.4 times.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM table1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE ent_id = 'xxxxx' </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND event_date = '2023-08-27' </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND file_level = 70 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND rule_group_id LIKE 'adid:%' </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY event_time LIMIT 100;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>For non-verbatim searches, the LIKE operator is an important implementation method, so Apache Doris 2.0 introduces the <a href="https://doris.apache.org/docs/dev/data-table/index/ngram-bloomfilter-index" target="_blank" rel="noopener noreferrer">NGram BloomFilter</a> to empower that. </p><p>Different from regular BloomFilter, the NGram BloomFilter does not put the entire text into the filter, but splits it into continuous sub-strings of length N, and then puts the sub-strings into the filter. For a query like <code>cola LIKE '%pattern%'</code>, it splits <code>'pattern'</code> into several strings of length N, and sees if each of these sub-strings exists in the dataset. The absence of any sub-string in the dataset will indicate that the dataset does not contain the word <code>'pattern'</code>, so it will be skipped in data scanning, and that's how the NGram BloomFilter accelerates queries.</p><p><strong>3. Optimizations for Top-N queries: Q19~Q29</strong></p><p>Example: Q22 was sped up 50.3 times.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM table1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where event_date = '2023-08-27' and file_level = 70 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and ent_id = 'nnnnnnn' and file_name = 'xxx.exe'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by event_time limit 100;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Top-N queries are to find the N logs that fit into the specified conditions. It is a common type of query in log analysis, with the SQl being like <code>SELECT * FROM t WHERE xxx ORDER BY xx LIMIT n</code>. Apache Doris has optimized itself for that. Based on the intermediate status of queries, it figures out the dynamic range of the ranking field and implements automatic predicate pushdown to reduce data scanning. In some cases, this can decrease the scanned data volume by an order of magnitude.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="visualized-operation--maintenance">Visualized operation &amp; maintenance<a href="#visualized-operation--maintenance" class="hash-link" aria-label="Visualized operation &amp; maintenance的直接链接" title="Visualized operation &amp; maintenance的直接链接"></a></h3><p>For more efficient cluster maintenance, VeloDB, the commercial supporter of Apache Doris , has contributed a visualized cluster management tool called <a href="https://github.com/apache/doris-manager" target="_blank" rel="noopener noreferrer">Doris Manager</a> to the Apache Doris project. Everyday management and maintenance operations can be done via the Doris Manager, including cluster monitoring, inspection, configuration modification, scaling, and upgrading. The visualized tool can save a lot of manual efforts and avoid the risks of maloperations on Doris.</p><p><img loading="lazy" alt="doris-manager-for-visualized-operation-and-maintenance" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-manager-for-visualized-operation-and-maintenance-b1f63cbae23f025b6ac4d49bf6b9ca36.png" width="1280" height="642" class="img_ev3q"></p><p>Apart from cluster management, Doris Manager provides a visualized WebUI for log analysis (think of Kibana), so it's very friendly to users who are familiar with the ELK Stack. It supports keyword searches, trend charts, field filtering, and detailed data listing and collapsed display, so it enables interactive analysis and easy drilling down of logs.</p><p><img loading="lazy" alt="doris-manager-webui-showcase" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-manager-webui-showcase-cba1b2b240ff03357c833aae15e614da.png" width="1280" height="687" class="img_ev3q"></p><p>After a month-long trial run, they officially replaced their old LSAS with the Apache Doris-based system for production, and achieved great results as they expected. Now, they ingest their 100s of billions of new logs every day via the <a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual/" target="_blank" rel="noopener noreferrer">Routine Load</a> method at a speed 3 times as fast as before. Among the 7-time overall query performance increase, they benefit from a speedup of over 20 times in full-text searches. And they enjoy easier maintenance and interactive analysis. Their next step is to expand the coverage of JSON data type and delve into semi-structured data analysis. Luckily, the upcoming Apache Doris 2.1 will provide more schema-free support. It will have a new Variant data type, support JSON data of any structures, and allow for flexible changes in field numbers and field types. Relevant updates will be released on the <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris website</a> and the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[How big data is saving lives in real time: IoV data analytics helps prevent accidents]]></title>
<id>https://doris.apache.org/zh-CN/blog/how-big-data-is-saving-lives-in-real-time-iov-data-analytics-helps-prevent-accidents</id>
<link href="https://doris.apache.org/zh-CN/blog/how-big-data-is-saving-lives-in-real-time-iov-data-analytics-helps-prevent-accidents"/>
<updated>2023-11-29T00:00:00.000Z</updated>
<summary type="html"><![CDATA[What needs to be taken care of in IoV data analysis? What's the difference between a near real-time analytic data platform and an actual real-time analytic data platform?]]></summary>
<content type="html"><![CDATA[<p>Internet of Vehicles, or IoV, is the product of the marriage between the automotive industry and IoT. IoV data is expected to get larger and larger, especially with electric vehicles being the new growth engine of the auto market. The question is: Is your data platform ready for that? This post shows you what an OLAP solution for IoV looks like.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-is-special-about-iov-data">What is special about IoV data?<a href="#what-is-special-about-iov-data" class="hash-link" aria-label="What is special about IoV data?的直接链接" title="What is special about IoV data?的直接链接"></a></h2><p>The idea of IoV is intuitive: to create a network so vehicles can share information with each other or with urban infrastructure. What‘s often under-explained is the network within each vehicle itself. On each car, there is something called Controller Area Network (CAN) that works as the communication center for the electronic control systems. For a car traveling on the road, the CAN is the guarantee of its safety and functionality, because it is responsible for:</p><ul><li><strong>Vehicle system monitoring</strong>: The CAN is the pulse of the vehicle system. For example, sensors send the temperature, pressure, or position they detect to the CAN; controllers issue commands (like adjusting the valve or the drive motor) to the executor via the CAN. </li><li><strong>Real-time feedback</strong>: Via the CAN, sensors send the speed, steering angle, and brake status to the controllers, which make timely adjustments to the car to ensure safety. </li><li><strong>Data sharing and coordination</strong>: The CAN allows for data exchange (such as status and commands) between various devices, so the whole system can be more performant and efficient.</li><li><strong>Network management and troubleshooting</strong>: The CAN keeps an eye on devices and components in the system. It recognizes, configures, and monitors the devices for maintenance and troubleshooting.</li></ul><p>With the CAN being that busy, you can imagine the data size that is traveling through the CAN every day. In the case of this post, we are talking about a car manufacturer who connects 4 million cars together and has to process 100 billion pieces of CAN data every day. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="iov-data-processing">IoV data processing<a href="#iov-data-processing" class="hash-link" aria-label="IoV data processing的直接链接" title="IoV data processing的直接链接"></a></h2><p>To turn this huge data size into valuable information that guides product development, production, and sales is the juicy part. Like most data analytic workloads, this comes down to data writing and computation, which are also where challenges exist:</p><ul><li><strong>Data writing at scale</strong>: Sensors are everywhere in a car: doors, seats, brake lights... Plus, many sensors collect more than one signal. The 4 million cars add up to a data throughput of millions of TPS, which means dozens of terabytes every day. With increasing car sales, that number is still growing. </li><li><strong>Real-time analysis</strong>: This is perhaps the best manifestation of "time is life". Car manufacturers collect the real-time data from their vehicles to identify potential malfunctions, and fix them before any damage happens.</li><li><strong>Low-cost computation and storage</strong>: It's hard to talk about huge data size without mentioning its costs. Low cost makes big data processing sustainable.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="from-apache-hive-to-apache-doris-a-transition-to-real-time-analysis">From Apache Hive to Apache Doris: a transition to real-time analysis<a href="#from-apache-hive-to-apache-doris-a-transition-to-real-time-analysis" class="hash-link" aria-label="From Apache Hive to Apache Doris: a transition to real-time analysis的直接链接" title="From Apache Hive to Apache Doris: a transition to real-time analysis的直接链接"></a></h2><p>Like Rome, a real-time data processing platform is not built in a day. The car manufacturer used to rely on the combination of a batch analytic engine (Apache Hive) and some streaming frameworks and engines (Apache Flink, Apache Kafka) to gain near real-time data analysis performance. They didn't realize they needed real-time that bad until real-time was a problem.</p><p><strong>Near Real-Time Data Analysis Platform</strong></p><p>This is what used to work for them:</p><p><img loading="lazy" alt="IoV-Hive-based-data-warehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/IoV-Hive-based-data-warehouse-1bbef26f4fbb3012d0ae17fc3b1c4fa5.png" width="1280" height="766" class="img_ev3q"></p><p>Data from the CAN and vehicle sensors are uploaded via 4G network to the cloud gateway, which writes the data into Kafka. Then, Flink processes this data and forwards it to Hive. Going through several data warehousing layers in Hive, the aggregated data is exported to MySQL. At the end, Hive and MySQL provide data to the application layer for data analysis, dashboarding, etc.</p><p>Since Hive is primarily designed for batch processing rather than real-time analytics, you can tell the mismatch of it in this use case.</p><ul><li><strong>Data writing</strong>: With such a huge data size, the data ingestion time from Flink into Hive was noticeably long. In addition, Hive only supports data updating at the granularity of partitions, which is not enough for some cases.</li><li><strong>Data analysis</strong>: The Hive-based analytic solution delivers high query latency, which is a multi-factor issue. Firstly, Hive was slower than expected when handling large tables with 1 billion rows. Secondly, within Hive, data is extracted from one layer to another by the execution of Spark SQL, which could take a while. Thirdly, as Hive needs to work with MySQL to serve all needs from the application side, data transfer between Hive and MySQL also adds to the query latency. </li></ul><p><strong>Real-Time Data Analysis Platform</strong></p><p>This is what happens when they add a real-time analytic engine to the picture:</p><p><img loading="lazy" alt="IoV-Doris-based-data-warehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/IoV-Doris-based-data-warehouse-6eb6329ab3bedda6ed707f02219d85c7.png" width="1280" height="1058" class="img_ev3q"></p><p>Compared to the old Hive-based platform, this new one is more efficient in three ways:</p><ul><li><strong>Data writing</strong>: Data ingestion into <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a> is quick and easy, without complicated configurations and the introduction of extra components. It supports a variety of data ingestion methods. For example, in this case, data is written from Kafka into Doris via <a href="https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual" target="_blank" rel="noopener noreferrer">Stream Load</a>, and from Hive into Doris via <a href="https://doris.apache.org/docs/data-operate/import/import-way/broker-load-manual" target="_blank" rel="noopener noreferrer">Broker Load</a>. </li><li><strong>Data analysis</strong>: To showcase the query speed of Apache Doris by example, it can return a 10-million-row result set within seconds in a cross-table join query. Also, it can work as a <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">unified query gateway</a> with its quick access to external data (Hive, MySQL, Iceberg, etc.), so analysts don't have to juggle between multiple components.</li><li><strong>Computation and storage costs</strong>: Apache Doris provides the Z-Standard algorithm that can bring a 3~5 times higher data compression ratio. That's how it helps reduce costs in data computation and storage. Moreover, the compression can be done solely in Doris so it won't consume resources from Flink.</li></ul><p>A good real-time analytic solution not only stresses data processing speed, it also considers all the way along your data pipeline and smoothens every step of it. Here are two examples:</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-the-arrangement-of-can-data">1. The arrangement of CAN data<a href="#1-the-arrangement-of-can-data" class="hash-link" aria-label="1. The arrangement of CAN data的直接链接" title="1. The arrangement of CAN data的直接链接"></a></h3><p>In Kafka, CAN data was arranged by the dimension of CAN ID. However, for the sake of data analysis, analysts had to compare signals from various vehicles, which meant to concatenate data of different CAN ID into a flat table and align it by timestamp. From that flat table, they could derive different tables for different analytic purposes. Such transformation was implemented using Spark SQL, which was time-consuming in the old Hive-based architecture, and the SQL statements are high-maintenance. Moreover, the data was updated by batch on a daily basis, which meant they could only get data from a day ago. </p><p>In Apache Doris, all they need is to build the tables with the <a href="https://doris.apache.org/docs/data-table/data-model#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate Key model</a>, specify VIN (Vehicle Identification Number) and timestamp as the Aggregate Key, and define other data fields by <code>REPLACE_IF_NOT_NULL</code>. With Doris, they don't have to take care of the SQL statements or the flat table, but are able to extract real-time insights from real-time data.</p><p><img loading="lazy" alt="IoV-CAN-data" src="https://cdnd.selectdb.com/zh-CN/assets/images/IoV-CAN-data-21c4722dff0b60c64dd2286cbf3df3be.jpeg" width="1280" height="937" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-dtc-data-query">2. DTC data query<a href="#2-dtc-data-query" class="hash-link" aria-label="2. DTC data query的直接链接" title="2. DTC data query的直接链接"></a></h3><p>Of all CAN data, DTC (Diagnostic Trouble Code) deserves high attention and separate storage, because it tells you what's wrong with a car. Each day, the manufacturer receives around 1 billion pieces of DTC. To capture life-saving information from the DTC, data engineers need to relate the DTC data to a DTC configuration table in MySQL.</p><p>What they used to do was to write the DTC data into Kafka every day, process it in Flink, and store the results in Hive. In this way, the DTC data and the DTC configuration table were stored in two different components. That caused a dilemma: a 1-billion-row DTC table was hard to write into MySQL, while querying from Hive was slow. As the DTC configuration table was also constantly updated, engineers could only import a version of it into Hive on a regular basis. That meant they didn't always get to relate the DTC data to the latest DTC configurations. </p><p>As is mentioned, Apache Doris can work as a unified query gateway. This is supported by its <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> feature. They import their DTC data from Hive into Doris, and then they create a MySQL Catalog in Doris to map to the DTC configuration table in MySQL. When all this is done, they can simply join the two tables within Doris and get real-time query response.</p><p><img loading="lazy" alt="IoV-DTC-data-query" src="https://cdnd.selectdb.com/zh-CN/assets/images/IoV-DTC-data-query-7e0534a9aafd3005e1e08439acb288fc.png" width="1280" height="523" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>This is an actual real-time analytic solution for IoV. It is designed for data at really large scale, and it is now supporting a car manufacturer who receives 10 billion rows of new data every day in improving driving safety and experience.</p><p>Building a data platform to suit your use case is not easy, I hope this post helps you in building your own analytic solution.</p><p>Apache Doris <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">GitHub repo</a></p><p>Find Apache Doris makers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a></p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Less components, higher performance: Apache Doris instead of ClickHouse, MySQL, Presto, and HBase]]></title>
<id>https://doris.apache.org/zh-CN/blog/less-components-higher-performance-apache-doris-instead-of-clickhouse-mysql-presto-and-hbase</id>
<link href="https://doris.apache.org/zh-CN/blog/less-components-higher-performance-apache-doris-instead-of-clickhouse-mysql-presto-and-hbase"/>
<updated>2023-11-22T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This post is about building a unified OLAP platform. An insurance company tries to build a data warehouse that can undertake all their customer-facing, analyst-facing, and management-facing data analysis workloads.]]></summary>
<content type="html"><![CDATA[<p>This post is about building a unified OLAP platform. An insurance company tries to build a data warehouse that can undertake all their customer-facing, analyst-facing, and management-facing data analysis workloads. The main tasks include: </p><ul><li><strong>Self-service insurance contract query</strong>: This is for insurance customers to check their contract details by their contract ID. It should also support filters such as coverage period, insurance types, and claim amount. </li><li><strong>Multi-dimensional analysis</strong>: Analysts develop their reports based on different data dimensions as they need, so they can extract insights to facilitate product innovation and their anti-fraud efforts. </li><li><strong>Dashboarding</strong>: This is to create visual overview of the insurance sales trends and the horizontal and vertical comparison of different metrics.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="component-heavy-data-architecture">Component-Heavy Data Architecture<a href="#component-heavy-data-architecture" class="hash-link" aria-label="Component-Heavy Data Architecture的直接链接" title="Component-Heavy Data Architecture的直接链接"></a></h2><p>The user started with Lambda architecture, spliting their data pipeline into a batch processing link and a stream processing link. For real-time data streaming, they apply Flink CDC; for batch import, they incorporate Sqoop, Python, and DataX to build their own data integration tool named Hisen. </p><p><img loading="lazy" alt="multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto" src="https://cdnd.selectdb.com/zh-CN/assets/images/multi-component-data-warehouse-mysql-clickhouse-hbase-hive-presto-6e3dbac016295bce3108943b4bddcf4c.png" width="1280" height="640" class="img_ev3q"></p><p>Then, the real-time and offline data meets in the data warehousing layer, which is made up of five components.</p><p><strong>ClickHouse</strong></p><p>The data warehouse is of flat table design and ClickHouse is superb in flat table reading. But as business evolves, things become challenging in two ways:</p><ul><li>To support cross-table joins and point queries, the user requires the star schema, but that's difficult to implement in ClickHouse.</li><li>Changes in insurance contracts need to be updated in the data warehouse in real time. In ClickHouse, that is done by recreating a flat table to overwrite the old one, which is not fast enough.</li></ul><p><strong>MySQL</strong></p><p>After calculation, data metrics are stored in MySQL, but as the data size grows, MySQL starts to struggle, with emerging problems like prolonged execution time and errors thrown.</p><p><strong>Apache</strong> <strong>Hive</strong> <strong>+ Presto</strong></p><p>Hive is the main executor in the batch processing link. It can transform, aggregate, query offline data. Presto is a complement to Hive for interactive analysis.</p><p><strong>Apache HBase</strong></p><p>HBase undertakes primary key queries. It reads customer status from MySQL and Hive, including customer credits, coverage period, and sum insured. However, since HBase does not support secondary indexes, it has limited capability in reading non-primary key columns. Plus, as a NoSQL database, HBase does not support SQL statements.</p><p>The components have to work in conjunction to serve all needs, making the data warehouse too much to take care of. It is not easy to get started with because engineers must be trained on all these components. Also, the complexity of architecture adds to the risks of latency. </p><p>So the user tried to look for a tool that ticks more boxes in fulfilling their requirements. The first thing they need is real-time capabilities, including real-time writing, real-time updating, and real-time response to data queries. Secondly, they need more flexibility in data analysis to support customer-facing self-service queries, like multi-dimensional analysis, join queries of large tables, primary key indexes, roll-ups, and drill-downs. Then, for batch processing, they also want high throughput in data writing.</p><p>They eventually made up their mind with <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a>. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="replacing-four-components-with-apache-doris">Replacing Four Components with Apache Doris<a href="#replacing-four-components-with-apache-doris" class="hash-link" aria-label="Replacing Four Components with Apache Doris的直接链接" title="Replacing Four Components with Apache Doris的直接链接"></a></h2><p> Apache Doris is capable of both real-time and offline data analysis, and it supports both high-throughput interactive analysis and high-concurrency point queries. That's why it can replace ClickHouse, MySQL, Presto, and Apache HBase and work as the unified query gateway for the entire data system. </p><p><img loading="lazy" alt="unified-data-warehouse-kafka-apache-doris-hive" src="https://cdnd.selectdb.com/zh-CN/assets/images/unified-data-warehouse-kafka-apache-doris-hive-0c1accc90b4280a26b81be17b31e5a63.png" width="1280" height="686" class="img_ev3q"></p><p>The improved data pipeline is a much cleaner Lambda architecture. </p><p>Apache Doris provides a wide range of data ingestion methods. It's quick in data writing. On top of this, it also implements Merge-on-Write to improve its performance on concurrent point queries. </p><p><strong>Reduced Cost</strong></p><p>The new architecture has reduced the user's cost in human efforts. For one thing, the much simpler data architecture leads to much easier maintenance; for another, developers no longer need to join the real-time and offline data in the data serving API.</p><p>The user can also save money with Doris because it supports tiered storage. It allows the user to put their huge amount of rarely accessed historical data in object storage, which is much cheaper to hoard data.</p><p><strong>Higher Efficiency</strong></p><p>Apache Doris can reach a QPS of 10,000s and respond to billions of point queries within milliseconds, so the customer-facing queries are easy for it to handle. Tiered storage that separates hot data from cold data also increases their query efficiency.</p><p><strong>Service Availability</strong></p><p>As a unified data warehouse for storage, computation, and data services, Apache Doris allows for easy disaster recovery. With less components, they don't have to worry about data loss or duplication. </p><p>An important guarantee of service availability for the user is the Cross-Cluster Replication (CCR) capability of Apache Doris. It can synchronize data from cluster to cluster within minutes or even seconds, and it implements two mechanisms to ensure data reliability:</p><ul><li><strong>Binlog</strong>: This mechanism can automatically log the data changes and generate a LogID for each data modification operation. The incremental LogIDs make sure that data changes are traceable and orderly.</li><li><strong>Data persistence</strong>: In the case of system meltdown or emergencies, data will be put into disks.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-deeper-look-into-apache-doris">A Deeper Look into Apache Doris<a href="#a-deeper-look-into-apache-doris" class="hash-link" aria-label="A Deeper Look into Apache Doris的直接链接" title="A Deeper Look into Apache Doris的直接链接"></a></h2><p>Apache Doris can replace the ClickHouse, MySQL, Presto, and HBase because it has a comprehensive collection of capabilities all along the data processing pipeline. In data ingestion, it enables low-latency real-time writing based on its support for Flink CDC and Merge-on-Write. It guarantees Exactly-Once writing by its Label mechanism and transactional loading. In data queries, it supports both Star Schema and flat table aggregation, so it can provide high performance in bother multi-table joins and large single table queries. It also provides various ways to speed up different queries, like <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index/" target="_blank" rel="noopener noreferrer">inverted index</a> for full-text search and range queries, short-circuit plan and prepared statements for point queries.</p>]]></content>
<author>
<name>CIGNA &amp; CMB</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris Summit Asia 2023: what can you expect from apache doris as a data warehouse?]]></title>
<id>https://doris.apache.org/zh-CN/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse</id>
<link href="https://doris.apache.org/zh-CN/blog/apache-doris-summit-asia-2023-what-can-you-expect-from-apache-doris-as-a-data-warehouse"/>
<updated>2023-11-10T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The past year marks a breakthrough of Apache Doris, an open-source real-time data warehouse that has just undergone an overall upgrade after long consistent incremental optimizations.]]></summary>
<content type="html"><![CDATA[<p>When it is cranberry and pumpkin season, we had the unforgettable Apache Doris Summit Asia 2023 with our remarkable committers, users, and community partners, to honor what we have achieved in the past year, and provide a preview of where we are going next.</p><p>The past year marks a breakthrough of <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a>, an open-source real-time data warehouse that has just undergone an overall upgrade after long consistent incremental optimizations:</p><p><strong>More</strong></p><p>Thanks to the hard work of 275 committers, the <a href="https://doris.apache.org/blog/release-note-2.0.0" target="_blank" rel="noopener noreferrer">Apache Doris 2.0</a> milestone has merged over 4100 pull requests, representing a 70% increase from version 1.2 last year and a 10-fold increase from 1.1. </p><p><strong>Faster</strong> </p><p>This year, Apache Doris has attained a 10-fold performance increase in blind benchmarking and single-table queries, a 13-fold increase in multi-table joins, and a 20-fold increase in concurrent point queries. The high query performance is supported by the smart design of Apache Doris, including a vectorized execution engine, Merge-on-Write mechanism, the Light Schema Change feature, a self-adaptive parallel execution model, and a <a href="https://doris.apache.org/docs/query-acceleration/nereids?_highlight=nereids" target="_blank" rel="noopener noreferrer">new query optimizer</a>.</p><p><strong>Wider</strong></p><p>We have built Apache Doris into more than just a powerful OLAP engine but also a data warehouse for a wider range of use cases, including log analysis and high-concurrency data services. To expand the data warehousing capabilities of Apache Doris, we have introduced <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> to connect Doris to a wide array of data sources.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="one-of-the-most-active-open-source-big-data-projects">One of the most active open source big data projects<a href="#one-of-the-most-active-open-source-big-data-projects" class="hash-link" aria-label="One of the most active open source big data projects的直接链接" title="One of the most active open source big data projects的直接链接"></a></h2><p>Apache Doris has become one of the world's most active open-source big data projects in all aspects:</p><ul><li>It has hit <strong>10K stars</strong> on <a href="https://github.com/apache/doris/" target="_blank" rel="noopener noreferrer">GitHub</a>, a year-on-year growth of 70%, and the momentum keeps going.</li><li>The community has included almost 600 contributors and welcomes new faces every week.</li><li>With <strong>120 monthly active contributors</strong>, Apache Doris has become a more active project than Apache Spark, Elasticsearch, Trino, and Apache Druid.</li><li>Over <strong>160 pull requests</strong> are created every week. Meanwhile, we have established a mature code review pipeline, making sure that every pull request stands the test of 3000 use cases. This is how we guarantee stability in the midst of agile iteration.</li></ul><p><img loading="lazy" alt="Apache-Doris-monthly-active-contributors" src="https://cdnd.selectdb.com/zh-CN/assets/images/Apache-Doris-monthly-active-contributors-1d7fa091149e4e022453d084f7ad9020.png" width="1190" height="720" class="img_ev3q"></p><p>Along with such growth, we've also witnessed higher diversity among contributors. They are engineers from tech giants and database unicorns, like VeloDB, which is the commercial company based on Apache Doris. Many cloud service providers, including Alibaba Cloud, Tencent Cloud, Huawei Cloud, AWS and GCP (coming soon), have also jumped on the bandwagon and provided Doris-based data warehouse cloud hosting services.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="fast-expanding-user-base">Fast-expanding user base<a href="#fast-expanding-user-base" class="hash-link" aria-label="Fast-expanding user base的直接链接" title="Fast-expanding user base的直接链接"></a></h2><p>Apache Doris now has a user base of over 30,000 data engineers from more than <strong>4000 enterprises</strong>, including those from the tech sector, finance, telecom, manufacturing, logistics, and retail. The great majority of them keep in close touch with the Apache Doris developers, committing code, getting involved in tests, and sharing experience and feedback with the community. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="fruit-that-have-been-reaped">Fruit that have been reaped<a href="#fruit-that-have-been-reaped" class="hash-link" aria-label="Fruit that have been reaped的直接链接" title="Fruit that have been reaped的直接链接"></a></h2><p>We aim to make Apache Doris the first choice for people in real-time data analysis. What we have done in the past year can be concluded in three keywords:</p><ul><li><strong>Real-time</strong>: We have realized high-throughput real-time data writing and updates, as well as low query latency.</li><li><strong>Unified</strong>: As we've been trying to make Doris an all-in-one platform that can undertake most of the analytic workloads for users, we have expanded and enhanced the data lakehousing capabilities of Doris, enabled faster log analysis, faster ELT/ETL, and faster response to point queries.</li><li><strong>Cloud-native</strong>: This is a leap towards cloud infrastructure. Apache Doris can now be deployed and run on Kubernetes to reduce storage and computation costs.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-response-to-queries">Real-time response to queries<a href="#real-time-response-to-queries" class="hash-link" aria-label="Real-time response to queries的直接链接" title="Real-time response to queries的直接链接"></a></h3><p>As is said, Apache Doris 2.0 delivers 10 times faster query speed than the previous versions, but what is the key accelerator behind such high performance? It is the <a href="https://doris.apache.org/docs/query-acceleration/nereids/" target="_blank" rel="noopener noreferrer">cost-based query optimizer</a> and the self-adaptive <a href="https://doris.apache.org/docs/query-acceleration/pipeline-execution-engine/" target="_blank" rel="noopener noreferrer">pipeline parallel execution model</a> of Apache Doris. </p><p>In traditional data reporting, data is often arranged in flat tables. The idea of flat tables and pre-aggregated tables is to trade storage space for query speed. In these cases, the key to high performance is to accelerate data scanning and aggregation. However, since nowadays data analytic workloads involve more complex computations with more and larger batch processing, data engineers often have to fine-tune the database and rewrite the SQL before they can enjoy satisfactory query speeds. That's why we have refactored the the query optimizer in Apache Doris. The new query optimizer can figure out the most efficient query execution plan for a thousand-line SQL or a join query that relates dozens of tables, saving engineers lots of efforts.</p><p>Similarly, the new version of Doris has automated another engineering-intensive process: adjusting the compute instance execution concurrency in the backend. What bothered our users was that when queries of different sizes happened concurrently, these queries tended to fight for resources and thus required human intervention. To solve that, we have introduced a pipeline execution model. It automatically decides the execution concurrency for the current situation to make sure queries of all sizes are executed smoothly. As a result, Doris now has more efficient CPU usage and higher system stability during query execution.</p><p>For <strong><a href="https://doris.apache.org/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times" target="_blank" rel="noopener noreferrer">high concurrency point queries</a></strong>, Apache Doris 2.0 reached a throughput of 30,000 QPS. It is a 20-fold improvement driven by optimizations in data storage, reading, and query execution. As a column-oriented DBMS, Apache Doris has relatively low row reading efficiency, so we have introduced ow/column hybrid storage and <a href="https://doris.apache.org/docs/query-acceleration/hight-concurrent-point-query/" target="_blank" rel="noopener noreferrer">row cache</a> to make up for that. We have also enabled the short circuit plan and prepared statements in Apache Doris. The former allows simple queries to skip the query planner for faster execution, and the latter allows users to reuse SQL for similar queries and thus reduce frontend overhead.</p><p><img loading="lazy" alt="hybrid-column-row-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/hybrid-column-row-storage-e27d3b35a9c082d9552e2b003e46c3a5.png" width="1280" height="490" class="img_ev3q"></p><p>For <strong>multi-dimensional data analysis</strong>, we introduced <a href="https://doris.apache.org/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch" target="_blank" rel="noopener noreferrer">inverted index</a> to accelerate fuzzy keyword queries, equivalence queries, and range queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-data-writing-and-update">Real-time data writing and update<a href="#real-time-data-writing-and-update" class="hash-link" aria-label="Real-time data writing and update的直接链接" title="Real-time data writing and update的直接链接"></a></h3><p>Data writing is another side of the real-time story, so we also spent great efforts improving the data ingestion speed of Apache Doris. After optimizations like Memtable parallel flushing and single-copy ingestion, Apache Doris is now 2~8 times faster in data writing. </p><p><img loading="lazy" alt="data-writing-efficiency" src="https://cdnd.selectdb.com/zh-CN/assets/images/data-writing-efficiency-c0b378a1da2df226e13f0f3edc27b8d5.png" width="1280" height="460" class="img_ev3q"></p><p>The <strong><a href="https://doris.apache.org/docs/data-table/data-model#merge-on-write" target="_blank" rel="noopener noreferrer">Merge-on-Write</a></strong> mechanism has been upgraded in version 2.0. It enables an upsert throughput of nearly 1 million rows per second, and it now supports a wider range of updating operations, including partial column updates.</p><p><img loading="lazy" alt="merge-on-write" src="https://cdnd.selectdb.com/zh-CN/assets/images/merge-on-write-08cf65ceba2e5a402bdd3a7159faca46.png" width="1280" height="411" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-for-more-use-cases">Support for more use cases<a href="#support-for-more-use-cases" class="hash-link" aria-label="Support for more use cases的直接链接" title="Support for more use cases的直接链接"></a></h3><p>For <strong><a href="https://doris.apache.org/blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance" target="_blank" rel="noopener noreferrer">data lakehousing</a></strong>, our last big move was to introduce <a href="https://doris.apache.org/docs/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">Multi-Catalog</a> for auto-mapping and auto-synchronization of heterogeneous data sources. In 2.0, we have further enhanced that. It now supports even more data sources, and it is also much faster in various production environments. With multi-catalog, users can ingest their multi-source data into Doris using the simple <code>insert into select</code> operation. </p><p>For <strong><a href="https://doris.apache.org/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch" target="_blank" rel="noopener noreferrer">log analysis</a></strong>, Doris 2.0 provides native support for semi-structured data, which can be arranged in data types like Json, Array, and Map. On the basis of Light Schema Change, it allows Schema Evolution. In addition to the foregoing inverted index, Doris 2.0 comes with a high-performance text analysis algorithm. Built on its large-size data writing and low-cost storage capabilities, Apache Doris is 10 times more cost-effective than the common log analytic solutions on the market.</p><p>For different analytic workloads in one single cluster, the Doris solution to <strong>resource isolation</strong> is <a href="https://doris.apache.org/docs/admin-manual/workload-group" target="_blank" rel="noopener noreferrer">Workload Group</a>. As the name implies, it is to divide various workloads into groups and thus allow more flexible use of memory and CPU resources. Users can limit the number of queries that a workload group can handle concurrently, so when there are too many query requests, the excessive ones will wait in a queue. This is a way to release system burden. </p><p><img loading="lazy" alt="resource-isolation-workload-group" src="https://cdnd.selectdb.com/zh-CN/assets/images/resource-isolation-workload-group-07626534ed91bd98298fbc854fccd310.png" width="1280" height="396" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="low-cost-and-high-availability">Low cost and high availability<a href="#low-cost-and-high-availability" class="hash-link" aria-label="Low cost and high availability的直接链接" title="Low cost and high availability的直接链接"></a></h3><p>Apache Doris provides <strong><a href="https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How" target="_blank" rel="noopener noreferrer">tiered storage</a></strong>. The less frequently accessed data, namely, cold data, will be put into object storage to reduce costs. Moreover, since object storage only requires a single copy of data, the storage costs will be further cut by 2/3 compared to 3-replica storage. Calculation based on AWS pricing shows that tiered storage can save you 70% of your cloud disk expenditure.</p><p><img loading="lazy" alt="tiered-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/tiered-storage-845f28e641725352de42e77820f704a3.png" width="1684" height="644" class="img_ev3q"></p><p>To facilitate Kubernetes deployment, we have built a <strong>Kubernetes Operator</strong>. With it, users can easily deploy, scale, inspect, and maintain all Apache Doris nodes (frontends, backends, compute nodes, brokers) on Kubernetes. Compute node is a variant of backend nodes but it does not store any data, which is why it is a good fit for auto-scaling of clusters. During computation peaks, compute nodes can flexibly join the cluster and share the burden. Auto-scaling has been under active testing and will soon be released in upcoming versions of Apache Doris.</p><p><img loading="lazy" alt="kubernetes-operator-for-apache-doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/kubernetes-operator-for-apache-doris-802ef820181d3938e390b06efb34bd22.png" width="1920" height="1118" class="img_ev3q"></p><p>For service availability guarantee, Apache Doris 2.0 supports <strong>Cross-Cluster Replication (CCR)</strong>. As a disaster recovery solution, it supports read-write separation and multi-data center backup. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="reach-for-the-stars">Reach for the stars<a href="#reach-for-the-stars" class="hash-link" aria-label="Reach for the stars的直接链接" title="Reach for the stars的直接链接"></a></h2><p>In the foreseeable future, Apache Doris will go further on the aforementioned three directions: real-time, unified, and cloud-native. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="get-even-faster">Get even faster<a href="#get-even-faster" class="hash-link" aria-label="Get even faster的直接链接" title="Get even faster的直接链接"></a></h3><p>In the upcoming Apache Doris 2.1, the <strong>cost-based query optimizer (CBO)</strong> will be able to automatically collect execution statistics and provide support for hint syntax. It will also allow users to adjust the optimizing rules. To fully demonstrate the performance of our CBO, we will release a TPC-DS benchmark results. </p><p>In addition, Doris 2.1 will support <strong>multi-table materialized views</strong> and <strong>writing intermediate results to disks</strong>. Meanwhile, a Union All operator will be added to accelerate the ETL process in Apache Doris. That means users will experience higher performance and stability when processing large batches of data. You can also expect a new Join algorithm that can double the execution speed of multi-table join queries.</p><p>In terms of <strong>data writing</strong>, we try to make it simpler and more intuitive for you, and efforts will be made in three aspects. </p><ol><li>In future versions, data streams, local files, and those from relational databases or data lakes will all be put into relational tables, and they can all be written into Doris using the simple <code>insert into</code> statement. </li><li>We will simplify the data writing pipeline. Data writing will be implemented by the built-in job scheduling mechanism, so users won't need an extra data synchronization component. </li><li>When there is frequent data writing, Doris will wait until the data accumulates into a sizable batch at the server end, so as to reduce the pressure caused by small file merging.</li></ol><p>In terms of <strong>data updating</strong>, as the Merge-on-Write mechanism advances towards maturity, it will be enabled in Doris by default. Users will be able to flexibly update or modify any columns in tables as they want. Also, based on Merge-on-Write, we will build a one-size-fits-all data model, so users don't have to rack their brains choosing the right data model for various use cases.</p><p>Apache Doris 2.1 will have enhanced <strong>observability</strong>. It will provide a brand new Profile for users to monitor operator execution, and visualize the query execution status with the aid of <a href="https://github.com/apache/doris-manager" target="_blank" rel="noopener noreferrer">Doris Manager</a>.</p><p><img loading="lazy" alt="doris-manager" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris-manager-a10a321e0a80f15c575b70a5159c5039.png" width="1280" height="378" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-more-analytic-scenarios">Support more analytic scenarios<a href="#support-more-analytic-scenarios" class="hash-link" aria-label="Support more analytic scenarios的直接链接" title="Support more analytic scenarios的直接链接"></a></h3><p>The above-mentioned multi-table materialized view and built-in job scheduling mechanism will also benefit the <strong>data lakehousing</strong> capability of Doris. From heterogeneous data sources to the data warehouse, users won't need a second component to do ETL and data warehouse layering. </p><p>In version 2.0, we support data writeback to JDBC sources, and we are going to expand that functionality to more data sources, including Apache Iceberg, Apache Hudi, Delta Lake and Apache Paimon.</p><p><img loading="lazy" alt="apache-doris-data-warehouse-layers" src="https://cdnd.selectdb.com/zh-CN/assets/images/apache-doris-data-warehouse-layers-34be5c71b95e444222c5144817f7d0df.png" width="1254" height="720" class="img_ev3q"></p><p>For data ingestion from data lakes, Apache Doris currently adopts the MySQL protocol. In large-scale data reading or data science use cases (like those involving Pandas), this might be a throughput bottleneck. Thus, what we are doing is introducing an Arrow Flight-based high-speed reading interface, which transfers data via the Doris backends directly. <strong>In our tests, the new interface delivers a writing throughput that is 100 times higher.</strong></p><p><img loading="lazy" alt="writing-throughput" src="https://cdnd.selectdb.com/zh-CN/assets/images/writing-throughput-fff5fa2be3ed27977d8e846659f9cb8e.png" width="1280" height="405" class="img_ev3q"></p><p>For <strong>log analysis</strong>, the inverted index will support more complicated data types, such as Array, Map, and GEO. We will also introduce a new data type named Variant to provide <strong>schema-free support</strong>. This means users can not only put Json data of any shapes and types in the table fields, but also easily handle schema changes without any DDL operations.</p><p><img loading="lazy" alt="schemaless-variant-data-type" src="https://cdnd.selectdb.com/zh-CN/assets/images/schemaless-variant-data-type-c003f11f5699fda02d7688af1bd5b94a.png" width="1280" height="520" class="img_ev3q"></p><p>For <strong>workload management</strong>, we will enable higher flexibility. Users will be able to use SQL to create, manage, and allocate resources for their Workload Groups. We will continue to maximize resource utilization while ensuring resource isolation between workload groups.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cloud-nativeness-and-storage-compute-separation">Cloud-nativeness and storage-compute separation<a href="#cloud-nativeness-and-storage-compute-separation" class="hash-link" aria-label="Cloud-nativeness and storage-compute separation的直接链接" title="Cloud-nativeness and storage-compute separation的直接链接"></a></h3><p>When Apache Doris 2.0 was released, we previewed the merging of the SelectDB Cloud storage-compute separation solution into the Apache Doris project. After some intense code refactoring and compatibility building, this functionallity will be good and ready in Apache Doris 2.2, and users will be able to experience the elastic computation capability. </p><p><img loading="lazy" alt="storage-compute-separation" src="https://cdnd.selectdb.com/zh-CN/assets/images/storage-compute-separation-95c9fb0f82f930be25c189ee564f7b41.png" width="2000" height="1018" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="stick-to-innovation">Stick to Innovation<a href="#stick-to-innovation" class="hash-link" aria-label="Stick to Innovation的直接链接" title="Stick to Innovation的直接链接"></a></h2><p>As Apache Doris is on the ramp, we look back on its ten-year development and ask ourselves: <strong>what injects vitality to this great project and keep it vibrant for this long?</strong> The answer is, we have been working with innovators.</p><p>Back in the time when SQL on Hadoop gained currency, Apache Doris chose to stay outside the Hadoop ecosystem. It does not rely on HDFS for data storage, nor Zookeeper for distributed monitoring, but insists on providing high availability by its scalable processes. When the major databases on the market goes by their own syntaxes, Apache Doris adopts stand SQL and the MySQL protocol, in order to lower the threshold for users. </p><p>From the self-developed pre-aggregation storage engine, materialized views, and the MPP framework, to inverted index, row/column hybrid storage, Light Schema Change, Merge-on-Write, and the Variant data type, Apache Doris never stops breaking new ground to provide better performance and user experience, which is also what we are going to do next:</p><ul><li>We want to work with more open-source enthusiasts to make a difference to the world.</li><li>We want to keep inspiring the data world by presenting more use cases.</li><li>We want to provide more and better choices for users by collaborating with partners along the data pipeline and cloud service providers.</li></ul><p>By choosing Apache Doris, you choose to stay in the heartbeat of innovation. The <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a> awaits newcomers.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Top News" term="Top News"/>
</entry>
<entry>
<title type="html"><![CDATA[Data analysis for live streaming: what happens in real time is analyzed in real time]]></title>
<id>https://doris.apache.org/zh-CN/blog/data-analysis-for-live-streaming-what-happens-in-real-time-is-analyzed-in-real-time</id>
<link href="https://doris.apache.org/zh-CN/blog/data-analysis-for-live-streaming-what-happens-in-real-time-is-analyzed-in-real-time"/>
<updated>2023-10-30T00:00:00.000Z</updated>
<summary type="html"><![CDATA[As live streaming emerges as a way of doing business, the need for data analysis follows up. This post is about how a live streaming service provider with 800 million end users found the right database to support its analytic solution.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="whats-different-about-data-analytics-in-live-streaming">What's different about data analytics in live streaming?<a href="#whats-different-about-data-analytics-in-live-streaming" class="hash-link" aria-label="What's different about data analytics in live streaming?的直接链接" title="What's different about data analytics in live streaming?的直接链接"></a></h2><p>Live streaming is one typical use case for real-time data analysis, because it stresses speed. Livestream organizers need to keep abreast of the latest data to see what is happening and maximize effectiveness. To realize that requires high efficiency in every step of data processing:</p><ul><li><strong>Data writing</strong>: A live event churns out huge amounts of data every second, so the database should be able to ingest such high throughput stably.</li><li><strong>Data update</strong>: As life itself, live streaming entails a lot of data changes, so there should be a quick and reliable data updating mechanism to absorb the changes.</li><li><strong>Data queries</strong>: Data should be ready and accessible as soon as analysts want it. Mostly that means real-time visibility.</li><li><strong>Maintenance</strong>: What's special about live streaming is that the data stream has prominent highs and lows. The analytic system should be able to ensure stability during peak times, and allow scaling down in off-peak times in order to improve resource utilization. If possible, it should also provide disaster recovery services to guarantee system availability, since the worst case in live streaming is interruption. </li></ul><p>The rest of this post is about how a live streaming service provider with 800 million end users found the right database to support its analytic solution.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="simplify-the-components">Simplify the Components<a href="#simplify-the-components" class="hash-link" aria-label="Simplify the Components的直接链接" title="Simplify the Components的直接链接"></a></h2><p>In this case, the live streaming data analytic platform adopts the Lambda architecture, which consists of a batch processing pipeline and a streaming pipeline, the former for user profile information and the latter for real-time generated data, including metrics like real-time subscription, visitor count, comments and responses. </p><ul><li><strong>Batching processing</strong>: The user basic information stored in HDFS is written into HBase to form a table.</li><li><strong>Streaming</strong>: Real-time generated data from MySQL, collected via Flink CDC, goes into Apache Kafka. Flink works as the computation engine and then the data is stored in Redis.</li></ul><p><img loading="lazy" alt="database-for-live-shopping-Elasticsearch-HBase" src="https://cdnd.selectdb.com/zh-CN/assets/images/xiaoe-tech-1-85a1ce0c20ef5cee50ca0b3c908f9ee0.png" width="1898" height="966" class="img_ev3q"></p><p>The real-time metrics will be combined with the user profile information to form a flat table, and Elasticsearch will work as the query engine.</p><p>As their business burgeons, the expanding data size becomes unbearable for this platform, with problems like:</p><ul><li><strong>Delayed data writing</strong>: The multiple components result in multiple steps in data writing, and inevitably lead to prolonged data writing, especially during peak times. </li><li><strong>Complicated updating mechanism</strong>: Every time there is a data change, such as that in user subscription information, it must be updated into the main tables and dimensional tables, and then the tables are correlated to generate a new flat table. And don't forget that this long process has to be executed across multiple components. So just imagine the complexity.</li><li><strong>Slow queries</strong>: As the query engine, Elasticsearch struggles with concurrent query requests and data accesses. It is also not flexible enough to deal with the join queries.</li><li><strong>Time-consuming maintenance</strong>: All engineers developing or maintaining this platform need to master all the components. That's a lot of training. And adding new metrics to the data pool is labor-intensive.</li></ul><p>So to sum up, the main problem for this architecture is its complexity. To reduce the components means to find a database that is not only capable of most workloads, but also performant in data writing and queries. After 6 months of testing, they finally upgraded their live streaming analytic platform with <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a>. </p><p>They converge the streaming and the batch processing pipelines at Apache Doris. It can undertake analytic workloads and also provides a storage layer so data doesn't have to shuffle back to Elasticsearch and HBase as it did in the old architecture.</p><p>With Apache Doris as the data warehouse, the platform architecture becomes neater.</p><p><img loading="lazy" alt="database-for-live-shopping-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/xiaoe-tech-2-53446135cfc264b66e055259af6ff08b.png" width="1908" height="936" class="img_ev3q"></p><ul><li><strong>Smooth data writing</strong>: Raw data is processed by Flink and written into Apache Doris in real time. The Doris community provides a <a href="https://github.com/apache/doris-flink-connector" target="_blank" rel="noopener noreferrer">Flink-Doris-Connector</a> with built-in Flink CDC.</li><li><strong>Flexible data update</strong>: For data changes, Apache Doris implements <a href="https://doris.apache.org/docs/data-table/data-model/#merge-on-write" target="_blank" rel="noopener noreferrer">Merge-on-Write</a>. This is especially useful in small-batch real-time writing because you don't have to renew the entire flat table. It also supports partial update of columns, which is another way to make data updates more lightweight. In this case, Apache Doris is able to finish Upsert or Insert Overwrite operations for <strong>200,000 rows per second</strong>, and these are all done in large tables with the biggest ones reaching billions of rows. </li><li><strong>Faster queries</strong>: For join queries, Apache Doris can easily join multiple large tables (10 billion rows). It can respond to a rich variety of queries within seconds or even milliseconds, including tag retrievals, fuzzy queries, ranking, and paginated queries.</li><li><strong>Easier maintenance</strong>: As for Apache Doris itself, the frontend and backend nodes are both flexibly scalable. It is compatible with MySQL protocol. What took the developers a month now can be finished within a week, which allows for more agile iteration of metrics. </li></ul><p>The above shows how Apache Doris speeds up the entire data processing pipeline with its all-in-one capabilities. Beyond that, it has some delightful features that can increase query efficiency and ensure service reliability in the case of live streaming. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="disaster-recovery">Disaster Recovery<a href="#disaster-recovery" class="hash-link" aria-label="Disaster Recovery的直接链接" title="Disaster Recovery的直接链接"></a></h2><p>The last thing you want in live streaming is service breakdown, so disaster recovery is necessary.</p><p>Before the live streaming platform had Apache Doris in place, they only backed up their data to object storage. It took an hour from when a failure was reported to when it was fixed. That one-hour window is fatal for live commerce because viewers will leave immediately. Thus, disaster recovery must be quick.</p><p>Now, with Apache Doris, they have a dual-cluster solution: a primary cluster and a backup cluster. This is for hot backup. Besides that, they have a cold backup plan, which is the same as what they did: backing up their everyday data to object storage via Backup and Freeze policies.</p><p>This is how they do hot backup before <a href="https://doris.apache.org/zh-CN/blog/release-note-2.0.0" target="_blank" rel="noopener noreferrer">Apache Doris 2.0</a>: </p><ul><li><strong>Data dual-write</strong>: Write data to both the primary cluster and backup cluster. </li><li><strong>Load balancing</strong>: In case there is something wrong with one cluster, query requests can be directed to the other cluster via reverse proxy.</li><li><strong>Monitoring</strong>: Regularly check the data consistency between the two clusters. </li></ul><p>Apache Doris 2.0 supports <a href="https://doris.apache.org/zh-CN/blog/release-note-2.0.0#support-for-cross-cluster-replication-ccr" target="_blank" rel="noopener noreferrer">Cross Cluster Replication (CCR)</a>, which can automate the above processes to reduce maintenance costs and inconsistency risks due to human factors.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-visualization">Data Visualization<a href="#data-visualization" class="hash-link" aria-label="Data Visualization的直接链接" title="Data Visualization的直接链接"></a></h2><p>In addition to reporting, dashboarding, and ad-hoc queries, the platform also allows analysts to configure various data sources to produce their own visualized data lists. </p><p>Apache Doris is compatible with most BI tools on the market, so the platform developers can tap on that and provide a broader set of functionalities for live streamers.</p><p>Also, built on the real-time capabilities and quick computation of Apache Doris, live streams can view data and see what happens in real time, instead of waiting for a day for data analysis.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bitmap-index-to-accelerate-tag-queries">Bitmap Index to Accelerate Tag Queries<a href="#bitmap-index-to-accelerate-tag-queries" class="hash-link" aria-label="Bitmap Index to Accelerate Tag Queries的直接链接" title="Bitmap Index to Accelerate Tag Queries的直接链接"></a></h2><p>A big part of data analysis in live streaming is viewer profiling. Viewers are divided into groups based on their online footprint. They are given tags like "watched for over one minute" and "visited during the past minute". As the show goes on, viewers are constantly tagged and untagged. In the data warehouse, it means frequent data insertion and deletion. Plus, one viewer is given multiple tags. To gain an overall understanding of users entail join queries, which is why the join performance of the data warehouse is important. </p><p>The following snippets give you a general idea of how to tag users and conduct tag queries in Apache Doris.</p><p><strong>Create a Tag Table</strong></p><p>A tag table lists all the tags that are given to the viewers, and maps the tags to the corresponding viewer ID.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">create table db.tags ( </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">u_id string, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">version string, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">tags string</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) with ( </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'connector' = 'doris', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'fenodes' = '', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'table.identifier' = 'tags', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'username' = '', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'password' = '', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.format' = 'json', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.strip_outer_array' = 'true', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.fuzzy_parse' = 'true', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.columns' = 'id,u_id,version,a_tags,m_tags,a_tags=bitmap_from_string(a_tags),m_tags=bitmap_from_string(m_tags)', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.batch.interval' = '10s', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.batch.size' = '100000' </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Create a Tag Version Table</strong></p><p>The tag table is constantly changing, so there are different versions of it as time goes by.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">create table db.tags_version ( </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">id string, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">u_id string, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">version string </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) with ( </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'connector' = 'doris', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'fenodes' = '', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'table.identifier' = 'db.tags_version', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'username' = '', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'password' = '', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.format' = 'json', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.strip_outer_array' = 'true', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.fuzzy_parse' = 'true', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.properties.columns' = 'id,u_id,version', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.batch.interval' = '10s', </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">'sink.batch.size' = '100000' </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Write Data into Tag Table and Tag Version Table</strong></p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into db.tags</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">u_id, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">last_timestamp as version,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">tags</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from db.source; </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into rtime_db.tags_version</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">u_id, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">last_timestamp as version</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from db.source;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Tag Queries Accelerated by Bitmap Index</strong></p><p>For example, analysts need to find out the latest tags related to a certain viewer with the last name Thomas. Apache Doris will run the LIKE operator in the user information table to find all "Thomas". Then it creates bitmap indexes for the tags. Lastly, it relates all user information table, tag table, and tag version table to return the result.</p><p><strong>Of almost a billion viewers and each of them has over a thousand tags, the bitmap index can help reduce the query response time to less than one second.</strong></p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">with t_user as (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> u_id,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from db.user</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where partition_id = 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and name like '%Thomas%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t_tags as (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> u_id, </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> version</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from db.tags</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> bitmap_and_count(a_tags, bitmap_from_string("123,124,125,126,333")) &gt; 0 </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t_tag_version as (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select id, u_id, version</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from db.tags_version</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t1.u_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t1.name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from t_user t1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">join t_tags t2 on t1.u_id = t2.u_id</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">join t_tag_version t3 on t2.u_id = t3.u_id and t2.version = t3.version</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by t1.u_id desc</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">limit 1,10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>Data analysis in live streaming is challenging for the underlying database, but it is also where the key competitiveness of Apache Doris comes to play. First of all, Apache Doris can handle most data processing workloads, so platform builders don't have to worry about putting many components together and consequential maintenance issues. Secondly, it has a lot of query-accelerating features, including but not limited to indexes. After tackling the speed issues, the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris developer community</a> has been exploring its boundaries, such as introducing a more efficient cost-based query optimizer in version 2.0 and inverted index for text searches, fuzzy queries, and range queries. These features are embraced by the live streaming service provider as they are actively testing them and planning to transfer their log analytic workloads to Apache Doris, too.</p>]]></content>
<author>
<name>He Gong</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 2.0.2]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-2.0.2</id>
<link href="https://doris.apache.org/zh-CN/blog/release-2.0.2"/>
<updated>2023-10-13T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Thanks to our community users and developers, 489 improvements and bug fixes have been made in Doris 2.0.2.]]></summary>
<content type="html"><![CDATA[<h1>Release 2.0.2</h1><p>Thanks to our community users and developers, 489 improvements and bug fixes have been made in Doris 2.0.2.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changes">Behavior Changes<a href="#behavior-changes" class="hash-link" aria-label="Behavior Changes的直接链接" title="Behavior Changes的直接链接"></a></h2><ul><li><p><a href="https://github.com/apache/doris/pull/24679" target="_blank" rel="noopener noreferrer">Remove json -&gt; operator convert to json_extract #24679</a></p><p>Remove json '-&gt;' operator since it is conflicted with lambda function syntax. It's a syntax sugar for function json_extract and can be replaced with the former.</p></li><li><p><a href="https://github.com/apache/doris/pull/24308" target="_blank" rel="noopener noreferrer">Start the script to set metadata_failure_recovery #24308</a></p><p>Move metadata_failure_recovery from fe.conf to start_fe.sh argument to prevent being used unexpectedly.</p></li><li><p><a href="https://github.com/apache/doris/pull/24207" target="_blank" rel="noopener noreferrer">Change ordinary type null value is \N,complex type null value is null #24207</a></p></li><li><p><a href="https://github.com/apache/doris/pull/23795" target="_blank" rel="noopener noreferrer">Optimize priority_ network matching logic for be #23795</a></p></li><li><p><a href="https://github.com/apache/doris/pull/17730" target="_blank" rel="noopener noreferrer">Fix cancel load failed because Job could not be cancelled… #17730</a></p><p>Allow cancel a retrying load job.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="easier-to-use">Easier to use<a href="#easier-to-use" class="hash-link" aria-label="Easier to use的直接链接" title="Easier to use的直接链接"></a></h3><ul><li><p><a href="https://github.com/apache/doris/pull/23887" target="_blank" rel="noopener noreferrer">Support custom lib dir to save custom libs #23887</a></p><p>Add a custom_lib dir to allow users place custom lib files and custom_lib will not be replaced.</p></li><li><p><a href="https://github.com/apache/doris/pull/23784" target="_blank" rel="noopener noreferrer">Optimize priority_ network matching logic #23784</a> </p><p>Optimize priority_network logic to avoid error when this config is wrong or not configured.</p></li><li><p><a href="https://github.com/apache/doris/pull/23022" target="_blank" rel="noopener noreferrer">Row policy support role #23022</a> </p><p>Support role based auth for row policy.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-optimizer-nereids-statistics-collection-improvement">New optimizer Nereids statistics collection improvement<a href="#new-optimizer-nereids-statistics-collection-improvement" class="hash-link" aria-label="New optimizer Nereids statistics collection improvement的直接链接" title="New optimizer Nereids statistics collection improvement的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/23663" target="_blank" rel="noopener noreferrer">Disable file cache while running analysis tasks. #23663</a></li><li><a href="https://github.com/apache/doris/pull/23703" target="_blank" rel="noopener noreferrer">Show column stats even when error occurred. #23703</a></li><li><a href="https://github.com/apache/doris/pull/23965" target="_blank" rel="noopener noreferrer">Support basic jdbc external table stats collection. #23965</a></li><li><a href="https://github.com/apache/doris/pull/24625" target="_blank" rel="noopener noreferrer">Skip unknown col stats check on __internal_scheam and information_schema #24625</a></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="better-support-for-jdbc-hdfs-hive-mysql-max-compute-multi-catalog">Better support for JDBC, HDFS, Hive, MySQL, Max Compute, Multi-Catalog<a href="#better-support-for-jdbc-hdfs-hive-mysql-max-compute-multi-catalog" class="hash-link" aria-label="Better support for JDBC, HDFS, Hive, MySQL, Max Compute, Multi-Catalog的直接链接" title="Better support for JDBC, HDFS, Hive, MySQL, Max Compute, Multi-Catalog的直接链接"></a></h3><ul><li><p><a href="https://github.com/apache/doris/pull/24168" target="_blank" rel="noopener noreferrer">Support hadoop viewfs. #24168</a></p></li><li><p><a href="https://github.com/apache/doris/pull/22369" target="_blank" rel="noopener noreferrer">Avoid calling checksum when replaying creating jdbc catalog and fix ranger issue #22369</a></p></li><li><p><a href="https://github.com/apache/doris/pull/23868" target="_blank" rel="noopener noreferrer">Optimize the JDBC Catalog connection error message #23868</a> </p><p>Improve property check and error message for JDBC catalog</p></li><li><p><a href="https://github.com/apache/doris/pull/24242" target="_blank" rel="noopener noreferrer">Fix mc decimal type parse, fix wrong obj location #24242</a> </p><p>Fix some issues for Max Compute catalog</p></li><li><p><a href="https://github.com/apache/doris/pull/23391" target="_blank" rel="noopener noreferrer">Support sql cache for hms catalog #23391</a> </p><p>SQL cache for Hive catalog</p></li><li><p><a href="https://github.com/apache/doris/pull/22869" target="_blank" rel="noopener noreferrer">Merge hms partition events. #22869</a> </p><p>Improve performance for Hive metadata sync</p></li><li><p><a href="https://github.com/apache/doris/pull/22702" target="_blank" rel="noopener noreferrer">Add metadata_name_ids for quickly get catlogs,db,table and add profiling table in order to Compatible with mysql #22702</a></p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-for-inverted-index-query">Performance for inverted index query<a href="#performance-for-inverted-index-query" class="hash-link" aria-label="Performance for inverted index query的直接链接" title="Performance for inverted index query的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/23952" target="_blank" rel="noopener noreferrer">Add bkd index query cache to improve perf #23952</a></li><li><a href="https://github.com/apache/doris/pull/24678" target="_blank" rel="noopener noreferrer">Improve performance for count on index other than match #24678</a></li><li><a href="https://github.com/apache/doris/pull/24751" target="_blank" rel="noopener noreferrer">Improve match performance without index #24751</a></li><li><a href="https://github.com/apache/doris/pull/23871" target="_blank" rel="noopener noreferrer">Optimize multiple terms conjunction query #23871</a>
Improve performance of MATCH_ALL</li><li><a href="https://github.com/apache/doris/pull/24389" target="_blank" rel="noopener noreferrer">Optimize unnecessary conversions #24389</a>
Improve performance of MATCH</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="improve-array-functions">Improve Array functions<a href="#improve-array-functions" class="hash-link" aria-label="Improve Array functions的直接链接" title="Improve Array functions的直接链接"></a></h3><ul><li>[<a href="https://github.com/apache/doris/pull/23630" target="_blank" rel="noopener noreferrer">Fix old optimizer with some array literal functions #23630</a></li><li><a href="https://github.com/apache/doris/pull/24327" target="_blank" rel="noopener noreferrer">Improve array union support multi params #24327</a></li><li><a href="https://github.com/apache/doris/pull/24455" target="_blank" rel="noopener noreferrer">Improve explode func with array nested complex type #24455</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="important-bug-fixes">Important Bug fixes<a href="#important-bug-fixes" class="hash-link" aria-label="Important Bug fixes的直接链接" title="Important Bug fixes的直接链接"></a></h2><ul><li><a href="https://github.com/apache/doris/pull/23601" target="_blank" rel="noopener noreferrer">The parameter positions of timestamp diff function to sql are reversed #23601</a></li><li><a href="https://github.com/apache/doris/pull/23630" target="_blank" rel="noopener noreferrer">Fix old optimizer with some array literal functions #23630</a></li><li><a href="https://github.com/apache/doris/pull/23555" target="_blank" rel="noopener noreferrer">Fix query cache returns wrong result after deleting partitions. #23555</a></li><li><a href="https://github.com/apache/doris/pull/17644" target="_blank" rel="noopener noreferrer">Fix potential data loss when clone task's dst tablet is cooldown replica #17644</a></li><li><a href="https://github.com/apache/doris/pull/23779" target="_blank" rel="noopener noreferrer">Fix array map batch append data with right next_array_item_rowid #23779</a></li><li><a href="https://github.com/apache/doris/pull/23940" target="_blank" rel="noopener noreferrer">Fix or to in rule #23940</a></li><li><a href="https://github.com/apache/doris/pull/23860" target="_blank" rel="noopener noreferrer">Fix 'char' function's toSql implementation is wrong #23860</a></li><li><a href="https://github.com/apache/doris/pull/23973" target="_blank" rel="noopener noreferrer">Record wrong best plan properties #23973</a></li><li><a href="https://github.com/apache/doris/pull/24020" target="_blank" rel="noopener noreferrer">Make TVF's distribution spec always be RANDOM #24020</a></li><li><a href="https://github.com/apache/doris/pull/24039" target="_blank" rel="noopener noreferrer">External scan use STORAGE_ANY instead of ANY as distibution #24039</a></li><li><a href="https://github.com/apache/doris/pull/23958" target="_blank" rel="noopener noreferrer">Runtimefilter target is not SlotReference #23958</a></li><li><a href="https://github.com/apache/doris/pull/24104" target="_blank" rel="noopener noreferrer">mv in select materialized_view should disable show table #24104</a></li><li><a href="https://github.com/apache/doris/pull/24097" target="_blank" rel="noopener noreferrer">Fail over to remote file reader if local cache failed #24097</a></li><li><a href="https://github.com/apache/doris/pull/23852" target="_blank" rel="noopener noreferrer">Fix revoke role operation cause fe down #23852</a></li><li><a href="https://github.com/apache/doris/pull/24139" target="_blank" rel="noopener noreferrer">Handle status code correctly and add a new error code <code>ENTRY_NOT_FOUND</code> #24139</a></li><li><a href="https://github.com/apache/doris/pull/24165" target="_blank" rel="noopener noreferrer">Fix leaky abstraction and shield the status code <code>END_OF_FILE</code> from upper layers #24165</a></li><li><a href="https://github.com/apache/doris/pull/24164" target="_blank" rel="noopener noreferrer">Fix bug that Read garbled files caused be crash. #24164</a></li><li><a href="https://github.com/apache/doris/pull/24369" target="_blank" rel="noopener noreferrer">Fix be core when user sepcified empty <code>column_separator</code> using hdfs tvf #24369</a></li><li><a href="https://github.com/apache/doris/pull/24372" target="_blank" rel="noopener noreferrer">Fix need to restart BE after replacing the jar package in java-udf #24372</a></li><li><a href="https://github.com/apache/doris/pull/24381" target="_blank" rel="noopener noreferrer">Need to call 'set_version' in nested functions #24381</a></li><li><a href="https://github.com/apache/doris/pull/24385" target="_blank" rel="noopener noreferrer">windown_funnel compatibility issue with multi backends #24385</a></li><li><a href="https://github.com/apache/doris/pull/24290" target="_blank" rel="noopener noreferrer">correlated anti join shouldn't be translated to null aware anti join #24290</a></li><li><a href="https://github.com/apache/doris/pull/24207" target="_blank" rel="noopener noreferrer">Change ordinary type null value is \N,complex type null value is null #24207</a></li><li><a href="https://github.com/apache/doris/pull/24521" target="_blank" rel="noopener noreferrer">Fix analyze failed when there are thousands of partitions. #24521</a></li><li><a href="https://github.com/apache/doris/pull/24460" target="_blank" rel="noopener noreferrer">Do not use enum as the data type for JavaUdfDataType. #24460</a></li><li><a href="https://github.com/apache/doris/pull/24568" target="_blank" rel="noopener noreferrer">Fix multi window projection issue temporarily #24568</a></li><li><a href="https://github.com/apache/doris/pull/24610" target="_blank" rel="noopener noreferrer">Make metadata compatible with 2.0.3 #24610</a></li><li><a href="https://github.com/apache/doris/pull/24595" target="_blank" rel="noopener noreferrer">Select outfile column order is wrong #24595</a></li><li><a href="https://github.com/apache/doris/pull/24616" target="_blank" rel="noopener noreferrer">Incorrect result of semi/anti mark join #24616</a></li><li><a href="https://github.com/apache/doris/pull/24635" target="_blank" rel="noopener noreferrer">Fix broker read issue #24635</a></li><li><a href="https://github.com/apache/doris/pull/24625" target="_blank" rel="noopener noreferrer">Skip unknown col stats check on __internal_scheam and information_schema #24625</a></li><li><a href="https://github.com/apache/doris/pull/24572" target="_blank" rel="noopener noreferrer">Fixed bug when parsing multi-character delimiters. #24572</a></li><li><a href="https://github.com/apache/doris/pull/24578" target="_blank" rel="noopener noreferrer">Fix timezone parse when there is no tzfile #24578</a></li><li><a href="https://github.com/apache/doris/pull/23943" target="_blank" rel="noopener noreferrer">We need to issue an error when starting FE without setting the Java home environment #23943</a></li><li><a href="https://github.com/apache/doris/pull/24697" target="_blank" rel="noopener noreferrer">Enable_unique_key_partial_update should be forwarded to master #24697</a></li><li><a href="https://github.com/apache/doris/pull/24681" target="_blank" rel="noopener noreferrer">Fix paimon file catalog meta issue and replication num analysis issue #24681</a></li><li><a href="https://github.com/apache/doris/pull/24617" target="_blank" rel="noopener noreferrer">Add more log for ingest_binlog &amp;&amp; Fix ingest_binlog not rewrite rowset_meta tablet_uid #24617</a></li><li><a href="https://github.com/apache/doris/pull/24692" target="_blank" rel="noopener noreferrer">Do not abort when a disk is broken #24692</a></li><li><a href="https://github.com/apache/doris/pull/24700" target="_blank" rel="noopener noreferrer">colocate join could not work well on full outer join #24700</a></li><li><a href="https://github.com/apache/doris/pull/24389" target="_blank" rel="noopener noreferrer">Optimize unnecessary conversions #24389</a></li><li><a href="https://github.com/apache/doris/pull/24698" target="_blank" rel="noopener noreferrer">Optimize the reading efficiency of nullable (string) columns. #24698</a></li><li><a href="https://github.com/apache/doris/pull/24778" target="_blank" rel="noopener noreferrer">Fix segment cache core when output rowset is nullptr #24778</a></li><li><a href="https://github.com/apache/doris/pull/24782" target="_blank" rel="noopener noreferrer">Fix duplicate key in schema change #24782</a></li><li><a href="https://github.com/apache/doris/pull/24800" target="_blank" rel="noopener noreferrer">Make metadata compatible for future version after 2.0.2 #24800</a></li><li><a href="https://github.com/apache/doris/pull/24808" target="_blank" rel="noopener noreferrer">Fix map/array deserialize string with quote pair #24808</a></li><li><a href="https://github.com/apache/doris/pull/24636" target="_blank" rel="noopener noreferrer">Failed on arm platform, with clang compiler and pch on, close #24633 #24636</a></li><li><a href="https://github.com/apache/doris/pull/24981" target="_blank" rel="noopener noreferrer">Table column order is changed if add a column and do truncate #24981</a></li><li><a href="https://github.com/apache/doris/pull/24949" target="_blank" rel="noopener noreferrer">Make parser mode coarse grained by default #24949</a></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/issues?q=label%3Adev%2F2.0.2-merged+is%3Aclosed" target="_blank" rel="noopener noreferrer">github</a> .</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p><a href="https://github.com/adonis0147" target="_blank" rel="noopener noreferrer">@adonis0147</a> <a href="https://github.com/airborne12" target="_blank" rel="noopener noreferrer">@airborne12</a> <a href="https://github.com/amorynan" target="_blank" rel="noopener noreferrer">@amorynan</a> <a href="https://github.com/AshinGau" target="_blank" rel="noopener noreferrer">@AshinGau</a> <a href="https://github.com/BePPPower" target="_blank" rel="noopener noreferrer">@BePPPower</a> <a href="https://github.com/BiteTheDDDDt" target="_blank" rel="noopener noreferrer">@BiteTheDDDDt</a> <a href="https://github.com/bobhan1" target="_blank" rel="noopener noreferrer">@bobhan1</a> <a href="https://github.com/ByteYue" target="_blank" rel="noopener noreferrer">@ByteYue</a> <a href="https://github.com/caiconghui" target="_blank" rel="noopener noreferrer">@caiconghui</a> <a href="https://github.com/CalvinKirs" target="_blank" rel="noopener noreferrer">@CalvinKirs</a> <a href="https://github.com/cambyzju" target="_blank" rel="noopener noreferrer">@cambyzju</a> <a href="https://github.com/ChengDaqi2023" target="_blank" rel="noopener noreferrer">@ChengDaqi2023</a> <a href="https://github.com/ChinaYiGuan" target="_blank" rel="noopener noreferrer">@ChinaYiGuan</a> <a href="https://github.com/CodeCooker17" target="_blank" rel="noopener noreferrer">@CodeCooker17</a> <a href="https://github.com/csun5285" target="_blank" rel="noopener noreferrer">@csun5285</a> <a href="https://github.com/dataroaring" target="_blank" rel="noopener noreferrer">@dataroaring</a> <a href="https://github.com/deadlinefen" target="_blank" rel="noopener noreferrer">@deadlinefen</a> <a href="https://github.com/DongLiang-0" target="_blank" rel="noopener noreferrer">@DongLiang-0</a> <a href="https://github.com/Doris-Extras" target="_blank" rel="noopener noreferrer">@Doris-Extras</a> <a href="https://github.com/dutyu" target="_blank" rel="noopener noreferrer">@dutyu</a> <a href="https://github.com/eldenmoon" target="_blank" rel="noopener noreferrer">@eldenmoon</a> <a href="https://github.com/englefly" target="_blank" rel="noopener noreferrer">@englefly</a> <a href="https://github.com/freemandealer" target="_blank" rel="noopener noreferrer">@freemandealer</a> <a href="https://github.com/Gabriel39" target="_blank" rel="noopener noreferrer">@Gabriel39</a> <a href="https://github.com/gnehil" target="_blank" rel="noopener noreferrer">@gnehil</a> <a href="https://github.com/GoGoWen" target="_blank" rel="noopener noreferrer">@GoGoWen</a> <a href="https://github.com/gohalo" target="_blank" rel="noopener noreferrer">@gohalo</a> <a href="https://github.com/HappenLee" target="_blank" rel="noopener noreferrer">@HappenLee</a> <a href="https://github.com/hello-stephen" target="_blank" rel="noopener noreferrer">@hello-stephen</a> <a href="https://github.com/HHoflittlefish777" target="_blank" rel="noopener noreferrer">@HHoflittlefish777</a> <a href="https://github.com/hubgeter" target="_blank" rel="noopener noreferrer">@hubgeter</a> <a href="https://github.com/hust-hhb" target="_blank" rel="noopener noreferrer">@hust-hhb</a> <a href="https://github.com/ixzc" target="_blank" rel="noopener noreferrer">@ixzc</a> <a href="https://github.com/JackDrogon" target="_blank" rel="noopener noreferrer">@JackDrogon</a> <a href="https://github.com/jacktengg" target="_blank" rel="noopener noreferrer">@jacktengg</a> <a href="https://github.com/jackwener" target="_blank" rel="noopener noreferrer">@jackwener</a> <a href="https://github.com/Jibing-Li" target="_blank" rel="noopener noreferrer">@Jibing-Li</a> <a href="https://github.com/JNSimba" target="_blank" rel="noopener noreferrer">@JNSimba</a> <a href="https://github.com/kaijchen" target="_blank" rel="noopener noreferrer">@kaijchen</a> <a href="https://github.com/kaka11chen" target="_blank" rel="noopener noreferrer">@kaka11chen</a> <a href="https://github.com/Kikyou1997" target="_blank" rel="noopener noreferrer">@Kikyou1997</a> <a href="https://github.com/Lchangliang" target="_blank" rel="noopener noreferrer">@Lchangliang</a> <a href="https://github.com/LemonLiTree" target="_blank" rel="noopener noreferrer">@LemonLiTree</a> <a href="https://github.com/liaoxin01" target="_blank" rel="noopener noreferrer">@liaoxin01</a> <a href="https://github.com/LiBinfeng-01" target="_blank" rel="noopener noreferrer">@LiBinfeng-01</a> <a href="https://github.com/liugddx" target="_blank" rel="noopener noreferrer">@liugddx</a> <a href="https://github.com/luwei16" target="_blank" rel="noopener noreferrer">@luwei16</a> <a href="https://github.com/mongo360" target="_blank" rel="noopener noreferrer">@mongo360</a> <a href="https://github.com/morningman" target="_blank" rel="noopener noreferrer">@morningman</a> <a href="https://github.com/morrySnow" target="_blank" rel="noopener noreferrer">@morrySnow</a> @mrhhsg @Mryange @mymeiyi @neuyilan @pingchunzhang @platoneko @qidaye @realize096 @RYH61 @shuke987 @sohardforaname @starocean999 @SWJTU-ZhangLei @TangSiyang2001 @Tech-Circle-48 @w41ter @wangbo @wsjz @wuwenchi @wyx123654 @xiaokang @XieJiann @xinyiZzz @XuJianxu @xutaoustc @xy720 @xyfsjq @xzj7019 @yiguolei @yujun777 @Yukang-Lian @Yulei-Yang @zclllyybb @zddr @zhangguoqiang666 @zhangstar333 @ZhangYu0123 @zhannngchen @zxealous @zy-kkk @zzzxl1993 @zzzzzzzs</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Migrating from ClickHouse to Apache Doris: what happened?]]></title>
<id>https://doris.apache.org/zh-CN/blog/migrating-from-clickhouse-to-apache-doris-what-happened</id>
<link href="https://doris.apache.org/zh-CN/blog/migrating-from-clickhouse-to-apache-doris-what-happened"/>
<updated>2023-10-11T00:00:00.000Z</updated>
<summary type="html"><![CDATA[A user of Apache Doris has written down their migration process from ClickHouse to Doris, including why they need the change, what needs to be taken care of, and how they compare the performance of the two databases in their environment. ]]></summary>
<content type="html"><![CDATA[<p>Migrating from one OLAP database to another is huge. Even if you're unhappy with your current data tool and have found some promising candidate, you might still hesitate to do the big surgery on your data architecture, because you're uncertain about how things are going to work. So you need experience shared by someone who has walked the path. </p><p>Luckily, a user of Apache Doris has written down their migration process from ClickHouse to Doris, including why they need the change, what needs to be taken care of, and how they compare the performance of the two databases in their environment. </p><p>To decide whether you want to continue reading, check if you tick one of the following boxes:</p><ul><li>You need your join queries to be executed faster.</li><li>You need flexible data updates.</li><li>You need real-time data analysis.</li><li>You need to minimize your components.</li></ul><p>If you do, this post might be of some help to you.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="replacing-kylin-clickhouse-and-druid-with-apache-doris">Replacing Kylin, ClickHouse, and Druid with Apache Doris<a href="#replacing-kylin-clickhouse-and-druid-with-apache-doris" class="hash-link" aria-label="Replacing Kylin, ClickHouse, and Druid with Apache Doris的直接链接" title="Replacing Kylin, ClickHouse, and Druid with Apache Doris的直接链接"></a></h2><p>The user undergoing this change is an e-commerce SaaS provider. Its data system serves realtime and offline reporting, customer segmentation, and log analysis. Initially, they used different OLAP engines for these various purposes:</p><ul><li><strong>Apache Kylin for offline reporting</strong>: The system provides offline reporting services for over 5 million sellers. The big ones among them have more than 10 million registered members and 100,000 SKU, and the detailed information is put into over 400 data cubes on the platform. </li><li><strong>ClickHouse for customer segmentation and Top-N log queries</strong>: This entails high-frequency updates, high QPS, and complicated SQL.</li><li><strong>Apache Druid for real-time reporting</strong>: Sellers extract data they need by combining different dimensions, and such real-time reporting requires quick data updates, quick query response, and strong stability of the system. </li></ul><p><img loading="lazy" alt="ClickHouse-Druid-Apache-Kylin" src="https://cdnd.selectdb.com/zh-CN/assets/images/youzan-1-21f1d14ff97ac4bbf038e58c72a95e85.png" width="1280" height="529" class="img_ev3q"></p><p>The three components have their own sore spots.</p><ul><li><strong>Apache Kylin</strong> runs well with a fixed table schema, but every time you want to add a dimension, you need to create a new data cube and refill the historical data in it.</li><li><strong>ClickHouse</strong> is not designed for multi-table processing, so you might need an extra solution for federated queries and multi-table join queries. And in this case, it was below expectation in high-concurrency scenarios.</li><li><strong>Apache Druid</strong> implements idempotent writing so it does not support data updating or deletion itself. That means when there is something wrong at the upstream, you will need a full data replacement. And such data fixing is a multi-step process if you think it all the way through, because of all the data backups and movements. Plus, newly ingested data will not be accessible for queries until it is put in segments in Druid. That means a longer window such that data inconsistency between upstream and downstream.</li></ul><p>As they work together, this architecture might be too demanding to navigate because it requires knowledge of all these components in terms of development, monitoring, and maintenance. Also, every time the user scales a cluster, they must stop the current cluster and migrate all databases and tables, which is not only a big undertaking but also a huge interruption to business.</p><p><img loading="lazy" alt="Replace-ClickHouse-Druid-Apache-Kylin-with-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/youzan-2-2f605efbaf41cb9b534ea86c82b209a8.png" width="1280" height="529" class="img_ev3q"></p><p>Apache Doris fills these gaps.</p><ul><li><strong>Query performance</strong>: Doris is good at high-concurrency queries and join queries, and it is now equipped with inverted index to speed up searches in logs.</li><li><strong>Data update</strong>: The Unique Key model of Doris supports both large-volume update and high-freqency real-time writing, and the Duplicate Key model and Unique Key model supports partial column update. It also provides exactly-once guarantee in data writing and ensures consistency between base tables, materialized views, and replicas.</li><li><strong>Maintenance</strong>: Doris is MySQL-compatible. It supports easy scaling and light schema change. It comes with its own integration tools such as Flink-Doris-Connector and Spark-Doris-Connector. </li></ul><p>So they plan on the migration.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-replacement-surgery">The Replacement Surgery<a href="#the-replacement-surgery" class="hash-link" aria-label="The Replacement Surgery的直接链接" title="The Replacement Surgery的直接链接"></a></h2><p>ClickHouse was the main performance bottleneck in the old data architecture and why the user wanted the change in the first place, so they started with ClickHouse.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="changes-in-sql-statements">Changes in SQL statements<a href="#changes-in-sql-statements" class="hash-link" aria-label="Changes in SQL statements的直接链接" title="Changes in SQL statements的直接链接"></a></h3><p><strong>Table creation statements</strong></p><p><img loading="lazy" alt="table-creation-statements-in-ClickHouse-and-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/youzan-3-80a2c58fe513a6bbf303d5b95a023fd9.png" width="1280" height="426" class="img_ev3q"></p><p>The user built their own SQL rewriting tool that can convert a ClickHouse table creation statement into a Doris table creation statement. The tool can automate the following changes:</p><ul><li><strong>Mapping the field types</strong>: It converts ClickHouse field types into the corresponding ones in Doris. For example, it converts String as a Key into Varchar, and String as a partitioning field into Date V2.</li><li><strong>Setting the number of historical partitions in dynamic partitioning tables</strong>: Some tables have historical partitions and the number of partitions should be specified upon table creation in Doris, otherwise a "No Partition" error will be thrown.</li><li><strong>Determining the number of buckets</strong>: It decides the number of buckets based on the data volume of historical partitions; for non-partitioned tables, it decides the bucketing configurations based on the historical data volume.</li><li><strong>Determining TTL</strong>: It decides the time to live of partitions in dynamic partitioning tables.</li><li><strong>Setting the import sequence</strong>: For the Unique Key model of Doris, it can specify the data import order based on the Sequence column to ensure orderliness in data ingestion.</li></ul><p><img loading="lazy" alt="changes-in-table-creation-statements-from-ClickHouse-to-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/youzan-4-3ee70bd47be6c98aeef15c24027bfb07.png" width="1280" height="881" class="img_ev3q"></p><p><strong>Query statements</strong></p><p>Similarly, they have their own tool to transform the ClickHouse query statements into Doris query statements. This is to prepare for the comparison test between ClickHouse and Doris. The key considerations in the conversions include:</p><ul><li><strong>Conversion of table names</strong>: This is simple given the mapping rules in table creation statements.</li><li><strong>Conversion of functions</strong>: For example, the <code>COUNTIF</code> function in ClickHouse is equivalent to <code>SUM(CASE WHEN_THEN 1 ELSE 0)</code>, <code>Array Join</code> is equivalent to <code>Explode</code> and <code>Lateral View</code>, and <code>ORDER BY</code> and <code>GROUP BY</code> should be converted to window functions.</li><li><strong>Difference</strong> <strong>in semantics</strong>: ClickHouse goes by its own protocol while Doris is MySQL-compatible, so there needs to be alias for subqueries. In this use case, subqueries are common in customer segmentation, so they use <code>sqlparse</code> </li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="changes-in-data-ingestion-methods">Changes in data ingestion methods<a href="#changes-in-data-ingestion-methods" class="hash-link" aria-label="Changes in data ingestion methods的直接链接" title="Changes in data ingestion methods的直接链接"></a></h3><p><img loading="lazy" alt="changes-in-data-ingestion-methods-from-ClickHouse-to-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/youzan-5-8223b76f140f27992ef2d3843ed7d572.png" width="1280" height="642" class="img_ev3q"></p><p>Apache Doris provides broad options of data writing methods. For the real-time link, the user adopts Stream Load to ingest data from NSQ and Kafka. </p><p>For the sizable offline data, the user tested different methods and here are the takeouts:</p><ol><li><strong>Insert Into</strong></li></ol><p>Using Multi-Catalog to read external data sources and ingesting with Insert Into can serve most needs in this use case.</p><ol start="2"><li><strong>Stream Load</strong></li></ol><p>The Spark-Doris-Connector is a more general method. It can handle large data volumes and ensure writing stability. The key is to find the right writing pace and parallelism.</p><p>The Spark-Doris-Connector also supports Bitmap. It allows you to move the computation workload of Bitmap data in Spark clusters. </p><p>Both the Spark-Doris-Connector and the Flink-Doris-Connector rely on Stream Load. CSV is the recommended format choice. Tests on the user's billions of rows showed that CSV was 40% faster than JSON. </p><ol start="3"><li><strong>Spark Load</strong></li></ol><p>The Spark Load method utilizes Spark resources for data shuffling and ranking. The computation results are put in HDFS, and then Doris reads the files from HDFS directly (via Broker Load). This approach is ideal for huge data ingestion. The more data there is, the faster and more resource-efficient the ingestion is. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="pressure-test">Pressure Test<a href="#pressure-test" class="hash-link" aria-label="Pressure Test的直接链接" title="Pressure Test的直接链接"></a></h2><p>The user compared performance of the two components on their SQL and join query scenarios, and calculated the CPU and memory consumption of Apache Doris.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="sql-query-performance">SQL query performance<a href="#sql-query-performance" class="hash-link" aria-label="SQL query performance的直接链接" title="SQL query performance的直接链接"></a></h3><p>Apache Doris outperformed ClickHouse in 10 of the 16 SQL queries, and the biggest performance gap was a ratio of almost 30. Overall, Apache Doris was 2~3 times faster than ClickHouse. </p><p><img loading="lazy" alt="SQL-query-performance-ClickHouse-VS-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/youzan-6-a4a80e719c4ef27b9db683b502796fce.png" width="1313" height="1057" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="join-query-performance">Join query performance<a href="#join-query-performance" class="hash-link" aria-label="Join query performance的直接链接" title="Join query performance的直接链接"></a></h3><p>For join query tests, the user used different sizes of main tables and dimension tables.</p><ul><li><strong>Primary tables</strong>: user activity table (4 billion rows), user attribute table (25 billion rows), and user attribute table (96 billion rows)</li><li><strong>Dimension tables</strong>: 1 million rows, 10 million rows, 50 million rows, 100 million rows, 500 million rows, 1 billion rows, and 2.5 billion rows.</li></ul><p>The tests include <strong>full join queries</strong> and <strong>filtering join queries</strong>. Full join queries join all rows of the primary table and dimension tables, while filtering join queries retrieve data of a certain seller ID with a <code>WHERE</code> filter. The results are concluded as follows:</p><p><strong>Primary table (4 billion rows):</strong></p><ul><li>Full join queries: Doris outperforms ClickHouse in full join queries with all dimension tables. The performance gap widens as the dimension tables get larger. The largest difference is a ratio of 5.</li><li>Filtering join queries: Based on the seller ID, the filter screened out 41 million rows from the primary table. With small dimension tables, Doris was 2~3 times faster than ClickHouse; with large dimension tables, Doris was over 10 times faster; with dimension tables larger than 100 million rows, ClickHouse threw an OOM error and Doris functions normally. </li></ul><p><strong>Primary table (25 billion rows):</strong></p><ul><li>Full join queries: Doris outperforms ClickHouse in full join queries with all dimension tables. ClickHouse produced an OOM error with dimension tables larger than 50 million rows.</li><li>Filtering join queries: The filter screened out 570 million rows from the primary table. Doris responded within seconds and ClickHouse finished within minutes and broke down when joining large dimension tables.</li></ul><p><strong>Primary table (96 billion rows):</strong></p><p>Doris delivered relatively quick performance in all queries and ClickHouse was unable to execute all of them.</p><p>In terms of CPU and memory consumption, Apache Doris maintained stable cluster loads in all sizes of join queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="future-directions">Future Directions<a href="#future-directions" class="hash-link" aria-label="Future Directions的直接链接" title="Future Directions的直接链接"></a></h2><p>As the migration goes on, the user works closely with the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Doris community</a>, and their feedback has contributed to the making of <a href="https://doris.apache.org/docs/dev/releasenotes/release-2.0.0/" target="_blank" rel="noopener noreferrer">Apache Doris 2.0.0</a>. We will continue assisting them in their migration from Kylin and Druid to Doris, and we look forward to see their Doris-based unified data platform come into being.</p>]]></content>
<author>
<name>Chuang Li</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Introduction to Apache Doris: a next-generation real-time data warehouse]]></title>
<id>https://doris.apache.org/zh-CN/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse</id>
<link href="https://doris.apache.org/zh-CN/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse"/>
<updated>2023-10-03T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This is a technical overview of Apache Doris, introducing how it enables fast query performance with its architectural design, features, and mechanisms.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-is-apache-doris">What is Apache Doris?<a href="#what-is-apache-doris" class="hash-link" aria-label="What is Apache Doris?的直接链接" title="What is Apache Doris?的直接链接"></a></h2><p><a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a> is an open-source real-time data warehouse. It can collect data from various data sources, including relational databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.), logs, and time series data from IoT devices. It is capable of reporting, ad-hoc analysis, federated queries, and log analysis, so it can be used to support dashboarding, self-service BI, A/B testing, user behavior analysis and the like.</p><p>Apache Doris supports both batch import and stream writing. It can be well integrated with Apache Spark, Apache Hive, Apache Flink, Airbyte, DBT, and Fivetran. It can also connect to data lakes such as Apache Hive, Apache Hudi, Apache Iceberg, Delta Lake, and Apache Paimon.</p><p><img loading="lazy" alt="What-Is-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_1-d906db31b97b75340d9e0cf7fe267dfc.png" width="1280" height="654" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="performance">Performance<a href="#performance" class="hash-link" aria-label="Performance的直接链接" title="Performance的直接链接"></a></h2><p>As a real-time OLAP engine, Apache Doris hasn a competitive edge in query speed. According to the TPC-H and SSB-Flat benchmarking results, Doris can deliver much faster performance than Presto, Greenplum, and ClickHouse.</p><p>As for its self-volution, it has increased its query speed by over 10 times in the past two years, both in complex queries and flat table analysis.</p><p><img loading="lazy" alt="Apache-Doris-VS-Presto-Greenplum-ClickHouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_2-3b95346f511caf321269de6c7ea692cd.png" width="1280" height="616" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="architectural-design">Architectural Design<a href="#architectural-design" class="hash-link" aria-label="Architectural Design的直接链接" title="Architectural Design的直接链接"></a></h2><p>Behind the fast speed of Apache Doris is the architectural design, features, and mechanisms that contribute to the performance of Doris. </p><p>First of all, Apache Doris has a cost-based optimizer (CBO) that can figure out the most efficient execution plan for complicated big queries. It has a fully vectorized execution engine so it can reduce virtual function calls and cache misses. It is MPP-based (Massively Parallel Processing) so it can give full play to the user's machines and cores. In Doris, query execution is data-driven, which means whether a query gets executed is determined by whether its relevant data is ready, and this enables more efficient use of CPUs. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="fast-point-queries-for-a-column-oriented-database">Fast Point Queries for A Column-Oriented Database<a href="#fast-point-queries-for-a-column-oriented-database" class="hash-link" aria-label="Fast Point Queries for A Column-Oriented Database的直接链接" title="Fast Point Queries for A Column-Oriented Database的直接链接"></a></h2><p>Apache Doris is a column-oriented database so it can make data compression and data sharding easier and faster. But this might not be suitable for cases such as customer-facing services. In these cases, a data platform will have to handle requests from a large number of users concurrently (these are called "high-concurrency point queries"), and having a columnar storage engine will amplify I/O operations per second, especially when data is arranged in flat tables. </p><p>To fix that, Apache Doris enables hybrid storage, which means to have row storage and columnar storage at the same time. </p><p><img loading="lazy" alt="Hybrid-Columnar-Row-Storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/Introduction_3-1ec14fc9a98e2d4029ce949804b3e1a7.png" width="1914" height="456" class="img_ev3q"></p><p>In addition, since point queries are all simple queries, it will be unnecessary and wasteful to call out the query planner, so Doris executes a short circuit plan for them to reduce overhead. </p><p>Another big source of overheads in high-concurrency point queries is SQL parsing. For that, Doris has prepared statements. The idea is to pre-compute the SQL statement and cache them, so they can be reused for similar queries.</p><p><img loading="lazy" alt="prepared-statement-and-short-circuit-plan" src="https://cdnd.selectdb.com/zh-CN/assets/images/Introduction_4-5daaa3b40e9f7b62a6f30bf10a039f94.png" width="1902" height="392" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-ingestion">Data Ingestion<a href="#data-ingestion" class="hash-link" aria-label="Data Ingestion的直接链接" title="Data Ingestion的直接链接"></a></h2><p>Apache Doris provides a range of methods for data ingestion.</p><p><strong>Real-Time stream writing</strong>:</p><ul><li><strong><a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/stream-load-manual?_highlight=stream&amp;_highlight=loa" target="_blank" rel="noopener noreferrer">Stream Load</a></strong>: You can apply this method to write local files or data streams via HTTP. It is linearly scalable and can reach a throughput of 10 million records per second in some use cases.</li><li><strong><a href="https://doris.apache.org/docs/1.2/ecosystem/flink-doris-connector/" target="_blank" rel="noopener noreferrer">Flink-Doris-Connector</a></strong>: With built-in Flink CDC, this Connector ingests data from OLTP databases to Doris. So far, we have realized auto-synchronization of data from MySQL and Oracle to Doris.</li><li><strong><a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual" target="_blank" rel="noopener noreferrer">Routine Load</a></strong>: This is to subscribe data from Kafka message queues. </li><li><strong><a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/insert-into-manual" target="_blank" rel="noopener noreferrer">Insert Into</a></strong>: This is especially useful when you try to do ETL in Doris internally, like writing data from one Doris table to another.</li></ul><p><strong>Batch writing</strong>:</p><ul><li><strong><a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/spark-load-manual" target="_blank" rel="noopener noreferrer">Spark Load</a></strong>: With this method, you can leverage Spark resources to pre-process data from HDFS and object storage before writing to Doris.</li><li><strong><a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/broker-load-manual" target="_blank" rel="noopener noreferrer">Broker Load</a></strong>: This supports HDFS and S3 protocol.</li><li><code>insert into &lt;internal table&gt; select from &lt;external table&gt;</code>: This simple statement allows you to connect Doris to various storage systems, data lakes, and databases.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-update">Data Update<a href="#data-update" class="hash-link" aria-label="Data Update的直接链接" title="Data Update的直接链接"></a></h2><p>For data updates, what Apache Doris has to offer is that, it supports both Merge on Read and Merge on Write, the former for low-frequency batch updates and the latter for real-time writing. With Merge on Write, the latest data will be ready by the time you execute queries, and that's why it can improve your query speed by 5 to 10 times compared to Merge on Read. </p><p>From an implementation perspective, these are a few common data update operations, and Doris supports them all: </p><ul><li><strong>Upsert</strong>: to replace or update a whole row</li><li><strong>Partial column update</strong>: to update just a few columns in a row</li><li><strong>Conditional updating</strong>: to filter out some data by combining a few conditions in order to replace or delete it</li><li><strong>Insert Overwrite</strong>: to rewrite a table or partition</li></ul><p>In some cases, data updates happen concurrently, which means there is numerous new data coming in and trying to modify the existing data record, so the updating order matters a lot. That's why Doris allows you to decide the order, either by the order of transaction commit or that of the sequence column (something that you specify in the table in advance). Doris also supports data deletion based on the specified predicate, and that's how conditional updating is done.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="service-availability--data-reliability">Service Availability &amp; Data Reliability<a href="#service-availability--data-reliability" class="hash-link" aria-label="Service Availability &amp; Data Reliability的直接链接" title="Service Availability &amp; Data Reliability的直接链接"></a></h2><p>Apart from fast performance in queries and data ingestion, Apache Doris also provides service availability guarantee, and this is how: </p><p>Architecturally, Doris has two processes: frontend and backend. Both of them are easily scalable. The frontend nodes manage the clusters, metadata and handle user requests; the backend nodes execute the queries and are capable of auto data balancing and auto-restoration. It supports cluster upgrading and scaling to avoid interruption to services.</p><p><img loading="lazy" alt="architecture-design-of-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_5-69ce643bd14fb428099a545039a4e18e.png" width="1118" height="720" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="cross-cluster-replication">Cross Cluster Replication<a href="#cross-cluster-replication" class="hash-link" aria-label="Cross Cluster Replication的直接链接" title="Cross Cluster Replication的直接链接"></a></h2><p>Enterprise users, especially those in finance or e-commerce, will need to backup their clusters or their entire data center, just in case of force majeure. So Doris 2.0 provides Cross Cluster Replication (CCR). With CCR, users can do a lot:</p><ul><li><strong>Disaster recovery</strong>: for quick restoration of data services</li><li><strong>Read-write separation</strong>: master cluster + slave cluster; one for reading, one for writing</li><li><strong>Isolated upgrade of clusters</strong>: For cluster scaling, CCR allows users to pre-create a backup cluster for a trial run so they can clear out the possible incompatibility issues and bugs.</li></ul><p>Tests show that Doris CCR can reach a data latency of minutes. In the best case, it can reach the upper speed limit of the hardware environment.</p><p><img loading="lazy" alt="Cross-Cluster-Replication-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_6-0a799bae221f0af5ddfe41901743df98.png" width="1280" height="297" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="multi-tenant-management">Multi-Tenant Management<a href="#multi-tenant-management" class="hash-link" aria-label="Multi-Tenant Management的直接链接" title="Multi-Tenant Management的直接链接"></a></h2><p>Apache Doris has sophisticated Role-Based Access Control, and it allows fine-grained privilege control on the level of databases, tables, rows, and columns. </p><p><img loading="lazy" alt="multi-tenant-management-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_7-97a2ac7fbb42fff86d99e46f92e184e5.png" width="1095" height="720" class="img_ev3q"></p><p>For resource isolation, Doris used to implement a hard isolation plan, which is to divide the backend nodes into resource groups, and assign the Resource Groups to different workloads. This is a hard isolation plan. It was simple and neat. But sometimes users can make the most out of their computing resource because some Resource Groups are idle.</p><p><img loading="lazy" alt="resource-group-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_8-c25066059694cad6e5e4015a4e4e8976.png" width="1280" height="685" class="img_ev3q"></p><p>Thus, instead of Resource Groups, Doris 2.0 introduces Workload Group. A soft limit is set for a Workload Group about how many resources it can use. When that soft limit is hit, and meanwhile there are some idle resources available. The idle resources will be shared across the workload groups. Users can also prioritize the workload groups in terms of their access to idle resources.</p><p><img loading="lazy" alt="workload-group-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_9-aeadca64cd04fe431a0120cec1fd6b20.png" width="1166" height="362" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="easy-to-use">Easy to Use<a href="#easy-to-use" class="hash-link" aria-label="Easy to Use的直接链接" title="Easy to Use的直接链接"></a></h2><p>As many capabilities as Apache Doris provides, it is also easy to use. It supports standard SQL and is compatible with MySQL protocol and most BI tools on the market.</p><p>Another effort that we've made to improve usability is a feature called Light Schema Change. This means if users need to add or delete some columns in a table, they just need to update the metadata in the frontend but don't have to modify all the data files. Light Schema Change can be done within milliseconds. It also allows changes to indexes and data type of columns. The combination of Light Schema Change and Flink-Doris-Connector means synchronization of upstream tables within milliseconds.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="semi-structured-data-analysis">Semi-Structured Data Analysis<a href="#semi-structured-data-analysis" class="hash-link" aria-label="Semi-Structured Data Analysis的直接链接" title="Semi-Structured Data Analysis的直接链接"></a></h2><p>Common examples of semi-structure data include logs, observability data, and time series data. These cases require schema-free support, lower cost, and capabilities in multi-dimensional analysis and full-text search.</p><p>In text analysis, mostly, people use the LIKE operator, so we put a lot of effort into improving the performance of it, including pushing down the LIKE operator down to the storage layer (to reduce data scanning), and introducing the NGram Bloomfilter, the Hyperscan regex matching library, and the Volnitsky algorithm (for sub-string matching).</p><p><img loading="lazy" alt="LIKE-operator" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_10-f4a60eac6572f70e2d4cedb5046e2b60.png" width="1280" height="335" class="img_ev3q"></p><p>We have also introduced inverted index for text tokenization. It is a power tool for fuzzy keyword search, full-text search, equivalence queries, and range queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-lakehouse">Data Lakehouse<a href="#data-lakehouse" class="hash-link" aria-label="Data Lakehouse的直接链接" title="Data Lakehouse的直接链接"></a></h2><p>For users to build a high-performing data lakehouse and a unified query gateway, Doris can map, cache, and auto-refresh the meta data from external sources. It supports Hive Metastore and almost all open data lakehouse formats. You can connect it to relational databases, Elasticsearch, and many other sources. And it allows you to reuse your own authentication systems, like Kerberos and Apache Ranger, on the external tables.</p><p>Benchmark results show that Apache Doris is 3~5 times faster than Trino in queries on Hive tables. It is the joint result of a few features: </p><ol><li>Efficient query engine</li><li>Hot data caching mechanism</li><li>Compute nodes</li><li>Views in Doris</li></ol><p>The <a href="https://doris.apache.org/docs/dev/advanced/compute-node" target="_blank" rel="noopener noreferrer">Compute Nodes</a> is a newly introduced solution in version 2.0 for data lakehousing. Unlike normal backend nodes, Compute Nodes are stateless and do not store any data. Neither are they involved in data balancing during cluster scaling. Thus, they can join the cluster flexibly and easily during computation peak times. </p><p>Also, Doris allows you to write the computation results of external tables into Doris to form a view. This is a similar thinking to Materialized Views: to trade space for speed. After a query on external tables is executed, the results can be put in Doris internally. When there are similar queries following up, the system can directly read the results of previous queries from Doris, and that speeds things up.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="tiered-storage">Tiered Storage<a href="#tiered-storage" class="hash-link" aria-label="Tiered Storage的直接链接" title="Tiered Storage的直接链接"></a></h2><p>The main purpose of tiered storage is to save money. <a href="https://doris.apache.org/docs/dev/advanced/cold-hot-separation?_highlight=cold" target="_blank" rel="noopener noreferrer">Tiered storage </a>means to separate hot data and cold data into different storage, with hot data being the data that is frequently accessed and cold data that isn't. It allows users to put hot data in the quick but expensive disks (such as SSD and HDD), and cold data in object storage.</p><p><img loading="lazy" alt="tiered-storage-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/introduction_11-a841a4fd892e69d61184e21b7c34246b.png" width="1150" height="622" class="img_ev3q"></p><p>Roughly speaking, for a data asset consisting of 80% cold data, tiered storage will reduce your storage cost by 70%.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-apache-doris-community">The Apache Doris Community<a href="#the-apache-doris-community" class="hash-link" aria-label="The Apache Doris Community的直接链接" title="The Apache Doris Community的直接链接"></a></h2><p>This is an overview of Apache Doris, an open-source real-time data warehouse. It is actively evolving with an agile release schedule, and the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">community</a> embraces any questions, ideas, and feedback.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Log analysis: Elasticsearch vs Apache Doris]]></title>
<id>https://doris.apache.org/zh-CN/blog/log-analysis-elasticsearch-vs-apache-doris</id>
<link href="https://doris.apache.org/zh-CN/blog/log-analysis-elasticsearch-vs-apache-doris"/>
<updated>2023-09-28T00:00:00.000Z</updated>
<summary type="html"><![CDATA[As a major part of a company's data asset, logs brings values to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine where you can extract information that points to business growth.]]></summary>
<content type="html"><![CDATA[<p>As a major part of a company's data asset, logs brings values to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine where you can extract information that points to business growth.</p><p>Logs are the sequential records of events in the computer system. If you think about how logs are generated and used, you will know what an ideal log analysis system should look like:</p><ul><li><strong>It should have schema-free support.</strong> Raw logs are unstructured free texts and basically impossible for aggregation and calculation, so you needed to turn them into structured tables (the process is called "ETL") before putting them into a database or data warehouse for analysis. If there was a schema change, lots of complicated adjustments needed to put into ETL and the structured tables. Therefore, semi-structured logs, mostly in JSON format, emerged. You can add or delete fields in these logs and the log storage system will adjust the schema accordingly. </li><li><strong>It should be low-cost.</strong> Logs are huge and they are generated continuously. A fairly big company produces 10~100 TBs of log data. For business or compliance reasons, it should keep the logs around for half a year or longer. That means to store a log size measured in PB, so the cost is considerable.</li><li><strong>It should be capable of real-time processing.</strong> Logs should be written in real time, otherwise engineers won't be able to catch the latest events in troubleshooting and security tracking. Plus, a good log system should provide full-text searching capabilities and respond to interactive queries quickly.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-elasticsearch-based-log-analysis-solution">The Elasticsearch-Based Log Analysis Solution<a href="#the-elasticsearch-based-log-analysis-solution" class="hash-link" aria-label="The Elasticsearch-Based Log Analysis Solution的直接链接" title="The Elasticsearch-Based Log Analysis Solution的直接链接"></a></h2><p>A popular log processing solution within the data industry is the <strong>ELK stack: Elasticsearch, Logstash, and Kibana</strong>. The pipeline can be split into five modules:</p><ul><li><strong>Log collection</strong>: Filebeat collects local log files and writes them to a Kafka message queue.</li><li><strong>Log transmission</strong>: Kafka message queue gathers and caches logs.</li><li><strong>Log transfer</strong>: Logstash filters and transfers log data in Kafka.</li><li><strong>Log storage</strong>: Logstash writes logs in JSON format into Elasticsearch for storage.</li><li><strong>Log query</strong>: Users search for logs via Kibana visualization or send a query request via Elasticsearch DSL API.</li></ul><p><img loading="lazy" alt="ELK-Stack" src="https://cdnd.selectdb.com/zh-CN/assets/images/LAS_1-5855470fa53156592103937c4c267847.png" width="1280" height="230" class="img_ev3q"></p><p>The ELK stack has outstanding real-time processing capabilities, but frictions exist.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="inadequate-schema-free-support">Inadequate Schema-Free Support<a href="#inadequate-schema-free-support" class="hash-link" aria-label="Inadequate Schema-Free Support的直接链接" title="Inadequate Schema-Free Support的直接链接"></a></h3><p>The Index Mapping in Elasticsearch defines the table scheme, which includes the field names, data types, and whether to enable index creation.</p><p><img loading="lazy" alt="index-mapping-in-Elasticsearch" src="https://cdnd.selectdb.com/zh-CN/assets/images/LAS_2-426b7a38106af2b1b439180d5ea5be47.png" width="796" height="772" class="img_ev3q"></p><p>Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds fields to the Mapping according to the input JSON data. This provides some sort of schema-free support, but it's not enough because:</p><ul><li>Dynamic Mapping often creates too many fields when processing dirty data, which interrupts the whole system.</li><li>The data type of fields is immutable. To ensure compatibility, users often configure "text" as the data type, but that results in much slower query performance than binary data types such as integer.</li><li>The index of fields is immutable, too. Users cannot add or delete indexes for a certain field, so they often create indexes for all fields to facilitate data filtering in queries. But too many indexes require extra storage space and slow down data ingestion.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="inadequate-analytic-capability">Inadequate Analytic Capability<a href="#inadequate-analytic-capability" class="hash-link" aria-label="Inadequate Analytic Capability的直接链接" title="Inadequate Analytic Capability的直接链接"></a></h3><p>Elasticsearch has its unique Domain Specific Language (DSL), which is very different from the tech stack that most data engineers and analysts are familiar with, so there is a steep learning curve. Moreover, Elasticsearch has a relatively closed ecosystem so there might be strong resistance in integration with BI tools. Most importantly, Elastisearch only supports single-table analysis and is lagging behind the modern OLAP demands for multi-table join, sub-query, and views.</p><p><img loading="lazy" alt="Elasticsearch-DSL" src="https://cdnd.selectdb.com/zh-CN/assets/images/LAS_3-8dda5ac77b54b4a58682c977efcd3817.png" width="550" height="764" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="high-cost--low-stability">High Cost &amp; Low Stability<a href="#high-cost--low-stability" class="hash-link" aria-label="High Cost &amp; Low Stability的直接链接" title="High Cost &amp; Low Stability的直接链接"></a></h3><p>Elasticsearch users have been complaining about the computation and storage costs. The root reason lies in the way Elasticsearch works.</p><ul><li><strong>Computation cost</strong>: In data writing, Elasticsearch also executes compute-intensive operations including inverted index creation, tokenization, and inverted index ranking. Under these circumstances, data is written into Elasticsearch at a speed of around 2MB/s per core. When CPU resources are tight, data writing requirements often get rejected during peak times, which further leads to higher latency. </li><li><strong>Storage cost</strong>: To speed up retrieval, Elasticsearch stores the forward indexes, inverted indexes, and docvalues of the original data, consuming a lot more storage space. The compression ratio of a single data copy is only 1.5:1, compared to the 5:1 in most log solutions.</li></ul><p>As data and cluster size grows, maintaining stability can be another issue:</p><ul><li><p><strong>During data writing peaks</strong>: Clusters are prone to overload during data writing peaks.</p></li><li><p><strong>During queries</strong>: Since all queries are processed in the memory, big queries can easily lead to JVM OOM.</p></li><li><p><strong>Slow recovery</strong>: For a cluster failure, Elasticsearch should reload indexes, which is resource-intensive, so it will take many minutes to recover. That challenges service availability guarantee.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-more-cost-effective-option">A More Cost-Effective Option<a href="#a-more-cost-effective-option" class="hash-link" aria-label="A More Cost-Effective Option的直接链接" title="A More Cost-Effective Option的直接链接"></a></h2><p>Reflecting on the strengths and limitations of the Elasticsearch-based solution, the Apache Doris developers have optimized Apache Doris for log processing. </p><ul><li><strong>Increase writing throughout</strong>: The performance of Elasticsearch is bottlenecked by data parsing and inverted index creation, so we improved Apache Doris in these factors: we quickened data parsing and index creation by SIMD instructions and CPU vector instructions; then we removed those data structures unnecessary for log analysis scenarios, such as forward indexes, to simplify index creation.</li><li><strong>Reduce storage costs</strong>: We removed forward indexes, which represented 30% of index data. We adopted columnar storage and the ZSTD compression algorithm, and thus achieved a compression ratio of 5:1 to 10:1. Given that a large part of the historical logs are rarely accessed, we introduced tiered storage to separate hot and cold data. Logs that are older than a specified time period will be moved to object storage, which is much less expensive. This can reduce storage costs by around 70%. </li></ul><p>Benchmark tests with ES Rally, the official testing tool for Elasticsearch, showed that Apache Doris was around 5 times as fast as Elasticsearch in data writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a writing speed of 550 MB/s and a compression ratio of 10:1.</p><p><img loading="lazy" alt="Elasticsearch-VS-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/LAS_4-49afc2c61d8929daeb0f51c39686e305.png" width="1280" height="725" class="img_ev3q"></p><p>The below figure show what a typical Doris-based log processing system looks like. It is more inclusive and allows for more flexible usage from data ingestion, analysis, and application:</p><ul><li><strong>Ingestion</strong>: Apache Doris supports various ingestion methods for log data. You can push logs to Doris via HTTP Output using Logstash, you can use Flink to pre-process the logs before you write them into Doris, or you can load logs from Flink or object storage to Doris via Routine Load and S3 Load. </li><li><strong>Analysis</strong>: You can put log data in Doris and conduct join queries across logs and other data in the data warehouse.</li><li><strong>Application</strong>: Apache Doris is compatible with MySQL protocol, so you can integrate a wide variety of data analytic tools and clients to Doris, such as Grafana and Tableau. You can also connect applications to Doris via JDBC and ODBC APIs. We are planning to build a Kibana-like system to visualize logs.</li></ul><p><img loading="lazy" alt="Apache-Doris-log-analysis-stack" src="https://cdnd.selectdb.com/zh-CN/assets/images/LAS_5-cdbff999ed0de15553dee4514a869fd4.png" width="1272" height="432" class="img_ev3q"></p><p>Moreover, Apache Doris has better scheme-free support and a more user-friendly analytic engine.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="native-support-for-semi-structured-data">Native Support for Semi-Structured Data<a href="#native-support-for-semi-structured-data" class="hash-link" aria-label="Native Support for Semi-Structured Data的直接链接" title="Native Support for Semi-Structured Data的直接链接"></a></h3><p><strong>Firstly, we worked on the data types.</strong> We optimized the string search and regular expression matching for "text" through vectorization and brought a performance increase of 2~10 times. For JSON strings, Apache Doris will parse and store them as a more compacted and efficient binary format, which can speed up queries by 4 times. We also added a new data type for complicated data: Array Map. It can structuralize concatenated strings to allow for higher compression rate and faster queries.</p><p><strong>Secondly, Apache Doris supports schema evolution.</strong> This means you can adjust the schema as your business changes. You can add or delete fields and indexes, and change the data types for fields.</p><p>Apache Doris provides Light Schema Change capabilities, so you can add or delete fields within milliseconds:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Add a column. Result will be returned in milliseconds.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE lineitem ADD COLUMN l_new_column INT;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>You can also add index only for your target fields, so you can avoid overheads from unnecessary index creation. After you add an index, by default, the system will generate the index for all incremental data, and you can specify which historical data partitions that need the index.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Add inverted index. Doris will generate inverted index for all new data afterward.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Build index for the specified historical data partitions.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, partition_name2);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="sql-based-analytic-engine">SQL-Based Analytic Engine<a href="#sql-based-analytic-engine" class="hash-link" aria-label="SQL-Based Analytic Engine的直接链接" title="SQL-Based Analytic Engine的直接链接"></a></h3><p>The SQL-based analytic engine makes sure that data engineers and analysts can smoothly grasp Apache Doris in a short time and bring their experience with SQL to this OLAP engine. Building on the rich features of SQL, users can execute data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and materialized views to serve their own needs. </p><p>With MySQL compatibility, Apache Doris can be integrated with most GUI and BI tools in the big data ecosystem, so users can realize more complex and diversified data analysis.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-in-use-case">Performance in Use Case<a href="#performance-in-use-case" class="hash-link" aria-label="Performance in Use Case的直接链接" title="Performance in Use Case的直接链接"></a></h3><p>A gaming company has transitioned from the ELK stack to the Apache Doris solution. Their Doris-based log system used 1/6 of the storage space that they previously needed. </p><p>In a cybersecurity company who built their log analysis system utilizing inverted index in Apache Doris, they supported a data writing speed of 300,000 rows per second with 1/5 of the server resources that they formerly used. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="hands-on-guide">Hands-On Guide<a href="#hands-on-guide" class="hash-link" aria-label="Hands-On Guide的直接链接" title="Hands-On Guide的直接链接"></a></h2><p>Now let's go through the three steps of building a log analysis system with Apache Doris. </p><p>Before you start, <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">download</a> Apache Doris 2.0 or newer versions from the website and <a href="https://doris.apache.org/docs/dev/install/standard-deployment/" target="_blank" rel="noopener noreferrer">deploy</a> clusters.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="step-1-create-tables">Step 1: Create Tables<a href="#step-1-create-tables" class="hash-link" aria-label="Step 1: Create Tables的直接链接" title="Step 1: Create Tables的直接链接"></a></h3><p>This is an example of table creation.</p><p>Explanations for the configurations:</p><ul><li>The DATETIMEV2 time field is specified as the Key in order to speed up queries for the latest N log records.</li><li>Indexes are created for the frequently accessed fields, and fields that require full-text search are specified with Parser parameters.</li><li>"PARTITION BY RANGE" means to partition the data by RANGE based on time fields, <a href="https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition/" target="_blank" rel="noopener noreferrer">Dynamic Partition</a> is enabled for auto-management.</li><li>"DISTRIBUTED BY RANDOM BUCKETS AUTO" means to distribute the data into buckets randomly and the system will automatically decide the number of buckets based on the cluster size and data volume.</li><li>"log_policy_1day" and "log_s3" means to move logs older than 1 day to S3 storage.</li></ul><div class="language-Go codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Go codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE DATABASE log_db;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">USE log_db;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE RESOURCE "log_s3"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "type" = "s3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.endpoint" = "your_endpoint_url",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.region" = "your_region",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.bucket" = "your_bucket",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.root.path" = "your_path",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.access_key" = "your_ak",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.secret_key" = "your_sk"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE STORAGE POLICY log_policy_1day</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "storage_resource" = "log_s3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "cooldown_ttl" = "86400"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE log_table</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `ts` DATETIMEV2,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `clientip` VARCHAR(20),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `request` TEXT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `status` INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `size` INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_size (`size`) USING INVERTED,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_status (`status`) USING INVERTED,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_clientip (`clientip`) USING INVERTED,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ENGINE = OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DUPLICATE KEY(`ts`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PARTITION BY RANGE(`ts`) ()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY RANDOM BUCKETS AUTO</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"replication_num" = "1",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"storage_policy" = "log_policy_1day",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"deprecated_dynamic_schema" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.enable" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.time_unit" = "DAY",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.start" = "-3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.end" = "7",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.prefix" = "p",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.buckets" = "AUTO",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"dynamic_partition.replication_num" = "1"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="step-2-ingest-the-logs">Step 2: Ingest the Logs<a href="#step-2-ingest-the-logs" class="hash-link" aria-label="Step 2: Ingest the Logs的直接链接" title="Step 2: Ingest the Logs的直接链接"></a></h3><p>Apache Doris supports various ingestion methods. For real-time logs, we recommend the following three methods:</p><ul><li>Pull logs from Kafka message queue: Routine Load </li><li>Logstash: write logs into Doris via HTTP API</li><li>Self-defined writing program: write logs into Doris via HTTP API</li></ul><p><strong>Ingest from Kafka</strong></p><p>For JSON logs that are written into Kafka message queues, create <a href="https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual/" target="_blank" rel="noopener noreferrer">Routine Load</a> so Doris will pull data from Kafka. The following is an example. The <code>property.*</code> configurations are optional:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Prepare the Kafka cluster and topic ("log_topic")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-- Create Routine Load, load data from Kafka log_topic to "log_table"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">COLUMNS(ts, clientip, request, status, size)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "max_batch_interval" = "10",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "max_batch_rows" = "1000000",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "max_batch_size" = "109715200",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "strict_mode" = "false",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "format" = "json"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM KAFKA (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "kafka_broker_list" = "host:port",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "kafka_topic" = "log_topic",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "property.group.id" = "your_group_id",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "property.security.protocol"="SASL_PLAINTEXT", </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "property.sasl.mechanism"="GSSAPI", </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "property.sasl.kerberos.service.name"="kafka", </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "property.sasl.kerberos.keytab"="/path/to/xxx.keytab", </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "property.sasl.kerberos.principal"="xxx@yyy.com"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>You can check how the Routine Load runs via the <code>SHOW ROUTINE LOAD</code> command. </p><p><strong>Ingest via Logstash</strong></p><p>Configure HTTP Output for Logstash, and then data will be sent to Doris via HTTP Stream Load.</p><ol><li>Specify the batch size and batch delay in <code>logstash.yml</code> to improve data writing performance.</li></ol><div class="language-Plain codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Plain codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">pipeline.batch.size: 100000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">pipeline.batch.delay: 10000</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="2"><li>Add HTTP Output to the log collection configuration file <code>testlog.conf</code>, URL =&gt; the Stream Load address in Doris.</li></ol><ul><li>Since Logstash does not support HTTP redirection, you should use a backend address instead of a FE address.</li><li>Authorization in the headers is <code>http basic auth</code>. It is computed with <code>echo -n 'username:password' | base64</code>.</li><li>The <code>load_to_single_tablet</code> in the headers can reduce the number of small files in data ingestion.</li></ul><div class="language-Plain codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Plain codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">output {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> http {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> follow_redirects =&gt; true</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> keepalive =&gt; false</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> http_method =&gt; "put"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> url =&gt; "http://172.21.0.5:8640/api/logdb/logtable/_stream_load"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> headers =&gt; [</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "format", "json",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "strip_outer_array", "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "load_to_single_tablet", "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "Authorization", "Basic cm9vdDo=",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "Expect", "100-continue"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> format =&gt; "json_batch"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Ingest via self-defined program</strong></p><p>This is an example of ingesting data to Doris via HTTP Stream Load.</p><p>Notes:</p><ul><li>Use <code>basic auth</code> for HTTP authorization, use <code>echo -n 'username:password' | base64</code> in computation</li><li><code>http header "format:json"</code>: the data type is specified as JSON</li><li><code>http header "read_json_by_line:true"</code>: each line is a JSON record</li><li><code>http header "load_to_single_tablet:true"</code>: write to one tablet each time</li><li>For the data writing clients, we recommend a batch size of 100MB~1GB. Future versions will enable Group Commit at the server end and reduce batch size from clients.</li></ul><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">curl \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--location-trusted \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-u username:password \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-H "format:json" \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-H "read_json_by_line:true" \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-H "load_to_single_tablet:true" \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">-T logfile.json \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">http://fe_host:fe_http_port/api/log_db/log_table/_stream_load</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="step-3-execute-queries">Step 3: Execute Queries<a href="#step-3-execute-queries" class="hash-link" aria-label="Step 3: Execute Queries的直接链接" title="Step 3: Execute Queries的直接链接"></a></h3><p>Apache Doris supports standard SQL, so you can connect to Doris via MySQL client or JDBC and then execute SQL queries.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql -h fe_host -P fe_mysql_port -u root -Dlog_db</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>A few common queries in log analysis:</strong></p><ul><li>Check the latest 10 records.</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM log_table ORDER BY ts DESC LIMIT 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Check the latest 10 records of Client IP "8.8.8.8".</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Retrieve the latest 10 records with "error" or "404" in the "request" field. <strong>MATCH_ANY</strong> is a SQL syntax keyword for full-text search in Doris. It means to find the records that include any one of the specified keywords.</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Retrieve the latest 10 records with "image" and "faq" in the "request" field. <strong>MATCH_ALL</strong> is also a SQL syntax keyword for full-text search in Doris. It means to find the records that include all of the specified keywords.</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>If you are looking for an efficient log analytic solution, Apache Doris is friendly to anyone equipped with SQL knowledge; if you find friction with the ELK stack, try Apache Doris provides better schema-free support, enables faster data writing and queries, and brings much less storage burden.</p><p>But we won't stop here. We are going to provide more features to facilitate log analysis. We plan to add more complicated data types to inverted index, and support BKD index to make Apache Doris a fit for geo data analysis. We also plan to expand capabilities in semi-structured data analysis, such as working on the complex data types (Array, Map, Struct, JSON) and high-performance string matching algorithm. And we welcome any <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">user feedback and development advice</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Log analysis: how to digest 15 billion logs per day and keep big queries within 1 second]]></title>
<id>https://doris.apache.org/zh-CN/blog/Log-Analysis-How-to-Digest-15-Billion-Logs-Per-Day-and-Keep-Big-Queries-Within-1-Second</id>
<link href="https://doris.apache.org/zh-CN/blog/Log-Analysis-How-to-Digest-15-Billion-Logs-Per-Day-and-Keep-Big-Queries-Within-1-Second"/>
<updated>2023-09-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article describes a large-scale data warehousing use case to provide reference for data engineers who are looking for log analytic solutions. It introduces the log processing architecture and real case practice in data ingestion, storage, and queries.]]></summary>
<content type="html"><![CDATA[<p>This data warehousing use case is about <strong>scale</strong>. The user is <a href="https://en.wikipedia.org/wiki/China_Unicom" target="_blank" rel="noopener noreferrer">China Unicom</a>, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records.</p><p>From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this post is about what their log processing architecture looks like, and how they realize stable data ingestion, low-cost storage, and quick queries with it.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="system-architecture">System Architecture<a href="#system-architecture" class="hash-link" aria-label="System Architecture的直接链接" title="System Architecture的直接链接"></a></h2><p>This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing.</p><p><img loading="lazy" alt="real-time-data-warehouse-2.0" src="https://cdnd.selectdb.com/zh-CN/assets/images/Unicom-1-0c734fbe7faf4875c3a647ac5136cce9.png" width="1280" height="609" class="img_ev3q"></p><ul><li><strong>ODS</strong>: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay.</li><li><strong>DWD</strong>: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the <a href="https://doris.apache.org/docs/dev/data-table/data-model#duplicate-model" target="_blank" rel="noopener noreferrer">Duplicate Key model</a> of Apache Doris. </li><li><strong>DWS</strong>: This layer aggregates data from DWD and lays the foundation for queries and analysis.</li><li><strong>ADS</strong>: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model. </li></ul><p>Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins.</p><p><img loading="lazy" alt="real-time-data-warehouse-1.0" src="https://cdnd.selectdb.com/zh-CN/assets/images/Unicom-2-6b242b382e769bf8acd4f0e08471045f.png" width="1280" height="607" class="img_ev3q"></p><p>Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="real-case-practice">Real-Case Practice<a href="#real-case-practice" class="hash-link" aria-label="Real-Case Practice的直接链接" title="Real-Case Practice的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="stable-ingestion-of-15-billion-logs-per-day">Stable ingestion of 15 billion logs per day<a href="#stable-ingestion-of-15-billion-logs-per-day" class="hash-link" aria-label="Stable ingestion of 15 billion logs per day的直接链接" title="Stable ingestion of 15 billion logs per day的直接链接"></a></h3><p>In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads.</p><p>A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations:</p><ul><li><strong>Flink Checkpoint</strong>: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions.</li><li><strong>Data Pre-Aggregation</strong>: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing.</li><li><strong>Doris Compaction</strong>: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up <code>max_tablet_version_num</code> to avoid version accumulation.</li></ul><p>These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the <a href="https://doris.apache.org/docs/dev/data-table/data-model#unique-model" target="_blank" rel="noopener noreferrer">Unique Key model</a> in Doris can ensure quicker data updates.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="storage-strategies-to-reduce-costs-by-50">Storage strategies to reduce costs by 50%<a href="#storage-strategies-to-reduce-costs-by-50" class="hash-link" aria-label="Storage strategies to reduce costs by 50%的直接链接" title="Storage strategies to reduce costs by 50%的直接链接"></a></h3><p>The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs. </p><ul><li><strong>ZSTD (ZStandard) compression algorithm</strong>: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1. </li><li><strong>Tiered storage of hot and cold data</strong>: This is supported by the <a href="https://blog.devgenius.io/hot-cold-data-separation-what-why-and-how-5f7c73e7a3cf" target="_blank" rel="noopener noreferrer">new feature</a> of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder", it will be moved to object storage for much lower storage costs. Plus, in object storage, data will be stored with only one copy instead of three. This further cuts down costs and the overheads brought by redundant storage. </li><li><strong>Differentiated replica numbers for different data partitions</strong>: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy. </li></ul><p>With these three strategies, the user has reduced their storage costs by 50%.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="differentiated-query-strategies-based-on-data-size">Differentiated query strategies based on data size<a href="#differentiated-query-strategies-based-on-data-size" class="hash-link" aria-label="Differentiated query strategies based on data size的直接链接" title="Differentiated query strategies based on data size的直接链接"></a></h3><p>Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes:</p><ul><li><strong>Less than 100G</strong>: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance of data within a partition, they use the snowflake ID as the bucketing field. They also set a starting offset of 20 days, which means data of the recent 20 days will be kept. In this way, they find the balance point between data backlog and analytic needs.</li><li><strong>100G~1T</strong>: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle.</li><li><strong>More than 100T</strong>: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. <strong>In this way, we enable queries of 2 billion log records to be done in 1~2s.</strong> </li></ul><p>These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="ongoing-plans">Ongoing Plans<a href="#ongoing-plans" class="hash-link" aria-label="Ongoing Plans的直接链接" title="Ongoing Plans的直接链接"></a></h2><p>The user is now testing with the newly added <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index?_highlight=inverted" target="_blank" rel="noopener noreferrer">inverted index</a> in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is, most of their new data comes in during daytime, but little at nights. So in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a> and we are now working on this optimization.</p>]]></content>
<author>
<name>Yuqi Liu</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.7]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.7</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.7"/>
<updated>2023-09-04T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.2.7 is now available, with several enhancements and bug fixes based on 1.2.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li>Fixed some query issues.</li><li>Fix some storage issues.</li><li>Fix some decimal precision issues.</li><li>Fix query error caused by invalid <code>sql_select_limit</code> session variable's value.</li><li>Fix the problem that hdfs short-circuit read cannot be used.</li><li>Fix the problem that Tencent Cloud cosn cannot be accessed.</li><li>Fix several issues with hive catalog kerberos access.</li><li>Fix the problem that stream load profile cannot be used.</li><li>Fix promethus monitoring parameter format problem.</li><li>Fix the table creation timeout issue when creating a large number of tablets.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-features">New Features<a href="#new-features" class="hash-link" aria-label="New Features的直接链接" title="New Features的直接链接"></a></h2><ul><li>Unique Key model supports array type as value column</li><li>Added <code>have_query_cache</code> variable for compatibility with MySQL ecosystem.</li><li>Added <code>enable_strong_consistency_read</code> to support strong consistent read between sessions</li><li>FE metrics supports user-level query counter</li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 2.0.1]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.1</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.1"/>
<updated>2023-09-04T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris has fixed 383 issues or performance improvements in version 2.0.1 based on 2.0.0, enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<p>Thanks to our community users and developers, 383 improvements and bug fixes have been made in Doris 2.0.1.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changes">Behavior Changes<a href="#behavior-changes" class="hash-link" aria-label="Behavior Changes的直接链接" title="Behavior Changes的直接链接"></a></h2><ul><li><a href="https://github.com/apache/doris/pull/21302" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/21302</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="functionality-and-stability-of-array-and-map-datatypes">functionality and stability of array and map datatypes<a href="#functionality-and-stability-of-array-and-map-datatypes" class="hash-link" aria-label="functionality and stability of array and map datatypes的直接链接" title="functionality and stability of array and map datatypes的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/22793" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22793</a></li><li><a href="https://github.com/apache/doris/pull/22927" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22927</a></li><li><a href="https://github.com/apache/doris/pull/22738" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22738</a></li><li><a href="https://github.com/apache/doris/pull/22347" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22347</a></li><li><a href="https://github.com/apache/doris/pull/23250" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23250</a></li><li><a href="https://github.com/apache/doris/pull/22300" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22300</a></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-for-inverted-index-query">performance for inverted index query<a href="#performance-for-inverted-index-query" class="hash-link" aria-label="performance for inverted index query的直接链接" title="performance for inverted index query的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/22836" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22836</a></li><li><a href="https://github.com/apache/doris/pull/23381" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23381</a></li><li><a href="https://github.com/apache/doris/pull/23389" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23389</a></li><li><a href="https://github.com/apache/doris/pull/22570" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22570</a></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-for-bitmap-like-scan-agg-functions">performance for bitmap, like, scan, agg functions<a href="#performance-for-bitmap-like-scan-agg-functions" class="hash-link" aria-label="performance for bitmap, like, scan, agg functions的直接链接" title="performance for bitmap, like, scan, agg functions的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/23172" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23172</a></li><li><a href="https://github.com/apache/doris/pull/23495" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23495</a></li><li><a href="https://github.com/apache/doris/pull/23476" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23476</a></li><li><a href="https://github.com/apache/doris/pull/23396" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23396</a></li><li><a href="https://github.com/apache/doris/pull/23182" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23182</a></li><li><a href="https://github.com/apache/doris/pull/22216" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22216</a></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="functionality-and-stability-of-ccr">functionality and stability of CCR<a href="#functionality-and-stability-of-ccr" class="hash-link" aria-label="functionality and stability of CCR的直接链接" title="functionality and stability of CCR的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/22447" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22447</a></li><li><a href="https://github.com/apache/doris/pull/22559" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22559</a></li><li><a href="https://github.com/apache/doris/pull/22173" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22173</a></li><li><a href="https://github.com/apache/doris/pull/22678" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22678</a></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="merge-on-write-unique-table">merge on write unique table<a href="#merge-on-write-unique-table" class="hash-link" aria-label="merge on write unique table的直接链接" title="merge on write unique table的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/22282" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22282</a></li><li><a href="https://github.com/apache/doris/pull/22984" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22984</a></li><li><a href="https://github.com/apache/doris/pull/21933" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/21933</a></li><li><a href="https://github.com/apache/doris/pull/22874" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22874</a></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimizer-table-stats-and-analyze">optimizer table stats and analyze<a href="#optimizer-table-stats-and-analyze" class="hash-link" aria-label="optimizer table stats and analyze的直接链接" title="optimizer table stats and analyze的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/22658" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22658</a></li><li><a href="https://github.com/apache/doris/pull/22211" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22211</a></li><li><a href="https://github.com/apache/doris/pull/22775" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22775</a></li><li><a href="https://github.com/apache/doris/pull/22896" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22896</a></li><li><a href="https://github.com/apache/doris/pull/22788" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22788</a></li><li><a href="https://github.com/apache/doris/pull/22882" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22882</a></li><li></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="functionality-and-performance-of-multi-catalog">functionality and performance of multi catalog<a href="#functionality-and-performance-of-multi-catalog" class="hash-link" aria-label="functionality and performance of multi catalog的直接链接" title="functionality and performance of multi catalog的直接链接"></a></h3><ul><li><a href="https://github.com/apache/doris/pull/22949" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22949</a></li><li><a href="https://github.com/apache/doris/pull/22923" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22923</a></li><li><a href="https://github.com/apache/doris/pull/22336" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22336</a></li><li><a href="https://github.com/apache/doris/pull/22915" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22915</a></li><li><a href="https://github.com/apache/doris/pull/23056" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23056</a></li><li><a href="https://github.com/apache/doris/pull/23297" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23297</a></li><li><a href="https://github.com/apache/doris/pull/23279" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23279</a></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="important-bug-fixes">Important Bug fixes<a href="#important-bug-fixes" class="hash-link" aria-label="Important Bug fixes的直接链接" title="Important Bug fixes的直接链接"></a></h2><ul><li><a href="https://github.com/apache/doris/pull/22673" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22673</a></li><li><a href="https://github.com/apache/doris/pull/22656" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22656</a></li><li><a href="https://github.com/apache/doris/pull/22892" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22892</a></li><li><a href="https://github.com/apache/doris/pull/22959" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22959</a></li><li><a href="https://github.com/apache/doris/pull/22902" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22902</a></li><li><a href="https://github.com/apache/doris/pull/22976" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22976</a></li><li><a href="https://github.com/apache/doris/pull/22734" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22734</a></li><li><a href="https://github.com/apache/doris/pull/22840" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22840</a></li><li><a href="https://github.com/apache/doris/pull/23008" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23008</a></li><li><a href="https://github.com/apache/doris/pull/23003" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23003</a></li><li><a href="https://github.com/apache/doris/pull/22966" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22966</a></li><li><a href="https://github.com/apache/doris/pull/22965" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22965</a></li><li><a href="https://github.com/apache/doris/pull/22784" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22784</a></li><li><a href="https://github.com/apache/doris/pull/23049" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23049</a></li><li><a href="https://github.com/apache/doris/pull/23084" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23084</a></li><li><a href="https://github.com/apache/doris/pull/22947" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22947</a></li><li><a href="https://github.com/apache/doris/pull/22919" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22919</a></li><li><a href="https://github.com/apache/doris/pull/22979" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22979</a></li><li><a href="https://github.com/apache/doris/pull/23096" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23096</a></li><li><a href="https://github.com/apache/doris/pull/23113" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23113</a></li><li><a href="https://github.com/apache/doris/pull/23062" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23062</a></li><li><a href="https://github.com/apache/doris/pull/22918" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/22918</a></li><li><a href="https://github.com/apache/doris/pull/23026" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23026</a></li><li><a href="https://github.com/apache/doris/pull/23175" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23175</a></li><li><a href="https://github.com/apache/doris/pull/23167" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23167</a></li><li><a href="https://github.com/apache/doris/pull/23015" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23015</a></li><li><a href="https://github.com/apache/doris/pull/23165" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23165</a></li><li><a href="https://github.com/apache/doris/pull/23264" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23264</a></li><li><a href="https://github.com/apache/doris/pull/23246" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23246</a></li><li><a href="https://github.com/apache/doris/pull/23198" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23198</a></li><li><a href="https://github.com/apache/doris/pull/23221" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23221</a></li><li><a href="https://github.com/apache/doris/pull/23277" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23277</a></li><li><a href="https://github.com/apache/doris/pull/23249" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23249</a></li><li><a href="https://github.com/apache/doris/pull/23272" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23272</a></li><li><a href="https://github.com/apache/doris/pull/23383" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23383</a></li><li><a href="https://github.com/apache/doris/pull/23372" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23372</a></li><li><a href="https://github.com/apache/doris/pull/23399" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23399</a></li><li><a href="https://github.com/apache/doris/pull/23295" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23295</a></li><li><a href="https://github.com/apache/doris/pull/23446" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23446</a></li><li><a href="https://github.com/apache/doris/pull/23406" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23406</a></li><li><a href="https://github.com/apache/doris/pull/23387" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23387</a></li><li><a href="https://github.com/apache/doris/pull/23421" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23421</a></li><li><a href="https://github.com/apache/doris/pull/23456" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23456</a></li><li><a href="https://github.com/apache/doris/pull/23361" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23361</a></li><li><a href="https://github.com/apache/doris/pull/23402" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23402</a></li><li><a href="https://github.com/apache/doris/pull/23369" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23369</a></li><li><a href="https://github.com/apache/doris/pull/23245" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23245</a></li><li><a href="https://github.com/apache/doris/pull/23532" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23532</a></li><li><a href="https://github.com/apache/doris/pull/23529" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23529</a></li><li><a href="https://github.com/apache/doris/pull/23601" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/23601</a></li></ul><p>See the complete list of improvements and bug fixes on <a href="https://github.com/apache/doris/issues?q=label%3Adev%2F2.0.1-merged+is%3Aclosed" target="_blank" rel="noopener noreferrer">github</a> .</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p>@adonis0147
@airborne12
@amorynan
@AshinGau
@BePPPower
@BiteTheDDDDt
@bobhan1
@ByteYue
@caiconghui
@CalvinKirs
@csun5285
@DarvenDuan
@deadlinefen
@DongLiang-0
@Doris-Extras
@dutyu
@englefly
@freemandealer
@Gabriel39
@GoGoWen
@HappenLee
@hello-stephen
@HHoflittlefish777
@hubgeter
@hust-hhb
@JackDrogon
@jacktengg
@jackwener
@Jibing-Li
@kaijchen
@kaka11chen
@Kikyou1997
@Lchangliang
@LemonLiTree
@liaoxin01
@LiBinfeng-01
@lsy3993
@luozenglin
@morningman
@morrySnow
@mrhhsg
@Mryange
@mymeiyi
@shuke987
@sohardforaname
@starocean999
@TangSiyang2001
@Tanya-W
@ucasfl
@vinlee19
@wangbo
@wsjz
@wuwenchi
@xiaokang
@XieJiann
@xinyiZzz
@yujun777
@Yukang-Lian
@Yulei-Yang
@zclllyybb
@zddr
@zenoyang
@zgxme
@zhangguoqiang666
@zhangstar333
@zhannngchen
@zhiqiang-hhhh
@zxealous
@zy-kkk
@zzzxl1993
@zzzzzzzs</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[LLM-powered OLAP: the Tencent application with Apache Doris]]></title>
<id>https://doris.apache.org/zh-CN/blog/Tencent-LLM</id>
<link href="https://doris.apache.org/zh-CN/blog/Tencent-LLM"/>
<updated>2023-08-29T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The exploration of a LLM+OLAP solution is a bumpy journey, but phew, it now works well for the Tencent case, and they're writing down every lesson learned to share with you.]]></summary>
<content type="html"><![CDATA[<p>Six months ago, I wrote about <a href="https://doris.apache.org/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris" target="_blank" rel="noopener noreferrer">why we replaced ClickHouse with Apache Doris as an OLAP engine</a> for our data management system. Back then, we were struggling with the auto-generation of SQL statements. As days pass, we have made progresses big enough to be references for you (I think), so here I am again. </p><p>We have adopted Large Language Models (LLM) to empower our Doris-based OLAP services.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="llm--olap">LLM + OLAP<a href="#llm--olap" class="hash-link" aria-label="LLM + OLAP的直接链接" title="LLM + OLAP的直接链接"></a></h2><p>Our incentive was to save our internal staff from the steep learning curve of SQL writing. Thus, we used LLM as an intermediate. It transforms natural language questions into SQL statements and sends the SQLs to the OLAP engine for execution.</p><p><img loading="lazy" alt="LLM-OLAP-solution" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_1-6672112c0d09d75171d8ed9a749ff196.png" width="1280" height="253" class="img_ev3q"></p><p>Like every AI-related experience, we came across some friction:</p><ol><li>LLM does not understand data jargons, like "fields", "rows", "columns" and "tables". Instead, they can perfectly translate business terms like "corporate income" and "DAU", which are basically what the fields/rows/columns are about. That means it can work well only if the analysts use the exact right word to refer to the metric they need when typing their questions.</li><li>The LLM we are using is slow in inference. It takes over 10 seconds to respond. As it charges fees by token, cost-effectiveness becomes a problem.</li><li>Although the LLM is trained on a large collection of public datasets, it is under-informed of niche knowledge. In our case, the LLM is super unfamiliar with indie songs, so even if the songs are included in our database, the LLM will not able to identify them properly. </li><li>Sometimes our input questions require adequate and latest legal, political, financial, and regulatory information, which is hard to be included in a training dataset or knowledge base. We need to connect the LLM to wider info bases in order to perform more diversified tasks.</li></ol><p>We knock these problems down one by one.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-a-semantic-layer">1. A semantic layer<a href="#1-a-semantic-layer" class="hash-link" aria-label="1. A semantic layer的直接链接" title="1. A semantic layer的直接链接"></a></h3><p>For problem No.1, we introduce a semantic layer between the LLM and the OLAP engine. This layer translates business terms into the corresponding data fields. It can identify data filtering conditions from the various natural language wordings, relate them to the metrics involved, and then generate SQL statements. </p><p>Besides that, the semantic layer can optimize the computation logic. When analysts input a question that involves a complicated query, let's say, a multi-table join, the semantic layer can split that into multiple single-table queries to reduce semantic distortion.</p><p><img loading="lazy" alt="LLM-OLAP-semantic-layer" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_2-bb2fdaed64ef15214c0542204dd45832.png" width="1280" height="289" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-llm-parsing-rules">2. LLM parsing rules<a href="#2-llm-parsing-rules" class="hash-link" aria-label="2. LLM parsing rules的直接链接" title="2. LLM parsing rules的直接链接"></a></h3><p>To increase cost-effectiveness in using LLM, we evaluate the computation complexity of all scenarios, such as metric computation, detailed record retrieval, and user segmentation. Then, we create rules and dedicate the LLM-parsing step to only complicated tasks. That means for the simple computation tasks, it will skip the parsing. </p><p>For example, when an analyst inputs "tell me the earnings of the major musical platforms", the LLM identifies that this question only entails several metrics or dimensions, so it will not further parse it but send it straight for SQL generation and execution. This can largely shorten query response time and reduce API expenses. </p><p><img loading="lazy" alt="LLM-OLAP-parsing-rules" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_3-3ab023081e1acb069d34a4ce24aef010.png" width="1280" height="406" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-schema-mapper-and-external-knowledge-base">3. Schema Mapper and external knowledge base<a href="#3-schema-mapper-and-external-knowledge-base" class="hash-link" aria-label="3. Schema Mapper and external knowledge base的直接链接" title="3. Schema Mapper and external knowledge base的直接链接"></a></h3><p>To empower the LLM with niche knowledge, we added a Schema Mapper upstream from the LLM. The Schema Mapper maps the input question to an external knowledge base, and then the LLM will do parsing.</p><p>We are constantly testing and optimizing the Schema Mapper. We categorize and rate content in the external knowledge base, and do various levels of mapping (full-text mapping and fuzzy mapping) to enable better semantic parsing.</p><p><img loading="lazy" alt="LLM-OLAP-schema-mapper" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_4-261ee680cf77335b25f32e41d7a4924b.png" width="2001" height="647" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-plugins">4. Plugins<a href="#4-plugins" class="hash-link" aria-label="4. Plugins的直接链接" title="4. Plugins的直接链接"></a></h3><p>We used plugins to connect the LLM to more fields of information, and we have different integration methods for different types of plugins:</p><ul><li><strong>Embedding local files</strong>: This is especially useful when we need to "teach" the LLM the latest regulatory policies, which are often text files. Firstly, the system vectorizes the local text file, executes semantic searches to find matching or similar terms in the local file, extracts the relevant contents and puts them into the LLM parsing window to generate output. </li><li><strong>Third-party plugins</strong>: The marketplace is full of third-party plugins that are designed for all kinds of sectors. With them, the LLM is able to deal with wide-ranging topics. Each plugin has its own prompts and calling function. Once the input question hits a prompt, the relevant plugin will be called.</li></ul><p><img loading="lazy" alt="LLM-OLAP-plugins" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_5-70a170e771dd9eadcc1488b94d892478.png" width="2001" height="645" class="img_ev3q"></p><p>After we are done with above four optimizations, the SuperSonic framework comes into being.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-supersonic-framework">The SuperSonic framework<a href="#the-supersonic-framework" class="hash-link" aria-label="The SuperSonic framework的直接链接" title="The SuperSonic framework的直接链接"></a></h2><p>Now let me walk you through this <a href="https://github.com/tencentmusic/supersonic" target="_blank" rel="noopener noreferrer">framework</a>:</p><p><img loading="lazy" alt="LLM-OLAP-supersonic-framework" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_6-cbbbb25041c807376b2b9d14609e82c8.png" width="1280" height="1117" class="img_ev3q"></p><ul><li>An analyst inputs a question.</li><li>The Schema Mapper maps the question to an external knowledge base.</li><li>If there are matching fields in the external knowledge base, the question will not be parsed by the LLM. Instead, a metric computation formula will trigger the OLAP engine to start querying. If there is no matching field, the question will enter the LLM.</li><li>Based on the pre-defined rules, the LLM rates the complexity level of the question. If it is a simple query, it will go directly to the OLAP engine; if it is a complicated query, it will be semantically parsed and converted to a DSL statement.</li><li>At the Semantic Layer, the DSL statement will be split based on its query scenario. For example, if it is a multi-table join query, this layer will generate multiple single-table query SQL statements.</li><li>If the question involves external knowledge, the LLM will call a third-party plugin.</li></ul><p><strong>Example</strong></p><p><img loading="lazy" alt="LLM-OLAP-chatbot-query-interface" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_7-c20b3cc2b0b00b32bc2825c1d62b1d5d.png" width="2001" height="1126" class="img_ev3q"></p><p>To answer whether a certain song can be performed on variety shows, the system retrieves the OLAP data warehouse for details about the song, and presents it with results from the Commercial Use Query third-party plugin.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="olap-architecture">OLAP Architecture<a href="#olap-architecture" class="hash-link" aria-label="OLAP Architecture的直接链接" title="OLAP Architecture的直接链接"></a></h2><p>As for the OLAP part of this framework, after several rounds of architectural evolution, this is what our current OLAP pipeline looks like. </p><p>Raw data is sorted into tags and metrics, which are custom-defined by the analysts. The tags and metrics are under unified management in order to avoid inconsistent definitions. Then, they are combined into various tagsets and metricsets for various queries. </p><p><img loading="lazy" alt="LLM-OLAP-architecture" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tencent_LLM_8-6d517a787c782510bf3869176730ce3a.png" width="1709" height="1119" class="img_ev3q"></p><p>We have drawn two main takeaways for you from our architectural optimization experience.</p><p><strong>1. Streamline the links</strong></p><p>Before we adopted Apache Doris, we used to have ClickHouse to accelerate the computation of tags and metrics, and Elasticsearch to process dimensional data. That's two analytic engines and requires us to adapt the query statements to both of them. It was high-maintenance.</p><p>Thus, we replaced ClickHouse with Apache Doris, and utilized the <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/es" target="_blank" rel="noopener noreferrer">Elasticsearch Catalog</a> functionality to connect Elasticsearch data to Doris. In this way, we make Doris our unified query gateway. </p><p><strong>2. Split the flat tables</strong></p><p>In early versions of our OLAP architecture, we used to put data into flat tables, which made things tricky. For one thing, flat tables absorbed all the writing latency from upstreams, and that added up to considerable loss in data realtimeliness. For another, 50% of data in a flat table was dimensional data, which was rarely updated. With every new flat table came some bulky dimensional data that consumed lots of storage space. </p><p>Therefore, we split the flat tables into metric tables and dimension tables. As they are updated in different paces, we put them into different data models.</p><ul><li><strong>Metric tables</strong>: We arrange metric data in the Aggregate Key model of Apache Doris, which means new data will be merged with the old data by way of SUM, MAX, MIN, etc.</li><li><strong>Dimension tables</strong>: These tables are in the Unique Key model of Apache Doris, which means new data record will replace the old. This can greatly increase performance in our query scenarios.</li></ul><p>You might ask, does this cause trouble in queries, since most queries require data from both types of tables? Don't worry, we address that with the Rollup feature of Doris. On the basis of the base tables, we can select the dimensions we need to create Rollup views, which will automatically execute <code>GROUP BY</code>. This relieves us of the need to define tags for each Rollup view and largely speed up queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="other-tricks">Other Tricks<a href="#other-tricks" class="hash-link" aria-label="Other Tricks的直接链接" title="Other Tricks的直接链接"></a></h2><p>In our experience with Apache Doris, we also find some other functionalities handy, so I list them here for you, too:</p><p><strong>1. Materialized View</strong></p><p>A Materialized View is a pre-computed dataset. It is a way to accelerate queries when you frequently need to access data of certain dimensions. In these scenarios, we define derived tags and metrics based on the original ones. For example, we create a derived metric by combining Metric 1, Metric 2, and Metric 3: <code>sum(m1+m2+m3)</code>. Then, we can create a Materialized View for it. According to the Doris release schedule, version 2.1 will support multi-table Materialized Views, and we look forward to that.</p><p><strong>2. Flink-Doris-Connector</strong></p><p>This is for Exactly-Once guarantee in data ingestion. The Flink-Doris-Connector implements a checkpoint mechanism and two-stage commit, and allows for auto data synchronization from relational databases to Doris.</p><p><strong>3. Compaction</strong></p><p>When the number of aggregation tasks or data volume becomes overwhelming for Flink, there might be huge latency in data compaction. We solve that with Vertical Compaction and Segment Compaction. Vertical Compaction supports loading of only part of the columns, so it can reduce storage consumption when compacting flat tables. Segment Compaction can avoid generating too much segments during data writing, and allows for compaction while writing simultaneously. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="whats-next">What's Next<a href="#whats-next" class="hash-link" aria-label="What's Next的直接链接" title="What's Next的直接链接"></a></h2><p>With an aim to reduce costs and increase service availability, we plan to test the newly released Storage-Compute Separation and Cross-Cluster Replication of Doris, and we embrace any ideas and inputs about the SuperSonic framework and the Apache Doris project.</p>]]></content>
<author>
<name>Jun Zhang &amp; Lei Luo</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Choosing an OLAP engine for financial risk management: what to consider?]]></title>
<id>https://doris.apache.org/zh-CN/blog/Choosing-an-OLAP-Engine-for-Financial-Risk-Management-What-to-Consider</id>
<link href="https://doris.apache.org/zh-CN/blog/Choosing-an-OLAP-Engine-for-Financial-Risk-Management-What-to-Consider"/>
<updated>2023-08-17T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This post provides reference for what you should take into account when choosing an OLAP engine in a financial scenario.]]></summary>
<content type="html"><![CDATA[<p>From a data engineer's point of view, financial risk management is a series of data analysis activities on financial data. The financial sector imposes its unique requirements on data engineering. This post explains them with a use case of Apache Doris, and provides reference for what you should take into account when choosing an OLAP engine in a financial scenario. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-must-be-combined">Data Must Be Combined<a href="#data-must-be-combined" class="hash-link" aria-label="Data Must Be Combined的直接链接" title="Data Must Be Combined的直接链接"></a></h2><p>The financial data landscape is evolving from standalone to distributed, heterogeneous systems. For example, in this use case scenario, the fintech service provider needs to connect the various transaction processing (TP) systems (MySQL, Oracle, and PostgreSQL) of its partnering banks. Before they adopted an OLAP engine, they were using Kettle to collect data. The ETL tool did not support join queries across different data sources and it could not store data. The ever-enlarging data size at the source end was pushing the system towards latency and instability. That's when they decided to introduce an OLAP engine.</p><p>The financial user's main pursuit is quick queries on large data volume with as least engineering and maintenance efforts as possible, so when it comes to the choice of OLAP engines, SQL on Hadoop should be crossed off the list due to its huge ecosystem and complicated components. One reason that they landed on Apache Doris was the metadata management capability. Apache Doris collects metadata of various data sources via API so it is a fit for the case which requires combination of different TP systems. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="high-concurrency--high-throughput">High Concurrency &amp; High Throughput<a href="#high-concurrency--high-throughput" class="hash-link" aria-label="High Concurrency &amp; High Throughput的直接链接" title="High Concurrency &amp; High Throughput的直接链接"></a></h2><p>Financial risk control is based on analysis of large amounts of transaction data. Sometimes analysts identify abnormalities by combining data from different large tables, and often times they need to check a certain data record, which comes in the form of concurrent point queries in the data system. Thus, the OLAP engine should be able to handle both high-throughput queries and high-concurrency queries. </p><p>To speed up the highly concurrent point queries, you can create <a href="https://doris.apache.org/docs/dev/query-acceleration/materialized-view/" target="_blank" rel="noopener noreferrer">Materialized Views</a> in Apache Doris. A Materialized View is a pre-computed data set stored in Apache Doris so that the system can respond much faster to queries that are frequently conducted. </p><p>To facilitate queries on large tables, you can leverage the <a href="https://doris.apache.org/docs/dev/query-acceleration/join-optimization/colocation-join/" target="_blank" rel="noopener noreferrer">Colocation Join</a> mechanism. Colocation Join minimizes data transfer between computation nodes to reduce overheads brought by data movement. Thus, it can largely improve query speed when joining large tables.</p><p><img loading="lazy" alt="colocation-join" src="https://cdnd.selectdb.com/zh-CN/assets/images/Xingyun_1-d07e739500944ff34d4ad3c75968850b.png" width="1280" height="687" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="log-analysis">Log Analysis<a href="#log-analysis" class="hash-link" aria-label="Log Analysis的直接链接" title="Log Analysis的直接链接"></a></h2><p>Log analysis is important in financial data processing. Real-time processing and monitoring of logs can expose risks in time. Apache Doris provides data storage and analytics capabilities to make log analysis easier and more efficient. As logs are bulky, Apache Doris can deliver a high data compression rate to lower storage costs. </p><p>Retrieval is a major part of log analysis, so <a href="https://doris.apache.org/docs/dev/releasenotes/release-2.0.0" target="_blank" rel="noopener noreferrer">Apache Doris 2.0</a> supports inverted index, which is a way to accelerate text searching and equivalence/range queries on numerics and datetime. It allows users to quickly locate the log record that they need among the massive data. The JSON storage feature in Apache Doris is reported to reduce storage costs of user activity logs by 70%, and the variety of parse functions provided can save data engineers from developing their own SQl functions. </p><p><img loading="lazy" alt="log-analysis" src="https://cdnd.selectdb.com/zh-CN/assets/images/Xingyun_2-84440a0d5bfc678448d3a3e3063bd7f9.png" width="1280" height="473" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="easy-maintenance">Easy Maintenance<a href="#easy-maintenance" class="hash-link" aria-label="Easy Maintenance的直接链接" title="Easy Maintenance的直接链接"></a></h2><p>In addition to the easy deployment, Apache Doris has a few mechanisms that are designed to save maintenance efforts. For example, it ensures high availability of cluster nodes with Systemd, and high availability of data with multi-replica and auto-balancing of replicas, so all maintenance required is to backup metadata on a regular basis. Apache Doris also supports <a href="https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition/" target="_blank" rel="noopener noreferrer">dynamic partitioning of data</a>, which means it will automatically create or delete data partitions according to the rules specified by the user. This saves efforts in partition management and eliminates possible efforts caused by manual management.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="architecture-overview">Architecture Overview<a href="#architecture-overview" class="hash-link" aria-label="Architecture Overview的直接链接" title="Architecture Overview的直接链接"></a></h2><p>This is overall data architecture in the case. The user utilizes Apache Flume for log data collection, and DataX for data update. Data from multiple sources will be collected into Apache Doris to form a data mart, from which analysts extract information to generate reports and dashboards for reference in risk control and business decisions. As for stability of the data mart itself, Grafana and Prometheus are used to monitor memory usage, compaction score and query response time of Apache Doris to make sure it is running well.</p><p><img loading="lazy" alt="data-architecture" src="https://cdnd.selectdb.com/zh-CN/assets/images/Xingyun_3-ef9c50ef508df963514a76a7365b0490.png" width="1280" height="792" class="img_ev3q"></p>]]></content>
<author>
<name>Jianbo Liu</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Auto synchronization of an entire MySQL database for data analysis]]></title>
<id>https://doris.apache.org/zh-CN/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis</id>
<link href="https://doris.apache.org/zh-CN/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis"/>
<updated>2023-08-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Flink-Doris-Connector 1.4.0 allows users to ingest a whole database (MySQL or Oracle) that contains thousands of tables into Apache Doris, in one step.]]></summary>
<content type="html"><![CDATA[<p>Flink-Doris-Connector 1.4.0 allows users to ingest a whole database (<strong>MySQL</strong> or <strong>Oracle</strong>) that contains thousands of tables into <a href="https://doris.apache.org/zh-CN/" target="_blank" rel="noopener noreferrer">Apache Doris</a>, a real-time analytic database, <strong>in one step</strong>.</p><p>With built-in Flink CDC, the Connector can directly synchronize the table schema and data from the upstream source to Apache Doris, which means users no longer have to write a DataStream program or pre-create mapping tables in Doris. </p><p>When a Flink job starts, the Connector automatically checks for data equivalence between the source database and Apache Doris. If the data source contains tables which do not exist in Doris, the Connector will automatically create those same tables in Doris, and utilizes the side outputs of Flink to facilitate the ingestion of multiple tables at once; if there is a schema change in the source, it will automatically obtain the DDL statement and make the same schema change in Doris. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="quick-start">Quick Start<a href="#quick-start" class="hash-link" aria-label="Quick Start的直接链接" title="Quick Start的直接链接"></a></h2><p>Download Flink Doris Connector: <a href="https://doris.apache.org/download/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download/</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-use-it">How to Use It<a href="#how-to-use-it" class="hash-link" aria-label="How to Use It的直接链接" title="How to Use It的直接链接"></a></h2><p>For example, to ingest a whole MySQL database <code>mysql_db</code> into Doris (the MySQL table names start with <code>tbl</code> or <code>test</code>), simply execute the following command (no need to create the tables in Doris in advance):</p><div class="language-Shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">&lt;FLINK_HOME&gt;/bin/flink run \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> -Dexecution.checkpointing.interval=10s \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> -Dparallelism.default=1 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> -c org.apache.doris.flink.tools.cdc.CdcTools \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lib/flink-doris-connector-1.16-1.4.0.jar \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> mysql-sync-database \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --database test_db \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --mysql-conf hostname=127.0.0.1 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --mysql-conf username=root \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --mysql-conf password=123456 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --mysql-conf database-name=mysql_db \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --including-tables "tbl|test.*" \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --sink-conf fenodes=127.0.0.1:8030 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --sink-conf username=root \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --sink-conf password=123456 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --sink-conf jdbc-url=jdbc:mysql://127.0.0.1:9030 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --sink-conf sink.label-prefix=label1 \</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> --table-conf replication_num=1 </span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>To ingest an Oracle database: please refer to the <a href="https://github.com/apache/doris-flink-connector/pull/156" target="_blank" rel="noopener noreferrer">example code</a>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-it-performs">How It Performs<a href="#how-it-performs" class="hash-link" aria-label="How It Performs的直接链接" title="How It Performs的直接链接"></a></h2><p>When it comes to synchronizing a whole database (containing hundreds or even thousands of tables, active or inactive), most users want it to be done within seconds. So we tested the Connector to see if it came up to scratch:</p><ul><li>1000 MySQL tables, each having 100 fields. All tables were active (which meant they were continuously updated and each data writing involved over a hundred rows)</li><li>Flink job checkpoint: 10s</li></ul><p>Under pressure test, the system showed high stability, with key metrics as follows:</p><p><img loading="lazy" alt="Flink-Doris-Connector" src="https://cdnd.selectdb.com/zh-CN/assets/images/FDC_1-ce2b3c35d3126c743a9b9df1105dd1ce.png" width="1280" height="243" class="img_ev3q"></p><p><img loading="lazy" alt="Flink-CDC" src="https://cdnd.selectdb.com/zh-CN/assets/images/FDC_2-18b4e1b3346d90e6430b992d74e9a64f.png" width="1280" height="487" class="img_ev3q"></p><p><img loading="lazy" alt="Doris-Cluster-Compaction-Score" src="https://cdnd.selectdb.com/zh-CN/assets/images/FDC_3-5e973914e448c11df5e3e408823f2ded.png" width="1280" height="306" class="img_ev3q"></p><p>According to feedback from early adopters, the Connector has also delivered high performance and system stability in 10,000-table database synchronization in their production environment. This proves that the combination of Apache Doris and Flink CDC is capable of large-scale data synchronization with high efficiency and reliability.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-it-benefits-data-engineers">How It Benefits Data Engineers<a href="#how-it-benefits-data-engineers" class="hash-link" aria-label="How It Benefits Data Engineers的直接链接" title="How It Benefits Data Engineers的直接链接"></a></h2><p>Engineers no longer have to worry about table creation or table schema maintenance, saving them days of tedious and error-prone work. Previously in Flink CDC, you need to create a Flink job for each table and build a log parsing link at the source end, but now with whole-database ingestion, resouce consumption in the source database is largely reduced. It is also a unified solution for incremental update and full update.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="other-features">Other Features<a href="#other-features" class="hash-link" aria-label="Other Features的直接链接" title="Other Features的直接链接"></a></h2><p><strong>1. Joining dimension table and fact table</strong></p><p>The common practice is to put dimension tables in Doris and run join queries via the real-time stream of Flink. Based on the <a href="https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/asyncio/" target="_blank" rel="noopener noreferrer">Async I/O of Flink</a>, Flink-Doris-Connector 1.4.0 implements asynchronous Lookup Join, so the Flink real-time stream won't be blocked due to queries. Also, the Connector allows you to combine mulitple queries into one big query, and send it to Doris at once for processing. This improves the efficiency and throughput of such join queries.</p><p><strong>2. Thrift</strong> <strong>SDK</strong></p><p>We introduced Thrift-Service SDK into the Connector so users no longer have to use Thrift plug-ins or configure a Thrift environment in compilation. This makes the compilation process much simpler.</p><p><strong>3. On-Demand Stream Load</strong></p><p>During data synchronization, when there is no new data ingestion, no Stream Load requests will be issued. This avoids unnecessary consumption of cluster resources.</p><p><strong>4. Polling of Backend Nodes</strong></p><p>For data ingestion, Doris calls a frontend node to obtain a list of the backend nodes, and randomly chooses one to launch an ingestion request. That backend node will be the Coordinator. Flink-Doris-Connector 1.4.0 allows users to enable the polling mechanism, which is to have a different backend node to be the Coordinator at each Flink checkpoint to avoid too much pressure on a single backend node for a long time.</p><p><strong>5. Support for More Data Types</strong></p><p>In addition to the common data types, Flink-Doris-Connector 1.4.0 supports DecimalV3/DateV2/DateTimev2/Array/JSON in Doris.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="example-usage">Example Usage<a href="#example-usage" class="hash-link" aria-label="Example Usage的直接链接" title="Example Usage的直接链接"></a></h2><p><strong>Read from Apache Doris:</strong> </p><p>You can read data from Doris via DataStream or FlinkSQL (bounded stream). Predicate pushdown is supported.</p><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE flink_doris_source (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> name STRING,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> age INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> score DECIMAL(5,2)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> WITH (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'connector' = 'doris',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'fenodes' = '127.0.0.1:8030',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'table.identifier' = 'database.table',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'username' = 'root',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'password' = 'password',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'doris.filter.query' = 'age=18'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM flink_doris_source;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Join dimension table and fact table</strong>:</p><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE fact_table (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `id` BIGINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `name` STRING,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `city` STRING,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `process_time` as proctime()</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) WITH (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'connector' = 'kafka',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ...</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">create table dim_city(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `city` STRING,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `level` INT ,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `province` STRING,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `country` STRING</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) WITH (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'connector' = 'doris',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'fenodes' = '127.0.0.1:8030',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'jdbc-url' = 'jdbc:mysql://127.0.0.1:9030',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'lookup.jdbc.async' = 'true',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'table.identifier' = 'dim.dim_city',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'username' = 'root',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'password' = ''</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT a.id, a.name, a.city, c.province, c.country,c.level </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM fact_table a</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LEFT JOIN dim_city FOR SYSTEM_TIME AS OF a.process_time AS c</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ON a.city = c.city</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Write to Apache Doris</strong>: </p><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE doris_sink (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> name STRING,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> age INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> score DECIMAL(5,2)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> WITH (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'connector' = 'doris',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'fenodes' = '127.0.0.1:8030',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'table.identifier' = 'database.table',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'username' = 'root',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'password' = '',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.label-prefix' = 'doris_label',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> //json write in</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.properties.format' = 'json',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.properties.read_json_by_line' = 'true'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>If you've got any questions, find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[New milestone: Apache Doris 2.0.0 just released]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-2.0.0</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-2.0.0"/>
<updated>2023-08-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, we are excited to announce that Apache Doris 2.0.0 is now production-ready on August 16, 2023]]></summary>
<content type="html"><![CDATA[<p>We are more than excited to announce that, after six months of coding, testing, and fine-tuning, Apache Doris 2.0.0 is now production-ready. Special thanks to the 275 committers who altogether contributed over 4100 optimizations and fixes to the project. </p><p>This new version highlights:</p><ul><li>10 times faster data queries</li><li>Enhanced log analytic and federated query capabilities</li><li>More efficient data writing and updates</li><li>Improved multi-tenant and resource isolation mechanisms</li><li>Progresses in elastic scaling of resources and storage-compute separation</li><li>Enterprise-facing features for higher usability</li></ul><blockquote><p>Download: <a href="https://doris.apache.org/download" target="_blank" rel="noopener noreferrer">https://doris.apache.org/download</a></p><p>GitHub source code: <a href="https://github.com/apache/doris/releases/tag/2.0.0-rc04" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/releases/tag/2.0.0-rc04</a></p></blockquote><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-10-times-performance-increase"><strong>A 10 Times Performance Increase</strong><a href="#a-10-times-performance-increase" class="hash-link" aria-label="a-10-times-performance-increase的直接链接" title="a-10-times-performance-increase的直接链接"></a></h2><p>In SSB-Flat and TPC-H benchmarking, Apache Doris 2.0.0 delivered <strong>over 10-time faster query performance</strong> compared to an early version of Apache Doris.</p><p><img loading="lazy" alt="TPCH-benchmark-SSB-Flat-benchmark" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-1-876d7d5b2307bad4190500815eca710a.png" width="1724" height="744" class="img_ev3q"></p><p>This is realized by the introduction of a smarter query optimizer, inverted index, a parallel execution model, and a series of new functionalities to support high-concurrency point queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-smarter-query-optimizer">A smarter query optimizer<a href="#a-smarter-query-optimizer" class="hash-link" aria-label="A smarter query optimizer的直接链接" title="A smarter query optimizer的直接链接"></a></h3><p>The brand new query optimizer, Nereids, has a richer statistical base and adopts the Cascades framework. It is capable of self-tuning in most query scenarios and supports all 99 SQLs in TPC-DS, so users can expect high performance without any fine-tuning or SQL rewriting.</p><p>TPC-H tests showed that Nereids, with no human intervention, outperformed the old query optimizer by a wide margin. Over 100 users have tried Apache Doris 2.0.0 in their production environment and the vast majority of them reported huge speedups in query execution.</p><p><img loading="lazy" alt="Nereids-optimizer-TPCH" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-2-01156d489e9907a7bc6c424d0bcd41c9.png" width="1280" height="631" class="img_ev3q"></p><p><strong>Doc</strong>: <a href="https://doris.apache.org/docs/dev/query-acceleration/nereids/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/query-acceleration/nereids/</a></p><p>Nereids is enabled by default in Apache Doris 2.0.0: <code>SET enable_nereids_planner=true</code>. Nereids collects statistical data by calling the Analyze command.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="inverted-index">Inverted Index<a href="#inverted-index" class="hash-link" aria-label="Inverted Index的直接链接" title="Inverted Index的直接链接"></a></h3><p>In Apache Doris 2.0.0, we introduced <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index?_highlight=inverted" target="_blank" rel="noopener noreferrer">inverted index</a> to better support fuzzy keyword search, equivalence queries, and range queries.</p><p>A smartphone manufacturer tested Apache Doris 2.0.0 in their user behavior analysis scenarios. With inverted index enabled, v2.0.0 was able to finish the queries within milliseconds and maintain stable performance as the query concurrency level went up. In this case, it is 5 to 90 times faster than its old version. </p><p><img loading="lazy" alt="inverted-index-performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-3-9abcfa6e282014f53bb04009b1d2623d.png" width="1718" height="736" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="20-times-higher-concurrency-capability">20 times higher concurrency capability<a href="#20-times-higher-concurrency-capability" class="hash-link" aria-label="20 times higher concurrency capability的直接链接" title="20 times higher concurrency capability的直接链接"></a></h3><p>In scenarios like e-commerce order queries and express tracking, a huge number of end data users search for a certain data record simultaneously. These are what we call high-concurrency point queries, which can bring huge pressure on the system. A traditional solution is to introduce Key-Value stores like Apache HBase for such queries, and Redis as a cache layer to ease the burden, but that means redundant storage and higher maintenance costs.</p><p>For a column-oriented DBMS like Apache Doris, the I/O usage of point queries will be multiplied. We need neater execution. Thus, on the basis of columnar storage, we added row storage format and row cache to increase row reading efficiency, short-circuit plans to speed up data retrieval, and prepared statements to reduce frontend overheads.</p><p>After these optimizations, Apache Doris 2.0 reached a concurrency level of <strong>30,000 QPS per node</strong> on YCSB on a 16 Core 64G cloud server with 4×1T hard drives, representing an improvement of <strong>20 times</strong> compared to its older version. This makes Apache Doris a good alternative to HBase in high-concurrency scenarios, so that users don't need to endure extra maintenance costs and redundant storage brought by complicated tech stacks.</p><p>Read more: <a href="https://doris.apache.org/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times" target="_blank" rel="noopener noreferrer">https://doris.apache.org/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-self-adaptive-parallel-execution-model">A self-adaptive parallel execution model<a href="#a-self-adaptive-parallel-execution-model" class="hash-link" aria-label="A self-adaptive parallel execution model的直接链接" title="A self-adaptive parallel execution model的直接链接"></a></h3><p>Apache 2.0 brought in a Pipeline execution model for higher efficiency and stability in hybrid analytic workloads. In this model, the execution of queries is driven by data. The blocking operators in all query execution processes are split into pipelines. Whether a pipeline gets an execution thread depends on whether its relevant data is ready. This enables asynchronous blocking operations and more flexible system resource management. Also, this improves CPU efficiency as the system doesn't have to create and destroy threads that much.</p><p>Doc: <a href="https://doris.apache.org/docs/dev/query-acceleration/pipeline-execution-engine/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/query-acceleration/pipeline-execution-engine/</a></p><p><strong>How to enable the Pipeline execution model</strong></p><ul><li>The Pipeline execution engine is enabled by default in Apache Doris 2.0: <code>Set enable_pipeline_engine = true</code>.</li><li><code>parallel_pipeline_task_num</code> represents the number of pipeline tasks that are parallelly executed in SQL queries. The default value of it is <code>0</code>, which means Apache Doris will automatically set the concurrency level to half the number of CPUs in each backend node. Users can change this value as they need it. </li><li>For those who are upgrading to Apache Doris 2.0 from an older version, it is recommended to set the value of <code>parallel_pipeline_task_num</code> to that of <code>parallel_fragment_exec_instance_num</code> in the old version.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-unified-platform-for-multiple-analytic-workloads">A Unified Platform for Multiple Analytic Workloads<a href="#a-unified-platform-for-multiple-analytic-workloads" class="hash-link" aria-label="A Unified Platform for Multiple Analytic Workloads的直接链接" title="A Unified Platform for Multiple Analytic Workloads的直接链接"></a></h2><p>Apache Doris has been pushing its boundaries. Starting as an OLAP engine for reporting, it is now a data warehouse capable of ETL/ELT and more. Version 2.0 is making advancements in its log analysis and data lakehousing capabilities. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-10-times-more-cost-effective-log-analysis-solution">A 10 times more cost-effective log analysis solution<a href="#a-10-times-more-cost-effective-log-analysis-solution" class="hash-link" aria-label="A 10 times more cost-effective log analysis solution的直接链接" title="A 10 times more cost-effective log analysis solution的直接链接"></a></h3><p>Apache Doris 2.0.0 provides native support for semi-structured data. In addition to JSON and Array, it now supports a complex data type: Map. Based on Light Schema Change, it also supports Schema Evolution, which means you can adjust the schema as your business changes. You can add or delete fields and indexes, and change the data types for fields. As we introduced inverted index and a high-performance text analysis algorithm into it, it can execute full-text search and dimensional analysis of logs more efficiently. With faster data writing and query speed and lower storage cost, it is 10 times more cost-effective than the common log analytic solution within the industry.</p><p><img loading="lazy" alt="Apache-Doris-VS-Elasticsearch" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-4-1e3ed90c197a9f3a5e853cb854f8f56e.png" width="1280" height="725" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="enhanced-data-lakehousing-capabilities">Enhanced data lakehousing capabilities<a href="#enhanced-data-lakehousing-capabilities" class="hash-link" aria-label="Enhanced data lakehousing capabilities的直接链接" title="Enhanced data lakehousing capabilities的直接链接"></a></h3><p>In Apache Doris 1.2, we introduced Multi-Catalog to allow for auto-mapping and auto-synchronization of data from heterogeneous sources. In version 2.0.0, we extended the list of data sources supported and optimized Doris for based on users' needs in production environment.</p><p><img loading="lazy" alt="Apache-Doris-data-lakehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-5-f438663781fa9c4da57e1b08cd360d7a.png" width="1708" height="724" class="img_ev3q"></p><p>Apache Doris 2.0.0 supports dozens of data sources including Hive, Hudi, Iceberg, Paimon, MaxCompute, Elasticsearch, Trino, ClickHouse, and almost all open lakehouse formats. It also supports snapshot queries on Hudi Copy-on-Write tables and read optimized queries on Hudi Merge-on-Read tables. It allows for authorization of Hive Catalog using Apache Ranger, so users can reuse their existing privilege control system. Besides, it supports extensible authorization plug-ins to enable user-defined authorization methods for any catalog. </p><p>TPC-H benchmark tests showed that Apache Doris 2.0.0 is 3~5 times faster than Presto/Trino in queries on Hive tables. This is realized by all-around optimizations (in small file reading, flat table reading, local file cache, ORC/Parquet file reading, Compute Nodes, and information collection of external tables) finished in this development cycle and the distributed execution framework, vectorized execution engine, and query optimizer of Apache Doris. </p><p><img loading="lazy" alt="Apache-Doris-VS-Trino" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-6-e12c23ddbbd23aa017fe3cecf1847f48.png" width="1722" height="728" class="img_ev3q"></p><p>All this gives Apache Doris 2.0.0 an edge in data lakehousing scenarios. With Doris, you can do incremental or overall synchronization of multiple upstream data sources in one place, and expect much higher data query performance than other query engines. The processed data can be written back to the sources or provided for downstream systems. In this way, you can make Apache Doris your unified data analytic gateway.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="efficient-data-update">Efficient Data Update<a href="#efficient-data-update" class="hash-link" aria-label="Efficient Data Update的直接链接" title="Efficient Data Update的直接链接"></a></h2><p>Data update is important in real-time analysis, since users want to always be accessible to the latest data, and be able to update data flexibly, such as updating a row or just a few columns, batching updating or deleting their specified data, or even overwriting a whole data partition.</p><p>Efficient data updating has been another hill to climb in data analysis. Apache Hive only supports updates on the partition level, while Hudi and Iceberg do better in low-frequency batch updates instead of real-time updates due to their Merge-on-Read and Copy-on-Write implementations.</p><p>As for data updating, Apache Doris 2.0.0 is capable of:</p><ul><li><strong>Faster data writing</strong>: In the pressure tests with an online payment platform, under 20 concurrent data writing tasks, Doris reached a writing throughput of 300,000 records per second and maintained stability throughout the over 10-hour continuous writing process.</li><li><strong>Partial column update</strong>: Older versions of Doris implements partial column update by <code>replace_if_not_null</code> in the Aggregate Key model. In 2.0.0, we enable partial column updates in the Unique Key model. That means you can directly write data from multiple source tables into a flat table, without having to concatenate them into one output stream using Flink before writing. This method avoids a complicated processing pipeline and the extra resource consumption. You can simply specify the columns you need to update.</li><li><strong>Conditional update and deletion</strong>: In addition to the simple Update and Delete operations, we realize complicated conditional updates and deletes operations on the basis of Merge-on-Write. </li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="faster-stabler-and-smarter-data-writing">Faster, Stabler, and Smarter Data Writing<a href="#faster-stabler-and-smarter-data-writing" class="hash-link" aria-label="Faster, Stabler, and Smarter Data Writing的直接链接" title="Faster, Stabler, and Smarter Data Writing的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="higher-speed-in-data-writing">Higher speed in data writing<a href="#higher-speed-in-data-writing" class="hash-link" aria-label="Higher speed in data writing的直接链接" title="Higher speed in data writing的直接链接"></a></h3><p>As part of our continuing effort to strengthen the real-time analytic capability of Apache Doris, we have improved the end-to-end real-time data writing capability of version 2.0.0. Benchmark tests reported higher throughput in various writing methods:</p><ul><li>Stream Load, TPC-H 144G lineitem table, 48-bucket Duplicate table, triple-replica writing: throughput increased by 100%</li><li>Stream Load, TPC-H 144G lineitem table, 48-bucket Unique Key table, triple-replica writing: throughput increased by 200%</li><li>Insert Into Select, TPC-H 144G lineitem table, 48-bucket Duplicate table: throughput increased by 50%</li><li>Insert Into Select, TPC-H 144G lineitem table, 48-bucket Unique Key table: throughput increased by 150%</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="greater-stability-in-high-concurrency-data-writing">Greater stability in high-concurrency data writing<a href="#greater-stability-in-high-concurrency-data-writing" class="hash-link" aria-label="Greater stability in high-concurrency data writing的直接链接" title="Greater stability in high-concurrency data writing的直接链接"></a></h3><p>The sources of system instability often includes small file merging, write amplification, and the consequential disk I/O and CPU overheads. Hence, we introduced Vertical Compaction and Segment Compaction in version 2.0.0 to eliminate OOM errors in compaction and avoid the generation of too many segment files during data writing. After such improvements, Apache Doris can write data 50% faster while <strong>using only 10% of the memory that it previously used</strong>.</p><p>Read more: <a href="https://doris.apache.org/blog/Understanding-Data-Compaction-in-3-Minutes/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/blog/Understanding-Data-Compaction-in-3-Minutes/</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="auto-synchronization-of-table-schema">Auto-synchronization of table schema<a href="#auto-synchronization-of-table-schema" class="hash-link" aria-label="Auto-synchronization of table schema的直接链接" title="Auto-synchronization of table schema的直接链接"></a></h3><p>The latest Flink-Doris-Connector allows users to synchronize an entire database (such as MySQL and Oracle) to Apache Doris by one simple step. According to our test results, one single synchronization task can support the real-time concurrent writing of thousands of tables. Users no longer need to go through a complicated synchronization procedure because Apache Doris has automated the process. Changes in the upstream data schema will be automatically captured and dynamically updated to Apache Doris in a seamless manner.</p><p>Read more: <a href="https://doris.apache.org/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis" target="_blank" rel="noopener noreferrer">https://doris.apache.org/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-new-multi-tenant-resource-isolation-solution">A New Multi-Tenant Resource Isolation Solution<a href="#a-new-multi-tenant-resource-isolation-solution" class="hash-link" aria-label="A New Multi-Tenant Resource Isolation Solution的直接链接" title="A New Multi-Tenant Resource Isolation Solution的直接链接"></a></h2><p>The purpose of multi-tenant resource isolation is to avoid resource preemption in the case of heavy loads. For that sake, older versions of Apache Doris adopted a hard isolation plan featured by Resource Group: Backend nodes of the same Doris cluster would be tagged, and those of the same tag formed a Resource Group. As data was ingested into the database, different data replicas would be written into different Resource Groups, which will be responsible for different workloads. For example, data reading and writing will be conducted on different data tablets, so as to realize read-write separation. Similarly, you can also put online and offline business on different Resource Groups. </p><p><img loading="lazy" alt="resource-isolation" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-7-55f4a01b41924f345f92ce1f8315dca9.png" width="1823" height="977" class="img_ev3q"></p><p>This is an effective solution, but in practice, it happens that some Resource Groups are heavily occupied while others are idle. We want a more flexible way to reduce vacancy rate of resources. Thus, in 2.0.0, we introduce Workload Group resource soft limit.</p><p><img loading="lazy" alt="workload-group" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-2.0.0-8-786bcb555e7791f51a591291ed105021.png" width="1166" height="362" class="img_ev3q"></p><p>The idea is to divide workloads into groups to allow for flexible management of CPU and memory resources. Apache Doris associates a query with a Workload Group, and limits the percentage of CPU and memory that a single query can use on a backend node. The memory soft limit can be configured and enabled by the user. </p><p>When there is a cluster resource shortage, the system will kill the largest memory-consuming query tasks; when there are sufficient cluster resources, once a Workload Group uses more resources than expected, the idle cluster resources will be shared among all the Workload Groups to give full play to the system memory and ensure stable execution of queries. You can also prioritize the Workload Groups in terms of resource allocation. In other words, you can decide which tasks can be assigned with adequate resources and which not.</p><p>Meanwhile, we introduced Query Queue in 2.0.0. Upon Workload Group creation, you can set a maximum query number for a query queue. Queries beyond that limit will wait for execution in the queue. This is to reduce system burden under heavy workloads.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="elastic-scaling-and-storage-compute-separation">Elastic Scaling and Storage-Compute Separation<a href="#elastic-scaling-and-storage-compute-separation" class="hash-link" aria-label="Elastic Scaling and Storage-Compute Separation的直接链接" title="Elastic Scaling and Storage-Compute Separation的直接链接"></a></h2><p>When it comes to computation and storage resources, what do users want?</p><ul><li><strong>Elastic scaling of computation resources</strong>: Scale up resources quickly in peak times to increase efficiency and scale down in valley times to reduce costs.</li><li><strong>Lower storage costs</strong>: Use low-cost storage media and separate storage from computation.</li><li><strong>Separation of workloads</strong>: Isolate the computation resources of different workloads to avoid preemption.</li><li><strong>Unified management of data</strong>: Simply manage catalogs and data in one place.</li></ul><p>To separate storage and computation is a way to realize elastic scaling of resources, but it demands more efforts in maintaining storage stability, which determines the stability and continuity of OLAP services. To ensure storage stability, we introduced mechanisms including cache management, computation resource management, and garbage collection.</p><p> In this respect, we divide our users into three groups after investigation:</p><ol><li>Users with no need for resource scaling</li><li>Users requiring resource scaling, low storage costs, and workload separation from Apache Doris</li><li>Users who already have a stable large-scale storage system and thus require an advanced compute-storage-separated architecture for efficient resource scaling</li></ol><p>Apache Doris 2.0 provides two solutions to address the needs of the first two types of users.</p><ol><li><strong>Compute nodes</strong>. We introduced stateless compute nodes in version 2.0. Unlike the mix nodes, the compute nodes do not save any data and are not involved in workload balancing of data tablets during cluster scaling. Thus, they are able to quickly join the cluster and share the computing pressure during peak times. In addition, in data lakehouse analysis, these nodes will be the first ones to execute queries on remote storage (HDFS/S3) so there will be no resource competition between internal tables and external tables.<ol><li><a href="https://doris.apache.org/docs/2.0/admin-manual/resource-admin/compute-node/" target="_blank" rel="noopener noreferrer">Read more in Docs</a></li></ol></li><li><strong>Hot-cold data separation</strong>. Hot/cold data refers to data that is frequently/seldom accessed, respectively. Generally, it makes more sense to store cold data in low-cost storage. Older versions of Apache Doris support lifecycle management of table partitions: As hot data cooled down, it would be moved from SSD to HDD. However, data was stored with multiple replicas on HDD, which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in object storage, which is even cheaper and allows single-copy storage. That reduces the storage costs by 70% and cuts down the computation and network overheads that come with storage.<ol><li><a href="https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How" target="_blank" rel="noopener noreferrer">Read more in Docs</a></li></ol></li></ol><p>For neater separate of computation and storage, the VeloDB team is going to contribute the Cloud Compute-Storage-Separation solution to the Apache Doris project. The performance and stability of it has stood the test of hundreds of companies in their production environment. The merging of code will be finished by October this year, and all Apache Doris users will be able to get an early taste of it in September.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="enhanced-usability">Enhanced Usability<a href="#enhanced-usability" class="hash-link" aria-label="Enhanced Usability的直接链接" title="Enhanced Usability的直接链接"></a></h2><p>Apache Doris 2.0.0 also highlights some enterprise-facing functionalities.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-for-kubernetes-deployment">Support for Kubernetes Deployment<a href="#support-for-kubernetes-deployment" class="hash-link" aria-label="Support for Kubernetes Deployment的直接链接" title="Support for Kubernetes Deployment的直接链接"></a></h3><p>Older versions of Apache Doris communicate based on IP, so any host failure in Kubernetes deployment that causes a POD IP drift will lead to cluster unavailability. Now, version 2.0 supports FQDN. That means the failed Doris nodes can recover automatically without human intervention, which lays the foundation for Kubernetes deployment and elastic scaling. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-for-cross-cluster-replication-ccr">Support for Cross-Cluster Replication (CCR)<a href="#support-for-cross-cluster-replication-ccr" class="hash-link" aria-label="Support for Cross-Cluster Replication (CCR)的直接链接" title="Support for Cross-Cluster Replication (CCR)的直接链接"></a></h3><p>Apache Doris 2.0.0 supports cross-cluster replication (CCR). Data changes at the database/table level in the source cluster will be synchronized to the target cluster. You can choose to replicate the incremental data or the overall data. </p><p>It also supports synchronization of DDL, which means DDL statements executed by the source cluster can also by automatically replicated to the target cluster. </p><p>It is simple to configure and use CCR in Doris. Leveraging this functionality, you can implement read-write separation and multi-datacenter replication </p><p>This feature allows for higher availability of data, read/write workload separation, and cross-data-center replication more efficiently.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-change">Behavior Change<a href="#behavior-change" class="hash-link" aria-label="Behavior Change的直接链接" title="Behavior Change的直接链接"></a></h2><ul><li>Use rolling upgrade from 1.2-ITS to 2.0.0, and restart upgrade from preview versions of 2.0 to 2.0.0;</li><li>The new query optimizer (Nereids) is enabled by default: <code>enable_nereids_planner=true</code>;</li><li>All non-vectorized code has been removed from the system, so the <code>enable_vectorized_engine</code> parameter no long works;</li><li>A new parameter <code>enable_single_replica_compaction</code> has been added;</li><li>datev2, datetimev2, and decimalv3 are the default data types in table creation; datav1, datetimev1, and decimalv2 are not supported in table creation;</li><li>decimalv3 is the default data type for JDBC and Iceberg Catalog;</li><li>A new data type <code>AGG_STATE</code> has been added;</li><li>The cluster column has been removed from backend tables;</li><li>For better compatibility with BI tools, datev2 and datetimev2 are displayed as date and datetime when <code>show create table</code>;</li><li>max_openfiles and swaps checks are added to the backend startup script so inappropriate system configuration might lead to backend failure;</li><li>Password-free login is not allowed when accessing frontend on localhost;</li><li>If there is a Multi-Catalog in the system, by default, only data of the internal catalog will be displayed when querying information schema;</li><li>A limit has been imposed on the depth of the expression tree. The default value is 200;</li><li>The single quote in the return value of array string has been changed to double quote;</li><li>The Doris processes are renamed to DorisFE and DorisBE.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="embarking-on-the-200-journey">Embarking on the 2.0.0 Journey<a href="#embarking-on-the-200-journey" class="hash-link" aria-label="Embarking on the 2.0.0 Journey的直接链接" title="Embarking on the 2.0.0 Journey的直接链接"></a></h2><p>To make Apache Doris 2.0.0 production-ready, we invited hundreds of enterprise users to engage in the testing and optimized it for better performance, stability, and usability. In the next phase, we will continue responding to user needs with agile release planning. We plan to launch 2.0.1 in late August and 2.0.2 in September, as we keep fixing bugs and adding new features. We also plan to release an early version of 2.1 in September to bring a few long-requested capabilities to you. For example, in Doris 2.1, the Variant data type will better serve the schema-free analytic needs of semi-structured data; the multi-table materialized views will be able to simplify the data scheduling and processing link while speeding up queries; more and neater data ingestion methods will be added and nested composite data types will be realized.</p><p>If you have any questions or ideas when investigating, testing, and deploying Apache Doris, please find us on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>. Our developers will be happy to hear them and provide targeted support.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Database in fintech: how to support 10,000 dashboards without causing a mess]]></title>
<id>https://doris.apache.org/zh-CN/blog/Database-in-Fintech-How-to-Support-ten-thousand-Dashboards-Without-Causing-a-Mess</id>
<link href="https://doris.apache.org/zh-CN/blog/Database-in-Fintech-How-to-Support-ten-thousand-Dashboards-Without-Causing-a-Mess"/>
<updated>2023-08-05T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article introduces the lifecycle of financial metrics in a database, from how they're produced to how they're efficiently presented in data reports.]]></summary>
<content type="html"><![CDATA[<p>In a data-intensive industry like finance, data comes from numerous entries and goes to numerous exits. Such status quo can easily, and almost inevitably, lead to chaos in data analysis and management. For example, analysts from different business lines define their own financial metrics in data reports. When you pool these countless reports together in your data architecture, you will find that many metrics overlap or even contradict each other in definition. The consequence is, developing a simple data report will require lots of clarification back and forth, making the process more complicated and time-consuming than it should be.</p><p>As your business grows, your data management will arrive at a point when "standardization" is needed. In terms of data engineering, that means you need a data platform where you can produce and manage all metrics. That's your architectural prerequisite to provide efficient financial services. </p><p>This article introduces the lifecycle of financial metrics in a database (in this case, <a href="https://doris.apache.org/" target="_blank" rel="noopener noreferrer">Apache Doris</a>), from how they're produced to how they're efficiently presented in data reports. You will get an inside view of what's behind those fancy financial dashboards. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="define-new-metrics--add-them-to-your-database">Define New Metrics &amp; Add Them to Your Database<a href="#define-new-metrics--add-them-to-your-database" class="hash-link" aria-label="Define New Metrics &amp; Add Them to Your Database的直接链接" title="Define New Metrics &amp; Add Them to Your Database的直接链接"></a></h2><p>Fundamentally, metrics are fields in a table. To provide a more concrete idea of them, I will explain with an example in the banking industry. </p><p>Banks measure the assets of customers by AUM (Assets Under Management). In this scenario, AUM is an <strong>atomic metric</strong>, which is often a field in the source data table. On the basis of AUM, analysts derive a series of <strong>derivative metrics</strong>, such as "year-on-year AUM growth", "month-on-month AUM growth", and "AUM per customer".</p><p>Once you define the new metrics, you add them to your data reports, which involves a few simple configurations in Apache Doris:</p><p>Developers update the metadata accordingly, register the base table where the metrics are derived, configure the data granularity and update frequency of intermediate tables, and input the metric name and definition. Some engineers will also monitor the metrics to identify abnormalities and remove redundant metrics based on a metric evaluation system.</p><p>When the metrics are soundly put in place, you can ingest new data into your database to get your data reports. For example, if you ingest CSV files, we recommend the Stream Load method of Apache Doris and a file size of 1~10G per batch. Eventually, these metrics will be visualized in data charts. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="calculate-your-metrics">Calculate Your Metrics<a href="#calculate-your-metrics" class="hash-link" aria-label="Calculate Your Metrics的直接链接" title="Calculate Your Metrics的直接链接"></a></h2><p>As is mentioned, some metrics are produced by combining multiple fields in the source table. In data engineering, that is a multi-table join query. Based on the optimization experience of an Apache Doris user, we recommend flat tables instead of Star/Snowflake Schema. The user reduced the query response time on tables of 100 million rows <strong>from 5s to 63ms</strong> after such a change.</p><p><img loading="lazy" alt="join-queries" src="https://cdnd.selectdb.com/zh-CN/assets/images/Pingan_1-ca53619302ca8b80b8fdb1c73a5c39c9.png" width="1280" height="642" class="img_ev3q"></p><p>The flat table solution also eliminates jitter.</p><p><img loading="lazy" alt="reduced-jitter" src="https://cdnd.selectdb.com/zh-CN/assets/images/Pingan_2-325bffe3684325c0fd1970d82aadf4ff.png" width="1280" height="283" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="enable-sql-caching-to-reduce-resource-consumption">Enable SQL Caching to Reduce Resource Consumption<a href="#enable-sql-caching-to-reduce-resource-consumption" class="hash-link" aria-label="Enable SQL Caching to Reduce Resource Consumption的直接链接" title="Enable SQL Caching to Reduce Resource Consumption的直接链接"></a></h2><p>Analysts often check data reports of the same metrics on a regular basis. These reports are produced by the same SQL, so one way to further improve query speed is SQL caching. Here is how it turns out in a use case with SQL caching enabled.</p><ul><li>All queries are responded within 10ms;</li><li>When computing 30 metrics simultaneously (over 120 SQL commands), results can be returned within 600ms;</li><li>A TPS (Transactions Per Second) of 300 is reached, with CPU, memory, disk, and I/O usage under 80%;</li><li>Under the recommended cluster size, over 10,000 metrics can be cached, which means you can save a lot of computation resources.</li></ul><p><img loading="lazy" alt="reduced-computation-resources" src="https://cdnd.selectdb.com/zh-CN/assets/images/Pingan_3-6f36c1669284dcc3672824c3fa772c55.png" width="1280" height="1212" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>The complexity of data analysis in the financial industry lies in the data itself other than the engineering side. Thus, the underlying data architecture should focus on facilitating the unified and efficient management of data. Apache Doris provides the flexibility of simple metric registration and the ability of fast and resource-efficient metric computation. In this case, the user is able to handle 10,000 active financial metrics in 10,000 dashboards with 30% less ETL efforts.</p><p>Find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>.</p>]]></content>
<author>
<name>Hou Lan</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[For entry-level data engineers: how to build a simple but solid data architecture]]></title>
<id>https://doris.apache.org/zh-CN/blog/For-Entry-Level-Data-Engineers-How-to-Build-a-Simple-but-Solid-Data-Architecture</id>
<link href="https://doris.apache.org/zh-CN/blog/For-Entry-Level-Data-Engineers-How-to-Build-a-Simple-but-Solid-Data-Architecture"/>
<updated>2023-07-31T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article aims to provide reference for non-tech companies who are seeking to empower your business with data analytics.]]></summary>
<content type="html"><![CDATA[<p>This article aims to provide reference for non-tech companies who are seeking to empower your business with data analytics. You will learn the basics about how to build an efficient and easy-to-use data system, and I will walk you through every aspect of it with a use case of Apache Doris, an MPP-based analytic data warehouse. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-you-need">What You Need<a href="#what-you-need" class="hash-link" aria-label="What You Need的直接链接" title="What You Need的直接链接"></a></h2><p>This case is about a ticketing service provider who want a data platform that boasts quick processing, low maintenance costs, and ease of use, and I think they speak for the majority of entry-level database users.</p><p>A prominent feature of ticketing services is the periodic spikes in ticket orders, you know, before the shows go on. So from time to time, the company has a huge amount of new data rushing in and requires real-time processing of it, so they can make timely adjustments during the short sales window. But in other time, they don't want to spend too much energy and funds on maintaining the data system. Furthermore, for a beginner of digital operation who only require basic analytic functions, it is better to have a data architecture that is easy-to-grasp and user-friendly. After research and comparison, they came to the Apache Doris community and we help them build a Doris-based data architecture.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="simple-architecture">Simple Architecture<a href="#simple-architecture" class="hash-link" aria-label="Simple Architecture的直接链接" title="Simple Architecture的直接链接"></a></h2><p>The building blocks of this architecture are simple. You only need Apache Flink and Apache Kafka for data ingestion, and Apache Doris as an analytic data warehouse. </p><p><img loading="lazy" alt="simple-data-architecture-with-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/Poly_1-4657c20d910093fd2ab45c5355bf13dc.png" width="1280" height="599" class="img_ev3q"></p><p>Connecting data sources to the data warehouse is simple, too. The key component, Apache Doris, supports various data loading methods to fit with different data sources. You can perform column mapping, transforming, and filtering during data loading to avoid duplicate collection of data. To ingest a table, users only need to add the table name to the configurations, instead of writing a script themselves. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-update">Data Update<a href="#data-update" class="hash-link" aria-label="Data Update的直接链接" title="Data Update的直接链接"></a></h2><p>Flink CDC was found to be the optimal choice if you are looking for higher stability in data ingestion. It also allows you to update the dynamically changing tables in real time. The process includes the following steps:</p><ul><li>Configure Flink CDC for the source MySQL database, so that it allows dynamic updating of the table management configurations (which you can think of as the "metadata").</li><li>Create two CDC jobs in Flink, one to capture the changed data (the Forward stream), the other to update the table management configurations (the Broadcast stream).</li><li>Configure all tables of the source database at the Sink end (the output end of Flink CDC). When there is newly added table in the source database, the Broadcast stream will be triggered to update the table management configurations. (You just need to configure the tables, instead of "creating" the tables.)</li></ul><p><img loading="lazy" alt="configure-Flink-CDC" src="https://cdnd.selectdb.com/zh-CN/assets/images/Poly_2-0bd77b804cf526923be9c603871a34e7.png" width="1280" height="899" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="layering-of-data-warehouse">Layering of Data Warehouse<a href="#layering-of-data-warehouse" class="hash-link" aria-label="Layering of Data Warehouse的直接链接" title="Layering of Data Warehouse的直接链接"></a></h2><p>Data flows from various sources into the data warehouse, where it is cleaned and organized before it is ready for queries and analysis. The data processing here is divided into five typical layers. Such layering simplifies the data cleaning process because it provides a clear division of labor and makes things easier to locate and comprehend. </p><ul><li><strong>ODS</strong>: This is the prep zone of the data warehouse. The unprocessed original data is put in the <a href="https://doris.apache.org/docs/dev/data-table/data-model/#unique-model" target="_blank" rel="noopener noreferrer">Unique Key Model</a> of Apache Doris, which can avoid duplication of data. </li><li><strong>DWD</strong>: This layer cleans, formats, and de-identifies data to produce fact tables. Every detailed data record is preserved. Data in this layer is also put into the Unique Key Model.</li><li><strong>DWS</strong>: This layer produces flat tables of a certain theme (order, user, etc.) based on data from the DWD layer. </li><li><strong>ADS</strong>: This layer auto-aggregates data, which is implemented by the <a href="https://doris.apache.org/docs/dev/data-table/data-model/#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate Key Model</a> of Apache Doris.</li><li><strong>DIM</strong>: The DIM layer accommodates dimension data (in this case, data about the theaters, projects, and show sessions, etc.), which is used in combination with the order details.</li></ul><p>After the original data goes through these layers, it is available for queries via one data export interface.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="reporting">Reporting<a href="#reporting" class="hash-link" aria-label="Reporting的直接链接" title="Reporting的直接链接"></a></h2><p>Like many non-tech business, the ticketing service provider needs a data warehouse mainly for reporting. They derive trends and patterns from all kinds of data reports, and then figure out ways towards efficient management and sales increase. Specifically, this is the information they are observing in their reports:</p><ul><li><strong>Statistical Reporting</strong>: These are the most frequently used reports, including sales reports by theater, distribution channel, sales representative, and show.</li><li><strong>Agile Reporting</strong>: These are reports developed for specific purposes, such as daily and weekly project data reports, sales summary reports, GMV reports, and settlement reports.</li><li><strong>Data Analysis</strong>: This involves data such as membership orders, attendance rates, and user portraits.</li><li><strong>Dashboarding</strong>: This is to visually display sales data.</li></ul><p><img loading="lazy" alt="Real-Time-Data-Warehouse-and-Reporting" src="https://cdnd.selectdb.com/zh-CN/assets/images/Poly_3-8dbc669ac5f492a38335618a36ef214f.png" width="1280" height="584" class="img_ev3q"></p><p>These are all entry-level tasks in data analytics. One of the biggest burdens for the data engineers was to quickly develop new reports as the internal analysts required. The <a href="https://doris.apache.org/docs/dev/data-table/data-model#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate Key Model</a> of Apache Doris is designed for this. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="quick-aggregation-to-produce-reports-on-demand">Quick aggregation to produce reports on demand<a href="#quick-aggregation-to-produce-reports-on-demand" class="hash-link" aria-label="Quick aggregation to produce reports on demand的直接链接" title="Quick aggregation to produce reports on demand的直接链接"></a></h3><p>For example, supposing that analysts want a sales report by sales representatives, data engineers can produce that by simple configuration:</p><ol><li>Put the original data in the Aggregate Key Model</li><li>Specify the sales representative ID column and the payment date column as the Key columns, and the order amount column as the Value column</li></ol><p>Then, order amounts of the same sale representative within the specified period of time will be auto-aggregated. Bam! That's the report you need! </p><p>According to the user, this whole process only takes them 10~30 minutes, depending on the complexity of the report required. So the Aggregate Key Model largely releases data engineers from the pressure of report development.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="quick-response-to-data-queries">Quick response to data queries<a href="#quick-response-to-data-queries" class="hash-link" aria-label="Quick response to data queries的直接链接" title="Quick response to data queries的直接链接"></a></h3><p>Most data analysts would just want their target data to be returned the second they need it. In this case, the user often leverages two capabilities of Apache Doris to realize quick query response.</p><p>Firstly, Apache Doris is famously fast in Join queries. So if you need to extract information across multiple tables, you are in good hands. Secondly, in data analysis, it often happens that analysts frequently input the same request. For example, they frequently want to check the sales data of different theaters. In this scenario, Apache Doris allows you to create a <a href="https://doris.apache.org/docs/dev/query-acceleration/materialized-view/" target="_blank" rel="noopener noreferrer">Materialized View</a>, which means you pre-aggregate the sales data of each theater, and store this table in isolation from the original tables. In this way, every time you need to check the sales data by theater, the system directly goes to the Materialized View and reads data from there, instead of scanning the original table all over again. This can increase query speed by orders of magnitudes.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>This is the overview of a simple data architecture and how it can provide the data services you need. It ensures data ingestion stability and quality with Flink CDC, and quick data analysis with Apache Doris. The deployment of this architecture is simple, too. If you plan for a data analytic upgrade for your business, you might refer to this case. If you need advice and help, you may join our <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">community here</a>.</p>]]></content>
<author>
<name>Zhenwei Liu</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Is your latest data really the latest? check the data update mechanism of your database]]></title>
<id>https://doris.apache.org/zh-CN/blog/Is-Your-Latest-Data-Really-the-Latest-Check-the-Data-Update-Mechanism-of-Your-Database</id>
<link href="https://doris.apache.org/zh-CN/blog/Is-Your-Latest-Data-Really-the-Latest-Check-the-Data-Update-Mechanism-of-Your-Database"/>
<updated>2023-07-24T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This is about how to support both row update and partial column update in a database in a way that is simple in execution and efficient in data quality guarantee.]]></summary>
<content type="html"><![CDATA[<p>In databases, data update is to add, delete, or modify data. Timely data update is an important part of high quality data services.</p><p>Technically speaking, there are two types of data updates: you either update a whole row (<strong>Row Update</strong>) or just update part of the columns (<strong>Partial Column Update</strong>). Many databases supports both of them, but in different ways. This post is about one of them, which is simple in execution and efficient in data quality guarantee. </p><p>As an open source analytic database, Apache Doris supports both Row Update and Partial Column Update with one data model: the <a href="https://doris.apache.org/docs/dev/data-table/data-model#unique-model" target="_blank" rel="noopener noreferrer"><strong>Unique Key Model</strong></a>. It is where you put data that doesn't need to be aggregated. In the Unique Key Model, you can specify one column or the combination of several columns as the Unique Key (a.k.a. Primary Key). For one Unique Key, there will always be one row of data: the newly ingested data record replaces the old. That's how data updates work.</p><p>The idea is straightforward, but in real-life implementation, it happens that the latest data does not arrive the last or doesn't even get written at all, so I'm going to show you how Apache Doris implements data update and avoids messups with its Unique Key Model. </p><p><img loading="lazy" alt="data-update" src="https://cdnd.selectdb.com/zh-CN/assets/images/Dataupdate_1-f213a24dcaaac700ff9f45906687c4a9.png" width="1280" height="705" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="row-update">Row Update<a href="#row-update" class="hash-link" aria-label="Row Update的直接链接" title="Row Update的直接链接"></a></h2><p>For data writing to the Unique Key Model, Apache Doris adopts the <strong>Upsert</strong> semantics, which means <strong>Update or Insert</strong>. If the new data record includes a Unique Key that already exists in the table, the new record will replace the old record; if it includes a brand new Unique Key, the new record will be inserted into the table as a whole. The Upsert operation can provide high throughput and guarantee data reliability.</p><p><strong>Example</strong>:</p><p>In the following table, the Unique Key is the combination of three columns: <code>user_id, date, group_id</code>.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; desc test_table;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-------------+--------------+------+-------+---------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| Field | Type | Null | Key | Default | Extra |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-------------+--------------+------+-------+---------+-------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| user_id | BIGINT | Yes | true | NULL | |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| date | DATE | Yes | true | NULL | |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| group_id | BIGINT | Yes | true | NULL | |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| modify_date | DATE | Yes | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| keyword | VARCHAR(128) | Yes | false | NULL | NONE |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+-------------+--------------+------+-------+---------+-------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Execute <code>insert into</code> to write in a data record. Since the table was empty, by the Upsert semantics, it means to add a new row to the table.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; insert into test_table values (1, "2023-04-28", 2, "2023-04-28", "foo");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 1 row affected (0.05 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'insert_2fb45d1833db4348_b612b8791c97b467', 'status':'VISIBLE', 'txnId':'343'}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select * from test_table;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| user_id | date | group_id | modify_date | keyword |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 2023-04-28 | 2 | 2023-04-28 | foo |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Then insert two more data records, one of which has the same Unique Key with the previously inserted row. Now, by the Upsert semantics, it means to replace the old row with the new one of the same Unique Key, and insert the record of the new Unique Key.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; insert into test_table values (1, "2023-04-28", 2, "2023-04-29", "foo"), (2, "2023-04-29", 2, "2023-04-29", "bar");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 2 rows affected (0.04 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'insert_7dd3954468aa4ac1_a63a3852e3573b4c', 'status':'VISIBLE', 'txnId':'344'}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select * from test_table;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| user_id | date | group_id | modify_date | keyword |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 2 | 2023-04-29 | 2 | 2023-04-29 | bar |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 2023-04-28 | 2 | 2023-04-29 | foo |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="partial-column-update">Partial Column Update<a href="#partial-column-update" class="hash-link" aria-label="Partial Column Update的直接链接" title="Partial Column Update的直接链接"></a></h2><p>Besides row update, under many circumstances, data analysts require the convenience of partial column update. For example, in user portraits, they would like to update certain dimensions of their users in real time. Or, if they need to maintain a flat table that is made of data from various source tables, they will prefer partial columm update than complicated join operations as a way of data update. </p><p>Apache Doris supports partial column update with the UPDATE statement. It filters the rows that need to be modified, read them, changes a few values, and write the rows back to the table. </p><p><strong>Example</strong>:</p><p>Suppose that there is an order table, in which the Order ID is the Unique Key.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+--------------+-----------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| order_id | order_amount | order_status |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+--------------+-----------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 100 | Payment Pending |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+--------------+-----------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.01 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>When the buyer completes the payment, Apache Doris should change the order status of Order ID 1 from "Payment Pending" to "Delivery Pending". This is when the Update command comes into play.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; UPDATE test_order SET order_status = 'Delivery Pending' WHERE order_id = 1;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 1 row affected (0.11 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'update_20ae22daf0354fe0-b5aceeaaddc666c5', 'status':'VISIBLE', 'txnId':'33', 'queryId':'20ae22daf0354fe0-b5aceeaaddc666c5'}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>This is the table after updating.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+--------------+------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| order_id | order_amount | order_status |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+--------------+------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 100 | Delivery Pending |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+----------+--------------+------------------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.01 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The execution of the Update command consists of three steps in the system:</p><ul><li>Step One: Read the row where Order ID = 1 (1, 100, 'Payment Pending')</li><li>Step Two: Modify the order status from "Payment Pending" to "Delivery Pending" (1, 100, 'Delivery Pending')</li><li>Step Three: Insert the new row into the table</li></ul><p><img loading="lazy" alt="partial-column-update-1" src="https://cdnd.selectdb.com/zh-CN/assets/images/Dataupdate_2-9a653bfdd528301c5b147351f157da3a.png" width="1484" height="296" class="img_ev3q"></p><p>The table is in the Unique Key Model, which means for rows of the same Unique Key, only the last inserted one will be reserved, so this is what the table will finally look like:</p><p><img loading="lazy" alt="partial-column-update-2" src="https://cdnd.selectdb.com/zh-CN/assets/images/Dataupdate_3-0af75c350522fdc2c1074db4b2235711.png" width="1500" height="260" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="order-of-data-updates">Order of Data Updates<a href="#order-of-data-updates" class="hash-link" aria-label="Order of Data Updates的直接链接" title="Order of Data Updates的直接链接"></a></h2><p>So far this sounds simple, but in the actual world, data update might fail due to reasons such as data format errors, and thus mess up the data writing order. The order of data update matters more than you imagine. For example, in financial transactions, messed-up data writing order might lead to transaction data losses, errors, or duplication, which further leads to bigger problems.</p><p>Apache Doris provides two options for users to guarantee that their data is updated in the correct order:</p><p><strong>1. Update by the order of transaction commit</strong> </p><p>In Apache Doris, each data ingestion task is a transaction. Each successfully ingested task will be given a data version and the number of data versions is strictly increasing. If the ingestion fails, the transaction will be rolled back, and no new data version will be generated.</p><p> By default, the Upsert semantics follows the order of the transaction commits. If there are two data ingestion tasks involving the same Unique Key, the first task generating data version 2 and the second, data version 3, then according to transaction commit order, data version 3 will replace data version 2.</p><p><strong>2. Update by the user-defined order</strong></p><p>In real-time data analytics, data updates often happen in high concurrency. It is possible that there are multiple data ingestion tasks updating the same row, but these tasks are committed in unknown order, so the last saved update remains unknown, too.</p><p>For example, these are two data updates, with "2023-04-30" and "2023-05-01" as the <code>modify_data</code>, respectively. If they are written into the system concurrently, but the "2023-05-01" one is successully committed first and the other later, then the "2023-04-30" record will be saved due to its higher data version number, but we know it is not the latest one.</p><div class="language-Plain codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Plain codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 1 row affected (0.04 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'insert_e2daf8cea5524ee1_94e5c87e7bb74d67', 'status':'VISIBLE', 'txnId':'345'}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 1 row affected (0.03 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'insert_ef906f685a7049d0_b135b6cfee49fb98', 'status':'VISIBLE', 'txnId':'346'}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select * from test_table;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| user_id | date | group_id | modify_date | keyword |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 2 | 2023-04-29 | 2 | 2023-04-30 | aaa |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 2023-04-28 | 2 | 2023-04-29 | foo |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>That's why in high-concurrency scenarios, Apache Doris allows data update in user-defined order. Users can designate a column to the Sequence Column. In this way, the system will identity save the latest data version based on value in the Sequence Column.</p><p><strong>Example:</strong></p><p>You can designate a Sequence Column by specifying the <code>function_column.sequence_col</code> property upon table creation.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE test.test_table</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> user_id bigint,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> date date,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> group_id bigint,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> modify_date date,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> keyword VARCHAR(128)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">UNIQUE KEY(user_id, date, group_id)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH (user_id) BUCKETS 32</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "function_column.sequence_col" = 'modify_date',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "1",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "in_memory" = "false"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Then check and see, the data record with the highest value in the Sequence Column will be saved:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 1 row affected (0.03 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'insert_3aac37ae95bc4b5d_b3839b49a4d1ad6f', 'status':'VISIBLE', 'txnId':'349'}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">Query OK, 1 row affected (0.03 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">{'label':'insert_419d4008768d45f3_a6912e584cf1b500', 'status':'VISIBLE', 'txnId':'350'}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; select * from test_table;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| user_id | date | group_id | modify_date | keyword |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 2 | 2023-04-29 | 2 | 2023-05-01 | bbb |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 1 | 2023-04-28 | 2 | 2023-04-29 | foo |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+------------+----------+-------------+---------+</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>Congratulations. Now you've gained an overview of how data updates are implemented in Apache Doris. With this knowledge, you can basically guarantee efficiency and accuracy of data updating. But wait, there is so much more about that. As Apache Doris 2.0 is going to provide more powerful Partial Column Update capabilities, with improved execution of the Update statement and the support for more complicated multi-table Join queries, I will show you how to take advantage of them in details in my follow-up writings. <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">We</a> are constantly updating our data updates!</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.6]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.6</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.6"/>
<updated>2023-07-17T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.2.6 is now available, with several enhancements and bug fixes based on 1.2.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-change">Behavior Change<a href="#behavior-change" class="hash-link" aria-label="Behavior Change的直接链接" title="Behavior Change的直接链接"></a></h2><ul><li>Add a BE configuration item <code>allow_invalid_decimalv2_literal</code> to control whether can import data that exceeding the decimal's precision, for compatibility with previous logic.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="query">Query<a href="#query" class="hash-link" aria-label="Query的直接链接" title="Query的直接链接"></a></h2><ul><li>Fix several query planning issues.</li><li>Support <code>sql_select_limit</code> session variable.</li><li>Optimize query cold run performance.</li><li>Fix expr context memory leak.</li><li>Fix the issue that the <code>explode_split</code> function was executed incorrectly in some cases.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="multi-catalog">Multi Catalog<a href="#multi-catalog" class="hash-link" aria-label="Multi Catalog的直接链接" title="Multi Catalog的直接链接"></a></h3><ul><li>Fix the issue that synchronizing hive metadata caused FE replay edit log to fail.</li><li>Fix <code>refresh catalog</code> operation causing FE OOM.</li><li>Fix the issue that jdbc catalog cannot handle <code>0000-00-00</code> correctly.</li><li>Fixed the issue that the kerberos ticket cannot be refreshed automatically.</li><li>Optimize the partition pruning performance of hive.</li><li>Fix the inconsistent behavior of trino and presto in jdbc catalog.</li><li>Fix the issue that hdfs short-circuit read could not be used to improve query efficiency in some environments.</li><li>Fix the issue that the iceberg table on CHDFS could not be read.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="storage">Storage<a href="#storage" class="hash-link" aria-label="Storage的直接链接" title="Storage的直接链接"></a></h2><ul><li>Fix the wrong calculation of delete bitmap in MOW table.</li><li>Fix several BE memory issues.</li><li>Fix snappy compression issue.</li><li>Fix the issue that jemalloc may cause BE to crash in some cases.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="others">Others<a href="#others" class="hash-link" aria-label="Others的直接链接" title="Others的直接链接"></a></h2><ul><li>Fix several java udf related issues.</li><li>Fix the issue that the <code>recover table</code> operation incorrectly triggered the creation of dynamic partitions.</li><li>Fix timezone when importing orc files via broker load.</li><li>Fix the issue that the newly added <code>PERCENT</code> keyword caused the replay metadata of the routine load job to fail.</li><li>Fix the issue that the <code>truncate</code> operation failed to acts on a non-partitioned table.</li><li>Fix the issue that the mysql connection was lost due to the <code>show snapshot</code> operation.</li><li>Optimize the lock logic to reduce the probability of lock timeout errors when creating tables.</li><li>Add session variable <code>have_query_cache</code> to be compatible with some old mysql clients.</li><li>Optimize the error message when encountering an error of loading.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>Thanks all who contribute to this release:</p><p>@amorynan</p><p>@BiteTheDDDDt</p><p>@caoliang-web</p><p>@dataroaring</p><p>@Doris-Extras</p><p>@dutyu</p><p>@Gabriel39</p><p>@HHoflittlefish777</p><p>@htyoung</p><p>@jacktengg</p><p>@jeffreys-cat</p><p>@kaijchen</p><p>@kaka11chen</p><p>@Kikyou1997</p><p>@KnightLiJunLong</p><p>@liaoxin01</p><p>@LiBinfeng-01</p><p>@morningman</p><p>@mrhhsg</p><p>@sohardforaname</p><p>@starocean999</p><p>@vinlee19</p><p>@wangbo</p><p>@wsjz</p><p>@xiaokang</p><p>@xinyiZzz</p><p>@yiguolei</p><p>@yujun777</p><p>@Yulei-Yang</p><p>@zhangstar333</p><p>@zy-kkk</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Database dissection: how fast data queries are implemented]]></title>
<id>https://doris.apache.org/zh-CN/blog/Database-Dissection-How-Fast-Data-Queries-Are-Implemented</id>
<link href="https://doris.apache.org/zh-CN/blog/Database-Dissection-How-Fast-Data-Queries-Are-Implemented"/>
<updated>2023-07-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[What's more important than quick performance itself is the architectural design and mechanism that enable it.]]></summary>
<content type="html"><![CDATA[<p>In data analytics, fast query performance is more of a result than a guarantee. What's more important than the result itself is the architectural design and mechanism that enables quick performance. This is exactly what this post is about. I will put you into context with a typical use case of Apache Doris, an open-source MPP-based analytic database.</p><p>The user in this case is an all-category Q&amp;A website. As a billion-dollar listed company, they have their own data management platform. What Doris does is to support the data filtering, packaging, analyzing, and monitoring workloads of that platform. Based on their huge data size, the user demands quick data loading and quick response to queries. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-enable-quick-queries-on-huge-dataset">How to Enable Quick Queries on Huge Dataset<a href="#how-to-enable-quick-queries-on-huge-dataset" class="hash-link" aria-label="How to Enable Quick Queries on Huge Dataset的直接链接" title="How to Enable Quick Queries on Huge Dataset的直接链接"></a></h2><ul><li><strong>Scenario</strong>: user segmentation for the website</li><li><strong>Data size</strong>: 100 billion data objects, 2.4 million tags</li><li><strong>Requirements</strong>: query response time &lt; 1 second; result packaging &lt; 10 seconds</li></ul><p>For these goals, the engineers have made three critical changes in their data processing pipeline.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1distribute-the-data">1.Distribute the data<a href="#1distribute-the-data" class="hash-link" aria-label="1.Distribute the data的直接链接" title="1.Distribute the data的直接链接"></a></h3><p>User segmentation is when analysts pick out a group of website users that share certain characteristics (tags). In the database system, this process is implemented by a bunch of set operations (union, intersection, and difference). </p><p><strong>Narration from the engineers:</strong></p><p>We realize that instead of executing set operations on one big dataset, we can divide our dataset into smaller ones, execute set operations on each of them, and then merge all the results. In this way, each small dataset is computed by one thread/queue. Then we have a queue to do the final merging. It's simple distributed computing thinking.</p><p><img loading="lazy" alt="distributed-computing-in-database" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_1-7c5ee52877c98c9502ba57d03becdd9b.png" width="1280" height="651" class="img_ev3q"></p><p>Example:</p><ol><li>Every 1 million users are put into one group with a <code>group_id</code>.</li><li>All user tags in that same group will relate to the corresponding <code>group_id</code>.</li><li>Calculate the union/intersection/difference within each group. (Enable multi-thread mode to increase computation efficiency.)</li><li>Merge the results from the groups.</li></ol><p>The problem here is, since user tags are randomly distributed across various machines, the computation entails multi-time shuffling, which brings huge network overhead. That leads to the second change.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2pre-bind-a-data-group-to-a-machine">2.Pre-bind a data group to a machine<a href="#2pre-bind-a-data-group-to-a-machine" class="hash-link" aria-label="2.Pre-bind a data group to a machine的直接链接" title="2.Pre-bind a data group to a machine的直接链接"></a></h3><p>This is enabled by the Colocate mechanism of Apache Doris. The idea of Colocate is to place data chunks that are often accessed together onto the same node, so as to reduce cross-node data transfer and thus, get lower latency.</p><p><img loading="lazy" alt="colocate-mechanism" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_2-6f75c0c47ef7106018774d6a70bf0e99.png" width="1280" height="331" class="img_ev3q"></p><p>The implementation is simple: Bind one group key to one machine. Then naturally, data corresponding to that group key will be pre-bound to that machine. </p><p>The following is the query plan before we adopted Collocate: It is complicated, with a lot of data shuffling.</p><p><img loading="lazy" alt="complicated-data-shuffling" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_3-a6af7fe391aa9eaa717e558112e38d18.png" width="720" height="765" class="img_ev3q"></p><p>This is the query plan after. It is much simpler, which is why queries are much faster and less costly.</p><p><img loading="lazy" alt="simpler-query-plan-after-colocation-join" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_4-ad4a6e9be6d812a88220544a77ce1c73.png" width="1280" height="616" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3merge-the-operators">3.Merge the operators<a href="#3merge-the-operators" class="hash-link" aria-label="3.Merge the operators的直接链接" title="3.Merge the operators的直接链接"></a></h3><p>In data queries, the engineers realized that they often use a couple of functions in combination, so they decided to develop compound functions to further improve execution efficiency. They came to the Doris <a href="https://t.co/XD4uUSROft" target="_blank" rel="noopener noreferrer">community</a> and talked about their thoughts. The Doris developers provided support for them and soon the compound functions are ready for use on Doris. These are a few examples:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">bitmap_and_count == bitmap_count(bitmap_and(bitmap1, bitmap2))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">bitmap_and_not_count == bitmap_count(bitmap_not(bitmap1, bitmap_and(bitmap1, bitmap2))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">orthogonal_bitmap_union_count==bitmap_and(bitmap1,bitmap_and(bitmap2,bitmap3)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Query execution with one compound function is much faster than that with a chain of simple functions, as you can tell from the lengths of the flow charts:</p><p><img loading="lazy" alt="operator-merging" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_5-8ad26e082d2a60188e8928ab82192330.png" width="1280" height="396" class="img_ev3q"></p><ul><li><strong>Multiple Simple functions</strong>: This involves three function executions and two intermediate storage. It's a long and slow process.</li><li><strong>One compound function</strong>: Simple in and out.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-quickly-ingest-large-amounts-of-data">How to Quickly Ingest Large Amounts of Data<a href="#how-to-quickly-ingest-large-amounts-of-data" class="hash-link" aria-label="How to Quickly Ingest Large Amounts of Data的直接链接" title="How to Quickly Ingest Large Amounts of Data的直接链接"></a></h2><p>This is about putting the right workload on the right component. Apache Doris supports a variety of data loading methods. After trials and errors, the user settled on Spark Load and thus decreased their data loading time by 90%. </p><p><strong>Narration from the engineers:</strong></p><p>In offline data ingestion, we used to perform most computation in Apache Hive, write the data files to HDFS, and pull data regularly from HDFS to Apache Doris. However, after Doris obtains parquet files from HDFS, it performs a series of operations on them before it can turn them into segment files: decompressing, bucketing, sorting, aggregating, and compressing. These workloads will be borne by Doris backends, which have to undertake a few bitmap operations at the same time. So there is a huge pressure on the CPU. </p><p><img loading="lazy" alt="Broker-Load" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_6-10aa0935e2acd8774b0cb1f70d7013e8.png" width="1280" height="629" class="img_ev3q"></p><p>So we decided on the Spark Load method. It allows us to split the ingestion process into two parts: computation and storage, so we can move all the bucketing, sorting, aggregating, and compressing to Spark clusters. Then Spark writes the output to HDFS, from which Doris pulls data and flushes it to the local disks.</p><p><img loading="lazy" alt="Spark-Load" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_7-5eacf11ecef47a4bdebd2b820d1f2bd6.png" width="1280" height="372" class="img_ev3q"></p><p>When ingesting 1.2 TB data (that's 110 billion rows), the Spark Load method only took 55 minutes. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-vectorized-execution-engine">A Vectorized Execution Engine<a href="#a-vectorized-execution-engine" class="hash-link" aria-label="A Vectorized Execution Engine的直接链接" title="A Vectorized Execution Engine的直接链接"></a></h2><p>In addition to the above changes, a large part of the performance of a database relies on its execution engine. In the case of Apache Doris, it has fully vectorized its storage and computation layers since version 1.1. The longtime user also witnessed this revolution, so we invited them to test how the vectorized engine worked.</p><p>They compared query response time before and after the vectorization in seven of its frequent scenarios:</p><ul><li>Scenario 1: Simple user segmentation (hundreds of filtering conditions), data packaging of a multi-million user group.</li><li>Scenario 2: Complicated user segmentation (thousands of filtering conditions), data packaging of a tens-of-million user group.</li><li>Scenario 3: Multi-dimensional filtering (6 dimensions), single-table query, <strong>single-date flat table</strong>, data aggregation, 180 million rows per day.</li><li>Scenario 4: Multi-dimensional filtering (6 dimensions), single-table query, <strong>multi-date flat table</strong>, data aggregation, 180 million rows per day.</li><li>Scenario 5: <strong>Single-table query</strong>, COUNT, 180 million rows per day.</li><li>Scenario 6: <strong>Multi-table query</strong>, (Table A: 180 million rows, SUM, COUNT; Table B: 1.5 million rows, bitmap aggregation), aggregate Table A and Table B, join them with Table C, and then join the sub-tables, six joins in total.</li><li>Scenario 7: Single-table query, 500 million rows of itemized data</li></ul><p>The results are as below:</p><p><img loading="lazy" alt="performance-after-vectorization" src="https://cdnd.selectdb.com/zh-CN/assets/images/Zhihu_8-db8b7d375c494f0e806a2286ea9144b0.png" width="1280" height="591" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>In short, what contributed to the fast data loading and data queries in this case?</p><ul><li>The Colocate mechanism that's designed for distributed computing</li><li>Collaboration between database users and <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">developers</a> that enables the operator merging</li><li>Support for a wide range of data loading methods to choose from</li><li>A vectorized engine that brings overall performance increase</li></ul><p>It takes efforts from both the database developers and users to make fast performance possible. The user's experience and knowledge of their own status quo will allow them to figure out the quickest path, while a good database design will help pave the way and make users' life easier.</p>]]></content>
<author>
<name>Rong Hou</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Listen to that poor BI engineer: we need fast joins]]></title>
<id>https://doris.apache.org/zh-CN/blog/Listen-to-That-Poor-BI-Engineer-We-Need-Fast-Joins</id>
<link href="https://doris.apache.org/zh-CN/blog/Listen-to-That-Poor-BI-Engineer-We-Need-Fast-Joins"/>
<updated>2023-07-10T00:00:00.000Z</updated>
<summary type="html"><![CDATA[JOIN queries are always a hassle, but yes, you can expect fast joins from a relational database. Read this and learn how.]]></summary>
<content type="html"><![CDATA[<p>Business intelligence (BI) tool is often the last stop of a data processing pipeline. It is where data is visualized for analysts who then extract insights from it. From the standpoint of a SaaS BI provider, what are we looking for in a database? In my job, we are in urgent need of support for fast join queries.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-join-query-matters">Why JOIN Query Matters<a href="#why-join-query-matters" class="hash-link" aria-label="Why JOIN Query Matters的直接链接" title="Why JOIN Query Matters的直接链接"></a></h2><p>I work as an engineer that supports a human resource management system. One prominent selling point of our services is <strong>self-service</strong> <strong>BI</strong>. That means we allow users to customize their own dashboards: they can choose the fields they need and relate them to form the dataset as they want. </p><p><img loading="lazy" alt="self-service-BI" src="https://cdnd.selectdb.com/zh-CN/assets/images/Moka_1-6653b0bedab8b84497aad6667ab2db9c.png" width="1280" height="709" class="img_ev3q"></p><p>Join query is a more efficient way to realize self-service BI. It allows people to break down their data assets into many smaller tables instead of putting it all in a flat table. This would make data updates much faster and more cost-effective, because updating the whole flat table is not always the optimal choice when you have plenty of new data flowing in and old data being updated or deleted frequently, as is the case for most data input.</p><p>In order to maximize the time value of data, we need data updates to be executed really quickly. For this purpose, we looked into three OLAP databases on the market. They are all fast in some way but there are some differences.</p><p><img loading="lazy" alt="Apache-Doris-VS-ClickHouse-VS-Greenplum" src="https://cdnd.selectdb.com/zh-CN/assets/images/Moka_2-fe0c3aef14ac2449ef661d83ca293e8d.png" width="1280" height="627" class="img_ev3q"></p><p>Greenplum is really quick in data loading and batch DML processing, but it is not good at handling high concurrency. There is a steep decline in performance as query concurrency rises. This can be risky for a BI platform that tries to ensure stable user experience. ClickHouse is mind-blowing in single-table queries, but it only allows batch update and batch delete, so that's less timely.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="welcome-to-join-hell">Welcome to JOIN Hell<a href="#welcome-to-join-hell" class="hash-link" aria-label="Welcome to JOIN Hell的直接链接" title="Welcome to JOIN Hell的直接链接"></a></h2><p>JOIN, my old friend JOIN, is always a hassle. Join queries are demanding for both engineers and the database system. Firstly, engineers must have a thorough grasp of the schema of all tables. Secondly, these queries are resource-intensive, especially when they involve large tables. Some of the reports on our platform entail join queries across up to 20 tables. Just imagine the mess.</p><p>We tested our candidate OLAP engines with our common join queries and our most notorious slow queries. </p><p><img loading="lazy" alt="Apache-Doris-VS-ClickHouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/Moka_3-dab994e57f63d5b0b6c72b18de3a562b.png" width="1280" height="726" class="img_ev3q"></p><p>As the number of tables joined grows, we witness a widening performance gap between Apache Doris and ClickHouse. In most join queries, Apache Doris was about 5 times faster than ClickHouse. In terms of slow queries, Apache Doris responded to most of them within less than 1 second, while the performance of ClickHouse fluctuated within a relatively large range. </p><p>And just like that, we decided to upgrade our data architecture with Apache Doris. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="architecture-that-supports-our-bi-services">Architecture that Supports Our BI Services<a href="#architecture-that-supports-our-bi-services" class="hash-link" aria-label="Architecture that Supports Our BI Services的直接链接" title="Architecture that Supports Our BI Services的直接链接"></a></h2><p><strong>Data Input:</strong> </p><p>Our business data flows into DBLE, a distributed middleware based on MySQL. Then the DBLE binlogs are written into Flink, getting deduplicated, merged, and then put into Kafka. Finally, Apache Doris reads data from Kafka via its Routine Load approach. We apply the "delete" configuration in Routine Load to enable real-time deletion. The combination of Apache Flink and the idempotent write mechanism of Apache Doris is how we get exactly-once guarantee. We have a data size of billions of rows per table, and this architecture is able to finish data updates in one minute. </p><p>In addition, taking advantage of Apache Kafka and the Routine Load method, we are able to shave the traffic peaks and maintain cluster stability. Kafka also allows us to have multiple consumers of data and recompute intermediate data by resetting the offsets.</p><p><strong>Data Output</strong>: </p><p>As a self-service BI platform, we allow users to customize their own reports by configuring the rows, columns, and filters as they need. This is supported by Apache Doris with its capabilities in join queries. </p><p>In total, we have 400 data tables, of which 50 has over 100 million rows. That adds up to a data size measured in TB. We put all our data into two Doris clusters on 40 servers.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="no-longer-stalled-by-privileged-access-queries">No Longer Stalled by Privileged Access Queries<a href="#no-longer-stalled-by-privileged-access-queries" class="hash-link" aria-label="No Longer Stalled by Privileged Access Queries的直接链接" title="No Longer Stalled by Privileged Access Queries的直接链接"></a></h2><p>On our BI platform, privileged queries are often much slower than non-privileged queries. Timeout is often the case and even more so for queries on large datasets.</p><p>Human resource data is subject to very strict and fine-grained access control policies. The role and position of users and the confidentiality level of data determine who has access to what (the data granularity here is up to fields in a table). Occasionally, we need to separately grant a certain privilege to a particular person. On top of that, we need to ensure data isolation between the multiple tenants on our platform.</p><p>How does all this add to complexity in engineering? Any user who inputs a query on our BI platform must go through multi-factor authentication, and the authenticated information will all be inserted into the SQL via <code>in</code> and then passed on to the OLAP engine. Therefore, the more fine-grained the privilege controls are, the longer the SQL will be, and the more time the OLAP system will spend on ID filtering. That's why our users are often tortured by high latency.</p><p><img loading="lazy" alt="privileged-access-queries" src="https://cdnd.selectdb.com/zh-CN/assets/images/Moka_4-64db81a5dd0659c2fe09805142c25b39.png" width="1396" height="650" class="img_ev3q"></p><p>So how did we fix that? We use the <a href="https://doris.apache.org/docs/dev/data-table/index/bloomfilter/" target="_blank" rel="noopener noreferrer">Bloom Filter index</a> in Apache Doris. </p><p><img loading="lazy" alt="BloomFilter-index" src="https://cdnd.selectdb.com/zh-CN/assets/images/Moka_5-666c3e530937abfa6243f0f3bb1f645c.png" width="1280" height="118" class="img_ev3q"></p><p>By adding Bloom Filter indexes to the relevant ID fields, we improve the speed of privileged queries by 30% and basically eliminate timeout errors.</p><p><img loading="lazy" alt="faster-privileged-access-queries" src="https://cdnd.selectdb.com/zh-CN/assets/images/Moka_6-946cd1d988bc4d2cd18f580775cb89a7.png" width="1852" height="863" class="img_ev3q"></p><p>Tips on when you should use the Bloom Filter index:</p><ul><li>For non-prefix filtering</li><li>For <code>in</code> and <code>=</code> filters on a particular column</li><li>For filtering on high-cardinality columns, such as UserID. In essence, the Bloom Filter index is used to check if a certain value exists in a dataset. There is no point in using the Bloom Filter index for a low-cardinality column, like "gender", for example, because almost every data block contains all the gender values.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="to-all-bi-engineers">To All BI Engineers<a href="#to-all-bi-engineers" class="hash-link" aria-label="To All BI Engineers的直接链接" title="To All BI Engineers的直接链接"></a></h2><p>We believe self-service BI is the future in the BI landscape, just like AGI is the future for artificial intelligence. Fast join queries is the way towards it, and the foregoing architectural upgrade is part of our ongoing effort to empower that. May there be less painful JOINs in the BI world. Cheers.</p><p>Find the Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a></p>]]></content>
<author>
<name>Baoming Zhang</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Replacing Apache Hive, Elasticsearch and PostgreSQL with Apache Doris]]></title>
<id>https://doris.apache.org/zh-CN/blog/Replacing-Apache-Hive-Elasticsearch-and-PostgreSQL-with-Apache-Doris</id>
<link href="https://doris.apache.org/zh-CN/blog/Replacing-Apache-Hive-Elasticsearch-and-PostgreSQL-with-Apache-Doris"/>
<updated>2023-07-01T00:00:00.000Z</updated>
<summary type="html"><![CDATA[How does a data service company build its data warehouse? Simplicity is the best policy. See how a due diligence platform increased data writing efficiency by 75%.]]></summary>
<content type="html"><![CDATA[<p>How does a data service company build its data warehouse? I worked as a real-time computing engineer for a due diligence platform, which is designed to allow users to search for a company's business data, financial, and legal details. It has collected information of over 300 million entities in more than 300 dimensions. The duty of my colleagues and I is to ensure real-time updates of such data so we can provide up-to-date information for our registered users. That's the customer-facing function of our data warehouse. Other than that, it needs to support our internal marketing and operation team in ad-hoc queries and user segmentation, which is a new demand that emerged with our growing business. </p><p>Our old data warehouse consisted of the most popular components of the time, including <strong>Apache</strong> <strong>Hive</strong>, <strong>MySQL</strong>, <strong>Elasticsearch</strong>, and <strong>PostgreSQL</strong>. They support the data computing and data storage layers of our data warehouse: </p><ul><li><strong>Data Computing</strong>: Apache Hive serves as the computation engine.</li><li><strong>Data Storage</strong>: <strong>MySQL</strong> provides data for DataBank, Tableau, and our customer-facing applications. <strong>Elasticsearch</strong> and <strong>PostgreSQL</strong> serve for our DMP user segmentation system: the former stores user profiling data, and the latter stores user group data packets. </li></ul><p>As you can imagine, a long and complicated data pipeline is high-maintenance and detrimental to development efficiency. Moreover, they are not capable of ad-hoc queries. So as an upgrade to our data warehouse, we replaced most of these components with <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a>, a unified analytic database.</p><p><img loading="lazy" alt="replace-MySQL-Elasticsearch-PostgreSQL-with-Apache-Doris-before" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_1-9cc7124fc979257cf029e086ce018e78.png" width="1280" height="640" class="img_ev3q"></p><p><img loading="lazy" alt="replace-MySQL-Elasticsearch-PostgreSQL-with-Apache-Doris-after" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_2-56765f2ef0a2d26069c3cd115e694882.png" width="1280" height="548" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-flow">Data Flow<a href="#data-flow" class="hash-link" aria-label="Data Flow的直接链接" title="Data Flow的直接链接"></a></h2><p>This is a lateral view of our data warehouse, from which you can see how the data flows.</p><p><img loading="lazy" alt="data-flow" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_3-733959d2cc60e873ec5b3b9fc06d9e0e.png" width="1280" height="489" class="img_ev3q"></p><p>For starters, binlogs from MySQL will be ingested into Kafka via Canal, while user activity logs will be transferred to Kafka via Apache Flume. In Kafka, data will be cleaned and organized into flat tables, which will be later turned into aggregated tables. Then, data will be passed from Kafka to Apache Doris, which serves as the storage and computing engine. </p><p>We adopt different data models in Apache Doris for different scenarios: data from MySQL will be arranged in the <a href="https://doris.apache.org/docs/dev/data-table/data-model/#unique-model" target="_blank" rel="noopener noreferrer">Unique model</a>, log data will be put in the <a href="https://doris.apache.org/docs/dev/data-table/data-model/#duplicate-model" target="_blank" rel="noopener noreferrer">Duplicate model</a>, while data in the DWS layer will be merged in the <a href="https://doris.apache.org/docs/dev/data-table/data-model/#aggregate-model" target="_blank" rel="noopener noreferrer">Aggregate model</a>.</p><p>This is how Apache Doris replaces the roles of Hive, Elasticsearch, and PostgreSQL in our datawarehouse. Such transformation has saved us lots of efforts in development and maintenance. It also made ad-hoc queries possible and our user segmentation more efficient. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="ad-hoc-queries">Ad-Hoc Queries<a href="#ad-hoc-queries" class="hash-link" aria-label="Ad-Hoc Queries的直接链接" title="Ad-Hoc Queries的直接链接"></a></h2><p><strong>Before</strong>: Every time a new request was raised, we developed and tested the data model in Hive, and wrote the scheduling task in MySQL so that our customer-facing application platforms could read results from MySQL. It was a complicated process that took a lot of time and development work. </p><p><strong>After</strong>: Since Apache Doris has all the itemized data, whenever it is faced with a new request, it can simply pull the metadata and configure the query conditions. Then it is ready for ad-hoc queries. In short, it only requires low-code configuration to respond to new requests. </p><p><img loading="lazy" alt="ad-hoc-queries" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_4-9a9132537dbc478b0aa9948131184564.png" width="1280" height="712" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="user-segmentation">User Segmentation<a href="#user-segmentation" class="hash-link" aria-label="User Segmentation的直接链接" title="User Segmentation的直接链接"></a></h2><p><strong>Before</strong>: After a user segmentation task was created based on metadata, the relevant user IDs would be written into the PostgreSQL profile list and the MySQL task list. Meanwhile, Elasticsearch would execute the query according to the task conditions; after the results are produced, it would update status in the task list and write the user group bitmap package into PostgreSQL. (The PostgreSQL plug-in is capable of computing the intersection, union, and difference set of bitmap.) Then PostgreSQL would provide user group packets for downstream operation platforms.</p><p>Tables in Elasticsearch and PostgreSQL were unreusable, making this architecture cost-ineffective. Plus, we had to pre-define the user tags before we could execute a new type of query. That slowed things down. </p><p><strong>After</strong>: The user IDs will only be written into the MySQL task list. For first-time segmentation, Apache Doris will execute the <strong>ad-hoc query</strong> based on the task conditions. In subsequent segmentation tasks, Apache Doris will perform <strong>micro-batch rolling</strong> and compute the difference set compared with the previously produced user group packet, and notify downstream platforms of any updates. (This is realized by the <a href="https://doris.apache.org/docs/dev/sql-manual/sql-functions/bitmap-functions/bitmap_union" target="_blank" rel="noopener noreferrer">bitmap functions</a> in Apache Doris.) </p><p>In this Doris-centered user segmentation process, we don't have to pre-define new tags. Instead, tags can be auto-generated based on the task conditions. The processing pipeline has the flexibility that can make our user-group-based A/B testing easier. Also, as both the itemized data and user group packets are in Apache Doris, we don't have to attend to the read and write complexity between multiple components.</p><p><img loading="lazy" alt="user-segmentation-pipeline" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_5-82288dba1ffdb438be29168a2eafd7f9.png" width="1280" height="688" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="trick-to-speed-up-user-segmentation-by-70">Trick to Speed up User Segmentation by 70%<a href="#trick-to-speed-up-user-segmentation-by-70" class="hash-link" aria-label="Trick to Speed up User Segmentation by 70%的直接链接" title="Trick to Speed up User Segmentation by 70%的直接链接"></a></h2><p>Due to risk aversion reasons, random generation of <code>user_id</code> is the choice for many companies, but that produces sparse and non-consecutive user IDs in user group packets. Using these IDs in user segmentation, we had to endure a long waiting time for bitmaps to be generated. </p><p>To solve that, we created consecutive and dense mappings for these user IDs. <strong>In this way, we decreased our user segmentation latency by 70%.</strong></p><p><img loading="lazy" alt="user-segmentation-latency-1" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_6-22694f7b8d5e06aa2c8c4757c52c8c05.png" width="1030" height="218" class="img_ev3q"></p><p><img loading="lazy" alt="user-segmentation-latency-2" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_7-e5d5d3312ade5d026533922a01207660.png" width="1280" height="698" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="example">Example<a href="#example" class="hash-link" aria-label="Example的直接链接" title="Example的直接链接"></a></h3><p><strong>Step 1: Create a user ID mapping table:</strong></p><p>We adopt the Unique model for user ID mapping tables, where the user ID is the unique key. The mapped consecutive IDs usually start from 1 and are strictly increasing. </p><p><img loading="lazy" alt="create-user-ID-mapping-table" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_8-74c77b6500d66dfb6aa2fc8ba742868c.png" width="1280" height="540" class="img_ev3q"></p><p><strong>Step 2: Create a user group table:</strong></p><p>We adopt the Aggregate model for user group tables, where user tags serve as the aggregation keys. </p><p><img loading="lazy" alt="create-user-group-table" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_9-76a30c385266aadc57e8ab898cc53bce.png" width="1280" height="604" class="img_ev3q"></p><p>Supposing that we need to pick out the users whose IDs are between 0 and 2000000. </p><p>The following snippets use non-consecutive (<code>tyc_user_id</code>) and consecutive (<code>tyc_user_id_continuous</code>) user IDs for user segmentation, respectively. There is a big gap between their <strong>response time:</strong></p><ul><li>Non-Consecutive User IDs: <strong>1843ms</strong></li><li>Consecutive User IDs: <strong>543ms</strong> </li></ul><p><img loading="lazy" alt="response-time-of-consecutive-and-non-consecutive-user-IDs" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_10-c239e3a39b72d21c1d65fc74858b36a3.png" width="1920" height="736" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>We have 2 clusters in Apache Doris accommodating tens of TBs of data, with almost a billion new rows flowing in every day. We used to witness a steep decline in data ingestion speed as data volume expanded. But after upgrading our data warehouse with Apache Doris, we increased our data writing efficiency by 75%. Also, in user segmentation with a result set of less than 5 million, it is able to respond within milliseconds. Most importantly, our data warehouse has been simpler and friendlier to developers and maintainers. </p><p><img loading="lazy" alt="user-segmentation-latency-3" src="https://cdnd.selectdb.com/zh-CN/assets/images/Tianyancha_11-3fe828cadbc9a5972a82bbbd2a0b473e.png" width="1280" height="667" class="img_ev3q"></p><p>Lastly, I would like to share with you something that interested us most when we first talked to the <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Apache Doris community</a>:</p><ul><li>Apache Doris supports data ingestion transactions so it can ensure data is written <strong>exactly once</strong>.</li><li>It is well-integrated with the data ecosystem and can smoothly interface with most data sources and data formats.</li><li>It allows us to implement elastic scaling of clusters using the command line interface.</li><li>It outperforms ClickHouse in <strong>join queries</strong>.</li></ul><p>Find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog" target="_blank" rel="noopener noreferrer">Slack</a></p>]]></content>
<author>
<name>Tao Wang</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Tiered storage for hot and cold data: what, why, and how?]]></title>
<id>https://doris.apache.org/zh-CN/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How</id>
<link href="https://doris.apache.org/zh-CN/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How"/>
<updated>2023-06-23T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Hot data is the frequently accessed data, while cold data is the one you seldom visit but still need. Separating them is for higher efficiency in computation and storage.]]></summary>
<content type="html"><![CDATA[<p>Apparently tiered storage is hot now. But first of all:</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-is-hotcold-data">What is Hot/Cold Data?<a href="#what-is-hotcold-data" class="hash-link" aria-label="What is Hot/Cold Data?的直接链接" title="What is Hot/Cold Data?的直接链接"></a></h2><p>In simple terms, hot data is the frequently accessed data, while cold data is the one you seldom visit but still need. Normally in data analytics, data is "hot" when it is new, and gets "colder" and "colder" as time goes by. </p><p>For example, orders of the past six months are "hot" and logs from years ago are "cold". But no matter how cold the logs are, you still need them to be somewhere you can find. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="why-separate-hot-and-cold-data">Why Separate Hot and Cold Data?<a href="#why-separate-hot-and-cold-data" class="hash-link" aria-label="Why Separate Hot and Cold Data?的直接链接" title="Why Separate Hot and Cold Data?的直接链接"></a></h2><p>Tiered storage is an idea often seen in real life: You put your favorite book on your bedside table, your Christmas ornament in the attic, and your childhood art project in the garage or a cheap self-storage space on the other side of town. The purpose is a tidy and efficient life.</p><p>Similarly, companies separate hot and cold data for more efficient computation and more cost-effective storage, because storage that allows quick read/write is always expensive, like SSD. On the other hand, HDD is cheaper but slower. So it is more sensible to put hot data on SSD and cold data on HDD. If you are looking for an even lower-cost option, you can go for object storage.</p><p>In data analytics, tiered storage is implemented by a tiered storage mechanism in the database. For example, Apache Doris supports three-tiered storage: SSD, HDD, and object storage. For newly ingested data, after a specified cooldown period, it will turn from hot data into cold data and be moved to object storage. In addition, object storage only preserves one copy of data, which further cuts down storage costs and the relevant computation/network overheads.</p><p><img loading="lazy" alt="tiered-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/HCDS_1-4bae127a675df686b358d72a242fc193.png" width="1528" height="722" class="img_ev3q"></p><p>How much can you save by tiered storage? Here is some math.</p><p>In public cloud services, cloud disks generally cost 5~10 times as much as object storage. If 80% of your data asset is cold data and you put it in object storage instead of cloud disks, you can expect a cost reduction of around 70%.</p><p>Let the percentage of cold data be "rate", the price of object storage be "OS", and the price of cloud disk be "CloudDisk", this is how much you can save by tiered storage instead of putting all your data on cloud disks: </p><p><img loading="lazy" alt="cost-calculation-of-tiered-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/HCDS_2-c03620863295956acc180ad3a590bcf6.png" width="1532" height="226" class="img_ev3q"></p><p>Now let's put real-world numbers in this formula: </p><p>AWS pricing, US East (Ohio):</p><ul><li><strong>S3 Standard Storage</strong>: 23 USD per TB per month</li><li><strong>Throughput Optimized HDD (st 1)</strong>: 102 USD per TB per month</li><li><strong>General Purpose SSD (gp2)</strong>: 158 USD per TB per month</li></ul><p><img loading="lazy" alt="cost-reduction-by-tiered-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/HCDS_3-631232bc5ad459445ffd5c51313a23d2.png" width="1280" height="590" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-is-tiered-storage-implemented">How Is Tiered Storage Implemented?<a href="#how-is-tiered-storage-implemented" class="hash-link" aria-label="How Is Tiered Storage Implemented?的直接链接" title="How Is Tiered Storage Implemented?的直接链接"></a></h2><p>Till now, hot-cold separation sounds nice, but the biggest concern is: how can we implement it without compromising query performance? This can be broken down to three questions:</p><ul><li>How to enable quick reading of cold data?</li><li>How to ensure high availability of data?</li><li>How to reduce I/O and CPU overheads?</li></ul><p>In what follows, I will show you how Apache Doris addresses them one by one.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="quick-reading-of-cold-data">Quick Reading of Cold Data<a href="#quick-reading-of-cold-data" class="hash-link" aria-label="Quick Reading of Cold Data的直接链接" title="Quick Reading of Cold Data的直接链接"></a></h3><p>Accessing cold data from object storage will indeed be slow. One solution is to cache cold data in local disks for use in queries. In Apache Doris 2.0, when a query requests cold data, only the first-time access will entail a full network I/O operation from object storage. Subsequent queries will be able to read data directly from local cache.</p><p>The granularity of caching matters, too. A coarse granularity might lead to a waste of cache space, but a fine granularity could be the reason for low I/O efficiency. Apache Doris bases its caching on data blocks. It downloads cold data blocks from object storage onto local Block Cache. This is the "pre-heating" process. With cold data fully pre-heated, queries on tables with tiered storage will be basically as fast as those on tablets without. We drew this conclusion from test results on Apache Doris:</p><p><img loading="lazy" alt="query-performance-with-tiered-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/HCDS_4-87151098d2bd6aa0e665ba9766dc8e19.png" width="1280" height="854" class="img_ev3q"></p><ul><li><strong>*Test Data**</strong>: SSB SF100 dataset*</li><li><strong>*Configuration**</strong>: 3 × 16C 64G, a cluster of 1 frontend and 3 backends* </li></ul><p>P.S. Block Cache adopts the LRU algorithm, so the more frequently accessed data will stay in Block Cache for longer.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="high-availability-of-data">High Availability of Data<a href="#high-availability-of-data" class="hash-link" aria-label="High Availability of Data的直接链接" title="High Availability of Data的直接链接"></a></h3><p>In object storage, only one copy of cold data is preserved. Within Apache Doris, hot data and metadata are put in the backend nodes, and there are multiple replicas of them across different backend nodes in order to ensure high data availability. These replicas are called "local replicas". The metadata of cold data is synchronized to all local replicas, so that Doris can ensure high availability of cold data without using too much storage space.</p><p>Implementation-wise, the Doris frontend picks a local replica as the Leader. Updates to the Leader will be synchronized to all other local replicas via a regular report mechanism. Also, as the Leader uploads data to object storage, the relevant metadata will be updated to other local replicas, too.</p><p><img loading="lazy" alt="data-availability-with-tiered-storage" src="https://cdnd.selectdb.com/zh-CN/assets/images/HCDS_5-91d506621f633b43cc8fdc41fb7c9aaa.png" width="1280" height="1041" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="reduced-io-and-cpu-overhead">Reduced I/O and CPU Overhead<a href="#reduced-io-and-cpu-overhead" class="hash-link" aria-label="Reduced I/O and CPU Overhead的直接链接" title="Reduced I/O and CPU Overhead的直接链接"></a></h3><p>This is realized by cold data <a href="https://medium.com/gitconnected/understanding-data-compaction-in-3-minutes-d2b5a1f7446f" target="_blank" rel="noopener noreferrer">compaction</a>. Some scenarios require large-scale update of historical data. In this case, part of the cold data in object storage should be deleted. Apache Doris 2.0 supports cold data compaction, which ensures that the updated cold data will be reorganized and compacted, so that it will take up storage space.</p><p>A thread in Doris backend will regularly pick N tablets from the cold data and start compaction. Every tablet has a CooldownReplica and only the CooldownReplica will execute cold data compaction for the tablet. Every time 5MB of data is compacted, it will be uploaded to object storage to clear up space locally. Once the compaction is done, the CooldownReplica will update the new metadata to object storage. Other replicas only need to synchronize the metadata from object storage. This is how I/O and CPU overheads are reduced.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="tutorial">Tutorial<a href="#tutorial" class="hash-link" aria-label="Tutorial的直接链接" title="Tutorial的直接链接"></a></h2><p>Separating tiered storage in storage is a huge cost saver and there have been ways to ensure the same fast query performance. Executing hot-cold data separation is a simple 6-step process, so you can find out how it works yourself:</p><p>To begin with, you need <strong>an object storage bucket</strong> and the relevant <strong>Access Key (AK)</strong> and <strong>Secret Access Key (SK)</strong>.</p><p>Then you can start cold/hot data separation by following these six steps.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-create-resource">1. Create Resource<a href="#1-create-resource" class="hash-link" aria-label="1. Create Resource的直接链接" title="1. Create Resource的直接链接"></a></h3><p>You can create a resource using the object storage bucket with the AK and SK. Apache Doris supports object storage on various cloud service providers including AWS, Azure, and Alibaba Cloud.</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE RESOURCE IF NOT EXISTS "${resource_name}"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "type"="s3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.endpoint" = "${S3Endpoint}",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.region" = "${S3Region}",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.root.path" = "path/to/root",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.access_key" = "${S3AK}",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.secret_key" = "${S3SK}",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.connection.maximum" = "50",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.connection.request.timeout" = "3000",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.connection.timeout" = "1000",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "s3.bucket" = "${S3BucketName}"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> );</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-create-storage-policy">2. Create Storage Policy<a href="#2-create-storage-policy" class="hash-link" aria-label="2. Create Storage Policy的直接链接" title="2. Create Storage Policy的直接链接"></a></h3><p>With the Storage Policy, you can specify the cooling-down period of data (including absolute cooling-down period and relative cooling-down period).</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE STORAGE POLICY testPolicy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "storage_resource" = "remote_s3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "cooldown_ttl" = "1d"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>In the above snippet, the Storage Policy is named <code>testPolicy</code>, and data will start to cool down one day after it is ingested. The cold data will be moved under the <code>root path</code> of the object storage <code>remote_s3</code>. Apart from setting the TTL, you can also specify the timepoint when the cooling down starts.</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE STORAGE POLICY testPolicyForTTlDatatime</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "storage_resource" = "remote_s3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "cooldown_datetime" = "2023-06-07 21:00:00"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-specify-storage-policy-for-a-tablepartition">3. Specify Storage Policy for a Table/Partition<a href="#3-specify-storage-policy-for-a-tablepartition" class="hash-link" aria-label="3. Specify Storage Policy for a Table/Partition的直接链接" title="3. Specify Storage Policy for a Table/Partition的直接链接"></a></h3><p>With an established Resource and a Storage Policy, you can set a Storage Policy for a data table or a specific data partition.</p><p>The following snippet uses the lineitem table in the TPC-H dataset as an example. To set a Storage Policy for the whole table, specify the PROPERTIES as follows:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE IF NOT EXISTS lineitem1 (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_ORDERKEY INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_PARTKEY INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SUPPKEY INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_LINENUMBER INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_QUANTITY DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_DISCOUNT DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_TAX DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_RETURNFLAG CHAR(1) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_LINESTATUS CHAR(1) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SHIPDATE DATEV2 NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_COMMITDATE DATEV2 NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_RECEIPTDATE DATEV2 NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SHIPINSTRUCT CHAR(25) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SHIPMODE CHAR(10) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_COMMENT VARCHAR(44) NOT NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DUPLICATE KEY(L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION BY RANGE(`L_SHIPDATE`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION `p202301` VALUES LESS THAN ("2017-02-01"),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION `p202302` VALUES LESS THAN ("2017-03-01")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(L_ORDERKEY) BUCKETS 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "3",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "storage_policy" = "${policy_name}"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>You can check the Storage Policy of a tablet via the <code>show tablets</code> command. If the <code>CooldownReplicaId</code> is anything rather than <code>-1</code> and the <code>CooldownMetaId</code> is not null, that means the current tablet has been specified with a Storage Policy.</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> TabletId: 3674797</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ReplicaId: 3674799</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> BackendId: 10162</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SchemaHash: 513232100</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> Version: 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstSuccessVersion: 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstFailedVersion: -1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstFailedTime: NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LocalDataSize: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> RemoteDataSize: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> RowCount: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> State: NORMAL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LstConsistencyCheckTime: NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CheckVersion: -1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> VersionCount: 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> QueryHits: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PathHash: 8030511811695924097</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> MetaUrl: http://172.16.0.16:6781/api/meta/header/3674797</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CompactionStatus: http://172.16.0.16:6781/api/compaction/show?tablet_id=3674797</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownReplicaId: 3674799</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownMetaId: TUniqueId(hi:-8987737979209762207, lo:-2847426088899160152)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>To set a Storage Policy for a specific partition, add the policy name to the partition PROPERTIES as follows:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE IF NOT EXISTS lineitem1 (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_ORDERKEY INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_PARTKEY INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SUPPKEY INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_LINENUMBER INTEGER NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_QUANTITY DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_DISCOUNT DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_TAX DECIMAL(15,2) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_RETURNFLAG CHAR(1) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_LINESTATUS CHAR(1) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SHIPDATE DATEV2 NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_COMMITDATE DATEV2 NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_RECEIPTDATE DATEV2 NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SHIPINSTRUCT CHAR(25) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_SHIPMODE CHAR(10) NOT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> L_COMMENT VARCHAR(44) NOT NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DUPLICATE KEY(L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION BY RANGE(`L_SHIPDATE`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION `p202301` VALUES LESS THAN ("2017-02-01") ("storage_policy" = "${policy_name}"),</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PARTITION `p202302` VALUES LESS THAN ("2017-03-01")</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> DISTRIBUTED BY HASH(L_ORDERKEY) BUCKETS 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> "replication_num" = "3"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>This is how you can confirm that only the target partition is set with a Storage Policy:</strong></p><p>In the above example, Table Lineitem1 has 2 partitions, each partition has 3 buckets, and <code>replication_num</code> is set to "3". That means there are 2<em>3 = 6 tablets and 6</em>3 = 18 replicas in total.</p><p>Now, if you check the replica information of all tablets via the <code>show tablets</code> command, you will see that only the replicas of tablets of the target partion have a CooldownReplicaId and a CooldownMetaId. (For a clear comparison, you can check replica information of a specific partition via the <code>ADMIN SHOW REPLICA STATUS FROM TABLE PARTITION(PARTITION)</code> command.)</p><p>For instance, Tablet 3691990 belongs to Partition p202301, which is the target partition, so the 3 replicas of this tablet have a CooldownReplicaId and a CooldownMetaId:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">*****************************************************************</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> TabletId: 3691990</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ReplicaId: 3691991</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownReplicaId: 3691993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownMetaId: TUniqueId(hi:-7401335798601697108, lo:3253711199097733258)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">*****************************************************************</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> TabletId: 3691990</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ReplicaId: 3691992</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownReplicaId: 3691993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownMetaId: TUniqueId(hi:-7401335798601697108, lo:3253711199097733258)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">*****************************************************************</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> TabletId: 3691990</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ReplicaId: 3691993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownReplicaId: 3691993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownMetaId: TUniqueId(hi:-7401335798601697108, lo:3253711199097733258)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Also, the above snippet means that all these 3 replicas have been specified with the same CooldownReplica: 3691993, so only the data in Replica 3691993 will be stored in the Resource.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-view-tablet-details">4. View Tablet Details<a href="#4-view-tablet-details" class="hash-link" aria-label="4. View Tablet Details的直接链接" title="4. View Tablet Details的直接链接"></a></h3><p>You can view the detailed information of Table Lineitem1 via a <code>show tablets from lineitem1</code> command. Among all the properties, <code>LocalDataSize</code> represents the size of locally stored data and <code>RemoteDataSize</code> represents the size of cold data in object storage.</p><p>For example, when the data is newly ingested into the Doris backends, you can see that all data is stored locally.</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">*************************** 1. row ***************************</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> TabletId: 2749703</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ReplicaId: 2749704</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> BackendId: 10090</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SchemaHash: 1159194262</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> Version: 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstSuccessVersion: 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstFailedVersion: -1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstFailedTime: NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LocalDataSize: 73001235</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> RemoteDataSize: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> RowCount: 1996567</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> State: NORMAL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LstConsistencyCheckTime: NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CheckVersion: -1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> VersionCount: 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> QueryHits: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PathHash: -8567514893400420464</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> MetaUrl: http://172.16.0.8:6781/api/meta/header/2749703</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CompactionStatus: http://172.16.0.8:6781/api/compaction/show?tablet_id=2749703</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownReplicaId: 2749704</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownMetaId:</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>When the data has cooled down, you will see that the data has been moved to remote object storage.</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">*************************** 1. row ***************************</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> TabletId: 2749703</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ReplicaId: 2749704</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> BackendId: 10090</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SchemaHash: 1159194262</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> Version: 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstSuccessVersion: 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstFailedVersion: -1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LstFailedTime: NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> LocalDataSize: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> RemoteDataSize: 73001235</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> RowCount: 1996567</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> State: NORMAL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">LstConsistencyCheckTime: NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CheckVersion: -1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> VersionCount: 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> QueryHits: 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> PathHash: -8567514893400420464</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> MetaUrl: http://172.16.0.8:6781/api/meta/header/2749703</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CompactionStatus: http://172.16.0.8:6781/api/compaction/show?tablet_id=2749703</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownReplicaId: 2749704</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> CooldownMetaId: TUniqueId(hi:-8697097432131255833, lo:9213158865768502666)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>You can also check your cold data from the object storage side by finding the data files under the path specified in the Storage Policy.</p><p>Data in object storage only has a single copy.</p><p><img loading="lazy" src="https://miro.medium.com/v2/resize:fit:1400/1*jao2TrbhDI2h6S04W0x95Q.png" alt="img" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="5-execute-queries">5. Execute Queries<a href="#5-execute-queries" class="hash-link" aria-label="5. Execute Queries的直接链接" title="5. Execute Queries的直接链接"></a></h3><p>When all data in Table Lineitem1 has been moved to object storage and a query requests data from Table Lineitem1, Apache Doris will follow the root path specified in the Storage Policy of the relevant data partition, and download the requested data for local computation.</p><p>Apache Doris 2.0 has been optimized for cold data queries. Only the first-time access to the cold data will entail a full network I/O operation from object storage. After that, the downloaded data will be put in cache to be available for subsequent queries, so as to improve query speed.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="6-update-cold-data">6. Update Cold Data<a href="#6-update-cold-data" class="hash-link" aria-label="6. Update Cold Data的直接链接" title="6. Update Cold Data的直接链接"></a></h3><p>In Apache Doris, each data ingestion leads to the generation of a new Rowset, so the update of historical data will be put in a Rowset that is separated from those of newly loaded data. That’s how it makes sure the updating of cold data does not interfere with the ingestion of hot data. Once the rowsets cool down, they will be moved to S3 and deleted locally, and the updated historical data will go to the partition where it belongs.</p><p>If you any questions, come find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>. We will be happy to provide targeted support.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.5]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.5</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.5"/>
<updated>2023-06-18T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris team has fixed nearly 210 issues or performance improvements in version 1.2.5 compared to the previous verison]]></summary>
<content type="html"><![CDATA[<p>In version 1.2.5, the Doris team has fixed nearly 210 issues or performance improvements since the release of version 1.2.4. At the same time, version 1.2.5 is also an iterative version of version 1.2.4, which has higher stability. It is recommended that all users upgrade to this version.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changed">Behavior Changed<a href="#behavior-changed" class="hash-link" aria-label="Behavior Changed的直接链接" title="Behavior Changed的直接链接"></a></h2><ul><li><p>The <code>start_be.sh</code> script will check that the maximum number of file handles in the system must be greater than or equal to 65536, otherwise the startup will fail.</p></li><li><p>The BE configuration item <code>enable_quick_compaction</code> is set to true by default. The Quick Compaction is enabled by default. This feature is used to optimize the problem of small files in the case of large batch import.</p></li><li><p>After modifying the dynamic partition attribute of the table, it will no longer take effect immediately, but wait for the next task scheduling of the dynamic partition table to avoid some deadlock problems.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><ul><li><p>Optimize the use of bthread and pthread to reduce the RPC blocking problem during the query process.</p></li><li><p>A button to download Profile is added to the Profile page of the FE web UI.</p></li><li><p>Added FE configuration <code>recover_with_skip_missing_version</code>, which is used to query to skip the problematic replica under certain failure conditions.</p></li><li><p>The row-level permission function supports external Catalog.</p></li><li><p>Hive Catalog supports automatic refreshing of kerberos tickets on the BE side without manual refreshing.</p></li><li><p>JDBC Catalog supports tables under the MySQL/ClickHouse system database (<code>information_schema</code>).</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li><p>Fixed the problem of incorrect query results caused by low-cardinality column optimization</p></li><li><p>Fixed several authentication and compatibility issues accessing HDFS.</p></li><li><p>Fixed several issues with float/double and decimal types.</p></li><li><p>Fixed several issues with date/datetimev2 types.</p></li><li><p>Fixed several query execution and planning issues.</p></li><li><p>Fixed several issues with JDBC Catalog.</p></li><li><p>Fixed several query-related issues with Hive Catalog, and Hive Metastore metadata synchronization issues.</p></li><li><p>Fix the problem that the result of <code>SHOW LOAD PROFILE</code> statement is incorrect.</p></li><li><p>Fixed several memory related issues.</p></li><li><p>Fixed several issues with <code>CREATE TABLE AS SELECT</code> functionality.</p></li><li><p>Fix the problem that the jsonb type causes BE to crash on CPU that do not support avx2.</p></li><li><p>Fixed several issues with dynamic partitions.</p></li><li><p>Fixed several issues with TOPN query optimization.</p></li><li><p>Fixed several issues with the Unique Key Merge-on-Write table model.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>58 contributors participated in the improvement and release of 1.2.5, and thank them for their hard work and dedication:</p><p>@adonis0147</p><p>@airborne12</p><p>@AshinGau</p><p>@BePPPower</p><p>@BiteTheDDDDt</p><p>@caiconghui</p><p>@CalvinKirs</p><p>@cambyzju</p><p>@caoliang-web</p><p>@dataroaring</p><p>@Doris-Extras</p><p>@dujl</p><p>@dutyu</p><p>@fsilent</p><p>@Gabriel39</p><p>@gitccl</p><p>@gnehil</p><p>@GoGoWen</p><p>@gongzexin</p><p>@HappenLee</p><p>@herry2038</p><p>@jacktengg</p><p>@Jibing-Li</p><p>@kaka11chen</p><p>@Kikyou1997</p><p>@LemonLiTree</p><p>@liaoxin01</p><p>@LiBinfeng-01</p><p>@luwei16</p><p>@Moonm3n</p><p>@morningman</p><p>@mrhhsg</p><p>@Mryange</p><p>@nextdreamblue</p><p>@nsnhuang</p><p>@qidaye</p><p>@Shoothzj</p><p>@sohardforaname</p><p>@stalary</p><p>@starocean999</p><p>@SWJTU-ZhangLei</p><p>@wsjz</p><p>@xiaokang</p><p>@xinyiZzz</p><p>@yangzhg</p><p>@yiguolei</p><p>@yixiutt</p><p>@yujun777</p><p>@Yulei-Yang</p><p>@yuxuan-luo</p><p>@zclllyybb</p><p>@zddr</p><p>@zenoyang</p><p>@zhangstar333</p><p>@zhannngchen</p><p>@zxealous</p><p>@zy-kkk</p><p>@zzzzzzzs</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Say goodbye to OOM crashes]]></title>
<id>https://doris.apache.org/zh-CN/blog/Say-Goodbye-to-OOM-Crashes</id>
<link href="https://doris.apache.org/zh-CN/blog/Say-Goodbye-to-OOM-Crashes"/>
<updated>2023-06-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[A more robust and flexible memory management solution with optimizations in memory allocation, memory tracking, and memory limit.]]></summary>
<content type="html"><![CDATA[<p>What guarantees system stability in large data query tasks? It is an effective memory allocation and monitoring mechanism. It is how you speed up computation, avoid memory hotspots, promptly respond to insufficient memory, and minimize OOM errors. </p><p><img loading="lazy" alt="memory-allocator" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_1-cbcc6d864b892831d6e8e3acf37a356f.png" width="1226" height="214" class="img_ev3q"></p><p>From a database user's perspective, how do they suffer from bad memory management? This is a list of things that used to bother our users:</p><ul><li>OOM errors cause backend processes to crash. To quote one of our community members: Hi, Apache Doris, it's okay to slow things down or fail a few tasks when you are short of memory, but throwing a downtime is just not cool.</li><li>Backend processes consume too much memory space, but there is no way to find the exact task to blame or limit the memory usage for a single query.</li><li>It is hard to set a proper memory size for each query, so chances are that a query gets canceled even when there is plenty of memory space.</li><li>High-concurrency queries are disproportionately slow, and memory hotspots are hard to locate.</li><li>Intermediate data during HashTable creation cannot be flushed to disks, so join queries between two large tables often fail due to OOM. </li></ul><p>Luckily, those dark days are behind us, because we have improved our memory management mechanism from the bottom up. Now get ready, things are going to be intensive.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="memory-allocation">Memory Allocation<a href="#memory-allocation" class="hash-link" aria-label="Memory Allocation的直接链接" title="Memory Allocation的直接链接"></a></h2><p>In Apache Doris, we have a one-and-only interface for memory allocation: <strong>Allocator</strong>. It will make adjustments as it sees appropriate to keep memory usage efficient and under control. Also, MemTrackers are in place to track the allocated or released memory size, and three different data structures are responsible for large memory allocation in operator execution (we will get to them immediately). </p><p><img loading="lazy" alt="memory-tracker" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_2-2804f7f38fc4bfec4b20bb6f1ce2416e.png" width="1228" height="568" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="data-structures-in-memory">Data Structures in Memory<a href="#data-structures-in-memory" class="hash-link" aria-label="Data Structures in Memory的直接链接" title="Data Structures in Memory的直接链接"></a></h3><p>As different queries have different memory hotspot patterns in execution, Apache Doris provides three different in-memory data structures: <strong>Arena</strong>, <strong>HashTable</strong>, and <strong>PODArray</strong>. They are all under the reign of the Allocator.</p><p><img loading="lazy" alt="data-structures" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_3-9450598001ffa0bb838abad7bc62efb6.png" width="1222" height="758" class="img_ev3q"></p><p><strong>1. Arena</strong></p><p>The Arena is a memory pool that maintains a list of chunks, which are to be allocated upon request from the Allocator. The chunks support memory alignment. They exist throughout the lifespan of the Arena, and will be freed up upon destruction (usually when the query is completed). Chunks are mainly used to store the serialized or deserialized data during Shuffle, or the serialized Keys in HashTables.</p><p>The initial size of a chunk is 4096 bytes. If the current chunk is smaller than the requested memory, a new chunk will be added to the list. If the current chunk is smaller than 128M, the new chunk will double its size; if it is larger than 128M, the new chunk will, at most, be 128M larger than what is required. The old small chunk will not be allocated for new requests. There is a cursor to mark the dividing line of chunks allocated and those unallocated.</p><p><strong>2. HashTable</strong></p><p>HashTables are applicable for Hash Joins, aggregations, set operations, and window functions. The PartitionedHashTable structure supports no more than 16 sub-HashTables. It also supports the parallel merging of HashTables and each sub-Hash Join can be scaled independently. These can reduce overall memory usage and the latency caused by scaling.</p><p>If the current HashTable is smaller than 8M, it will be scaled by a factor of 4; </p><p>If it is larger than 8M, it will be scaled by a factor of 2; </p><p>If it is smaller than 2G, it will be scaled when it is 50% full;</p><p>And if it is larger than 2G, it will be scaled when it is 75% full. </p><p>The newly created HashTables will be pre-scaled based on how much data it is going to have. We also provide different types of HashTables for different scenarios. For example, for aggregations, you can apply PHmap.</p><p><strong>3. PODArray</strong></p><p>PODArray, as the name suggests, is a dynamic array of POD. The difference between it and <code>std::vector</code> is that PODArray does not initialize elements. It supports memory alignment and some interfaces of <code>std::vector</code>. It is scaled by a factor of 2. In destruction, instead of calling the destructor function for each element, it releases memory of the whole PODArray. PODArray is mainly used to save strings in columns and is applicable in many function computation and expression filtering.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-interface">Memory Interface<a href="#memory-interface" class="hash-link" aria-label="Memory Interface的直接链接" title="Memory Interface的直接链接"></a></h3><p>As the only interface that coordinates Arena, PODArray, and HashTable, the Allocator executes memory mapping (MMAP) allocation for requests larger than 64M. Those smaller than 4K will be directly allocated from the system via malloc/free; and those in between will be accelerated by a general-purpose caching ChunkAllocator, which brings a 10% performance increase according to our benchmarking results. The ChunkAllocator will try and retrieve a chunk of the specified size from the FreeList of the current core in a lock-free manner; if such a chunk doesn't exist, it will try from other cores in a lock-based manner; if that still fails, it will request the specified memory size from the system and encapsulate it into a chunk.</p><p>We chose Jemalloc over TCMalloc after experience with both of them. We tried TCMalloc in our high-concurrency tests and noticed that Spin Lock in CentralFreeList took up 40% of the total query time. Disabling "aggressive memory decommit" made things better, but that brought much more memory usage, so we had to use an individual thread to regularly recycle cache. Jemalloc, on the other hand, was more performant and stable in high-concurrency queries. After fine-tuning for other scenarios, it delivered the same performance as TCMalloc but consumed less memory.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-reuse">Memory Reuse<a href="#memory-reuse" class="hash-link" aria-label="Memory Reuse的直接链接" title="Memory Reuse的直接链接"></a></h3><p>Memory reuse is widely executed on the execution layer of Apache Doris. For example, data blocks will be reused throughout the execution of a query. During Shuffle, there will be two blocks at the Sender end and they work alternately, one receiving data and the other in RPC transport. When reading a tablet, Doris will reuse the predicate column, implement cyclic reading, filter, copy filtered data to the upper block, and then clear. When ingesting data into an Aggregate Key table, once the MemTable that caches data reaches a certain size, it will be pre-aggregated and then more data will be written in. </p><p>Memory reuse is executed in data scanning, too. Before the scanning starts, a number of free blocks (depending on the number of scanners and threads) will be allocated to the scanning task. During each scanner scheduling, one of the free blocks will be passed to the storage layer for data reading. After data reading, the block will be put into the producer queue for consumption of the upper operators in subsequent computation. Once an upper operator has copied the computation data from the block, the block will go back in the free blocks for next scanner scheduling. The thread the preallocates the free blocks will also be responsible for freeing them up after data scanning, so there won't be extra overheads. The number of free blocks somehow determines the concurrency of data scanning.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="memory-tracking">Memory Tracking<a href="#memory-tracking" class="hash-link" aria-label="Memory Tracking的直接链接" title="Memory Tracking的直接链接"></a></h2><p> Apache Doris uses MemTrackers to follow up on the allocation and releasing of memory while analyzing memory hotspots. The MemTrackers keep records of each data query, data ingestion, data compaction task, and the memory size of each global object, such as Cache and TabletMeta. It supports both manual counting and MemHook auto-tracking. Users can view the real-time memory usage in Doris backend on a Web page. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="structure-of-memtrackers">Structure of MemTrackers<a href="#structure-of-memtrackers" class="hash-link" aria-label="Structure of MemTrackers的直接链接" title="Structure of MemTrackers的直接链接"></a></h3><p>The MemTracker system before Apache Doris 1.2.0 was in a hierarchical tree structure, consisting of process_mem_tracker, query_pool_mem_tracker, query_mem_tracker, instance_mem_tracker, ExecNode_mem_tracker and so on. MemTrackers of two neighbouring layers are of parent-child relationship. Hence, any calculation mistakes in a child MemTracker will be accumulated all the way up and result in a larger scale of incredibility. </p><p><img loading="lazy" alt="MemTrackers" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_4-90b44adf02bc4653708948cf5e65d50e.png" width="1280" height="419" class="img_ev3q"></p><p>In Apache Doris 1.2.0 and newer, we made the structure of MemTrackers much simpler. MemTrackers are only divided into two types based on their roles: <strong>MemTracker Limiter</strong> and the others. MemTracker Limiter, monitoring memory usage, is unique in every query/ingestion/compaction task and global object; while the other MemTrackers traces the memory hotspots in query execution, such as HashTables in Join/Aggregation/Sort/Window functions and intermediate data in serialization, to give a picture of how memory is used in different operators or provide reference for memory control in data flushing.</p><p>The parent-child relationship between MemTracker Limiter and other MemTrackers is only manifested in snapshot printing. You can think of such a relationship as a symbolic link. They are not consumed at the same time, and the lifecycle of one does not affect that of the other. This makes it much easier for developers to understand and use them. </p><p>MemTrackers (including MemTracker Limiter and the others) are put into a group of Maps. They allow users to print overall MemTracker type snapshot, Query/Load/Compaction task snapshot, and find out the Query/Load with the most memory usage or the most memory overusage. </p><p><img loading="lazy" alt="Structure-of-MemTrackers" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_5-beba9f4c16d66d0df644f9e69a3b7db3.png" width="1280" height="1063" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="how-memtracker-works">How MemTracker Works<a href="#how-memtracker-works" class="hash-link" aria-label="How MemTracker Works的直接链接" title="How MemTracker Works的直接链接"></a></h3><p>To calculate memory usage of a certain execution, a MemTracker is added to a stack in Thread Local of the current thread. By reloading the malloc/free/realloc in Jemalloc or TCMalloc, MemHook obtains the actual size of the memory allocated or released, and records it in Thread Local of the current thread. When an execution is done, the relevant MemTracker will be removed from the stack. At the bottom of the stack is the MemTracker that records memory usage during the whole query/load execution process.</p><p>Now let me explain with a simplified query execution process.</p><ul><li>After a Doris backend node starts, the memory usage of all threads will be recorded in the Process MemTracker.</li><li>When a query is submitted, a <strong>Query MemTracker</strong> will be added to the Thread Local Storage(TLS) Stack in the fragment execution thread.</li><li>Once a ScanNode is scheduled, a <strong>ScanNode MemTracker</strong> will be added to Thread Local Storage(TLS) Stack in the fragment execution thread. Then, any memory allocated or released in this thread will be recorded into both the Query MemTracker and the ScanNode MemTracker.</li><li>After a Scanner is scheduled, a Query MemTracker and a <strong>Scanner MemTracker</strong> will be added to the TLS Stack of the Scanner thread.</li><li>When the scanning is done, all MemTrackers in the Scanner Thread TLS Stack will be removed. When the ScanNode scheduling is done, the ScanNode MemTracker will be removed from the fragment execution thread. Then, similarly, when an aggregation node is scheduled, an <strong>AggregationNode MemTracker</strong> will be added to the fragment execution thread TLS Stack, and get removed after the scheduling is done.</li><li>If the query is completed, the Query MemTracker will be removed from the fragment execution thread TLS Stack. At this point, this stack should be empty. Then, from the QueryProfile, you can view the peak memory usage during the whole query execution as well as each phase (scanning, aggregation, etc.).</li></ul><p><img loading="lazy" alt="How-MemTrackers-Works" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_6-cf0ef627ae1b9d5448b54ab92f9c3180.png" width="1280" height="424" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-use-memtracker">How to Use MemTracker<a href="#how-to-use-memtracker" class="hash-link" aria-label="How to Use MemTracker的直接链接" title="How to Use MemTracker的直接链接"></a></h3><p>The Doris backend Web page demonstrates real-time memory usage, which is divided into types: Query/Load/Compaction/Global. Current memory consumption and peak consumption are shown. </p><p><img loading="lazy" alt="How-to-use-MemTrackers" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_7-4fef601c6c9f9a5fcc53b785485057d3.png" width="1280" height="562" class="img_ev3q"></p><p>The Global types include MemTrackers of Cache and TabletMeta.</p><p><img loading="lazy" alt="memory-usage-by-subsystem-1" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_8-a007e2d4fa06263628f5d603471a3f79.png" width="1280" height="489" class="img_ev3q"></p><p>From the Query types, you can see the current memory consumption and peak consumption of the current query and the operators it involves (you can tell how they are related from the labels). For memory statistics of historical queries, you can check the Doris FE audit logs or BE INFO logs.</p><p><img loading="lazy" alt="memory-usage-by-subsystem-2" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_9-32f7ad3c6b10088f0735cfd1ff0e1e39.png" width="1280" height="762" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="memory-limit">Memory Limit<a href="#memory-limit" class="hash-link" aria-label="Memory Limit的直接链接" title="Memory Limit的直接链接"></a></h2><p>With widely implemented memory tracking in Doris backends, we are one step closer to eliminating OOM, the cause of backend downtime and large-scale query failures. The next step is to optimize the memory limit on queries and processes to keep memory usage under control.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-limit-on-query">Memory Limit on Query<a href="#memory-limit-on-query" class="hash-link" aria-label="Memory Limit on Query的直接链接" title="Memory Limit on Query的直接链接"></a></h3><p>Users can put a memory limit on every query. If that limit is exceeded during execution, the query will be canceled. But since version 1.2, we have allowed Memory Overcommit, which is a more flexible memory limit control. If there are sufficient memory resources, a query can consume more memory than the limit without being canceled, so users don't have to pay extra attention to memory usage; if there are not, the query will wait until new memory space is allocated; only when the newly freed up memory is not enough for the query will the query be canceled.</p><p>While in Apache Doris 2.0, we have realized exception safety for queries. That means any insufficient memory allocation will immediately cause the query to be canceled, which saves the trouble of checking "Cancel" status in subsequent steps.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="memory-limit-on-process">Memory Limit on Process<a href="#memory-limit-on-process" class="hash-link" aria-label="Memory Limit on Process的直接链接" title="Memory Limit on Process的直接链接"></a></h3><p>On a regular basis, Doris backend retrieves the physical memory of processes and the currently available memory size from the system. Meanwhile, it collects MemTracker snapshots of all Query/Load/Compaction tasks. If a backend process exceeds its memory limit or there is insufficient memory, Doris will free up some memory space by clearing Cache and cancelling a number of queries or data ingestion tasks. These will be executed by an individual GC thread regularly.</p><p><img loading="lazy" alt="memory-limit-on-process" src="https://cdnd.selectdb.com/zh-CN/assets/images/OOM_10-4926bcb7439768952d6d973697de2468.png" width="1280" height="610" class="img_ev3q"></p><p>If the process memory consumed is over the SoftMemLimit (81% of total system memory, by default), or the available system memory drops below the Warning Water Mark (less than 3.2GB), <strong>Minor GC</strong> will be triggered. At this moment, query execution will be paused at the memory allocation step, the cached data in data ingestion tasks will be force flushed, and part of the Data Page Cache and the outdated Segment Cache will be released. If the newly released memory does not cover 10% of the process memory, with Memory Overcommit enabled, Doris will start cancelling the queries which are the biggest "overcommitters" until the 10% target is met or all queries are canceled. Then, Doris will shorten the system memory checking interval and the GC interval. The queries will be continued after more memory is available.</p><p>If the process memory consumed is beyond the MemLimit (90% of total system memory, by default), or the available system memory drops below the Low Water Mark (less than 1.6GB), <strong>Full GC</strong> will be triggered. At this time, data ingestion tasks will be stopped, and all Data Page Cache and most other Cache will be released. If, after all these steps, the newly released memory does not cover 20% of the process memory, Doris will look into all MemTrackers and find the most memory-consuming queries and ingestion tasks, and cancel them one by one. Only after the 20% target is met will the system memory checking interval and the GC interval be extended, and the queries and ingestion tasks be continued. (One garbage collection operation usually takes hundreds of μs to dozens of ms.)</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="influences-and-outcomes">Influences and Outcomes<a href="#influences-and-outcomes" class="hash-link" aria-label="Influences and Outcomes的直接链接" title="Influences and Outcomes的直接链接"></a></h2><p>After optimizations in memory allocation, memory tracking, and memory limit, we have substantially increased the stability and high-concurrency performance of Apache Doris as a real-time analytic data warehouse platform. OOM crash in the backend is a rare scene now. Even if there is an OOM, users can locate the problem root based on the logs and then fix it. In addition, with more flexible memory limits on queries and data ingestion, users don't have to spend extra effort taking care of memory when memory space is adequate. </p><p>In the next phase, we plan to ensure completion of queries in memory overcommitment, which means less queries will have to be canceled due to memory shortage. We have broken this objective into specific directions of work: exception safety, memory isolation between resource groups, and the flushing mechanism of intermediate data. If you want to meet our developers, <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">this is where you find us</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Understanding data compaction in 3 minutes]]></title>
<id>https://doris.apache.org/zh-CN/blog/Understanding-Data-Compaction-in-3-Minutes</id>
<link href="https://doris.apache.org/zh-CN/blog/Understanding-Data-Compaction-in-3-Minutes"/>
<updated>2023-06-09T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Think of your disks as a warehouse: The compaction mechanism is like a team of storekeepers who help put away the incoming data.]]></summary>
<content type="html"><![CDATA[<p>What is compaction in database? Think of your disks as a warehouse: The compaction mechanism is like a team of storekeepers (with genius organizing skills like Marie Kondo) who help put away the incoming data. </p><p>In particular, the data (which is the inflowing cargo in this metaphor) comes in on a "conveyor belt", which does not allow cutting in line. This is how the <strong>LSM-Tree</strong> (Log Structured-Merge Tree) works: In data storage, data is written into <strong>MemTables</strong> in an append-only manner, and then the MemTables are flushed to disks to form files. (These files go by different names in different databases. In my community, we call them <strong>Rowsets</strong>). Just like putting small boxes of cargo into a large container, compaction means merging multiple small rowset files into a big one, but it does much more than that. Like I said, the compaction mechanism is an organizing magician: </p><ul><li>Although the items (data) in each box (rowset) are orderly arranged, the boxes themselves are not. Hence, one thing that the "storekeepers" do is to sort the boxes (rowsets) in a certain order so they can be quickly found once needed (quickening data reading).</li><li>If an item needs to be discarded or replaced, since no line-jump is allowed on the conveyor belt (append-only), you can only put a "note" (together with the substitution item) at the end of the queue on the belt to remind the "storekeepers", who will later perform replacing or discarding for you.</li><li>If needed, the "storekeepers" are even kind enough to pre-process the cargo for you (pre-aggregating data to reduce computation burden during data reading). </li></ul><p><img loading="lazy" alt="MemTable-rowset" src="https://cdnd.selectdb.com/zh-CN/assets/images/Compaction_1-ee87990e8968b1976f60ae8c76c1f224.png" width="1279" height="670" class="img_ev3q"></p><p>As helpful as the "storekeepers" are, they can be troublemakers at times — that's why "team management" matters. For the compaction mechanism to work efficiently, you need wise planning and scheduling, or else you might need to deal with high memory and CPU usage, if not OOM in the backend or write error.</p><p>Specifically, efficient compaction is added up by quick triggering of compaction tasks, controllable memory and CPU overheads, and easy parameter adjustment from the engineer's side. That begs the question: <strong>How</strong>? In this post, I will show you our way, including how we trigger, execute, and fine-tune compaction for faster and less resource-hungry execution.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="trigger-strategies">Trigger Strategies<a href="#trigger-strategies" class="hash-link" aria-label="Trigger Strategies的直接链接" title="Trigger Strategies的直接链接"></a></h2><p>The overall objective here is to trigger compaction tasks timely with the least resource consumption possible.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="active-trigger">Active Trigger<a href="#active-trigger" class="hash-link" aria-label="Active Trigger的直接链接" title="Active Trigger的直接链接"></a></h3><p>The most intuitive way to ensure timely compaction is to scan for potential compaction tasks upon data ingestion. Every time a new data tablet version is generated, a compaction task is triggered immediately, so you will never have to worry about version buildup. But this only works for newly ingested data. This is called <strong>Cumulative Compaction</strong>, as opposed to <strong>Base Compaction</strong>, which is the compaction of existing data.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="passive-scan">Passive Scan<a href="#passive-scan" class="hash-link" aria-label="Passive Scan的直接链接" title="Passive Scan的直接链接"></a></h3><p>Base compaction is triggered by passive scan. Passive scan is a much heavier job than active trigger, because it scans all metadata in all data tablets in the node. After identifying all potential compaction tasks, the system starts compaction for the most urgent data tablet.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="tablet-dormancy">Tablet Dormancy<a href="#tablet-dormancy" class="hash-link" aria-label="Tablet Dormancy的直接链接" title="Tablet Dormancy的直接链接"></a></h3><p>Frequent metadata scanning is a waste of CPU resources, so it is better to introduce domancy: For tablets that have been producing no compaction tasks for long, the system just stops looking at them for a while. If there is a sudden data-write on a dormant tablet, that will trigger cumulative compaction as mentioned above, so no worries, you won't miss anything.</p><p>The combination of these three strategies is an example of cost-effective planning.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="execution">Execution<a href="#execution" class="hash-link" aria-label="Execution的直接链接" title="Execution的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="vertical-compaction-for-columnar-storage">Vertical Compaction for Columnar Storage<a href="#vertical-compaction-for-columnar-storage" class="hash-link" aria-label="Vertical Compaction for Columnar Storage的直接链接" title="Vertical Compaction for Columnar Storage的直接链接"></a></h3><p>As columnar storage is the future for analytic databases, the execution of compaction should adapt to that. We call it vertical compaction. I illustrate this mechanism with the figure below:</p><p><img loading="lazy" alt="vertical-compaction" src="https://cdnd.selectdb.com/zh-CN/assets/images/Compaction_2-a6a4e08e1d33d0c489208d8fb607abb9.png" width="1536" height="826" class="img_ev3q"></p><p>Hope all these tiny blocks and numbers don't make you dizzy. Actually, vertical compaction can be broken down into four simple steps:</p><ol><li><strong>Separate key columns and value columns</strong>. Split out all key columns from the input rowsets and put them into one group, and all value columns into N groups.</li><li><strong>Merge the key columns</strong>. Heap sort is used in this step. The product here is a merged and ordered key column as well as a global sequence marker (<strong>RowSources</strong>).</li><li><strong>Merge the value columns</strong>. The value columns are merged and organized based on the sequence in <strong>RowSources</strong>. </li><li><strong>Write the data</strong>. All columns are assembled together and form one big rowset.</li></ol><p>As a supporting technique for columnar storage, vertical compaction avoids the need to load all columns in every merging operation. That means it can vastly reduce memory usage compared to traditional row-oriented compaction.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="segment-compaction-to-avoid-jams">Segment Compaction to Avoid "Jams"<a href="#segment-compaction-to-avoid-jams" class="hash-link" aria-label="Segment Compaction to Avoid &quot;Jams&quot;的直接链接" title="Segment Compaction to Avoid &quot;Jams&quot;的直接链接"></a></h3><p>As described in the beginning, in data ingestion, data will first be piled in the memory until it reaches a certain size, and then flushed to disks and stored in the form of files. Therefore, if you have ingested one huge batch of data at a time, you will have a large number of newly generated files on the disks. That adds to the scanning burden during data reading, and thus slows down data queries. (Imagine that suddenly you have to look into 50 boxes instead of 5, to find the item you need. That's overwhelming.) In some databases, such explosion of files could even trigger a protection mechanism that suspends data ingestion.</p><p>Segment compaction is the way to avoid that. It allows you to compact data at the same time you ingest it, so that the system can ingest a larger data size quickly without generating too many files. </p><p>This is a flow chart that explains how segment compaction works:</p><p><img loading="lazy" alt="segment-compaction" src="https://cdnd.selectdb.com/zh-CN/assets/images/Compaction_3-6558067933ac6eb62900cd50b18f09fa.png" width="1030" height="950" class="img_ev3q"></p><p>Segment compaction will be triggered once the number of newly generated files exceeds a certain limit (let's say, 10). It is executed asynchronously by a specialized merging thread. Every 10 files will be merged into one, and the original 10 files will be deleted. Segment compaction does not prolong the data ingestion process by much, but it can largely accelerate data queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="ordered-data-compaction">Ordered Data Compaction<a href="#ordered-data-compaction" class="hash-link" aria-label="Ordered Data Compaction的直接链接" title="Ordered Data Compaction的直接链接"></a></h3><p>Time series data analysis is an increasingly common analytic scenario. </p><p>Time series data is "born orderly". It is already arranged chronologically, it is written at a regular pace, and every batch of it is of similar size. It is like the least-worried-about child in the family. Correspondingly, we have a tailored compaction method for it: ordered data compaction.</p><p><img loading="lazy" alt="ordered-data-compaction" src="https://cdnd.selectdb.com/zh-CN/assets/images/Compaction_4-e2cea4ec600d74073084b6ecbd1e600f.png" width="653" height="443" class="img_ev3q"></p><p>Ordered data compaction is even simpler:</p><ol><li><strong>Upload</strong>: Jot down the Min/Max Keys of the input rowset files.</li><li><strong>Check</strong>: Check if the rowset files are organized correctly based on the Min/Max Keys and the file size.</li><li><strong>Merge</strong>: Hard link the input rowsets to the new rowset, and create metadata for the new rowset (including number of rows, file size, Min/Max Key, etc.)</li></ol><p>See? It is a super neat and lightweight workload, involving only file linking and metadata creation. Statistically, <strong>it just takes milliseconds to compact huge amounts of time series data but consumes nearly zero memory</strong>.</p><p>So far, these are strategic and algorithmic optimizations for compaction, implemented by <a href="https://github.com/apache/doris/issues/19231" target="_blank" rel="noopener noreferrer">Apache Doris 2.0.0</a>, a unified analytic database. Apart from these, we, as developers for the open source project, have fine-tuned it from an engineering perspective.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="engineering-optimizations">Engineering Optimizations<a href="#engineering-optimizations" class="hash-link" aria-label="Engineering Optimizations的直接链接" title="Engineering Optimizations的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="zero-copy">Zero-Copy<a href="#zero-copy" class="hash-link" aria-label="Zero-Copy的直接链接" title="Zero-Copy的直接链接"></a></h3><p>In the backend nodes of Apache Doris, data goes through a few layers: Tablet -&gt; Rowset -&gt; Segment -&gt; Column -&gt; Page. The compaction process involves data transferring that consumes a lot of CPU resources. So we designed zero-copy compaction logic, which is realized by a data structure named BlockView. This brings another 5% increase in compaction efficiency.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="load-on-demand">Load-on-Demand<a href="#load-on-demand" class="hash-link" aria-label="Load-on-Demand的直接链接" title="Load-on-Demand的直接链接"></a></h3><p>In most cases, the rowsets are not 100% orderless, so we can take advantage of such partial orderliness. For a group of ordered rowsets, Apache Doris only loads the first one and then starts merging. As the merging goes on, it gradually loads the rowset files it needs. This is how it decreases memory usage. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="idle-schedule"><strong>Idle Schedule</strong><a href="#idle-schedule" class="hash-link" aria-label="idle-schedule的直接链接" title="idle-schedule的直接链接"></a></h3><p>According to our experience, base compaction tasks are often resource-intensive and time-consuming, so they can easily stand in the way of data queries. Doris 2.0.0 enables Idle Schedule, deprioritizing those base compaction tasks with huge data, long execution, and low compaction rate. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="parameter-optimizations">Parameter Optimizations<a href="#parameter-optimizations" class="hash-link" aria-label="Parameter Optimizations的直接链接" title="Parameter Optimizations的直接链接"></a></h2><p>Every data engineer has somehow been harassed by complicated parameters and configurations. To protect our users from this nightmare, we have provided a streamlined set of parameters with the best-performing default configurations in the general environment.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>This is how we keep our "storekeepers" working efficiently and cost-effectively. If you wonder how these strategies and optimization work in real practice, we tested Apache Doris with ClickBench. It reaches a <strong>compaction speed of 300,000 row/s</strong>; in high-concurrency scenarios, it maintains <strong>a stable compaction score of around 50</strong>. Also, we are planning to implement auto-tuning and increase observability for the compaction mechanism. If you are interested in the <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a> project and what we do, this is a group of visionary and passionate <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">developers</a> that you can talk to.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.4]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.4</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.4"/>
<updated>2023-06-05T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.2.4 is now available, with several enhancements and bug fixes based on 1.2.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changed">Behavior Changed<a href="#behavior-changed" class="hash-link" aria-label="Behavior Changed的直接链接" title="Behavior Changed的直接链接"></a></h2><ul><li><p>For <code>DateV2</code>/<code>DatetimeV2</code> and <code>DecimalV3</code> type, in the results of <code>DESCRIBLE</code> and <code>SHOW CREATE TABLE</code> statements, they will no longer be displayed as <code>DateV2</code>/<code>DatetimeV2</code> or <code>DecimalV3</code>, but directly displayed as <code>Date</code>/<code>Datetime</code> or <code>Decimal</code>.</p><ul><li>This change is for compatibility with some BI tools. If you want to see the actual type of the column, you can check it with the <code>DESCRIBE ALL</code> statement.</li></ul></li><li><p>When querying tables in the <code>information_schema</code> database, the meta information(database, table, column, etc.) in the external catalog is no longer returned by default.</p><ul><li>This change avoids the problem that the <code>information_schema</code> database cannot be queried due to the connection problem of some external catalog, so as to solve the problem of using some BI tools with Doris. It can be controlled by the FE configuration <code>infodb_support_ext_catalog</code>, and the default value is <code>false</code>, that is, the meta information of external catalog will not be returned.</li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="jdbc-catalog">JDBC Catalog<a href="#jdbc-catalog" class="hash-link" aria-label="JDBC Catalog的直接链接" title="JDBC Catalog的直接链接"></a></h3><ul><li>Supports connecting to Trino/Presto via JDBC Catalog</li></ul><p>​ Refer to: <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc#trino" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc#trino</a></p><ul><li>JDBC Catalog connects to Clickhouse data source and supports Array type mapping</li></ul><p>​ Refer to: <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc#clickhouse" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc#clickhouse</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="spark-load">Spark Load<a href="#spark-load" class="hash-link" aria-label="Spark Load的直接链接" title="Spark Load的直接链接"></a></h3><ul><li>Spark Load supports Resource Manager HA related configuration</li></ul><p>​ Refer to: <a href="https://github.com/apache/doris/pull/15000" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/15000</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li><p>Fixed several connectivity issues with Hive Catalog.</p></li><li><p>Fixed ClassNotFound issues with Hudi Catalog.</p></li><li><p>Optimize the connection pool of JDBC Catalog to avoid too many connections.</p></li><li><p>Fix the problem that OOM will occur when importing data from another Doris cluster through JDBC Catalog.</p></li><li><p>Fixed serveral queries and imports planning issues.</p></li><li><p>Fixed several issues with Unique Key Merge-On-Write data model.</p></li><li><p>Fix several BDBJE issues and solve the problem of abnormal FE metadata in some cases.</p></li><li><p>Fix the problem that the <code>CREATE VIEW</code> statement does not support Table Valued Function.</p></li><li><p>Fixed several memory statistics issues.</p></li><li><p>Fixed several issues reading Parquet/ORC format.</p></li><li><p>Fixed several issues with DecimalV3.</p></li><li><p>Fixed several issues with SHOW QUERY/LOAD PROFILE.</p></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[A/B Testing was a handful, until we found the replacement for Druid]]></title>
<id>https://doris.apache.org/zh-CN/blog/AB-Testing-was-a-Handful-Until-we-Found-the-Replacement-for-Druid</id>
<link href="https://doris.apache.org/zh-CN/blog/AB-Testing-was-a-Handful-Until-we-Found-the-Replacement-for-Druid"/>
<updated>2023-06-01T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The recipe for successful A/B testing is quick computation, no duplication, and no data loss. For that, we used Apache Flink and Apache Doris to build our data platform.]]></summary>
<content type="html"><![CDATA[<p>Unlike normal reporting, A/B testing collects data of a different combination of dimensions every time. It is also a complicated kind of analysis of immense data. In our case, we have a real-time data volume of millions of OPS (Operations Per Second), with each operation involving around 20 data tags and over a dozen dimensions.</p><p>For effective A/B testing, as data engineers, we must ensure quick computation as well as high data integrity (which means no duplication and no data loss). I'm sure I'm not the only one to say this: it is hard!</p><p>Let me show you our long-term struggle with our previous Druid-based data platform.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="platform-architecture-10">Platform Architecture 1.0<a href="#platform-architecture-10" class="hash-link" aria-label="Platform Architecture 1.0的直接链接" title="Platform Architecture 1.0的直接链接"></a></h2><p><strong>Components</strong>: Apache Storm + Apache Druid + MySQL</p><p>This was our real-time datawarehouse, where Apache Storm was the real-time data processing engine and Apache Druid pre-aggregated the data. However, Druid did not support certain paging and join queries, so we wrote data from Druid to MySQL regularly, making MySQL the "materialized view" of Druid. But that was only a duct tape solution as it couldn't support our ever enlarging real-time data size. So data timeliness was unattainable.</p><p><img loading="lazy" alt="Apache-Storm-Apache-Druid-MySQL" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_1-8cb2f7a87f8ce60f9da14e0ec0ea7bb5.png" width="1709" height="960" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="platform-architecture-20">Platform Architecture 2.0<a href="#platform-architecture-20" class="hash-link" aria-label="Platform Architecture 2.0的直接链接" title="Platform Architecture 2.0的直接链接"></a></h2><p><strong>Components</strong>: Apache Flink + Apache Druid + TiDB</p><p>This time, we replaced Storm with Flink, and MySQL with TiDB. Flink was more powerful in terms of semantics and features, while TiDB, with its distributed capability, was more maintainable than MySQL. But architecture 2.0 was nowhere near our goal of end-to-end data consistency, either, because when processing huge data, enabling TiDB transactions largely slowed down data writing. Plus, Druid itself did not support standard SQL, so there were some learning costs and frictions in usage.</p><p><img loading="lazy" alt="Apache-Flink-Apache-Druid-TiDB" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_2-d32b762837d3788bdc43f0370fbf8199.png" width="1592" height="1083" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="platform-architecture-30">Platform Architecture 3.0<a href="#platform-architecture-30" class="hash-link" aria-label="Platform Architecture 3.0的直接链接" title="Platform Architecture 3.0的直接链接"></a></h2><p><strong>Components</strong>: Apache Flink + <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a></p><p>We replaced Apache Druid with Apache Doris as the OLAP engine, which could also serve as a unified data serving gateway. So in Architecture 3.0, we only need to maintain one set of query logic. And we layered our real-time datawarehouse to increase reusability of real-time data.</p><p><img loading="lazy" alt="Apache-Flink-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_3-c04ebf18268d873153f0365681d2a5d0.png" width="1340" height="1101" class="img_ev3q"></p><p>Turns out the combination of Flink and Doris was the answer. We can exploit their features to realize quick computation and data consistency. Keep reading and see how we make it happen.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="quick-computation">Quick Computation<a href="#quick-computation" class="hash-link" aria-label="Quick Computation的直接链接" title="Quick Computation的直接链接"></a></h2><p>As one piece of operation data can be attached to 20 tags, in A/B testing, we compare two groups of data centering only one tag each time. At first, we thought about splitting one piece of operation data (with 20 tags) into 20 pieces of data of only one tag upon data ingestion, and then importing them into Doris for analysis, but that could cause a data explosion and thus huge pressure on our clusters. </p><p>Then we tried moving part of such workload to the computation engine. So we tried and "exploded" the data in Flink, but soon regretted it, because when we aggregated the data using the global hash windows in Flink jobs, the network and CPU usage also "exploded".</p><p>Our third shot was to aggregate data locally in Flink right after we split it. As is shown below, we create a window in the memory of one operator for local aggregation; then we further aggregate it using the global hash windows. Since two operators chained together are in one thread, transferring data between operators consumes much less network resources. <strong>The two-step aggregation method, combined with the</strong> <strong><a href="https://doris.apache.org/docs/dev/data-table/data-model" target="_blank" rel="noopener noreferrer">Aggregate model</a></strong> <strong>of Apache Doris, can keep data explosion in a manageable range.</strong></p><p><img loading="lazy" alt="Apache-Flink-Apache-Doris-2" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_4-b4cad8ba4f8625718a23e7297885c40d.png" width="1642" height="624" class="img_ev3q"></p><p>For convenience in A/B testing, we make the test tag ID the first sorted field in Apache Doris, so we can quickly locate the target data using sorted indexes. To further minimize data processing in queries, we create materialized views with the frequently used dimensions. With constant modification and updates, the materialized views are applicable in 80% of our queries.</p><p>To sum up, with the application of sorted index and materialized views, we reduce our query response time to merely seconds in A/B testing.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-integrity-guarantee">Data Integrity Guarantee<a href="#data-integrity-guarantee" class="hash-link" aria-label="Data Integrity Guarantee的直接链接" title="Data Integrity Guarantee的直接链接"></a></h2><p>Imagine that your algorithm designers worked sweat and tears trying to improve the business, only to find their solution unable to be validated by A/B testing due to data loss. This is an unbearable situation, and we make every effort to avoid it.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="develop-a-sink-to-doris-component">Develop a Sink-to-Doris Component<a href="#develop-a-sink-to-doris-component" class="hash-link" aria-label="Develop a Sink-to-Doris Component的直接链接" title="Develop a Sink-to-Doris Component的直接链接"></a></h3><p>To ensure end-to-end data integrity, we developed a Sink-to-Doris component. It is built on our own Flink Stream API scaffolding and realized by the idempotent writing of Apache Doris and the two-stage commit mechanism of Apache Flink. On top of it, we have a data protection mechanism against anomalies. </p><p>It is the result of our long-term evolution. We used to ensure data consistency by implementing "one writing for one tag ID". Then we realized we could make good use of the transactions in Apache Doris and the two-stage commit of Apache Flink. </p><p><img loading="lazy" alt="idempotent-writing-two-stage-commit" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_5-b5f8490ad14a1b485d4472b3db36e9d6.png" width="3380" height="3334" class="img_ev3q"></p><p>As is shown above, this is how two-stage commit works to guarantee data consistency:</p><ol><li>Write data into local files;</li><li>Stage One: pre-commit data to Apache Doris. Save the Doris transaction ID into status;</li><li>If checkpoint fails, manually abandon the transaction; if checkpoint succeeds, commit the transaction in Stage Two;</li><li>If the commit fails after multiple retries, the transaction ID and the relevant data will be saved in HDFS, and we can restore the data via Broker Load.</li></ol><p>We make it possible to split a single checkpoint into multiple transactions, so that we can prevent one Stream Load from taking more time than a Flink checkpoint in the event of large data volumes.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="application-display">Application Display<a href="#application-display" class="hash-link" aria-label="Application Display的直接链接" title="Application Display的直接链接"></a></h3><p>This is how we implement Sink-to-Doris. The component has blocked API calls and topology assembly. With simple configuration, we can write data into Apache Doris via Stream Load. </p><p><img loading="lazy" alt="Sink-to-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_6-9d94599760bc55e52be086ec6d44cc69.png" width="3289" height="1077" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cluster-monitoring">Cluster Monitoring<a href="#cluster-monitoring" class="hash-link" aria-label="Cluster Monitoring的直接链接" title="Cluster Monitoring的直接链接"></a></h3><p>For cluster and host monitoring, we adopted the metrics templates provided by the Apache Doris community. For data monitoring, in addition to the template metrics, we added Stream Load request numbers and loading rates.</p><p><img loading="lazy" alt="stream-load-cluster-monitoring" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_7-a8f9f0c95e96e136b287be46bdbc4add.png" width="2001" height="832" class="img_ev3q"></p><p>Other metrics of our concerns include data writing speed and task processing time. In the case of anomalies, we will receive notifications in the form of phone calls, messages, and emails.</p><p><img loading="lazy" alt="cluster-monitoring" src="https://cdnd.selectdb.com/zh-CN/assets/images/360_8-e02d4bf0c8cfab543e5693216fee6357.png" width="1280" height="888" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="key-takeaways">Key Takeaways<a href="#key-takeaways" class="hash-link" aria-label="Key Takeaways的直接链接" title="Key Takeaways的直接链接"></a></h2><p>The recipe for successful A/B testing is quick computation and high data integrity. For this purpose, we implement a two-step aggregation method in Apache Flink, utilize the Aggregate model, materialized view, and short indexes of Apache Doris. Then we develop a Sink-to-Doris component, which is realized by the idempotent writing of Apache Doris and the two-stage commit mechanism of Apache Flink.</p>]]></content>
<author>
<name>Heyu Dou, Xinxin Wang</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Building a log analytics solution 10 times more cost-effective than Elasticsearch]]></title>
<id>https://doris.apache.org/zh-CN/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch</id>
<link href="https://doris.apache.org/zh-CN/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch"/>
<updated>2023-05-26T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Apache Doris has introduced inverted indexes in version 2.0.0 and further optimized it to realize two times faster log query performance than Elasticsearch with 1/5 of the storage space it uses.]]></summary>
<content type="html"><![CDATA[<p>Logs often take up the majority of a company's data assets. Examples of logs include business logs (such as user activity logs), and Operation &amp; Maintenance logs of servers, databases, and network or IoT devices.</p><p>Logs are the guardian angel of business. On the one hand, they provide system risk alerts and help engineers in troubleshooting. On the other hand, if you zoom them out by time range, you might identify some helpful trends and patterns, not to mention that business logs are the cornerstone of user insights.</p><p>However, logs can be a handful, because:</p><ul><li><strong>They flow in like crazy.</strong> Every system event or click from user generates a log. A company often produces tens of billions of new logs per day.</li><li><strong>They are bulky.</strong> Logs are supposed to stay. They might not be useful until they are. So a company can accumulate up to PBs of log data, many of which are seldom visited but take up huge storage space. </li><li><strong>They must be quick to load and find.</strong> Locating the target log for troubleshooting is literally like looking for a needle in a haystack. People long for real-time log writing and real-time responses to log queries. </li></ul><p>Now you can see a clear picture of what an ideal log processing system is like. It should support:</p><ul><li><strong>High-throughput real-time data ingestion:</strong> It should be able to write blogs in bulk, and make them visible immediately.</li><li><strong>Low-cost storage:</strong> It should be able to store substantial amounts of logs without costing too many resources.</li><li><strong>Real-time text search:</strong> It should be capable of quick text search.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="common-solutions-elasticsearch--grafana-loki">Common Solutions: Elasticsearch &amp; Grafana Loki<a href="#common-solutions-elasticsearch--grafana-loki" class="hash-link" aria-label="Common Solutions: Elasticsearch &amp; Grafana Loki的直接链接" title="Common Solutions: Elasticsearch &amp; Grafana Loki的直接链接"></a></h2><p>There exist two common log processing solutions within the industry, exemplified by Elasticsearch and Grafana Loki, respectively. </p><ul><li><strong>Inverted index (Elasticsearch)</strong>: It is well-embraced due to its support for full-text search and high performance. The downside is the low throughput in real-time writing and the huge resource consumption in index creation.</li><li><strong>Lightweight index / no index (Grafana Loki)</strong>: It is the opposite of inverted index because it boasts high real-time write throughput and low storage cost but delivers slow queries.</li></ul><p><img loading="lazy" alt="Elasticsearch-and-Grafana-Loki" src="https://cdnd.selectdb.com/zh-CN/assets/images/Inverted_1-976368c1a98899c128fdb268e00261c5.png" width="1412" height="748" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduction-to-inverted-index">Introduction to Inverted Index<a href="#introduction-to-inverted-index" class="hash-link" aria-label="Introduction to Inverted Index的直接链接" title="Introduction to Inverted Index的直接链接"></a></h2><p>A prominent strength of Elasticsearch in log processing is quick keyword search among a sea of logs. This is enabled by inverted indexes.</p><p>Inverted indexing was originally used to retrieve words or phrases in texts. The figure below illustrates how it works: </p><p>Upon data writing, the system tokenizes texts into <strong>terms</strong>, and stores these terms in a <strong>posting list</strong> which maps terms to the ID of the row where they exist. In text queries, the database finds the corresponding <strong>row ID</strong> of the keyword (term) in the posting list, and fetches the target row based on the row ID. By doing so, the system won't have to traverse the whole dataset and thus improves query speeds by orders of magnitudes. </p><p><img loading="lazy" alt="inverted-index" src="https://cdnd.selectdb.com/zh-CN/assets/images/Inverted_2-20f3d1267475f3074304b15f8a901db3.png" width="961" height="720" class="img_ev3q"></p><p>In inverted indexing of Elasticsearch, quick retrieval comes at the cost of writing speed, writing throughput, and storage space. Why? Firstly, tokenization, dictionary sorting, and inverted index creation are all CPU- and memory-intensive operations. Secondly, Elasticssearch has to store the original data, the inverted index, and an extra copy of data stored in columns for query acceleration. That's triple redundancy. </p><p>But without inverted index, Grafana Loki, for example, is hurting user experience with its slow queries, which is the biggest pain point for engineers in log analysis.</p><p>Simply put, Elasticsearch and Grafana Loki represent different tradeoffs between high writing throughput, low storage cost, and fast query performance. What if I tell you there is a way to have them all? We have introduced inverted indexes in <a href="https://github.com/apache/doris/issues/19231" target="_blank" rel="noopener noreferrer">Apache Doris 2.0.0</a> and further optimized it to realize <strong>two times faster log query performance than Elasticsearch with 1/5 of the storage space it uses. Both factors combined, it is a 10 times better solution.</strong> </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="inverted-index-in-apache-doris">Inverted Index in Apache Doris<a href="#inverted-index-in-apache-doris" class="hash-link" aria-label="Inverted Index in Apache Doris的直接链接" title="Inverted Index in Apache Doris的直接链接"></a></h2><p>Generally, there are two ways to implement indexes: <strong>external indexing system</strong> or <strong>built-in indexes</strong>.</p><p><strong>External indexing system:</strong> You connect an external indexing system to your database. In data ingestion, data is imported to both systems. After the indexing system creates indexes, it deletes the original data within itself. When data users input a query, the indexing system provides the IDs of the relevant data, and then the database looks up the target data based on the IDs. </p><p>Building an external indexing system is easier and less intrusive to the database, but it comes with some annoying flaws:</p><ul><li>The need to write data into two systems can result in data inconsistency and storage redundancy.</li><li>Interaction between the database and the indexing system brings overheads, so when the target data is huge, the query across the two systems can be slow.</li><li>It is exhausting to maintain two systems.</li></ul><p>In <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a>, we opt for the other way. Built-in inverted indexes are more difficult to make, but once it is done, it is faster, more user-friendly, and trouble-free to maintain.</p><p>In Apache Doris, data is arranged in the following format. Indexes are stored in the Index Region:</p><p><img loading="lazy" alt="index-region-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/Inverted_3-10be120b343f5721d2c3adb666ab16b2.png" width="1054" height="324" class="img_ev3q"></p><p>We implement inverted indexes in a non-intrusive manner:</p><ol><li><strong>Data ingestion &amp; compaction</strong>: As a segment file is written into Doris, an inverted index file will be written, too. The index file path is determined by the segment ID and the index ID. Rows in segments correspond to the docs in indexes, so are the RowID and the DocID.</li><li><strong>Query</strong>: If the <code>where</code> clause includes a column with inverted index, the system will look up in the index file, return a DocID list, and convert the DocID list into a RowID Bitmap. Under the RowID filtering mechanism of Apache Doris, only the target rows will be read. This is how queries are accelerated.</li></ol><p><img loading="lazy" alt="non-intrusive-inverted-index" src="https://cdnd.selectdb.com/zh-CN/assets/images/Inverted_4-9de8707a87ec78e61f52142b043b512b.png" width="1174" height="1126" class="img_ev3q"></p><p>Such non-intrusive method separates the index file from the data files, so you can make any changes to the inverted indexes without worrying about affecting the data files themselves or other indexes. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimizations-for-inverted-index">Optimizations for Inverted Index<a href="#optimizations-for-inverted-index" class="hash-link" aria-label="Optimizations for Inverted Index的直接链接" title="Optimizations for Inverted Index的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="general-optimizations">General Optimizations<a href="#general-optimizations" class="hash-link" aria-label="General Optimizations的直接链接" title="General Optimizations的直接链接"></a></h3><p><strong>C++ Implementation and Vectorization</strong></p><p>Different from Elasticsearch, which uses Java, Apache Doris implements C++ in its storage modules, query execution engine, and inverted indexes. Compared to Java, C++ provides better performance, allows easier vectorization, and produces no JVM GC overheads. We have vectorized every step of inverted indexing in Apache Doris, such as tokenization, index creation, and queries. To provide you with a perspective, <strong>in inverted indexing, Apache Doris writes data at a speed of 20MB/s per core, which is four times that of Elasticsearch (5MB/s).</strong></p><p><strong>Columnar Storage &amp; Compression</strong></p><p>Apache Lucene lays the foundation for inverted indexes in Elasticsearch. As Lucene itself is built to support file storage, it stores data in a row-oriented format. </p><p>In Apache Doris, inverted indexes for different columns are isolated from each other, and the inverted index files adopt columnar storage to facilitate vectorization and data compression.</p><p>By utilizing Zstandard compression, Apache Doris realizes a compression ratio ranging from <strong>5:1</strong> to <strong>10:1</strong>, faster compression speeds, and 50% less space usage than GZIP compression.</p><p><strong>BKD Trees for Numeric / Datetime Columns</strong></p><p>Apache Doris implements BKD trees for numeric and datetime columns. This not only increases performance of range queries, but is a more space-saving method than converting those columns to fixed-length strings. Other benefits of it include:</p><ol><li><strong>Efficient range queries</strong>: It is able to quickly locate the target data range in numeric and datetime columns.</li><li><strong>Less storage space</strong>: It aggregates and compresses adjacent data blocks to reduce storage costs.</li><li><strong>Support for multi-dimensional data</strong>: BKD trees are scalable and adaptive to multi-dimensional data types, such as GEO points and ranges.</li></ol><p>In addition to BKD trees, we have further optimized the queries on numeric and datetime columns.</p><ol><li><strong>Optimization for low-cardinality scenarios</strong>: We have fine-tuned the compression algorithm for low-cardinality scenarios, so decompressing and de-serializing large amounts of inverted lists will consume less CPU resources.</li><li><strong>Pre-fetching</strong>: For high-hit-rate scenarios, we adopt pre-fetching. If the hit rate exceeds a certain threshold, Doris will skip the indexing process and start data filtering.</li></ol><h3 class="anchor anchorWithStickyNavbar_LWe7" id="tailored-optimizations-to-olap">Tailored Optimizations to OLAP<a href="#tailored-optimizations-to-olap" class="hash-link" aria-label="Tailored Optimizations to OLAP的直接链接" title="Tailored Optimizations to OLAP的直接链接"></a></h3><p>Log analysis is a simple kind of query with no need for advanced features (e.g. relevance scoring in Apache Lucene). The bread and butter capability of a log processing tool is quick queries and low storage cost. Therefore, in Apache Doris, we have streamlined the inverted index structure to meet the needs of an OLAP database.</p><ul><li>In data ingestion, we prevent multiple threads from writing data into the same index, and thus avoid overheads brought by lock contention.</li><li>We discard forward index files and Norm files to clear storage space and reduce I/O overheads.</li><li>We simplify the computation logic of relevance scoring and ranking to further reduce overheads and increase performance.</li></ul><p>In light of the fact that logs are partitioned by time range and historical logs are visited less frequently. We plan to provide more granular and flexible index management in future versions of Apache Doris:</p><ul><li><strong>Create inverted index for a specified data partition</strong>: create index for logs of the past seven days, etc.</li><li><strong>Delete</strong> <strong>inverted index for a specified data partition</strong>: delete index for logs from over one month ago, etc. (so as to clear out index space).</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="benchmarking">Benchmarking<a href="#benchmarking" class="hash-link" aria-label="Benchmarking的直接链接" title="Benchmarking的直接链接"></a></h2><p>We tested Apache Doris on publicly available datasets against Elasticsearch and ClickHouse.</p><p>For a fair comparison, we ensure uniformity of testing conditions, including benchmarking tool, dataset, and hardware.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="apache-doris-vs-elasticsearch">Apache Doris VS Elasticsearch<a href="#apache-doris-vs-elasticsearch" class="hash-link" aria-label="Apache Doris VS Elasticsearch的直接链接" title="Apache Doris VS Elasticsearch的直接链接"></a></h3><p><strong>Benchmarking tool</strong>: ES Rally, the official testing tool for Elasticsearch</p><p><strong>Dataset</strong>: 1998 World Cup HTTP Server Logs (self-contained dataset in ES Rally)</p><p><strong>Data Size (Before Compression)</strong>: 32G, 247 million rows, 134 bytes per row (on average)</p><p><strong>Query</strong>: 11 queries including keyword search, range query, aggregation, and ranking; Each query is serially executed 100 times.</p><p><strong>Environment</strong>: 3 × 16C 64G cloud virtual machines</p><ul><li><p><strong>Results of Apache Doris</strong>:</p><ul><li>Writing Speed: 550 MB/s, <strong>4.2 times that of Elasticsearch</strong></li><li>Compression Ratio: 10:1</li><li>Storage Usage: <strong>20% that of Elasticsearch</strong></li><li>Response Time: <strong>43% that of Elasticsearch</strong></li></ul></li></ul><p><img loading="lazy" alt="Apache-Doris-VS-Elasticsearch" src="https://cdnd.selectdb.com/zh-CN/assets/images/Inverted_5-d5600afe9c83f8ade57180eaa1104e8e.png" width="1592" height="680" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="apache-doris-vs-clickhouse">Apache Doris VS ClickHouse<a href="#apache-doris-vs-clickhouse" class="hash-link" aria-label="Apache Doris VS ClickHouse的直接链接" title="Apache Doris VS ClickHouse的直接链接"></a></h3><p>As ClickHouse launched inverted index as an experimental feature in v23.1, we tested Apache Doris with the same dataset and SQL as described in the ClickHouse <a href="https://clickhouse.com/blog/clickhouse-search-with-inverted-indices" target="_blank" rel="noopener noreferrer">blog</a>, and compared performance of the two under the same testing resource, case, and tool.</p><p><strong>Data</strong>: 6.7G, 28.73 million rows, the Hacker News dataset, Parquet format</p><p><strong>Query</strong>: 3 keyword searches, counting the number of occurrence of the keywords "ClickHouse", "OLAP" OR "OLTP", and "avx" AND "sve".</p><p><strong>Environment</strong>: 1 × 16C 64G cloud virtual machine</p><p><strong>Result</strong>: Apache Doris was <strong>4.7 times, 12 times, 18.5 times</strong> faster than ClickHouse in the three queries, respectively.</p><p><img loading="lazy" alt="Apache-Doris-VS-ClickHouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/Inverted_6-a3e009f32ae1d0a25a40f257c04b8878.png" width="1546" height="626" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="usage--example">Usage &amp; Example<a href="#usage--example" class="hash-link" aria-label="Usage &amp; Example的直接链接" title="Usage &amp; Example的直接链接"></a></h2><p><strong>Dataset</strong>: one million comment records from Hacker News</p><ul><li><p><strong>Step 1</strong>: Specify inverted index to the data table upon table creation.</p></li><li><p><strong>Parameters</strong>:</p></li><li><ul><li>INDEX idx_comment (<code>comment</code>): create an index named "idx_comment" comment for the "comment" column</li><li>USING INVERTED: specify inverted index for the table</li><li>PROPERTIES("parser" = "english"): specify the tokenization language to English</li></ul></li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE hackernews_1m</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `id` BIGINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `deleted` TINYINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `type` String,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `author` String,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `timestamp` DateTimeV2,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `comment` String,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `dead` TINYINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `parent` BIGINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `poll` BIGINT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `children` Array&lt;BIGINT&gt;,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `url` String,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `score` INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `title` String,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `parts` Array&lt;INT&gt;,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `descendants` INT,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DUPLICATE KEY(`id`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`id`) BUCKETS 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES ("replication_num" = "1");</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Note: You can add index to an existing table via <code>ADD INDEX idx_comment ON hackernews_1m(</code>comment<code>) USING INVERTED PROPERTIES("parser" = "english") </code>. Different from that of smart index and secondary index, the creation of inverted index only involves the reading of the comment column, so it can be much faster.</p><p><strong>Step 2</strong>: Retrieve the words"OLAP" and "OLTP" in the comment column with <code>MATCH_ALL</code>. The response time here was 1/10 of that in hard matching with <code>like</code>. (The performance gap widens as data volume increases.)</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%';</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| count() |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 15 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.13 sec)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">mysql&gt; SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP';</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| count() |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">| 15 |</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">+---------+</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">1 row in set (0.01 sec)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>For more feature introduction and usage guide, see documentation: <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index/" target="_blank" rel="noopener noreferrer">Inverted Index</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="wrap-up">Wrap-up<a href="#wrap-up" class="hash-link" aria-label="Wrap-up的直接链接" title="Wrap-up的直接链接"></a></h2><p>In a word, what contributes to Apache Doris' 10-time higher cost-effectiveness than Elasticsearch is its OLAP-tailored optimizations for inverted indexing, supported by the columnar storage engine, massively parallel processing framework, vectorized query engine, and cost-based optimizer of Apache Doris. </p><p>As proud as we are about our own inverted indexing solution, we understand that self-published benchmarks can be controversial, so we are open to <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">feedback</a> from any third-party users and see how <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a> works in real-world cases.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Building a data warehouse for traditional industry]]></title>
<id>https://doris.apache.org/zh-CN/blog/Building-a-Data-Warehouse-for-Traditional-Industry</id>
<link href="https://doris.apache.org/zh-CN/blog/Building-a-Data-Warehouse-for-Traditional-Industry"/>
<updated>2023-05-12T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The best component for you is the one that suits you most. In Midland Realty, we don't have too much data to process but want a data platform easy to use and maintain.]]></summary>
<content type="html"><![CDATA[<p>By Herman Seah, Data Warehouse Planner &amp; Data Analyst at Midland Realty</p><p>This is a part of the digital transformation of a real estate giant. For the sake of confidentiality, I'm not going to reveal any business data, but you'll get a detailed view of our data warehouse and our optimization strategies.</p><p>Now let's get started.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="architecture">Architecture<a href="#architecture" class="hash-link" aria-label="Architecture的直接链接" title="Architecture的直接链接"></a></h2><p>Logically, our data architecture can be divided into four parts.</p><p><img loading="lazy" alt="data-processing-architecture" src="https://cdnd.selectdb.com/zh-CN/assets/images/Midland_1-13321d195f728638c4903bdd51e60ef0.png" width="1280" height="616" class="img_ev3q"></p><ul><li><strong>Data integration</strong>: This is supported by Flink CDC, DataX, and the Multi-Catalog feature of Apache Doris.</li><li><strong>Data management</strong>: We use Apache Dolphinscheduler for script lifecycle management, privileges in multi-tenancy management, and data quality monitoring.</li><li><strong>Alerting</strong>: We use Grafana, Prometheus, and Loki to monitor component resources and logs.</li><li><strong>Data services</strong>: This is where BI tools step in for user interaction, such as data queries and analysis.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-tables">1. <strong>Tables</strong><a href="#1-tables" class="hash-link" aria-label="1-tables的直接链接" title="1-tables的直接链接"></a></h3><p>We create our dimension tables and fact tables centering each operating entity in business, including customers, houses, etc. If there are a series of activities involving the same operating entity, they should be recorded by one field. (This is a lesson learned from our previous chaotic data management system.)</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-layers">2. <strong>Layers</strong><a href="#2-layers" class="hash-link" aria-label="2-layers的直接链接" title="2-layers的直接链接"></a></h3><p>Our data warehouse is divided into five conceptual layers. We use Apache Doris and Apache DolphinScheduler to schedule the DAG scripts between these layers.</p><p><img loading="lazy" alt="ODS-DWD-DWS-ADS-DIM" src="https://cdnd.selectdb.com/zh-CN/assets/images/Midland_2-4d94af927a13961e91486cef3512b47f.png" width="1280" height="729" class="img_ev3q"></p><p>Every day, the layers go through an overall update besides incremental updates in case of changes in historical status fields or incomplete data synchronization of ODS tables.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-incremental-update-strategies">3. <strong>Incremental Update Strategies</strong><a href="#3-incremental-update-strategies" class="hash-link" aria-label="3-incremental-update-strategies的直接链接" title="3-incremental-update-strategies的直接链接"></a></h3><p>(1) Set <code>where &gt;= "activity time -1 day or -1 hour"</code> instead of <code>where &gt;= "activity time</code></p><p>The reason for doing so is to prevent data drift caused by the time gap of scheduling scripts. Let's say, with the execution interval set to 10 min, suppose that the script is executed at 23:58:00 and a new piece of data arrives at 23:59:00, if we set <code>where &gt;= "activity time</code>, that piece of data of the day will be missed.</p><p>(2) Fetch the ID of the largest primary key of the table before every script execution, store the ID in the auxiliary table, and set <code>where &gt;= "ID in auxiliary table"</code></p><p>This is to avoid data duplication. Data duplication might happen if you use the Unique Key model of Apache Doris and designate a set of primary keys, because if there are any changes in the primary keys in the source table, the changes will be recorded and the relevant data will be loaded. This method can fix that, but it is only applicable when the source tables have auto-increment primary keys.</p><p>(3) Partition the tables</p><p>As for time-based auto-increment data such as log tables, there might be less changes in historical data and status, but the data volume is large, so there could be huge computing pressure on overall updates and snapshot creation. Hence, it is better to partition such tables, so for each incremental update, we only need to replace one partition. (You might need to watch out for data drift, too.)</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-overall-update-strategies">4. <strong>Overall Update Strategies</strong><a href="#4-overall-update-strategies" class="hash-link" aria-label="4-overall-update-strategies的直接链接" title="4-overall-update-strategies的直接链接"></a></h3><p>(1) Truncate Table</p><p>Clear out the table and then ingest all data from the source table into it. This is applicable for small tables and scenarios with no user activity in wee hours.</p><p>(2) <code>ALTER TABLE tbl1 REPLACE WITH TABLE tbl2 </code></p><p>This is an atomic operation and it is advisable for large tables. Every time before executing a script, we create a temporary table with the same schema, load all data into it, and replace the original table with it.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="application">Application<a href="#application" class="hash-link" aria-label="Application的直接链接" title="Application的直接链接"></a></h2><ul><li><strong>ETL job</strong>: every minute</li><li><strong>Configuration for first-time deployment</strong>: 8 nodes, 2 frontends, 8 backends, hybrid deployment</li><li><strong>Node configuration</strong>: 32C <em> 60GB </em> 2TB SSD</li></ul><p>This is our configuration for TBs of legacy data and GBs of incremental data. You can use it as a reference and scale your cluster on this basis. Deployment of Apache Doris is simple. You don't need other components.</p><ol><li>To integrate offline data and log data, we use DataX, which supports CSV format and readers of many relational databases, and Apache Doris provides a DataX-Doris-Writer.</li></ol><p><img loading="lazy" alt="DataX-Doris-Writer" src="https://cdnd.selectdb.com/zh-CN/assets/images/Midland_3-d394cef81ce173d944a379f14824f5e6.png" width="992" height="636" class="img_ev3q"></p><ol start="2"><li>We use Flink CDC to synchronize data from source tables. Then we aggregate the real-time metrics utilizing the Materialized View or the Aggregate Model of Apache Doris. Since we only have to process part of the metrics in a real-time manner and we don't want to generate too many database connections, we use one Flink job to maintain multiple CDC source tables. This is realized by the multi-source merging and full database sync features of Dinky, or you can implement a Flink DataStream multi-source merging task yourself. It is noteworthy that Flink CDC and Apache Doris support Schema Change.</li></ol><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">EXECUTE CDCSOURCE demo_doris WITH (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'connector' = 'mysql-cdc',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'hostname' = '127.0.0.1',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'port' = '3306',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'username' = 'root',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'password' = '123456',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'checkpoint' = '10000',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'scan.startup.mode' = 'initial',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'parallelism' = '1',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'table-name' = 'ods.ods_*,ods.ods_*',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.connector' = 'doris',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.fenodes' = '127.0.0.1:8030',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.username' = 'root',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.password' = '123456',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.doris.batch.size' = '1000',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.sink.max-retries' = '1',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.sink.batch.interval' = '60000',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.sink.db' = 'test',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.sink.properties.format' ='json',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.sink.properties.read_json_by_line' ='true',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.table.identifier' = '${schemaName}.${tableName}',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'sink.sink.label-prefix' = '${schemaName}_${tableName}_1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol start="3"><li>We use SQL scripts or "Shell + SQL" scripts, and we perform script lifecycle management. At the ODS layer, we write a general DataX job file and pass parameters for each source table ingestion, instead of writing a DataX job for each source table. In this way, we make things much easier to maintain. We manage the ETL scripts of Apache Doris on DolphinScheduler, where we also conduct version control. In case of any errors in the production environment, we can always rollback.</li></ol><p><img loading="lazy" alt="SQL-script" src="https://cdnd.selectdb.com/zh-CN/assets/images/Midland_4-f50219b88be08e1bdf3a7b31c21ae258.png" width="1280" height="625" class="img_ev3q"></p><ol start="4"><li>After ingesting data with ETL scripts, we create a page in our reporting tool. We assign different privileges to different accounts using SQL, including the privilege of modifying rows, fields, and global dictionary. Apache Doris supports privilege control over accounts, which works the same as that in MySQL. </li></ol><p><img loading="lazy" alt="privilege-control-over-accounts" src="https://cdnd.selectdb.com/zh-CN/assets/images/Midland_5-7b83ea92344d586f4de8cd363b7c6357.png" width="1280" height="516" class="img_ev3q"></p><p>We also use Apache Doris data backup for disaster recovery, Apache Doris audit logs to monitor SQL execution efficiency, Grafana+Loki for cluster metric alerts, and Supervisor to monitor the daemon processes of node components.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimization">Optimization<a href="#optimization" class="hash-link" aria-label="Optimization的直接链接" title="Optimization的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-data-ingestion">1. Data Ingestion<a href="#1-data-ingestion" class="hash-link" aria-label="1. Data Ingestion的直接链接" title="1. Data Ingestion的直接链接"></a></h3><p>We use DataX to Stream Load offline data. It allows us to adjust the size of each batch. The Stream Load method returns results synchronously, which meets the needs of our architecture. If we execute asynchronous data import using DolphinScheduler, the system might assume that the script has been executed, and that can cause a messup. If you use a different method, we recommend that you execute <code>show load</code> in the shell script, and check the regex filtering status to see if the ingestion succeeds.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-data-model">2. Data Model<a href="#2-data-model" class="hash-link" aria-label="2. Data Model的直接链接" title="2. Data Model的直接链接"></a></h3><p>We adopt the Unique Key model of Apache Doris for most of our tables. The Unique Key model ensures idempotence of data scripts and effectively avoids upstream data duplication. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-reading-external-data">3. Reading External Data<a href="#3-reading-external-data" class="hash-link" aria-label="3. Reading External Data的直接链接" title="3. Reading External Data的直接链接"></a></h3><p>We use the Multi-Catalog feature of Apache Doris to connect to external data sources. It allows us to create mappings of external data at the Catalog level.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-query-optimization">4. Query Optimization<a href="#4-query-optimization" class="hash-link" aria-label="4. Query Optimization的直接链接" title="4. Query Optimization的直接链接"></a></h3><p>We suggest that you put the most frequently used fields of non-character types (such as int and where clauses) in the first 36 bytes, so you can filter these fields within milliseconds in point queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="5-data-dictionary">5. Data Dictionary<a href="#5-data-dictionary" class="hash-link" aria-label="5. Data Dictionary的直接链接" title="5. Data Dictionary的直接链接"></a></h3><p>For us, it is important to create a data dictionary because it largely reduces personnel communication costs, which can be a headache when you have a big team. We use the <code>information_schema</code> in Apache Doris to generate a data dictionary. With it, we can quickly grasp the whole picture of the tables and fields and thus increase development efficiency.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="performance">Performance<a href="#performance" class="hash-link" aria-label="Performance的直接链接" title="Performance的直接链接"></a></h2><p><strong>Offline data ingestion time</strong>: Within minutes</p><p><strong>Query latency</strong>: For tables containing over 100 million rows, Apache Doris responds to ad-hoc queries within one second, and complicated queries in five seconds.</p><p><strong>Resource consumption</strong>: It only takes up a small number of servers to build this data warehouse. The 70% compression ratio of Apache Doris saves us lots of storage resources.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="experience-and-conclusion"><strong>Experience and Conclusion</strong><a href="#experience-and-conclusion" class="hash-link" aria-label="experience-and-conclusion的直接链接" title="experience-and-conclusion的直接链接"></a></h2><p>Actually, before we evolved into our current data architecture, we tried Hive, Spark and Hadoop to build an offline data warehouse. It turned out that Hadoop was overkill for a traditional company like us since we didn't have too much data to process. It is important to find the component that suits you most.</p><p><img loading="lazy" alt="old-offline-data warehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/Midland_6-52e4498a6ab21c3075077b71435e2d28.png" width="832" height="703" class="img_ev3q"></p><p>(Our old off-line data warehouse)</p><p>On the other hand, to smoothen our big data transition, we need to make our data platform as simple as possible in terms of usage and maintenance. That's why we landed on Apache Doris. It is compatible with MySQL protocol and provides a rich collection of functions so we don't have to develop our own UDFs. Also, it is composed of only two types of processes: frontends and backends, so it is easy to scale and track.</p><p>Find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>.</p>]]></content>
<author>
<name>Herman Seah</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Zipping up the lambda architecture for 40% faster performance]]></title>
<id>https://doris.apache.org/zh-CN/blog/Zipping-up-the-Lambda-Architecture-for-40-Percent-Faster-Performance</id>
<link href="https://doris.apache.org/zh-CN/blog/Zipping-up-the-Lambda-Architecture-for-40-Percent-Faster-Performance"/>
<updated>2023-05-05T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Instead of pooling real-time and offline data after they are fully ready for queries, Douyu engineers use Apache Doris to share part of the pre-query computation burden.]]></summary>
<content type="html"><![CDATA[<p>Author: Tongyang Han, Senior Data Engineer at Douyu</p><p>The Lambda architecture has been common practice in big data processing. The concept is to separate stream (real time data) and batch (offline data) processing, and that's exactly what we did. These two types of data of ours were processed in two isolated tubes before they were pooled together and ready for searches and queries.</p><p><img loading="lazy" alt="Lambda-architecture" src="https://cdnd.selectdb.com/zh-CN/assets/images/Douyu_1-cfd4fa7607d4bf15315307b50436d676.png" width="1276" height="613" class="img_ev3q"></p><p>Then we run into a few problems:</p><ol><li><strong>Isolation of real-time and offline data warehouses</strong><ol><li>I know this is kind of the essence of Lambda architecture, but that means we could not reuse real-time data since it was not layered as offline data, so further customized development was required.</li></ol></li><li><strong>Complex Pipeline from Data Sources to Data Application</strong><ol><li>Data had to go through multi-step processing before it reached our data users. As our architecture involved too many components, navigating and maintaining these tech stacks was a lot of work.</li></ol></li><li><strong>Lack of management of real-time data sources</strong><ol><li>In extreme cases, this worked like a data silo and we had no way to find out whether the ingested data was duplicated or reusable.</li></ol></li></ol><p>So we decided to "zip up" the Lambda architecture a little bit. By "zipping up", I mean to introduce an OLAP engine that is capable of processing, storing, and analyzing data, so real-time data and offline data converge a little earlier than they used to. It is not a revolution of Lambda, but a minor change in the choice of components, which made our real-time data processing 40% faster.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="zipping-up-lambda-architecture"><strong>Zipping up Lambda Architecture</strong><a href="#zipping-up-lambda-architecture" class="hash-link" aria-label="zipping-up-lambda-architecture的直接链接" title="zipping-up-lambda-architecture的直接链接"></a></h2><p>I am going to elaborate on how this is done using our data tagging process as an example.</p><p>Previously, our offline tags were produced by the data warehouse, put into a flat table, and then written in <strong>HBase</strong>, while real-time tags were produced by <strong>Flink</strong>, and put into <strong>HBase</strong> directly. Then <strong>Spark</strong> would work as the computing engine.</p><p><img loading="lazy" alt="HBase-Redis-Spark" src="https://cdnd.selectdb.com/zh-CN/assets/images/Douyu_2-9cd11673aa896382f99ca957435efd84.png" width="1280" height="602" class="img_ev3q"></p><p>The problem with this stemmed from the low computation efficiency of <strong>Flink</strong> and <strong>Spark</strong>. </p><ul><li><strong>Real-time tag production</strong>: When computing real-time tags that involve data within a long time range, Flink did not deliver stable performance and consumed more resources than expected. And when a task failed, it would take a really long time for checkpoint recovery.</li><li><strong>Tag query</strong>: As a tag query engine, Spark could be slow.</li></ul><p>As a solution, we replaced <strong>HBase</strong> and <strong>Spark</strong> with <strong>Apache Doris</strong>, a real-time analytic database, and moved part of the computational logic of the foregoing wide-time-range real-time tags from <strong>Flink</strong> to <strong>Apache Doris</strong>.</p><p><img loading="lazy" alt="Apache-Doris-Redis" src="https://cdnd.selectdb.com/zh-CN/assets/images/Douyu_3-684e3028f23e722b9892e0afdf472e4b.png" width="1280" height="577" class="img_ev3q"></p><p>Instead of putting our flat tables in HBase, we place them in Apache Doris. These tables are divided into partitions based on time sensitivity. Offline tags will be updated daily while real-time tags will be updated in real time. We organize these tables in the Aggregate Model of Apache Doris, which allows partial update of data.</p><p>Instead of using Spark for queries, we parse the query rules into SQL for execution in Apache Doris. For pattern matching, we use Redis to cache the hot data from Apache Doris, so the system can respond to such queries much faster.</p><p><img loading="lazy" alt="Real-time-and-offline-data-processing-in-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/Douyu_4-afd928fc30baf4ec825e80ab3638e984.png" width="1280" height="486" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="computational-pipeline-of-wide-time-range-real-time-tags"><strong>Computational Pipeline of Wide-Time-Range Real-Time Tags</strong><a href="#computational-pipeline-of-wide-time-range-real-time-tags" class="hash-link" aria-label="computational-pipeline-of-wide-time-range-real-time-tags的直接链接" title="computational-pipeline-of-wide-time-range-real-time-tags的直接链接"></a></h2><p>In some cases, the computation of wide-time-range real-time tags entails the aggregation of historical (offline) data with real-time data. The following figure shows our old computational pipeline for these tags. </p><p><img loading="lazy" alt="offline-data-processing-link" src="https://cdnd.selectdb.com/zh-CN/assets/images/Douyu_5-104e16d5c9830069f513dc4c25665bcf.png" width="1280" height="695" class="img_ev3q"></p><p>As you can see, it required multiple tasks to finish computing one real-time tag. Also, in complicated aggregations that involve a collection of aggregation operations, any improper resource allocation could lead to back pressure or waste of resources. This adds to the difficulty of task scheduling. The maintenance and stability guarantee of such a long pipeline could be an issue, too.</p><p>To improve on that, we decided to move such aggregation workload to Apache Doris.</p><p><img loading="lazy" alt="real-time-data-processing-link" src="https://cdnd.selectdb.com/zh-CN/assets/images/Douyu_6-4243729274c033573acca9a2c621bf45.png" width="1280" height="717" class="img_ev3q"></p><p>We have around 400 million customer tags in our system, and each customer is attached with over 300 tags. We divide customers into more than 10,000 groups, and we have to update 5000 of them on a daily basis. The above improvement has sped up the computation of our wide-time-range real-time queries by <strong>40%</strong>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="overwrite">Overwrite<a href="#overwrite" class="hash-link" aria-label="Overwrite的直接链接" title="Overwrite的直接链接"></a></h2><p>To atomically replace data tables and partitions in Apache Doris, we customized the <a href="https://github.com/apache/doris-spark-connector" target="_blank" rel="noopener noreferrer">Doris-Spark-Connector</a>, and added an "Overwrite" mode to the Connector.</p><p>When a Spark job is submitted, Apache Doris will call an interface to fetch information of the data tables and partitions.</p><ul><li>If it is a non-partitioned table, we create a temporary table for the target table, ingest data into it, and then perform atomic replacement. If the data ingestion fails, we clear the temporary table;</li><li>If it is a dynamic partitioned table, we create a temporary partition for the target partition, ingest data into it, and then perform atomic replacement. If the data ingestion fails, we clear the temporary partition;</li><li>If it is a non-dynamic partitioned table, we need to extend the Doris-Spark-Connector parameter configuration first. Then we create a temporary partition and take steps as above.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>One prominent advantage of Lambda architecture is the stability it provides. However, in our practice, the processing of real-time data and offline data sometimes intertwines. For example, the computation of certain real-time tags requires historical (offline) data. Such interaction becomes a root cause of instability. Thus, instead of pooling real-time and offline data after they are fully ready for queries, we use an OLAP engine to share part of the pre-query computation burden and make things faster, simpler, and more cost-effective.</p>]]></content>
<author>
<name>Tongyang Han</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Step-by-step guide to building a high-performing risk data mart]]></title>
<id>https://doris.apache.org/zh-CN/blog/Step-by-step-Guide-to-Building-a-High-Performing-Risk-Data-Mart</id>
<link href="https://doris.apache.org/zh-CN/blog/Step-by-step-Guide-to-Building-a-High-Performing-Risk-Data-Mart"/>
<updated>2023-04-20T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The key step is to leverage the Multi Catalog feature of Apache Doris to unify the heterogenous data sources. This removed a lot of our performance bottlenecks.]]></summary>
<content type="html"><![CDATA[<p>Pursuing data-driven management at a consumer financing company, we aim to serve four needs in our data platform development: monitoring and alerting, query and analysis, dashboarding, and data modeling. For these purposes, we built our data processing architecture based on Greenplum and CDH. The most essential part of it is the risk data mart. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="risk-data-mart--apache-hive">Risk Data Mart: Apache Hive<a href="#risk-data-mart--apache-hive" class="hash-link" aria-label="Risk Data Mart: Apache Hive的直接链接" title="Risk Data Mart: Apache Hive的直接链接"></a></h2><p>I will walk you through how the risk data mart works following the data flow: </p><ol><li>Our <strong>business data</strong> is imported into <strong>Greenplum</strong> for real-time analysis to generate BI reports. Part of this data also goes into Apache Hive for queries and modeling analysis. </li><li>Our <strong>risk control variables</strong> are updated into <strong>Elasticsearch</strong> in real time via message queues, while Elasticsearch ingests data into Hive for analysis, too.</li><li>The <strong>risk management decision data</strong> is passed from <strong>MongoDB</strong> to Hive for risk control analysis and modeling.</li></ol><p>So these are the three data sources of our risk data mart.</p><p><img loading="lazy" alt="risk-data-mart" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_1-7e8b0a7061d967673ece1d403f03edd3.png" width="826" height="486" class="img_ev3q"></p><p>This whole architecture is built with CDH 6.0. The workflows in it can be divided into real-time data streaming and offline risk analysis.</p><ul><li><strong>Real-time data streaming</strong>: Real-time data from Apache Kafka will be cleaned by Apache Flink, and then written into Elasticsearch. Elasticsearch will aggregate part of the data it receives and send it for reference in risk management. </li><li><strong>Offline risk analysis</strong>: Based on the CDH solution and utilizing Sqoop, we ingest data from Greenplum in an offline manner. Then we put this data together with the third-party data from MongoDB. Then, after data cleaning, we pour all this data into Hive for daily batch processing and data queries.</li></ul><p>To give a brief overview, these are the components that support the four features of our data processing platform:</p><p><img loading="lazy" alt="features-of-a-data-processing-platform" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_2-1880ff586d295ecd43f0731f01124965.png" width="1002" height="606" class="img_ev3q"></p><p>As you see, Apache Hive is central to this architecture. But in practice, it takes minutes for Apache Hive to execute analysis, so our next step is to increase query speed.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="what-are-slowing-down-our-queries">What are Slowing Down Our Queries?<a href="#what-are-slowing-down-our-queries" class="hash-link" aria-label="What are Slowing Down Our Queries?的直接链接" title="What are Slowing Down Our Queries?的直接链接"></a></h3><ol><li><strong>Huge data volume in external tables</strong></li></ol><p>Our Hive-based data mart is now carrying more than 300 terabytes of data. That's about 20,000 tables and 5 million fields. To put them all in external tables is maintenance-intensive. Plus, data ingestion can be a big headache.</p><ol><li><strong>Big flat tables</strong></li></ol><p>Due to the complexity of the rule engine in risk management, our company invests a lot in the derivation of variables. In some dimensions, we have thousands of variables or even more. As a result, a few of the frequently used flat tables in Hive have over 3000 fields. So you can imagine how time consuming these queries can be.</p><ol><li><strong>Unstable interface</strong></li></ol><p>Results produced by daily offline batch processing will be regularly sent to our Elasticsearch clusters. (The data volume in these updates is huge, and the call of interface can get expired.) This process might cause high I/O and introduce garbage collection jitter, and further leads to unstable interface services. </p><p>In addition, since our risk control analysts and modeling engineers are using Hive with Spark, the expanding data architecture is also dragging down query performance.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="a-unified-query-gateway">A Unified Query Gateway<a href="#a-unified-query-gateway" class="hash-link" aria-label="A Unified Query Gateway的直接链接" title="A Unified Query Gateway的直接链接"></a></h2><p>We wanted a unified gateway to manage our heterogenous data sources. That's why we introduced Apache Doris.</p><p><img loading="lazy" alt="unified-query-gateway" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_3-89141f14a59c83d413d14f31fcf386f4.png" width="1716" height="1094" class="img_ev3q"></p><p>But doesn't that make things even more complicated? Actually, no.</p><p>We can connect various data sources to Apache Doris and simply conduct queries on it. This is made possible by the <strong>Multi-Catalog</strong> feature of Apache Doris: It can interface with various data sources, including datalakes like Apache Hive, Apache Iceberg, and Apache Hudi, and databases like MySQL, Elasticsearch, and Greenplum. That happens to cover our toolkit. </p><p>We create Elasticsearch Catalog and Hive Catalog in Apache Doris. These catalogs map to the external data in Elasticsearch and Hive, so we can conduct federated queries across these data sources using Apache Doris as a unified gateway. Also, we use the <a href="https://github.com/apache/doris-spark-connector" target="_blank" rel="noopener noreferrer">Spark-Doris-Connector</a> to allow data communication between Spark and Doris. So basically, we replace Apache Hive with Apache Doris as the central hub of our data architecture. </p><p><img loading="lazy" alt="Apache-Doris-as-center-of-data-architecture" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_4-e6af4e754989aed3aef02a357e7607ad.png" width="1002" height="608" class="img_ev3q"></p><p>How does that affect our data processing efficiency?</p><ul><li><strong>Monitoring &amp; Alerting</strong>: This is about real-time data querying. We access our real-time data in Elasticsearch clusters using Elasticsearch Catalog in Apache Doris. Then we perform queries directly in Apache Doris. It is able to return results within seconds, as opposed to the minute-level response time when we used Hive.</li><li><strong>Query &amp; Analysis</strong>: As I said, we have 20,000 tables in Hive so it wouldn't make sense to map all of them to external tables in Hive. That would mean a hell of maintenance. Instead, we utilize the Multi Catalog feature of Apache Doris 1.2. It enables data mapping at the catalog level, so we can simply create one Hive Catalog in Doris before we can conduct queries. This separates query operations from the daily batching processing workload in Hive, so there will be less resource conflict.</li><li><strong>Dashboarding</strong>: We use Tableau and Doris to provide dashboard services. This reduces the query response time to seconds and milliseconds, compared with the several minutes back in the "Tableau + Hive" days.</li><li><strong>Modeling</strong>: We use Spark and Doris for aggregation modeling. The Spark-Doris-Connector allows mutual synchronization of data, so data from Doris can also be used in modeling for more accurate analysis.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cluster-monitoring-in-production-environment"><strong>Cluster Monitoring in Production Environment</strong><a href="#cluster-monitoring-in-production-environment" class="hash-link" aria-label="cluster-monitoring-in-production-environment的直接链接" title="cluster-monitoring-in-production-environment的直接链接"></a></h3><p>We tested this new architecture in our production environment. We built two clusters.</p><p><strong>Configuration</strong>:</p><p>Production cluster: 4 frontends + 8 backends, m5d.16xlarge</p><p>Backup cluster: 4 frontends + 4 backends, m5d.16xlarge</p><p>This is the monitoring board: </p><p><img loading="lazy" alt="cluster-monitoring-board" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_5-8a88d55e3ac69ac6be859a9c367b0c76.png" width="1280" height="523" class="img_ev3q"></p><p>As is shown, the queries are fast. We expected that it would take at least 10 nodes but in real cases, we mainly conduct queries via Catalogs, so we can handle this with a relatively small cluster size. The compatibility is good, too. It doesn't rock the rest of our existing system.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="guide-to-faster-data-integration">Guide to Faster Data Integration<a href="#guide-to-faster-data-integration" class="hash-link" aria-label="Guide to Faster Data Integration的直接链接" title="Guide to Faster Data Integration的直接链接"></a></h2><p>To accelerate the regular data ingestion from Hive to Apache Doris 1.2.2, we have a solution that goes as follows:</p><p><img loading="lazy" alt="faster-data-integration" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_6-946a2cf22287a5c16c7fc03d2a3e2c18.png" width="1280" height="681" class="img_ev3q"></p><p><strong>Main components:</strong></p><ul><li>DolphinScheduler 3.1.4</li><li>SeaTunnel 2.1.3</li></ul><p>With our current hardware configuration, we use the Shell script mode of DolphinScheduler and call the SeaTunnel script on a regular basis. This is the configuration file of the data synchronization tasks:</p><div class="language-undefined codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-undefined codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> env{</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> spark.app.name = "hive2doris-template"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> spark.executor.instances = 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> spark.executor.cores = 5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> spark.executor.memory = "20g"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">spark {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> spark.sql.catalogImplementation = "hive"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">source {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> hive {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> pre_sql = "select * from ods.demo_tbl where dt='2023-03-09'"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> result_table_name = "ods_demo_tbl"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">transform {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">sink {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> doris {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> fenodes = "192.168.0.10:8030,192.168.0.11:8030,192.168.0.12:8030,192.168.0.13:8030"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> user = root</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> password = "XXX"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> database = ods</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> table = ods_demo_tbl</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> batch_size = 500000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> max_retries = 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> interval = 10000</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> doris.column_separator = "\t"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> }</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>This solution consumes less resources and memory but brings higher performance in queries and data ingestion.</p><ol><li><strong>Less storage costs</strong></li></ol><p><strong>Before</strong>: The original table in Hive had 500 fields. It was divided into partitions by day, with 150 million pieces of data per partition. It takes <strong>810G</strong> to store in HDFS.</p><p><strong>After</strong>: For data synchronization, we call Spark on YARN using SeaTunnel. It can be finished within 40 minutes, and the ingested data only takes up <strong>270G</strong> of storage space.</p><ol><li><strong>Less memory usage &amp; higher performance in queries</strong></li></ol><p><strong>Before</strong>: For a GROUP BY query on the foregoing table in Hive, it occupied 720 Cores and 1.44T in YARN, and took a response time of <strong>162 seconds</strong>. </p><p><strong>After</strong>: We perform an aggregate query using Hive Catalog in Doris, <code>set exec_mem_limit=16G</code>, and receive the result after <strong>58.531 seconds</strong>. We also try and put the table in Doris and conduct the same query in Doris itself, that only takes <strong>0.828 seconds</strong>.</p><p>The corresponding statements are as follows:</p><ul><li>Query in Hive, response time: 162 seconds</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select count(*),product_no FROM ods.demo_tbl where dt='2023-03-09'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by product_no;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Query in Doris using Hive Catalog, response time: 58.531 seconds</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">set exec_mem_limit=16G;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select count(*),product_no FROM hive.ods.demo_tbl where dt='2023-03-09'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by product_no;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Query in Doris directly, response time: 0.828 seconds</li></ul><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select count(*),product_no FROM ods.demo_tbl where dt='2023-03-09'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by product_no;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol><li><strong>Faster data ingestion</strong></li></ol><p><strong>Before</strong>: The original table in Hive had 40 fields. It was divided into partitions by day, with 1.1 billion pieces of data per partition. It takes <strong>806G</strong> to store in HDFS.</p><p><strong>After</strong>: For data synchronization, we call Spark on YARN using SeaTunnel. It can be finished within 11 minutes (100 million pieces per minute ), and the ingested data only takes up <strong>378G</strong> of storage space.</p><p><img loading="lazy" alt="faster-data-ingestion" src="https://cdnd.selectdb.com/zh-CN/assets/images/RDM_7-aabcb97d311b9da69a1d8722339b633a.png" width="1280" height="463" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="summary">Summary<a href="#summary" class="hash-link" aria-label="Summary的直接链接" title="Summary的直接链接"></a></h2><p>The key step to building a high-performing risk data mart is to leverage the Multi Catalog feature of Apache Doris to unify the heterogenous data sources. This not only increases our query speed but also solves a lot of the problems coming with our previous data architecture.</p><ol><li>Deploying Apache Doris allows us to decouple daily batch processing workloads with ad-hoc queries, so they don't have to compete for resources. This reduces the query response time from minutes to seconds.</li><li>We used to build our data ingestion interface based on Elasticsearch clusters, which could lead to garbage collection jitter when transferring large batches of offline data. When we stored the interface service dataset on Doris, no jitter was found during data writing and we were able to transfer 10 million rows within 10 minutes.</li><li>Apache Doris has been optimizing itself in many scenarios including flat tables. As far as we know, compared with ClickHouse, Apache Doris 1.2 is twice as fast in SSB-Flat-table benchmark and dozens of times faster in TPC-H benchmark.</li><li>In terms of cluster scaling and updating, we used to suffer from a big window of restoration time after configuration revision. But Doris supports hot swap and easy scaling out, so we can reboot nodes within a few seconds and minimize interruption to users caused by cluster scaling.</li></ol><p>(One last piece of advice for you: If you encounter any problems with deploying Apache Doris, don't hesitate to contact the Doris community for help, they and a bunch of SelectDB engineers will be more than happy to make your adaption journey quick and easy.)</p>]]></content>
<author>
<name>Jacob Chow</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[How We increased database query concurrency by 20 times]]></title>
<id>https://doris.apache.org/zh-CN/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times</id>
<link href="https://doris.apache.org/zh-CN/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times"/>
<updated>2023-04-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[In the upcoming Apache Doris 2.0, we have optimized it for high-concurrency point queries. Long story short, it can achieve over 30,000 QPS for a single node.]]></summary>
<content type="html"><![CDATA[<p>A unified analytic database is the holy grail for data engineers, but what does it look like specifically? It should evolve with the needs of data users.</p><p>Vertically, companies now have an ever enlarging pool of data and expect a higher level of concurrency in data processing. Horizontally, they require a wider range of data analytics services. Besides traditional OLAP scenarios such as statistical reporting and ad-hoc queries, they are also leveraging data analysis in recommender systems, risk control, customer tagging and profiling, and IoT.</p><p>Among all these data services, point queries are the most frequent operations conducted by data users. Point query means to retrieve one or several rows from the database based on the Key. A point query only returns a small piece of data, such as the details of a shopping order, a transaction, a consumer profile, a product description, logistics status, and so on. Sounds easy, right? But the tricky part is, <strong>a database often needs to handle tens of thousands of point queries at a time and respond to all of them in milliseconds</strong>.</p><p>Most current OLAP databases are built with a columnar storage engine to process huge data volumes. They take pride in their high throughput, but often underperform in high-concurrency scenarios. As a complement, many data engineers invite Key-Value stores like Apache HBase for point queries, and Redis as a cache layer to ease the burden. The downside is redundant storage and high maintenance costs.</p><p>Since Apache Doris was born, we have been striving to make it a unified database for data queries of all sizes, including ad-hoc queries and point queries. Till now, we have already taken down the monster of high-throughput OLAP scenarios. In the upcoming Apache Doris 2.0, we have optimized it for high-concurrency point queries. Long story short, it can achieve over 30,000 QPS for a single node. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="five-ways-to-accelerate-high-concurrency-queries">Five Ways to Accelerate High-Concurrency Queries<a href="#five-ways-to-accelerate-high-concurrency-queries" class="hash-link" aria-label="Five Ways to Accelerate High-Concurrency Queries的直接链接" title="Five Ways to Accelerate High-Concurrency Queries的直接链接"></a></h2><p>High-concurrency queries are thorny because you need to handle high loads with limited system resources. That means you have to reduce the CPU, memory and I/O overheads of a single SQL as much as possible. The key is to minimize the scanning of underlying data and follow-up computing. </p><p>Apache Doris uses five methods to achieve higher QPS.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="partioning-and-bucketing">Partioning and Bucketing<a href="#partioning-and-bucketing" class="hash-link" aria-label="Partioning and Bucketing的直接链接" title="Partioning and Bucketing的直接链接"></a></h3><p>Apache Doris shards data into a two-tiered structure: Partition and Bucket. You can use time information as the Partition Key. As for bucketing, you distribute the data into various nodes after data hashing. A wise bucketing plan can largely increase concurrency and throughput in data reading. </p><p>This is an example:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select * from user_table where id = 5122 and create_date = '2022-01-01'</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>In this case, the user has set 10 buckets. <code>create_date</code> is the Partition Key and <code>id</code> is the Bucket Key. After dividing the data into partitions and buckets, the system only needs to scan one bucket in one partition before it can locate the needed data. This is a huge time saver.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="index">Index<a href="#index" class="hash-link" aria-label="Index的直接链接" title="Index的直接链接"></a></h3><p>Apache Doris uses various data indexes to speed up data reading and filtering, including smart indexes and secondary indexes. Smart indexes are auto-generated by Doris upon data ingestion, which requires no action from the user's side. </p><p>There are two types of smart indexes:</p><ul><li><strong>Sorted Index</strong>: Apache Doris stores data in an orderly way. It creates a sorted index for every 1024 rows of data. The Key in the index is the value of the sorted column in the first row of the current 1024 rows. If the query involves the sorted column, the system will locate the first row of the relevant 1024 row group and start scanning there.</li><li><strong>ZoneMap Index</strong>: These are indexes on the Segment and Page level. The maximum and minimum values of each column within a Page will be recorded, so are those within a Segment. Hence, in equivalence queries and range queries, the system can narrow down the filter range with the help of the MinMax indexes.</li></ul><p>Secondary indexes are created by users. These include Bloom Filter indexes, Bitmap indexes, <a href="https://doris.apache.org/docs/dev/data-table/index/inverted-index/" target="_blank" rel="noopener noreferrer">Inverted indexes</a>, and <a href="https://doris.apache.org/docs/dev/data-table/index/ngram-bloomfilter-index/" target="_blank" rel="noopener noreferrer">NGram Bloom Filter indexes</a>. (If you are interested, I will go into details about them in future articles.)</p><p>Example:</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select * from user_table where id &gt; 10 and id &lt; 1024</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Suppose that the user has designated <code>id</code> as the Key during table creation, the data will be sorted by <code>id</code> on Memtable and the disks. So any queries involving <code>id</code> as a filter condition will be executed much faster with the aid of sorted indexes. Specifically, the data in storage will be put into multiple ranges based on <code>id</code>, and the system will implement binary search to locate the exact range according to the sorted indexes. But that could still be a large range since the sorted indexes are sparse. You can further narrow it down based on ZoneMap indexes, Bloom Filter indexes, and Bitmap indexes. </p><p>This is another way to reduce data scanning and improve overall concurrency of the system.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="materialized-view">Materialized View<a href="#materialized-view" class="hash-link" aria-label="Materialized View的直接链接" title="Materialized View的直接链接"></a></h3><p>The idea of materialized view is to trade space for time: You execute pre-computation with pre-defined SQL statements, and perpetuate the results in a table that is visible to users but occupies some storage space. In this way, Apache Doris can respond much faster to queries for aggregated data and breakdown data and those involve the matching of sorted indexes once it hits a materialized view. This is a good way to lessen computation, improve query performance, and reduce resource consumption.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">// For an aggregation query, the system reads the pre-aggregated columns in the materialized view.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">create materialized view store_amt as select store_id, sum(sale_amt) from sales_records group by store_id;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT store_id, sum(sale_amt) FROM sales_records GROUP BY store_id;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// For a query where k3 matches the sorted column in the materialized view, the system directly performs the query on the materialized view. </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE MATERIALIZED VIEW mv_1 as SELECT k3, k2, k1 FROM tableA ORDER BY k3;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select k1, k2, k3 from table A where k3=3;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="runtime-filter">Runtime Filter<a href="#runtime-filter" class="hash-link" aria-label="Runtime Filter的直接链接" title="Runtime Filter的直接链接"></a></h3><p>Apart from filtering data by indexes, Apache Doris has a dynamic filtering mechanism: Runtime Filter. </p><p>In multi-table Join queries, the left table is usually called ProbeTable while the right one is called BuildTable, with the former much bigger than the latter. In query execution, firstly, the system reads the right table and creates a HashTable (Build) in the memory. Then, it starts reading the left table row by row, during which it also compares data between the left table and the HashTable and returns the matched data (Probe). </p><p>So what's new about that in Apache Doris? During the creation of HashTable, Apache Doris generates a filter for the columns. It can be a Min/Max filter or an IN filter. Then it pushes down the filter to the left table, which can use the filter to screen out data and thus reduces the amount of data that the Probe node has to transfer and compare. </p><p>This is how the Runtime Filter works. In most Join queries, the Runtime Filter can be automatically pushed down to the most underlying scan nodes or to the distributed Shuffle Join. In other words, Runtime Filter is able to reduce data reading and shorten response time for most Join queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="top-n-optimization">TOP-N Optimization<a href="#top-n-optimization" class="hash-link" aria-label="TOP-N Optimization的直接链接" title="TOP-N Optimization的直接链接"></a></h3><p>TOP-N query is a frequent scenario in data analysis. For example, users want to fetch the most recent 100 orders, or the 5 highest/lowest priced products. The performance of such queries determines the quality of real-time analysis. For them, Apache Doris implements TOP-N optimization. Here is how it goes:</p><ol><li>Apache Doris reads the sorted fields and query fields from the Scanner layer, reserves only the TOP-N pieces of data by means of Heapsort, updates the real-time TOP-N results as it continues reading, and dynamically pushes them down to the Scanner. </li><li>Combing the received TOP-N range and the indexes, the Scanner can skip a large proportion of irrelevant files and data chunks and only read a small number of rows.</li><li>Queries on flat tables usually mean the need to scan massive data, but TOP-N queries only retrieve a small amount of data. The strategy here is to divide the data reading process into two stages. In stage one, the system sorts the data based on a few columns (sorted column, or condition column) and locates the TOP-N rows. In stage two, it fetches the TOP-N rows of data after data sorting, and then it retrieves the target data according to the row numbers. </li></ol><p>To sum up, Apache Doris prunes the data that needs to be read and sorted, and thus substantially reduces consumption of I/O, CPU, and memory resources.</p><p>In addition to the foregoing five methods, Apache Doris also improves concurrency by SQL Cache, Partition Cache, and a variety of Join optimization techniques.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="how-we-bring-concurrency-to-the-next-level">How We Bring Concurrency to the Next Level<a href="#how-we-bring-concurrency-to-the-next-level" class="hash-link" aria-label="How We Bring Concurrency to the Next Level的直接链接" title="How We Bring Concurrency to the Next Level的直接链接"></a></h2><p>By adopting the above methods, Apache Doris was able to achieve thousands of QPS per node. However, in scenarios requiring tens of thousands of QPS, it was still bottlenecked by several issues:</p><ul><li>With Doris' columnar storage engine, it was inconvenient to read rows. In flat table models, columnar storage could result in much larger I/O usage.</li><li>The execution engine and query optimizer of OLAP databases were sometimes too complicated for simple queries (point queries, etc.). Such queries needed to be processed with a shorter pipeline, which should be considered in query planning.</li><li>FE modules of Doris, implementing Java, were responsible for interfacing with SQL requests and parsing query plans. These processes could produce high CPU overheads in high-concurrency scenarios.</li></ul><p>We optimized Apache Doris to solve these problems. (<a href="https://github.com/apache/doris/pull/15491" target="_blank" rel="noopener noreferrer">Pull Request on Github</a>)</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="row-storage-format">Row Storage Format<a href="#row-storage-format" class="hash-link" aria-label="Row Storage Format的直接链接" title="Row Storage Format的直接链接"></a></h3><p>As we know, row storage is much more efficient when the user only queries for a single row of data. So we introduced row storage format in Apache Doris 2.0. Users can enable row storage by specifying the following property in the table creation statement.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">"store_row_column" = "true"</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>We chose JSONB as the encoding format for row storage for three reasons:</p><ul><li><strong>Flexible schema change</strong>: If a user has added or deleted a field, or modified the type of a field, these changes must be updated in row storage in real time. So we choose to adopt the JSONB format and encode columns into JSONB fields. This makes changes in fields very easy.</li><li><strong>High performance</strong>: Accessing rows in row-oriented storage is much faster than doing that in columnar storage, and it requires much less disk access in high-concurrency scenarios. Also, in some cases, you can map the column ID to the corresponding JSONB value so you can quickly access a certain column.</li><li><strong>Less storage space</strong>: JSONB is a compacted binary format. It consumes less space on the disk and is more cost-effective.</li></ul><p>In the storage engine, row storage will be stored as a hidden column (DORIS_ROW_STORE_COL). During Memtable Flush, the columns will be encoded into JSONB and cached into this hidden column. In data reading, the system uses the Column ID to locate the column, finds the target row based on the row number, and then deserializes the relevant columns.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="short-circuit">Short-Circuit<a href="#short-circuit" class="hash-link" aria-label="Short-Circuit的直接链接" title="Short-Circuit的直接链接"></a></h3><p>Normally, an SQL statement is executed in three steps:</p><ol><li>SQL Parser parses the statement to generate an abstract syntax tree (AST).</li><li>The Query Optimizer produces an executable plan.</li><li>Execute the plan and return the results.</li></ol><p>For complex queries on massive data, it is better to follow the plan created by the Query Optimizer. However, for high-concurrency point queries requiring low latency, that plan is not only unnecessary but also brings extra overheads. That's why we implement a short-circuit plan for point queries. </p><p><img loading="lazy" alt="short-circuit-plan" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-concurrency_1-90fb6261659bb3cce5ad8076fa9dc65d.png" width="1606" height="778" class="img_ev3q"></p><p>Once the FE receives a point query request, a short-circuit plan will be produced. It is a lightweight plan that involves no equivalent transformation, logic optimization or physical optimization. Instead, it conducts some basic analysis on the AST, creates a fixed plan accordingly, and finds ways to reduce overhead of the optimizer.</p><p>For a simple point query involving primary keys, such as <code>select * from tbl where pk1 = 123 and pk2 = 456</code>, since it only involves one single Tablet, it is better to use a lightweight RPC interface for interaction with the Storage Engine. This avoids the creation of a complicated Fragment Plan and eliminates the performance overhead brought by the scheduling under the MPP query framework.</p><p>Details of the RPC interface are as follows:</p><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">message PTabletKeyLookupRequest {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> required int64 tablet_id = 1;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> repeated KeyTuple key_tuples = 2;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> optional Descriptor desc_tbl = 4;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> optional ExprList output_expr = 5;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">message PTabletKeyLookupResponse {</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> required PStatus status = 1;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> optional bytes row_batch = 5;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> optional bool empty_batch = 6;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">}</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">rpc tablet_fetch_data(PTabletKeyLookupRequest) returns (PTabletKeyLookupResponse);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><code>tablet_id</code> is calculated based on the primary key column, while <code>key_tuples</code> is the string format of the primary key. In this example, the <code>key_tuples</code> is similar to <!-- -->['123', '456']<!-- -->. As BE receives the request, <code>key_tuples</code> will be encoded into primary key storage format. Then, it will locate the corresponding row number of the Key in the Segment File with the help of the primary key index, and check if that row exists in <code>delete bitmap</code>. If it does, the row number will be returned; if not, the system returns NotFound. The returned row number will be used for point query on <code>__DORIS_ROW_STORE_COL__</code>. That means we only need to locate one row in that column, fetch the original value of the JSONB format, and deserialize it.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="prepared-statement">Prepared Statement<a href="#prepared-statement" class="hash-link" aria-label="Prepared Statement的直接链接" title="Prepared Statement的直接链接"></a></h3><p>In high-concurrency queries, part of the CPU overhead comes from SQL analysis and parsing in FE. To reduce such overhead, in FE, we provide prepared statements that are fully compatible with MySQL protocol. With prepared statements, we can achieve a four-time performance increase for primary key point queries.</p><p><img loading="lazy" alt="prepared-statement-map" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-concurrency_2-24db356001a7e9c6276202b59ce09d03.png" width="1280" height="775" class="img_ev3q"></p><p>The idea of prepared statements is to cache precomputed SQL and expressions in HashMap in memory, so they can be directly used in queries when applicable.</p><p>Prepared statements adopt MySQL binary protocol for transmission. The protocol is implemented in the mysql_row_buffer.<!-- -->[h|cpp]<!-- --> file, and uses MySQL binary encoding. Under this protocol, the client (for example, JDBC Client) sends a pre-compiled statement to FE via <code>PREPARE</code> MySQL Command. Next, FE will parse and analyze the statement and cache it in the HashMap as shown in the figure above. Next, the client, using <code>EXECUTE</code> MySQL Command, will replace the placeholder, encode it into binary format, and send it to FE. Then, FE will perform deserialization to obtain the value of the placeholder, and generate query conditions.</p><p><img loading="lazy" alt="prepared-statement-execution" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-concurrency_3-78f37932ab22095589cd35e1829eba6b.png" width="1134" height="904" class="img_ev3q"></p><p>Apart from caching prepared statements in FE, we also cache reusable structures in BE. These structures include pre-allocated computation blocks, query descriptors, and output expressions. Serializing and deserializing these structures often cause a CPU hotspot, so it makes more sense to cache them. The prepared statement for each query comes with a UUID named CacheID. So when BE executes the point query, it will find the corresponding class based on the CacheID, and then reuse the structure in computation.</p><p>The following example demonstrates how to use a prepared statement in JDBC:</p><ol><li>Set a JDBC URL and enable prepared statement at the server end.</li></ol><div class="language-Bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Bash codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">url = jdbc:mysql://127.0.0.1:9030/ycsb?useServerPrepStmts=true</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ol><li>Use a prepared statement.</li></ol><div class="language-Java codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-Java codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">// Use `?` as placeholder, reuse readStatement.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PreparedStatement readStatement = conn.prepareStatement("select * from tbl_point_query where key = ?");</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">...</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">readStatement.setInt(1234);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ResultSet resultSet = readStatement.executeQuery();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">...</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">readStatement.setInt(1235);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">resultSet = readStatement.executeQuery();</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">...</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="row-storage-cache">Row Storage Cache<a href="#row-storage-cache" class="hash-link" aria-label="Row Storage Cache的直接链接" title="Row Storage Cache的直接链接"></a></h3><p>Apache Doris has a Page Cache feature, where each page caches the data of one column. </p><p><img loading="lazy" alt="page-cache" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-concurrency_4-3a2f6a6b077605a579c5b00560e0cffb.png" width="568" height="540" class="img_ev3q"></p><p>As mentioned above, we have introduced row storage in Doris. The problem with this is, one row of data consists of multiple columns, so in the case of big queries, the cached data might be erased. Thus, we also introduced row cache to increase row cache hit rate.</p><p>Row cache reuses the LRU Cache mechanism in Apache Doris. When the caching starts, the system will initialize a threshold value. If that threshold is hit, the old cached rows will be phased out. For a primary key query statement, the performance gap between cache hit and cache miss can be huge (we are talking about dozens of times less disk I/O and memory access here). So the introduction of row cache can remarkably enhance point query performance.</p><p><img loading="lazy" alt="row-cache" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-concurrency_5-91a7b2adc421fb31c47d89f638adecf6.png" width="468" height="500" class="img_ev3q"></p><p>To enable row cache, you can specify the following configuration in BE:</p><div class="language-JSON codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-JSON codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">disable_storage_row_cache=false // This specifies whether to enable row cache; it is set to false by default.</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">row_cache_mem_limit=20% // This specifies the percentage of row cache in the memory; it is set to 20% by default.</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="benchmark-performance">Benchmark Performance<a href="#benchmark-performance" class="hash-link" aria-label="Benchmark Performance的直接链接" title="Benchmark Performance的直接链接"></a></h2><p>We tested Apache Doris with YCSB (Yahoo! Cloud Serving Benchmark) to see how all these optimizations work.</p><p><strong>Configurations and data size:</strong></p><ul><li>Machines: a single 16 Core 64G cloud server with 4×1T hard drives</li><li>Cluster size: 1 Frontend + 2 Backends</li><li>Data volume: 100 million rows of data, with each row taking 1KB to store; preheated</li><li>Table schema and query statement:</li></ul><div class="language-JavaScript codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-JavaScript codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">// Table creation statement:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `usertable` (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `YCSB_KEY` varchar(255) NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD0` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD1` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD2` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD3` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD4` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD5` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD6` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD7` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD8` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD9` text NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) ENGINE=OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">UNIQUE KEY(`YCSB_KEY`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">COMMENT 'OLAP'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`YCSB_KEY`) BUCKETS 16</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"replication_allocation" = "tag.location.default: 1",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"in_memory" = "false",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"persistent" = "false",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"storage_format" = "V2",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"enable_unique_key_merge_on_write" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"light_schema_change" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"store_row_column" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"disable_auto_compaction" = "false"</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">// Query statement:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * from usertable WHERE YCSB_KEY = ?</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>We run the test with the optimizations (row storage, short-circuit, and prepared statement) enabled, and then did it again with all of them disabled. Here are the results:</p><p><img loading="lazy" alt="performance-before-and-after-concurrency-optimization" src="https://cdnd.selectdb.com/zh-CN/assets/images/high-concurrency_6-6c3bbd4d2c6e6a54db40297c2154a9da.png" width="1280" height="653" class="img_ev3q"></p><p>With optimizations enabled, <strong>the average query latency decreased by a whopping 96%, the 99th percentile latency was only 1/28 of that without optimizations, and it has achieved a query concurrency of over 30,000 QPS.</strong> This is a huge leap in performance and an over 20-time increase in concurrency.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="best-practice">Best Practice<a href="#best-practice" class="hash-link" aria-label="Best Practice的直接链接" title="Best Practice的直接链接"></a></h2><p>It should be noted that these optimizations for point queries are implemented in the Unique Key model of Apache Doris, and you should enable Merge-on-Write and Light Schema Change for this model.</p><p>This is a table creation statement example for point queries:</p><div class="language-undefined codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-undefined codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE TABLE `usertable` (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `USER_KEY` BIGINT NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD0` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD1` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD2` text NULL,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> `FIELD3` text NULL</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) ENGINE=OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">UNIQUE KEY(`USER_KEY`)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">COMMENT 'OLAP'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">DISTRIBUTED BY HASH(`USER_KEY`) BUCKETS 16</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"enable_unique_key_merge_on_write" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"light_schema_change" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">"store_row_column" = "true",</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">); </span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><strong>Note:</strong></p><ul><li>Enable <code>light_schema_change</code> to support JSONB row storage for encoding ColumnID</li><li>Enable <code>store_row_column</code> to store row storage format</li></ul><p>For a primary key-based point query like the one below, after table creation, you can use row storage and short-circuit execution to improve performance to a great extent.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select * from usertable where USER_KEY = xxx;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>To further unleash performance, you can apply prepared statement. If you have enough memory space, you can also enable row cache in the BE configuration.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="conclusion">Conclusion<a href="#conclusion" class="hash-link" aria-label="Conclusion的直接链接" title="Conclusion的直接链接"></a></h2><p>In high-concurrency scenarios, Apache Doris realizes over 30,000 QPS per node after optimizations including row storage, short-circuit, prepared statement, and row cache. Also, Apache Doris is easily scaled out since it is built on MPP architecture, on top of which you can scale it up by upgrading the hardware and machine configuration. This is how Apache Doris manages to achieve both high throughput and high concurrency. It allows you to deal with various data analytic workloads on one single platform and experience quick data analytics for various scenarios. Thanks to the great efforts of the Apache Doris community and a group of excellent SelectDB engineers, Apache Doris 2.0 is about to be released soon.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.3]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.3</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.3"/>
<updated>2023-03-19T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.2.3 is now available, with several enhancements and bug fixes based on 1.2.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="jdbc-catalog">JDBC Catalog<a href="#jdbc-catalog" class="hash-link" aria-label="JDBC Catalog的直接链接" title="JDBC Catalog的直接链接"></a></h3><ul><li>Support connecting to Doris clusters through JDBC Catalog.</li></ul><p>Currently, Jdbc Catalog only support to use 5.x version of JDBC jar package to connect another Doris database. If you use 8.x version of JDBC jar package, the data type of column may not be matched.</p><p>Reference: <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc/#doris" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc/#doris</a></p><ul><li><p>Support to synchronize only the specified database through the <code>only_specified_database</code> attribute.</p></li><li><p>Support synchronizing table names in the form of lowercase through <code>lower_case_table_names</code> to solve the problem of case sensitivity of table names.</p></li></ul><p>Reference: <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/lakehouse/multi-catalog/jdbc</a></p><ul><li>Optimize the read performance of JDBC Catalog.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="elasticsearch-catalog">Elasticsearch Catalog<a href="#elasticsearch-catalog" class="hash-link" aria-label="Elasticsearch Catalog的直接链接" title="Elasticsearch Catalog的直接链接"></a></h3><ul><li><p>Support Array type mapping.</p></li><li><p>Support whether to push down the like expression through the <code>like_push_down</code> attribute to control the CPU overhead of the ES cluster.</p></li></ul><p>Reference: <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/es" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/lakehouse/multi-catalog/es</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="hive-catalog">Hive Catalog<a href="#hive-catalog" class="hash-link" aria-label="Hive Catalog的直接链接" title="Hive Catalog的直接链接"></a></h3><ul><li><p>Support Hive table default partition <code>_HIVE_DEFAULT_PARTITION_</code>.</p></li><li><p>Hive Metastore metadata automatic synchronization supports notification event in compressed format.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="dynamic-partition-improvement">Dynamic Partition Improvement<a href="#dynamic-partition-improvement" class="hash-link" aria-label="Dynamic Partition Improvement的直接链接" title="Dynamic Partition Improvement的直接链接"></a></h3><ul><li>Dynamic partition supports specifying the <code>storage_medium</code> parameter to control the storage medium of the newly added partition.</li></ul><p>Reference: <a href="https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimize-bes-threading-model">Optimize BE's Threading Model<a href="#optimize-bes-threading-model" class="hash-link" aria-label="Optimize BE's Threading Model的直接链接" title="Optimize BE's Threading Model的直接链接"></a></h3><ul><li>Optimize BE's threading model to avoid stability problems caused by frequent thread creation and destroy.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li><p>Fixed issues with Merge-On-Write Unique Key tables.</p></li><li><p>Fixed compaction related issues.</p></li><li><p>Fixed some delete statement issues causing data errors.</p></li><li><p>Fixed several query execution errors.</p></li><li><p>Fixed the problem of using JDBC catalog to cause BE crash on some operating system.</p></li><li><p>Fixed Multi-Catalog issues.</p></li><li><p>Fixed memory statistics and optimization issues.</p></li><li><p>Fixed decimalV3 and date/datetimev2 related issues.</p></li><li><p>Fixed load transaction stability issues.</p></li><li><p>Fixed light-weight schema change issues.</p></li><li><p>Fixed the issue of using <code>datetime</code> type for batch partition creation.</p></li><li><p>Fixed the problem that a large number of failed broker loads would cause the FE memory usage to be too high.</p></li><li><p>Fixed the problem that stream load cannot be canceled after dropping the table.</p></li><li><p>Fixed querying <code>information_schema</code> timeout in some cases.</p></li><li><p>Fixed the problem of BE crash caused by concurrent data export using <code>select outfile</code>.</p></li><li><p>Fixed transactional insert operation memory leak.</p></li><li><p>Fixed several query/load profile issues, and supports direct download of profiles through FE web ui.</p></li><li><p>Fixed the problem that the BE tablet GC thread caused the IO util to be too high.</p></li><li><p>Fixed the problem that the commit offset is inaccurate in Kafka routine load.</p></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Building the next-generation data lakehouse: 10X performance]]></title>
<id>https://doris.apache.org/zh-CN/blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance</id>
<link href="https://doris.apache.org/zh-CN/blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance"/>
<updated>2023-03-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article explains how to connect various data sources quickly and ensure high query performance.]]></summary>
<content type="html"><![CDATA[<p>A data warehouse was defined by Bill Inmon as "a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management's decisions" over 30 years ago. However, the initial data warehouses unable to store massive heterogeneous data, hence the creation of data lakes. In modern times, data lakehouse emerges as a new paradigm. It is an open data management architecture featured by strong data analytics and governance capabilities, high flexibility, and open storage.</p><p>If I could only use one word to describe the next-gen data lakehouse, it would be <strong>unification:</strong></p><ul><li><strong>Unified data storage</strong> to avoid the trouble and risks brought by redundant storage and cross-system ETL.</li><li><strong>Unified governance</strong> of both data and metadata with support for ACID, Schema Evolution, and Snapshot.</li><li><strong>Unified data application</strong> that supports data access via a single interface for multiple engines and workloads.</li></ul><p>Let's look into the architecture of a data lakehouse. We will find that it is not only supported by table formats such as Apache Iceberg, Apache Hudi, and Delta Lake, but more importantly, it is powered by a high-performance query engine to extract value from data.</p><p>Users are looking for a query engine that allows quick and smooth access to the most popular data sources. What they don't want is for their data to be locked in a certain database and rendered unavailable for other engines or to spend extra time and computing costs on data transfer and format conversion.</p><p>To turn these visions into reality, a data query engine needs to figure out the following questions:</p><ul><li>How to access more data sources and acquire metadata more easily?</li><li>How to improve query performance on data coming from various sources?</li><li>How to enable more flexible resource scheduling and workload management?</li></ul><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a> provides a possible answer to these questions. It is a real-time OLAP database that aspires to build itself into a unified data analysis gateway. This means it needs to be easily connected to various RDBMS, data warehouses, and data lake engines (such as Hive, Iceberg, Hudi, Delta Lake, and Flink Table Store) and allow for quick data writing from and queries on these heterogeneous data sources. The rest of this article is an in-depth explanation of Apache Doris' techniques in the above three aspects: metadata acquisition, query performance optimization, and resource scheduling.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="metadata-acquisition-and-data-access">Metadata Acquisition and Data Access<a href="#metadata-acquisition-and-data-access" class="hash-link" aria-label="Metadata Acquisition and Data Access的直接链接" title="Metadata Acquisition and Data Access的直接链接"></a></h2><p>Apache Doris 1.2.2 supports a wide variety of data lake formats and data access from various external data sources. Besides, via the Table Value Function, users can analyze files in object storage or HDFS directly.</p><p><img loading="lazy" alt="data-sources-supported-in-data-lakehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_1-3be8649464d93663ca4b474a8ce1d669.png" width="1598" height="882" class="img_ev3q"></p><p>To support multiple data sources, Apache Doris puts efforts into metadata acquisition and data access.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="metadata-acquisition">Metadata Acquisition<a href="#metadata-acquisition" class="hash-link" aria-label="Metadata Acquisition的直接链接" title="Metadata Acquisition的直接链接"></a></h3><p>Metadata consists of information about the databases, tables, partitions, indexes, and files from the data source. Thus, metadata of various data sources come in different formats and patterns, adding to the difficulty of metadata connection. An ideal metadata acquisition service should include the following:</p><ol><li>A <strong>metadata structure</strong> that can accommodate heterogeneous metadata.</li><li>An <strong>extensible metadata connection framework</strong> that enables quick and low-cost data connection.</li><li>Reliable and <strong>efficient</strong> <strong>metadata access</strong> that supports real-time metadata capture.</li><li><strong>Custom authentication</strong> services to interface with external privilege management systems and thus reduce migration costs. </li></ol><h4 class="anchor anchorWithStickyNavbar_LWe7" id="metadata-structure">Metadata Structure<a href="#metadata-structure" class="hash-link" aria-label="Metadata Structure的直接链接" title="Metadata Structure的直接链接"></a></h4><p>Older versions of Doris support a two-tiered metadata structure: database and table. As a result, users need to create mappings for external databases and tables one by one, which is heavy work. Thus, Apache Doris 1.2.0 introduced the Multi-Catalog functionality. With this, you can map to external data at the catalog level, which means:</p><ol><li>You can map to the whole external data source and ingest all metadata from it.</li><li>You can manage the properties of the specified data source at the catalog level, such as connection, privileges, and data ingestion details, and easily handle multiple data sources.</li></ol><p><img loading="lazy" alt="metadata-structure" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_2-c1381100fe5007d6e3ee17ef81245112.png" width="1280" height="650" class="img_ev3q"></p><p>Data in Doris falls into two types of catalogs:</p><ol><li>Internal Catalog: Existing Doris databases and tables all belong to the Internal Catalog.</li><li>External Catalog: This is used to interface with external data sources. For example, HMS External Catalog can be connected to a cluster managed by Hive Metastore, and Iceberg External Catalog can be connected to an Iceberg cluster.</li></ol><p>You can use the <code>SWITCH</code> statement to switch catalogs. You can also conduct federated queries using fully qualified names. For example:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT * FROM hive.db1.tbl1 a JOIN iceberg.db2.tbl2 b</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ON a.k1 = b.k1;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>See more details <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">here</a>.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="extensible-metadata-connection-framework"><strong>Extensible Metadata Connection Framework</strong><a href="#extensible-metadata-connection-framework" class="hash-link" aria-label="extensible-metadata-connection-framework的直接链接" title="extensible-metadata-connection-framework的直接链接"></a></h4><p>The introduction of the catalog level also enables users to add new data sources simply by using the <code>CREATE CATALOG</code> statement:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">CREATE CATALOG hive PROPERTIES (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'type'='hms',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 'hive.metastore.uris' = 'thrift://172.21.0.1:7004',</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">);</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>In data lake scenarios, Apache Doris currently supports the following metadata services:</p><ul><li>Hive Metastore-compatible metadata services</li><li>Alibaba Cloud Data Lake Formation</li><li>AWS Glue</li></ul><p>This also paves the way for developers who want to connect to more data sources via External Catalog. All they need is to implement the access interface.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="efficient-metadata-access"><strong>Efficient Metadata Access</strong><a href="#efficient-metadata-access" class="hash-link" aria-label="efficient-metadata-access的直接链接" title="efficient-metadata-access的直接链接"></a></h4><p>Access to external data sources is often hindered by network conditions and data resources. This requires extra efforts of a data query engine to guarantee reliability, stability, and real-timeliness in metadata access.</p><p><img loading="lazy" alt="metadata-access-Hive-MetaStore" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_3-9539a868abd98e4fa92881ef8c32b38a.png" width="1280" height="472" class="img_ev3q"></p><p>Doris enables high efficiency in metadata access by <strong>Meta Cache</strong>, which includes Schema Cache, Partition Cache, and File Cache. This means that Doris can respond to metadata queries on thousands of tables in milliseconds. In addition, Doris supports manual refresh of metadata at the Catalog/Database/Table level. Meanwhile, it enables auto synchronization of metadata in Hive Metastore by monitoring Hive Metastore Event, so any changes can be updated within seconds.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="custom-authorization"><strong>Custom Authorization</strong><a href="#custom-authorization" class="hash-link" aria-label="custom-authorization的直接链接" title="custom-authorization的直接链接"></a></h4><p>External data sources usually come with their own privilege management services. Many companies use one single tool (such as Apache Ranger) to provide authorization for their multiple data systems. Doris supports a custom authorization plugin, which can be connected to the user's own privilege management system via the Doris Access Controller interface. As a user, you only need to specify the authorization plugin for a newly created catalog, and then you can readily perform authorization, audit, and data encryption on external data in Doris.</p><p><img loading="lazy" alt="custom-authorization" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_4-1f3438ed8d88966a5352a85b1b479057.png" width="1280" height="568" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="data-access">Data Access<a href="#data-access" class="hash-link" aria-label="Data Access的直接链接" title="Data Access的直接链接"></a></h3><p>Doris supports data access to external storage systems, including HDFS and S3-compatible object storage:</p><p><img loading="lazy" alt="access-to-external-storage-systems" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_5-3407cba9da2ea30688926b5e6ade10f5.png" width="1490" height="610" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="query-performance-optimization">Query Performance Optimization<a href="#query-performance-optimization" class="hash-link" aria-label="Query Performance Optimization的直接链接" title="Query Performance Optimization的直接链接"></a></h2><p>After clearing the way for external data access, the next step for a query engine would be to accelerate data queries. In the case of Apache Doris, efforts are made in data reading, execution engine, and optimizer.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="data-reading"><strong>Data Reading</strong><a href="#data-reading" class="hash-link" aria-label="data-reading的直接链接" title="data-reading的直接链接"></a></h3><p>Reading data on remote storage systems is often bottlenecked by access latency, concurrency, and I/O bandwidth, so reducing reading frequency will be a better choice.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="native-file-format-reader"><strong>Native File Format Reader</strong><a href="#native-file-format-reader" class="hash-link" aria-label="native-file-format-reader的直接链接" title="native-file-format-reader的直接链接"></a></h4><p>Improving data reading efficiency entails optimizing the reading of Parquet files and ORC files, which are the most commonly seen data files. Doris has refactored its File Reader, which is fine-tuned for each data format. Take the Native Parquet Reader as an example:</p><ul><li><strong>Reduce format conversion</strong>: It can directly convert files to the Doris storage format or to a format of higher performance using dictionary encoding. </li><li><strong>Smart indexing of finer granularity</strong>: It supports Page Index for Parquet files, so it can utilize Page-level smart indexing to filter Pages. </li><li><strong>Predicate pushdown and late materialization</strong>: It first reads columns with filters first and then reads the other columns of the filtered rows. This remarkably reduces file read volume since it avoids reading irrelevant data.</li><li><strong>Lower read frequency</strong>: Building on the high throughput and low concurrency of remote storage, it combines multiple data reads into one in order to improve overall data reading efficiency.</li></ul><h4 class="anchor anchorWithStickyNavbar_LWe7" id="file-cache">File Cache<a href="#file-cache" class="hash-link" aria-label="File Cache的直接链接" title="File Cache的直接链接"></a></h4><p>Doris caches files from remote storage in local high-performance disks as a way to reduce overhead and increase performance in data reading. In addition, it has developed two new features that make queries on remote files as quick as those on local files:</p><ol><li><strong>Block cache</strong>: Doris supports the block cache of remote files and can automatically adjust the block size from 4KB to 4MB based on the read request. The block cache method reduces read/write amplification and read latency in cold caches.</li><li><strong>Consistent hashing for caching</strong>: Doris applies consistent hashing to manage cache locations and schedule data scanning. By doing so, it prevents cache failures brought about by the online and offlining of nodes. It can also increase cache hit rate and query service stability.</li></ol><p><img loading="lazy" alt="file-cache" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_6-ab3642f54deb4b5f57694e0814a891a7.png" width="1080" height="638" class="img_ev3q"></p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="execution-engine">Execution Engine<a href="#execution-engine" class="hash-link" aria-label="Execution Engine的直接链接" title="Execution Engine的直接链接"></a></h4><p>Developers surely don't want to rebuild all the general features for every new data source. Instead, they hope to reuse the vectorized execution engine and all operators in Doris in the data lakehouse scenario. Thus, Doris has refactored the scan nodes:</p><ul><li><strong>Layer the logic</strong>: All data queries in Doris, including those on internal tables, use the same operators, such as Join, Sort, and Agg. The only difference between queries on internal and external data lies in data access. In Doris, anything above the scan nodes follows the same query logic, while below the scan nodes, the implementation classes will take care of access to different data sources.</li><li><strong>Use a general framework for scan operators</strong>: Even for the scan nodes, different data sources have a lot in common, such as task splitting logic, scheduling of sub-tasks and I/O, predicate pushdown, and Runtime Filter. Therefore, Doris uses interfaces to handle them. Then, it implements a unified scheduling logic for all sub-tasks. The scheduler is in charge of all scanning tasks in the node. With global information of the node in hand, the schedular is able to do fine-grained management. Such a general framework makes it easy to connect a new data source to Doris, which will only take a week of work for one developer.</li></ul><p><img loading="lazy" alt="execution-engine" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_7-8e726de5b739f4532ff1c8a45055911e.png" width="830" height="844" class="img_ev3q"></p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="query-optimizer">Query Optimizer<a href="#query-optimizer" class="hash-link" aria-label="Query Optimizer的直接链接" title="Query Optimizer的直接链接"></a></h4><p>Doris supports a range of statistical information from various data sources, including Hive Metastore, Iceberg Metafile, and Hudi MetaTable. It has also refined its cost model inference based on the characteristics of different data sources to enhance its query planning capability. </p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="performance">Performance<a href="#performance" class="hash-link" aria-label="Performance的直接链接" title="Performance的直接链接"></a></h4><p>We tested Doris and Presto/Trino on HDFS in flat table scenarios (ClickBench) and multi-table scenarios (TPC-H). Here are the results:</p><p><img loading="lazy" alt="Apache-Doris-VS-Trino-Presto-ClickBench" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_8-3fc3c0c72557dccb1dde18e02b88d155.png" width="1925" height="345" class="img_ev3q"></p><p><img loading="lazy" alt="Apache-Doris-VS-Trino-Presto-TPCH" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_9-07cdca2b93201425f7d2f7647c13c163.png" width="1688" height="421" class="img_ev3q"></p><p>As is shown, with the same computing resources and on the same dataset, Apache Doris takes much less time to respond to SQL queries in both scenarios, delivering a 3~10 times higher performance than Presto/Trino.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="workload-management-and-elastic-computing">Workload Management and Elastic Computing<a href="#workload-management-and-elastic-computing" class="hash-link" aria-label="Workload Management and Elastic Computing的直接链接" title="Workload Management and Elastic Computing的直接链接"></a></h2><p>Querying external data sources requires no internal storage of Doris. This makes elastic stateless computing nodes possible. Apache Doris 2.0 is going to implement Elastic Compute Node, which is dedicated to supporting query workloads of external data sources.</p><p><img loading="lazy" alt="stateless-compute-nodes" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_10-96454cb74de1a358c1b94af7e9673f24.png" width="1960" height="884" class="img_ev3q"></p><p>Stateless computing nodes are open for quick scaling so users can easily cope with query workloads during peaks and valleys and strike a balance between performance and cost. In addition, Doris has optimized itself for Kubernetes cluster management and node scheduling. Now Master nodes can automatically manage the onlining and offlining of Elastic Compute Nodes, so users can govern their cluster workloads in cloud-native and hybrid cloud scenarios without difficulty.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="use-case">Use Case<a href="#use-case" class="hash-link" aria-label="Use Case的直接链接" title="Use Case的直接链接"></a></h2><p>Apache Doris has been adopted by a financial institution for risk management. The user's high demands for data timeliness makes their data mart built on Greenplum and CDH, which could only process data from one day ago, no longer a great fit. In 2022, they incorporated Apache Doris in their data production and application pipeline, which allowed them to perform federated queries across Elasticsearch, Greenplum, and Hive. A few highlights from the user's feedback include:</p><ul><li>Doris allows them to create one Hive Catalog that maps to tens of thousands of external Hive tables and conducts fast queries on them.</li><li>Doris makes it possible to perform real-time federated queries using Elasticsearch Catalog and achieve a response time of mere milliseconds.</li><li>Doris enables the decoupling of daily batch processing and statistical analysis, bringing less resource consumption and higher system stability.</li></ul><p><img loading="lazy" alt="use-case-of-data-lakehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/Lakehouse_11-486b5f1ff26d1d7f20c70d4c4841e6ee.png" width="1510" height="882" class="img_ev3q"></p><h1>Future Plans</h1><p>Apache Doris is going to support a wider range of data sources, improve its data reading and write-back functionality, and optimizes its resource isolation and scheduling.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="more-data-sources">More Data Sources<a href="#more-data-sources" class="hash-link" aria-label="More Data Sources的直接链接" title="More Data Sources的直接链接"></a></h2><p>We are working closely with various open source communities to expand and improve Doris' features in data lake analytics. We plan to provide:</p><ul><li>Support for Incremental Query of Hudi Merge-on-Read tables;</li><li>Lower query latency utilizing the indexing of Iceberg/Hudi in combination with the query optimizer;</li><li>Support for more data lake formats such as Delta Lake and Flink Table Store.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-integration">Data Integration<a href="#data-integration" class="hash-link" aria-label="Data Integration的直接链接" title="Data Integration的直接链接"></a></h2><p> <strong>Data reading:</strong></p><p>Apache Doris is going to:</p><ul><li>Support CDC and Incremental Materialized Views for data lakes in order to provide users with near real-time data views;</li><li>Support a Git-Like data access mode and enable easier and safer data management via the multi-version and Branch mechanisms. </li></ul><p><strong>Data Write-Back:</strong></p><p>We are going to enhance Apache Doris' data analysis gateway. In the future, users will be able to use Doris as a unified data management portal that is in charge of the write-back of processed data, export of data, and the generation of a unified data view.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="resource-isolation--scheduling">Resource Isolation &amp; Scheduling<a href="#resource-isolation--scheduling" class="hash-link" aria-label="Resource Isolation &amp; Scheduling的直接链接" title="Resource Isolation &amp; Scheduling的直接链接"></a></h2><p>Apache Doris is undertaking a wider variety of workloads as it is interfacing with more and more data sources. For example, it needs to provide low-latency online services while batch processing T-1 data in Hive. To make it work, resource isolation within the same cluster is critical, which is where efforts will be made.</p><p>Meanwhile, we will continue optimizing the scheduling logic of elastic computing nodes in various scenarios and develop intra-node resource isolation at a finer granularity, such as CPU, I/O, and memory. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="join-us">Join us<a href="#join-us" class="hash-link" aria-label="Join us的直接链接" title="Join us的直接链接"></a></h2><p>Contact <a href="mailto:dev@apache.doris.org" target="_blank" rel="noopener noreferrer">dev@apache.doris.org</a> to join the Lakehouse SIG(Special Interest Group) in the Apache Doris community and talk to developers from all walks of life.</p><p><strong># Links:</strong></p><p><strong>Apache Doris:</strong></p><p><a href="http://doris.apache.org" target="_blank" rel="noopener noreferrer">http://doris.apache.org</a></p><p><strong>Apache Doris Github:</strong></p><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><p>Find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a>.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Tencent data engineer: why we went from ClickHouse to Apache Doris?]]></title>
<id>https://doris.apache.org/zh-CN/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris</id>
<link href="https://doris.apache.org/zh-CN/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris"/>
<updated>2023-03-07T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Evolution of the data processing architecture of Tencent Music Entertainment towards better performance and simpler maintenance.]]></summary>
<content type="html"><![CDATA[<p><img loading="lazy" alt="Tencent-use-case-of-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME-7ebdc46ff19cf90eaf92e280c1b1f0e4.png" width="900" height="383" class="img_ev3q"></p><p>This article is co-written by me and my colleague Kai Dai. We are both data platform engineers at <a href="https://www.tencentmusic.com/en-us/" target="_blank" rel="noopener noreferrer">Tencent Music</a> (NYSE: TME), a music streaming service provider with a whopping 800 million monthly active users. To drop the number here is not to brag but to give a hint of the sea of data that my poor coworkers and I have to deal with everyday.</p><h1>What We Use ClickHouse For?</h1><p>The music library of Tencent Music contains data of all forms and types: recorded music, live music, audios, videos, etc. As data platform engineers, our job is to distill information from the data, based on which our teammates can make better decisions to support our users and musical partners.</p><p>Specifically, we do all-round analysis of the songs, lyrics, melodies, albums, and artists, turn all this information into data assets, and pass them to our internal data users for inventory counting, user profiling, metrics analysis, and group targeting.</p><p><img loading="lazy" alt="data-pipeline" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_1-73b51a1362dc4f6f1cadbee5d51aaa05.png" width="1280" height="693" class="img_ev3q"></p><p>We stored and processed most of our data in Tencent Data Warehouse (TDW), an offline data platform where we put the data into various tag and metric systems and then created flat tables centering each object (songs, artists, etc.).</p><p>Then we imported the flat tables into ClickHouse for analysis and Elasticsearch for data searching and group targeting.</p><p>After that, our data analysts used the data under the tags and metrics they needed to form datasets for different usage scenarios, during which they could create their own tags and metrics.</p><p>The data processing pipeline looked like this:</p><p><img loading="lazy" alt="data-warehouse-architecture-1.0" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_2-edb671e5b547ca431f4eaa61b59fd2fb.png" width="1280" height="743" class="img_ev3q"></p><h1>The Problems with ClickHouse</h1><p>When working with the above pipeline, we encountered a few difficulties:</p><ol><li><strong>Partial Update</strong>: Partial update of columns was not supported. Therefore, any latency from any one of the data sources could delay the creation of flat tables, and thus undermine data timeliness.</li><li><strong>High storage cost</strong>: Data under different tags and metrics was updated at different frequencies. As much as ClickHouse excelled in dealing with flat tables, it was a huge waste of storage resources to just pour all data into a flat table and partition it by day, not to mention the maintenance cost coming with it.</li><li><strong>High maintenance cost</strong>: Architecturally speaking, ClickHouse was characterized by the strong coupling of storage nodes and compute nodes. Its components were heavily interdependent, adding to the risks of cluster instability. Plus, for federated queries across ClickHouse and Elasticsearch, we had to take care of a huge amount of connection issues. That was just tedious.</li></ol><h1>Transition to Apache Doris</h1><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a>, a real-time analytical database, boasts a few features that are exactly what we needed in solving our problems:</p><ol><li><strong>Partial update</strong>: Doris supports a wide variety of data models, among which the Aggregate Model supports real-time partial update of columns. Building on this, we can directly ingest raw data into Doris and create flat tables there. The ingestion goes like this: Firstly, we use Spark to load data into Kafka; then, any incremental data will be updated to Doris and Elasticsearch via Flink. Meanwhile, Flink will pre-aggregate the data so as to release burden on Doris and Elasticsearch.</li><li><strong>Storage cost</strong>: Doris supports multi-table join queries and federated queries across Hive, Iceberg, Hudi, MySQL, and Elasticsearch. This allows us to split the large flat tables into smaller ones and partition them by update frequency. The benefits of doing so include a relief of storage burden and an increase of query throughput.</li><li><strong>Maintenance cost</strong>: Doris is of simple architecture and is compatible with MySQL protocol. Deploying Doris only involves two processes (FE and BE) with no dependency on other systems, making it easy to operate and maintain. Also, Doris supports querying external ES data tables. It can easily interface with the metadata in ES and automatically map the table schema from ES so we can conduct queries on Elasticsearch data via Doris without grappling with complex connections.</li></ol><p>What’s more, Doris supports multiple data ingestion methods, including batch import from remote storage such as HDFS and S3, data reads from MySQL binlog and Kafka, and real-time data synchronization or batch import from MySQL, Oracle, and PostgreSQL. It ensures service availability and data reliability through a consistency protocol and is capable of auto debugging. This is great news for our operators and maintainers.</p><p>Statistically speaking, these features have cut our storage cost by 42% and development cost by 40%.</p><p>During our usage of Doris, we have received lots of support from the open source Apache Doris community and timely help from the SelectDB team, which is now running a commercial version of Apache Doris.</p><p><img loading="lazy" alt="data-warehouse-architecture-2.0" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_3-877f2cc02538dcf78f20d08c679df9f3.png" width="1280" height="734" class="img_ev3q"></p><h1>Further Improvement to Serve Our Needs</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="introduce-a-semantic-layer">Introduce a Semantic Layer<a href="#introduce-a-semantic-layer" class="hash-link" aria-label="Introduce a Semantic Layer的直接链接" title="Introduce a Semantic Layer的直接链接"></a></h2><p>Speaking of the datasets, on the bright side, our data analysts are given the liberty of redefining and combining the tags and metrics at their convenience. But on the dark side, high heterogeneity of the tag and metric systems leads to more difficulty in their usage and management.</p><p>Our solution is to introduce a semantic layer in our data processing pipeline. The semantic layer is where all the technical terms are translated into more comprehensible concepts for our internal data users. In other words, we are turning the tags and metrics into first-class citizens for data definement and management.</p><p><img loading="lazy" alt="data-warehouse-architecture-3.0" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_4-f78029c9a317de442e0e00aac053140a.png" width="1280" height="743" class="img_ev3q"></p><p><strong>Why would this help?</strong></p><p>For data analysts, all tags and metrics will be created and shared at the semantic layer so there will be less confusion and higher efficiency.</p><p>For data users, they no longer need to create their own datasets or figure out which one is applicable for each scenario but can simply conduct queries on their specified tagset and metricset.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-the-semantic-layer">Upgrade the Semantic Layer<a href="#upgrade-the-semantic-layer" class="hash-link" aria-label="Upgrade the Semantic Layer的直接链接" title="Upgrade the Semantic Layer的直接链接"></a></h2><p>Explicitly defining the tags and metrics at the semantic layer was not enough. In order to build a standardized data processing system, our next goal was to ensure consistent definition of tags and metrics throughout the whole data processing pipeline.</p><p>For this sake, we made the semantic layer the heart of our data management system:</p><p><img loading="lazy" alt="data-warehouse-architecture-4.0" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_5-69933329bfdc217369664b15c2ec4766.png" width="1280" height="714" class="img_ev3q"></p><p><strong>How does it work?</strong></p><p>All computing logics in TDW will be defined at the semantic layer in the form of a single tag or metric.</p><p>The semantic layer receives logic queries from the application side, selects an engine accordingly, and generates SQL. Then it sends the SQL command to TDW for execution. Meanwhile, it might also send configuration and data ingestion tasks to Doris and decide which metrics and tags should be accelerated.</p><p>In this way, we have made the tags and metrics more manageable. A fly in the ointment is that since each tag and metric is individually defined, we are struggling with automating the generation of a valid SQL statement for the queries. If you have any idea about this, you are more than welcome to talk to us.</p><h1>Give Full Play to Apache Doris</h1><p>As you can see, Apache Doris has played a pivotal role in our solution. Optimizing the usage of Doris can largely improve our overall data processing efficiency. So in this part, we are going to share with you what we do with Doris to accelerate data ingestion and queries and reduce costs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-we-want">What We Want?<a href="#what-we-want" class="hash-link" aria-label="What We Want?的直接链接" title="What We Want?的直接链接"></a></h2><p><img loading="lazy" alt="goals-of-a-data-analytic-solution" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_6-d0c8cb8b9a7501650f26ae3018b58b14.png" width="1280" height="444" class="img_ev3q"></p><p>Currently, we have 800+ tags and 1300+ metrics derived from the 80+ source tables in TDW.</p><p>When importing data from TDW to Doris, we hope to achieve:</p><ul><li><strong>Real-time availability:</strong> In addition to the traditional T+1 offline data ingestion, we require real-time tagging.</li><li><strong>Partial update</strong>: Each source table generates data through its own ETL task at various paces and involves only part of the tags and metrics, so we require the support for partial update of columns.</li><li><strong>High performance</strong>: We need a response time of only a few seconds in group targeting, analysis and reporting scenarios.</li><li><strong>Low costs</strong>: We hope to reduce costs as much as possible.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="what-we-do">What We Do?<a href="#what-we-do" class="hash-link" aria-label="What We Do?的直接链接" title="What We Do?的直接链接"></a></h2><ol><li><strong>Generate Flat Tables in Flink Instead of TDW</strong></li></ol><p><img loading="lazy" alt="generate-flat-tables-in-Flink" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_7-6ec720f226a737d5cf91c74a386319b4.png" width="1280" height="567" class="img_ev3q"></p><p>Generating flat tables in TDW has a few downsides:</p><ul><li><strong>High storage cost</strong>: TDW has to maintain an extra flat table apart from the discrete 80+ source tables. That’s huge redundancy.</li><li><strong>Low real-timeliness</strong>: Any delay in the source tables will be augmented and retard the whole data link.</li><li><strong>High development cost</strong>: To achieve real-timeliness would require extra development efforts and resources.</li></ul><p>On the contrary, generating flat tables in Doris is much easier and less expensive. The process is as follows:</p><ul><li>Use Spark to import new data into Kafka in an offline manner.</li><li>Use Flink to consume Kafka data.</li><li>Create a flat table via the primary key ID.</li><li>Import the flat table into Doris.</li></ul><p>As is shown below, Flink has aggregated the five lines of data, of which “ID”=1, into one line in Doris, reducing the data writing pressure on Doris.</p><p><img loading="lazy" alt="flat-tables-in-Flink-2" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_8-c5c8e4d117fb6c1157c42f6ab14829e0.png" width="1280" height="622" class="img_ev3q"></p><p>This can largely reduce storage costs since TDW no long has to maintain two copies of data and KafKa only needs to store the new data pending for ingestion. What’s more, we can add whatever ETL logic we want into Flink and reuse lots of development logic for offline and real-time data ingestion.</p><p><strong>2. Name the Columns Smartly</strong></p><p>As we mentioned, the Aggregate Model of Doris allows partial update of columns. Here we provide a simple introduction to other data models in Doris for your reference:</p><p><strong>Unique Model</strong>: This is applicable for scenarios requiring primary key uniqueness. It only keeps the latest data of the same primary key ID. (As far as we know, the Apache Doris community is planning to include partial update of columns in the Unique Model, too.)</p><p><strong>Duplicate Model</strong>: This model stores all original data exactly as it is without any pre-aggregation or deduplication.</p><p>After determining the data model, we had to think about how to name the columns. Using the tags or metrics as column names was not a choice because:</p><p>I. Our internal data users might need to rename the metrics or tags, but Doris 1.1.3 does not support modification of column names.</p><p>II. Tags might be taken online and offline frequently. If that involves the adding and dropping of columns, it will be not only time-consuming but also detrimental to query performance.</p><p>Instead, we do the following:</p><ul><li><strong>For flexible renaming of tags and metrics</strong>, we use MySQL tables to store the metadata (name, globally unique ID, status, etc.). Any change to the names will only happen in the metadata but will not affect the table schema in Doris. For example, if a <code>song_name</code> is given an ID of 4, it will be stored with the column name of a4 in Doris. Then if the <code>song_name</code>is involved in a query, it will be converted to a4 in SQL.</li><li><strong>For the onlining and offlining of tags</strong>, we sort out the tags based on how frequently they are being used. The least used ones will be given an offline mark in their metadata. No new data will be put under the offline tags but the existing data under those tags will still be available.</li><li><strong>For real-time availability of newly added tags and metrics</strong>, we prebuild a few ID columns in Doris tables based on the mapping of name IDs. These reserved ID columns will be allocated to the newly added tags and metrics. Thus, we can avoid table schema change and the consequent overheads. Our experience shows that only 10 minutes after the tags and metrics are added, the data under them can be available.</li></ul><p>Noteworthily, the recently released Doris 1.2.0 supports Light Schema Change, which means that to add or remove columns, you only need to modify the metadata in FE. Also, you can rename the columns in data tables as long as you have enabled Light Schema Change for the tables. This is a big trouble saver for us.</p><p><strong>3. Optimize Date Writing</strong></p><p>Here are a few practices that have reduced our daily offline data ingestion time by 75% and our CUMU compaction score from 600+ to 100.</p><ul><li>Flink pre-aggregation: as is mentioned above.</li><li>Auto-sizing of writing batch: To reduce Flink resource usage, we enable the data in one Kafka Topic to be written into various Doris tables and realize the automatic alteration of batch size based on the data amount.</li><li>Optimization of Doris data writing: fine-tune the the sizes of tablets and buckets as well as the compaction parameters for each scenario:</li></ul><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">max_XXXX_compaction_thread</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">max_cumulative_compaction_num_singleton_deltas</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li>Optimization of the BE commit logic: conduct regular caching of BE lists, commit them to the BE nodes batch by batch, and use finer load balancing granularity.</li></ul><p><img loading="lazy" alt="stable-compaction-score" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_9-f599364617a05d42a19e5430e500d6f7.png" width="1280" height="511" class="img_ev3q"></p><p><strong>4. Use Dori-on-ES in Queries</strong></p><p>About 60% of our data queries involve group targeting. Group targeting is to find our target data by using a set of tags as filters. It poses a few requirements for our data processing architecture:</p><ul><li>Group targeting related to APP users can involve very complicated logic. That means the system must support hundreds of tags as filters simultaneously.</li><li>Most group targeting scenarios only require the latest tag data. However, metric queries need to support historical data.</li><li>Data users might need to perform further aggregated analysis of metric data after group targeting.</li><li>Data users might also need to perform detailed queries on tags and metrics after group targeting.</li></ul><p>After consideration, we decided to adopt Doris-on-ES. Doris is where we store the metric data for each scenario as a partition table, while Elasticsearch stores all tag data. The Doris-on-ES solution combines the distributed query planning capability of Doris and the full-text search capability of Elasticsearch. The query pattern is as follows:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT tag, agg(metric) </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> FROM Doris </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> WHERE id in (select id from Es where tagFilter)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> GROUP BY tag</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>As is shown, the ID data located in Elasticsearch will be used in the sub-query in Doris for metric analysis.</p><p>In practice, we find that the query response time is related to the size of the target group. If the target group contains over one million objects, the query will take up to 60 seconds. If it is even larger, a timeout error might occur.</p><p>After investigation, we identified our two biggest time wasters:</p><p>I. When Doris BE pulls data from Elasticsearch (1024 lines at a time by default), for a target group of over one million objects, the network I/O overhead can be huge.</p><p>II. After the data pulling, Doris BE needs to conduct Join operations with local metric tables via SHUFFLE/BROADCAST, which can cost a lot.</p><p><img loading="lazy" alt="Doris-on-Elasticsearch" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_10-b177da2c3e9ab23ad3fb8e1784012442.png" width="1280" height="883" class="img_ev3q"></p><p>Thus, we make the following optimizations:</p><ul><li>Add a query session variable <code>es_optimize</code> that specifies whether to enable optimization.</li><li>In data writing into ES, add a BK column to store the bucket number after the primary key ID is hashed. The algorithm is the same as the bucketing algorithm in Doris (CRC32).</li><li>Use Doris BE to generate a Bucket Join execution plan, dispatch the bucket number to BE ScanNode and push it down to ES.</li><li>Use ES to compress the queried data; turn multiple data fetch into one and reduce network I/O overhead.</li><li>Make sure that Doris BE only pulls the data of buckets related to the local metric tables and conducts local Join operations directly to avoid data shuffling between Doris BEs.</li></ul><p><img loading="lazy" alt="Doris-on-Elasticsearch-2" src="https://cdnd.selectdb.com/zh-CN/assets/images/TME_11-5ac5f455cdcab0a0b8b1207d61b24afb.png" width="1280" height="924" class="img_ev3q"></p><p>As a result, we reduce the query response time for large group targeting from 60 seconds to a surprising 3.7 seconds.</p><p>Community information shows that Doris is going to support inverted indexing since version 2.0.0, which is soon to be released. With this new version, we will be able to conduct full-text search on text types, equivalence or range filtering of texts, numbers, and datetime, and conveniently combine AND, OR, NOT logic in filtering since the inverted indexing supports array types. This new feature of Doris is expected to deliver 3~5 times better performance than Elasticsearch on the same task.</p><p><strong>5. Refine the Management of Data</strong></p><p>Doris’ capability of cold and hot data separation provides the foundation of our cost reduction strategies in data processing.</p><ul><li>Based on the TTL mechanism of Doris, we only store data of the current year in Doris and put the historical data before that in TDW for lower storage cost.</li><li>We vary the numbers of copies for different data partitions. For example, we set three copies for data of the recent three months, which is used frequently, one copy for data older than six months, and two copies for data in between.</li><li>Doris supports turning hot data into cold data so we only store data of the past seven days in SSD and transfer data older than that to HDD for less expensive storage.</li></ul><h1>Conclusion</h1><p>Thank you for scrolling all the way down here and finishing this long read. We’ve shared our cheers and tears, lessons learned, and a few practices that might be of some value to you during our transition from ClickHouse to Doris. We really appreciate the help from the Apache Doris community, but we might still be chasing them around for a while since we attempt to realize auto-identification of cold and hot data, pre-computation of frequently used tags/metrics, simplification of code logic using Materialized Views, and so on and so forth.</p><p><strong># Links</strong></p><p><strong>Apache Doris</strong>:</p><p><a href="http://doris.apache.org" target="_blank" rel="noopener noreferrer">http://doris.apache.org</a></p><p><strong>Apache Doris Github</strong>:</p><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><p>Find Apache Doris developers on <a href="https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2kl08hzc0-SPJe4VWmL_qzrFd2u2XYQA" target="_blank" rel="noopener noreferrer">Slack</a></p>]]></content>
<author>
<name>Jun Zhang &amp; Kai Dai</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Best practice in Duyansoft, improving query speed to make the most out of your data]]></title>
<id>https://doris.apache.org/zh-CN/blog/Improving-Query-Speed-to-Make-the-Most-out-of-Your-Data</id>
<link href="https://doris.apache.org/zh-CN/blog/Improving-Query-Speed-to-Make-the-Most-out-of-Your-Data"/>
<updated>2023-02-27T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This is about how Duyansoft improved its overall data processing efficiency by optimizing the choice and usage of data warehouses.]]></summary>
<content type="html"><![CDATA[<blockquote><p>Author: Junfei Liu, Senior Architect of Duyansoft</p></blockquote><p><img loading="lazy" alt="Duyansoft-use-case-of-Apache-Doris" src="https://cdnd.selectdb.com/zh-CN/assets/images/Duyansoft-338cbc4c47491d4110145175cfa2d0ba.png" width="900" height="383" class="img_ev3q"></p><p>The world is getting more and more value out of data, as exemplified by the currently much-talked-about ChatGPT, which I believe is a robotic data analyst. However, in today’s era, what’s more important than the data itself is the ability to locate your wanted information among all the overflowing data quickly. So in this article, I will talk about how I improved overall data processing efficiency by optimizing the choice and usage of data warehouses.</p><h1>Too Much Data on My Plate</h1><p>The choice of data warehouses was never high on my worry list until 2021. I have been working as a data engineer for a Fintech SaaS provider since its incorporation in 2014. In the company’s infancy, we didn’t have too much data to juggle. We only needed a simple tool for OLTP and business reporting, and the traditional databases would cut the mustard.</p><p><img loading="lazy" alt="data-processing-pipeline-Duyansoft" src="https://cdnd.selectdb.com/zh-CN/assets/images/Duyan_1-be681a0c4e3b94cdca6f6476698be732.png" width="1466" height="590" class="img_ev3q"></p><p>But as the company grew, the data we received became overwhelmingly large in volume and increasingly diversified in sources. Every day, we had tons of user accounts logging in and sending myriads of requests. It was like collecting water from a thousand taps to put out a million scattered pieces of fire in a building, except that you must bring the exact amount of water needed for each fire spot. Also, we got more and more emails from our colleagues asking if we could make data analysis easier for them. That’s when the company assembled a big data team to tackle the beast.</p><p>The first thing we did was to revolutionize our data processing architecture. We used DataHub to collect all our transactional or log data and ingest it into an offline data warehouse for data processing (analyzing, computing. etc.). Then the results would be exported to MySQL and then forwarded to QuickBI to display the reports visually. We also replaced MongoDB with a real-time data warehouse for business queries.</p><p><img loading="lazy" alt="Data-ingestion-ETL-ELT-application" src="https://cdnd.selectdb.com/zh-CN/assets/images/Duyan_2-e87f8780d1f9b74df15d81a94718b378.png" width="1564" height="704" class="img_ev3q"></p><p>This new architecture worked, but there remained a few pebbles in our shoes:</p><ul><li><strong>We wanted faster responses.</strong> MySQL could be slow in aggregating large tables, but our product guys requested a query response time of fewer than five seconds. So first, we tried to optimize MySQL. Then we also tried to skip MySQL and directly connect the offline data warehouse with QuickBI, hoping that the combination of query acceleration capability of the former and caching of the latter would do the magic. Still, that five-second goal seemed to be unreachable. There was a time when I believed the only perfect solution was for the product team to hire people with more patience.</li><li><strong>We wanted less pain in maintaining dimension tables.</strong> The offline data warehouse conducted data synchronization every five minutes, making it not applicable for frequent data updates or deletions scenarios. If we needed to maintain dimension tables in it, we would have to filter and deduplicate the data regularly to ensure data consistency. Out of our trouble-averse instinct, we chose not to do so.</li><li><strong>We wanted support for point queries of high concurrency.</strong> The real-time database that we previously used required up to 500ms to respond to highly concurrent point queries in both columnar storage and row storage, even after optimization. That was not good enough.</li></ul><h1>Hit It Where It Hurts Most</h1><p>In March, 2022, we started our hunt for a better data warehouse. To our disappointment, there was no one-size-fits-all solution. Most of the tools we looked into were only good at one or a few of the tasks, but if we gathered the best performer for each usage scenario, that would add up to a heavy and messy toolkit, which was against instinct.</p><p>So we decided to solve our biggest headache first: slow response, as it was hurting both the experience of our users and our internal work efficiency.</p><p>To begin with, we tried to move the largest tables from MySQL to <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">Apache Doris</a>, a real-time analytical database that supports MySQL protocol. That reduced the query execution time by a factor of eight. Then we tried and used Doris to accommodate more data.</p><p>As for now, we are using two Doris clusters: one to handle point queries (high QPS) from our users and the other for internal ad-hoc queries and reporting. As a result, users have reported smoother experience and we can provide more features that are used to be bottlenecked by slow query execution. Moving our dimension tables to Doris also brought less data errors and higher development efficiency.</p><p><img loading="lazy" alt="Data-ingestion-ETL-ELT-Doris-application" src="https://cdnd.selectdb.com/zh-CN/assets/images/Duyan_3-0abe8037381914932a2d763843a2ed34.png" width="1356" height="864" class="img_ev3q"></p><p>Both the FE and BE processes of Doris can be scaled out, so tens of PBs of data stored in hundreds of devices can be put into one single cluster. In addition, the two types of processes implement a consistency protocol to ensure service availability and data reliability. This removes dependency on Hadoop and thus saves us the cost of deploying Hadoop clusters.</p><h1>Tips</h1><p>Here are a few of our practices to share with you:</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-model"><strong>Data Model:</strong><a href="#data-model" class="hash-link" aria-label="data-model的直接链接" title="data-model的直接链接"></a></h2><p>Out of the three Doris data models, we find the Unique Model and the Aggregate Model suit our needs most. For example, we use the Unique Model to ensure data consistency while ingesting dimension tables and original tables and the Aggregate Model to import report data.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-ingestion"><strong>Data Ingestion:</strong><a href="#data-ingestion" class="hash-link" aria-label="data-ingestion的直接链接" title="data-ingestion的直接链接"></a></h2><p>For real-time data ingestion, we use the Flink-Doris-Connector: After our business data, the MySQL-based binlogs, is written into Kafka, it will be parsed by Flink and then loaded into Doris in a real-time manner.</p><p>For offline data ingestion, we use DataX: This mainly involves the computed report data in our offline data warehouse.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-management"><strong>Data Management:</strong><a href="#data-management" class="hash-link" aria-label="data-management的直接链接" title="data-management的直接链接"></a></h2><p>We back up our cluster data in a remote storage system via Broker. Then, it can restore the data from the backups to any Doris cluster if needed via the restore command.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="monitoring-and-alerting"><strong>Monitoring and Alerting:</strong><a href="#monitoring-and-alerting" class="hash-link" aria-label="monitoring-and-alerting的直接链接" title="monitoring-and-alerting的直接链接"></a></h2><p>In addition to the various monitoring metrics of Doris, we deployed an audit log plugin to keep a closer eye on certain slow SQL of certain users for optimization.</p><p>Slow SQL queries:</p><p><img loading="lazy" alt="slow-SQL-queries-monitoring" src="https://cdnd.selectdb.com/zh-CN/assets/images/Duyan_4-4c0a444296e8e0489a68597c56a23c51.png" width="1080" height="437" class="img_ev3q"></p><p>Some of our often-used monitoring metrics:</p><p><img loading="lazy" alt="monitoring-metrics" src="https://cdnd.selectdb.com/zh-CN/assets/images/Duyan_5-fab738ac780df0ae000a0a7238093e35.png" width="1080" height="451" class="img_ev3q"></p><p><strong>Tradeoff Between Resource Usage and Real-Time Availability:</strong></p><p>It turned out that using Flink-Doris-Connector for data ingestion can result in high cluster resource usage, so we increased the interval between each data writing from 3s to 10 or 20s, compromising a little bit on the real-time availability of data in exchange for much less resource usage.</p><h1>Communication with Developers</h1><p>We have been in close contact with the open source Doris community all the way from our investigation to our adoption of the data warehouse, and we’ve provided a few suggestions to the developers:</p><ul><li>Enable Flink-Doris-Connector to support simultaneous writing of multiple tables in a single sink.</li><li>Enable Materialized Views to support Join of multiple tables.</li><li>Optimize the underlying compaction of data and reduce resource usage as much as possible.</li><li>Provide optimization suggestions for slow SQL and warnings for abnormal table creation behaviors.</li></ul><p>If the perfect data warehouse is not there to be found, I think providing feedback for the second best is a way to help make one. We are also looking into its commercialized version called SelectDB to see if more custom-tailored advanced features can grease the wheels.</p><h1>Conclusion</h1><p>As we set out to find a single data warehouse that could serve all our needs, we ended up finding something less than perfect but good enough to improve our query speed by a wide margin and discovered some surprising features of it along the way. So if you wiggle between different choices, you may bet on the one with the thing you want most badly, and taking care of the rest wouldn’t be so hard.</p><p><strong>Try</strong> <a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer"><strong>Apache Doris</strong></a> <strong>out!</strong></p><p>It is an open source real-time analytical database based on MPP architecture. It supports both high-concurrency point queries and high-throughput complex analysis.</p>]]></content>
<author>
<name>Junfei Liu</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.2]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.2</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.2"/>
<updated>2023-02-15T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.2.2 is now available, with several enhancements and bug fixes based on 1.2.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-features">New Features<a href="#new-features" class="hash-link" aria-label="New Features的直接链接" title="New Features的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="lakehouse">Lakehouse<a href="#lakehouse" class="hash-link" aria-label="Lakehouse的直接链接" title="Lakehouse的直接链接"></a></h3><ul><li><p>Support automatic synchronization of Hive metastore.</p></li><li><p>Support reading the Iceberg Snapshot, and viewing the Snapshot history.</p></li><li><p>JDBC Catalog supports PostgreSQL, Clickhouse, Oracle, SQLServer</p></li><li><p>JDBC Catalog supports Insert operation</p></li></ul><p>Reference: <a href="https://doris.apache.org/docs/dev/lakehouse/multi-catalog/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/lakehouse/multi-catalog/</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="auto-bucket">Auto Bucket<a href="#auto-bucket" class="hash-link" aria-label="Auto Bucket的直接链接" title="Auto Bucket的直接链接"></a></h3><p> Set and scale the number of buckets for different partitions to keep the number of tablet in a relatively appropriate range.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-functions">New Functions<a href="#new-functions" class="hash-link" aria-label="New Functions的直接链接" title="New Functions的直接链接"></a></h3><p>Add the new function <code>width_bucket</code>. </p><p>Reference: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/width-bucket/#description" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/width-bucket/#description</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changes">Behavior Changes<a href="#behavior-changes" class="hash-link" aria-label="Behavior Changes的直接链接" title="Behavior Changes的直接链接"></a></h2><ul><li>Disable BE's page cache by default: <code>disable_storage_page_cache=true</code></li></ul><p>Turn off this configuration to optimize memory usage and reduce the risk of memory OOM.
But it will reduce the query latency of some small queries.
If you are sensitive to query latency, or have high concurrency and small query scenarios, you can configure <em>disable_storage_page_cache=false</em> to enable page cache again.</p><ul><li>Add new session variable <code>group_by_and_having_use_alias_first</code>, used to control whether the group and having clauses use alias.</li></ul><p>Reference: <a href="https://doris.apache.org/docs/dev/advanced/variables" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/advanced/variables</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="compaction">Compaction<a href="#compaction" class="hash-link" aria-label="Compaction的直接链接" title="Compaction的直接链接"></a></h3><ul><li><p>Support <code>Vertical Compaction</code>. To optimize the compaction overhead and efficiency of wide tables.</p></li><li><p>Support <code>Segment ompaction</code>. Fix -238 and -235 issues with high frequency imports.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="lakehouse-1">Lakehouse<a href="#lakehouse-1" class="hash-link" aria-label="Lakehouse的直接链接" title="Lakehouse的直接链接"></a></h3><ul><li><p>Hive Catalog can be compatible with Hive version 1/2/3</p></li><li><p>Hive Catalog can access JuiceFS based HDFS with Broker.</p></li><li><p>Iceberg Catalog Support Hive Metastore and Rest Catalog type.</p></li><li><p>ES Catalog support _id column mapping。</p></li><li><p>Optimize Iceberg V2 read performance with large number of delete rows.</p></li><li><p>Support for reading Iceberg tables after Schema Evolution</p></li><li><p>Parquet Reader handles column name case correctly.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="other">Other<a href="#other" class="hash-link" aria-label="Other的直接链接" title="Other的直接链接"></a></h3><ul><li><p>Support for accessing Hadoop KMS-encrypted HDFS.</p></li><li><p>Support to cancel the Export export task in progress.</p></li><li><p>Optimize the performance of <code>explode_split</code> with 1x.</p></li><li><p>Optimize the read performance of nullable columns with 3x.</p></li><li><p>Optimize some problems of Memtracker, improve memory management accuracy, and optimize memory application.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li><p>Fixed memory leak when loading data with Doris Flink Connector.</p></li><li><p>Fixed the possible thread scheduling problem of BE and reduce the <code>Fragment sent timeout</code> error caused by BE thread exhaustion.</p></li><li><p>Fixed various correctness and precision issues of column type datetimev2/decimalv3.</p></li><li><p>Fixed the problem data correctness issue with Unique Key Merge-on-Read table.</p></li><li><p>Fixed various known issues with the Light Schema Change feature.</p></li><li><p>Fixed various data correctness issues of bitmap type Runtime Filter.</p></li><li><p>Fixed the problem of poor reading performance of csv reader introduced in version 1.2.1.</p></li><li><p>Fixed the problem of BE OOM caused by Spark Load data download phase. </p></li><li><p>Fixed possible metadata compatibility issues when upgrading from version 1.1 to version 1.2. </p></li><li><p>Fixed the metadata problem when creating JDBC Catalog with Resource.</p></li><li><p>Fixed the problem of high CPU usage caused by load operation.</p></li><li><p>Fixed the problem of FE OOM caused by a large number of failed Broker Load jobs.</p></li><li><p>Fixed the problem of precision loss when loading floating-point types.</p></li><li><p>Fixed the problem of memory leak when useing 2PC stream load</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="other-1">Other<a href="#other-1" class="hash-link" aria-label="Other的直接链接" title="Other的直接链接"></a></h2><p>Add metrics to view the total rowset and segment numbers on BE</p><ul><li>doris_be_all_rowsets_num and doris_be_all_segments_num</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>Thanks to ALL who contributed to this release!</p><p>@adonis0147</p><p>@AshinGau</p><p>@BePPPower</p><p>@BiteTheDDDDt</p><p>@ByteYue</p><p>@caiconghui</p><p>@cambyzju</p><p>@chenlinzhong</p><p>@DarvenDuan</p><p>@dataroaring</p><p>@Doris-Extras</p><p>@dutyu</p><p>@englefly</p><p>@freemandealer</p><p>@Gabriel39</p><p>@HappenLee</p><p>@Henry2SS</p><p>@htyoung</p><p>@isHuangXin</p><p>@JackDrogon</p><p>@jacktengg</p><p>@Jibing-Li</p><p>@kaka11chen</p><p>@Kikyou1997</p><p>@Lchangliang</p><p>@LemonLiTree</p><p>@liaoxin01</p><p>@liqing-coder</p><p>@luozenglin</p><p>@morningman</p><p>@morrySnow</p><p>@mrhhsg</p><p>@nextdreamblue</p><p>@qidaye</p><p>@qzsee</p><p>@spaces-X</p><p>@stalary</p><p>@starocean999</p><p>@weizuo93</p><p>@wsjz</p><p>@xiaokang</p><p>@xinyiZzz</p><p>@xy720</p><p>@yangzhg</p><p>@yiguolei</p><p>@yixiutt</p><p>@Yukang-Lian</p><p>@Yulei-Yang</p><p>@zclllyybb</p><p>@zddr</p><p>@zhangstar333</p><p>@zhannngchen</p><p>@zy-kkk</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[ClickHouse & Kudu to Doris: 10X concurrency increased, 70% latency down]]></title>
<id>https://doris.apache.org/zh-CN/blog/linkedcare</id>
<link href="https://doris.apache.org/zh-CN/blog/linkedcare"/>
<updated>2023-01-28T00:00:00.000Z</updated>
<summary type="html"><![CDATA[The value-added report provided by Linkedcare to customers was initially provided by ClickHouse, which was later replaced by Apache Doris]]></summary>
<content type="html"><![CDATA[<p><img loading="lazy" alt="kv" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-c9c4b972a14903911ba1674b76f5edca.png" width="900" height="383" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="author">Author:<a href="#author" class="hash-link" aria-label="Author:的直接链接" title="Author:的直接链接"></a></h2><p>YiYang, Senior Big Data Developer, Linkedcare</p><h1>About Linkedcare</h1><p>Linkedcare is a leading SaaS software company in the health technology industry, focusing on the medical dental and cosmetic plastic surgery. In 2021, it was selected as one of the top 150 digital healthcare companies in the world by CB Insights. Linkedcare has served thousands of plastic surgery institutions in Los Angeles, Taiwan, and Hong Kong. Linkedcare also provides integrated management system services for dental clinics, covering electronic medical records, customer relationship management, intelligent marketing, B2B trading platform, insurance payment, BI tools, etc.</p><h1>Doris' Evolution in Linkedcare</h1><p>Let me briefly introduce Doris's development in Linkedcare first. In general, the application of Doris in Linkedcare can be divided into two stages:</p><ol><li>The value-added report provided by Linkedcare to customers was initially provided by ClickHouse, which was later replaced by Apache Doris;</li><li>Due to the continuous improvement of real-time data analysis requirements, T+1's data reporting gradually cannot meet business needs. Linkedcare needs a data warehouse that can handle real-time processing, and Doris has been introduced into the company's data warehouse since then. With the support of the Apache Doris community and the SelectDB professional technical team, our business data has been gradually migrated from Kudu to Doris.</li></ol><p><img loading="lazy" alt="1" src="https://cdnd.selectdb.com/zh-CN/assets/images/1-39a723280720a07dc2ed0a7de5c99c9b.png" width="1696" height="866" class="img_ev3q"></p><h1>Data Service Architecture: From ClickHouse to Doris</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-service-architecture-requirements">Data Service Architecture Requirements<a href="#data-service-architecture-requirements" class="hash-link" aria-label="Data Service Architecture Requirements的直接链接" title="Data Service Architecture Requirements的直接链接"></a></h2><ul><li>Support complex queries: When customers do self-service on the dashboard, a complex SQL query statement will be generated to directly query the database, and the complexity of the statement is unknown, which adds a lot of pressure on the database and affects query performance.</li><li>High concurrency and low latency: At least 100 concurrent queries can be supported, and query results can be return within 1 second;</li><li>Real-time data update: The report data comes from the SaaS system. When the customer modifies the historical data in the system, the report data must be changed accordingly to ensure consistentency, which requires real-time processing.</li><li>Low cost and easy deployment: There are a lot of private cloud customers in our SaaS business. In order to reduce labor costs, the business requires that the architecture deployment and operation and maintenance be simple enough.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="early-problems-found-clickhouse-shuts-down-when-high-concurrency-occurs">Early Problems Found: ClickHouse Shuts Down When High-concurrency Occurs<a href="#early-problems-found-clickhouse-shuts-down-when-high-concurrency-occurs" class="hash-link" aria-label="Early Problems Found: ClickHouse Shuts Down When High-concurrency Occurs的直接链接" title="Early Problems Found: ClickHouse Shuts Down When High-concurrency Occurs的直接链接"></a></h2><p>The previous project chose ClickHouse to provide data query services, but serious concurrency problems occurred during use:
10 concurrent queries will cause ClickHouse to shut down, resulting in the inability to provide services to customers normally, which is the direct reason for us to replace ClickHouse.</p><p>In addition, there are several severe problems:</p><ol><li>The cost of ClickHouse services on the cloud is very high, and the dependency on ClickHouse components is relatively high. The frequent interaction between ClickHouse and Zookeeper during data ingestion will put greater pressure on stability.</li><li>How to seamlessly migrate data without affecting the normal use of customers is another problem.</li></ol><h2 class="anchor anchorWithStickyNavbar_LWe7" id="selection-between-doris-clickhouse-and-kudu">Selection between Doris, Clickhouse and Kudu<a href="#selection-between-doris-clickhouse-and-kudu" class="hash-link" aria-label="Selection between Doris, Clickhouse and Kudu的直接链接" title="Selection between Doris, Clickhouse and Kudu的直接链接"></a></h2><p>To deal with the existing problems and meet the business requirements, we decided to conduct research on Doris (0.14), Clickhouse, and Kudu respectively.</p><p><img loading="lazy" alt="2" src="https://cdnd.selectdb.com/zh-CN/assets/images/2-bd04a72816c9ff95512e08d3f6e8e05f.png" width="1600" height="454" class="img_ev3q"></p><p>As shown in the table above, we made a deep comparison of these 3 databases. And we can see that Doris has excellent performance in many aspects:</p><ul><li>High concurrency: Doris can handle high-concurrency of 1,000 and more. So it will easily solve the problem of 10 concurrent queries which led ClickHouse to shut down.</li><li>Query performance: Doris can achieve millisecond-level query response. In single-table query, although Doris and ClickHouse are almost equivalent in query performance, in multi-table query, Doris is far better than ClickHouse. Doris can make sure that the QPS won't drop when high-concurrency happens.</li><li>Data update: Doris' data model can meet our needs for data update to ensure the consistency of system data and business data, which will be described in detail below.</li><li>Ease of use: Doris has a flat architecture, simple and fast deployment, fully-completed data ingest functions, and good at scaling out; At the same time, Doris can automatically perform replica balancing internally, and the operation and maintenance cost is extremely low. However, ClickHouse and Kudu rely heavily on components and require a lot of preparatory work for use. This requires a professional team to handle a large number of daily operation and maintenance tasks.</li><li>Standard SQL: Doris is compatible with the MySQL protocol and uses standard SQL. It is easy for developers to get started and does not require additional learning costs.</li><li>Distributed JOINs: Doris supports distributed JOINs, but ClickHouse has limitations in JOIN queries and functions as well as poor maintainability.</li><li>Active community: The Apache Doris open source community is active with passion. At the same time, SelectDB provides a professional and full-time team for technical support for the Doris community. If you encounter problems, you can directly contact the community and find out a solution in time.</li></ul><p>From the above research, we can find that Doris has excellent capabilities in all aspects and is very in line with our needs. Therefore, we adopt Doris instead of ClickHouse, which solves the problems of poor concurrency and the shutdown of ClickHouse.</p><h1>Data Warehouse Architecture: From Kudu+Impala to Doris</h1><p>In the process of using data reports, we have gradually discovered many advantages of Doris, so we decided to introduce Doris to the company's data warehouse.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-warehouse-architecture-requirements">Data Warehouse Architecture Requirements<a href="#data-warehouse-architecture-requirements" class="hash-link" aria-label="Data Warehouse Architecture Requirements的直接链接" title="Data Warehouse Architecture Requirements的直接链接"></a></h2><ul><li>When the customer modifies the historical data in the system, the report data should also be changed accordingly. At the same time, there should be a feature that can help customers to change the value of a single column;</li><li>When Flink extracts the full amount of data from the business database and writes it into the data warehouse frequently, the version compaction must keep up with the speed of new version generation, and will not cause version accumulation;</li><li>Through resource isolation and other functions, Doris reduces the possibility of resource preemption, improves resource utilization, and makes full use of resources on the core computing nodes;</li><li>Due to the limited memory resources in the company, overloaded tasks must be completed without increasing the number of clusters.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="early-problems-found-kuduimpala-underperforms">Early Problems Found: Kudu+Impala Underperforms<a href="#early-problems-found-kuduimpala-underperforms" class="hash-link" aria-label="Early Problems Found: Kudu+Impala Underperforms的直接链接" title="Early Problems Found: Kudu+Impala Underperforms的直接链接"></a></h2><p>The early company data warehouse architecture used Kudu and Impala for computing and storage. But we found the following problems during use:</p><ol><li>When the number of concurrent queries (QPS) is large, the simple query response time of Kudu+Impala is always more than a few seconds, which cannot reach the millisecond-level required by the business. The long waiting time has brought bad user experience to customers. </li><li>The Kudu+Impala engine cannot perform incremental aggregation of factual data, and can barely support real-time data analysis.</li><li>Kudu relies on a large number of primary key lookups when ingesting data. The batch processing efficiency is low and Kudu consumes a lot of CPU, which is not friendly to resource utilization.</li></ol><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-data-warehouse-architecture-design-based-on-doris">New Data Warehouse Architecture Design Based on Doris<a href="#new-data-warehouse-architecture-design-based-on-doris" class="hash-link" aria-label="New Data Warehouse Architecture Design Based on Doris的直接链接" title="New Data Warehouse Architecture Design Based on Doris的直接链接"></a></h2><p><img loading="lazy" alt="3" src="https://cdnd.selectdb.com/zh-CN/assets/images/3-e7990ac868e7345d5fda0512b0ec6b8c.png" width="1280" height="690" class="img_ev3q"></p><p>As shown in the figure above, Apache Doris is used in the new architecture and is responsible for data warehouse storage and computing; Data ingestion of real-time data and ODS data through Kafka has been replaced with Flink; We use Duckula as our stream computing platform; While we introduce DolphinSchedular for our task scheduling.</p><h1>Benefits of the new architecture based on Apache Doris:</h1><ul><li>The new data warehouse architecture based on Doris no longer depends on Hadoop related components, and the operation and maintenance cost is low.</li><li>Higher performance. Doris uses less server resources but provides stronger data processing capabilities;</li><li>Doris supports high concurrency and can directly support WebApp query services;</li><li>Doris supports the access to external tables, which enable easy data publishing and data ingestion;</li><li>Doris supports dynamic scaling out and automatic data balance;</li><li>Doris supports multiple federated queries, including Hive, ES, MySQL, etc.;</li><li>Doris' Aggregate Model supports users updating a single column;</li><li>By adjusting BE parameters and cluster size, the problem of version accumulation can be effectively solved;</li><li>Through the Resource Tag and Query Block function, cluster resource isolation can be realized, resource usage rate can be reduced, and query performance can be improved.</li></ul><p>Thanks to the excellent capabilities of the new architecture, the cluster we use has been reduced from 18 pieces of 16Cores 128G to 12 pieces of 16Cores 128G, saving up to 33% of resources compared to before; Further, the computing performance has been greatly improved. Doris can complete an ETL task that was completed in 3 hours on Kudu in only 1 hour. In addition, in frequent updates, Kudu's internal data fragmentation files cannot be automatically merged so that the performance will become worse and worse, requiring regular rebuilding; While the compaction function of Doris can effectively solve this problem.</p><h1>Highly Recommended</h1><p>The cost of using Doris is very low. Only 3 low-end servers or even desktops can be used to deploy easily a data warehouse based on Apache Doris; For enterprises with limited investment and do not want to be left behind by the market, it is highly recommended to try Apache Doris.</p><p>Doris is also a mature analytical database with MPP architecture. At the same time, its community is very active and easy to communicate with. SelectDB, the commercial company behind Doris, has set up a full-time technical team for the community. Any questions can be answered within 1 hour. In the last year, the community has been continuously promoted by SelectDB and introduced a series of industry-leading new features. In addition, the community will seriously consider the user habits when iterating, which will bring a lot of convenience.</p><p>I really appreciate the full support from the Doris community and the SelectDB team. And I sincerely recommend developers and enterprises to start with Apache Doris today.</p><h1>Apache Doris</h1><p>Apache Doris is a real-time analytical database based on MPP architecture, known for its high performance and ease of use. It supports both high-concurrency point queries and high-throughput complex analysis. (<a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a>)</p><h1>Links</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="github">GitHub:<a href="#github" class="hash-link" aria-label="GitHub:的直接链接" title="GitHub:的直接链接"></a></h2><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="apache-doris-website">Apache Doris Website:<a href="#apache-doris-website" class="hash-link" aria-label="Apache Doris Website:的直接链接" title="Apache Doris Website:的直接链接"></a></h2><p><a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">https://doris.apache.org</a></p>]]></content>
<author>
<name>Yi Yang</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[A glimpse of the next-generation analytical database]]></title>
<id>https://doris.apache.org/zh-CN/blog/summit</id>
<link href="https://doris.apache.org/zh-CN/blog/summit"/>
<updated>2023-01-19T00:00:00.000Z</updated>
<summary type="html"><![CDATA[My name is Mingyu Chen and I am the PMC Chair of the Apache Doris.In this lecture, you will go through the development of Doris in 2022 and look into the new trends that Doris is exploring in 2023.]]></summary>
<content type="html"><![CDATA[<h1>Self-Intro</h1><p>Hello everyone, welcome to the Doris Summit 2022, the first summit of Apache Doris since it was open-sourced. In this lecture, you will go through the development of Doris in 2022 and look into the new trends that Doris is exploring in 2023. My name is Mingyu Chen and I am the PMC Chair of the Apache Doris. I have been developing for Doris since 2014, and witnessed its whole process from open-source to graduation from Apache. My sharing will cover the following aspects. Let's get started.</p><p>As the beginning, I will briefly introduce what Doris is and why we should choose Doris in case you are new to Apache Doris. In 2022, Doris has became one of the most active open-sourced big data analysis engine projects in the world while the Doris community became one of the most active open-source communities in China, which you may get interested in. Moreover, the cutting-edge features, such as vectorized execution engine, cloud-native and efficient semi-structured data analysis, real-time processing and Lakehouse will be the focus of my lecture today. Also, it is important to prioritize tasks at the beginning of the year, so I will go through our job list with you shortly.</p><h1>About Doris</h1><p>Briefly speaking, Apache Doris is an easy-to-use, high-performance and unified analytical database. As shown in this enterprise data flow chart, you may have a clear vision of where Apache Doris stands. Data from various upstream data sources, such as transactional databases, log systems, event tracking, etc., as well as data from ETL components, such as Flink, Spark and Hive is ingested into Doris through data processing and integration tools. </p><p><img loading="lazy" alt="flow" src="https://cdnd.selectdb.com/zh-CN/assets/images/flow-6f2b3b515642bdf16bbdba8ecf2749cd.png" width="1988" height="1102" class="img_ev3q"></p><p>As a fully-complete database system, Doris can provide various direct query functions including report analysis, multi-dimensional analysis, log analysis, user portrait and lakehouse, etc. Thanks to Doris' MPP SQL distributed query engine. Doris can also be used to query external data sources from Hive, Iceberg, Hudi, Elasticsearch and various transactional database systems connected through JDBC, without data import and maintaining the schema of other data sources. There are several core features that can help users solve practical problems, which are as follows:</p><ul><li>NO.1 is the ease of use. It supports ANSI SQL syntax, including single table aggregation, sorting, filtering and multi table join, sub query, etc. It also supports complex SQL syntax such as window function and grouping sets. At the same time, users can expand system functions through UDF, UDAF. In addition, Apache Doris is also compatible with MySQL protocol, which allows users access Doris through various BI tools. </li><li>NO.2 is high performance. Doris is equipped with an efficient column storage engine, which not only reduces the amount of data scanning, but also implements an ultra-high data compression ratio. At the same time, Doris also uses various index technology to speed up data reading and filtering. Using the partition and bucket pruning function, Doris can support ultra-high concurrency of online service business, and a single node can support up to thousands of QPS. Further, Apache Doris combines the vectorized execution engine to give full play to the modern CPU parallel computing power. Doris supports materialized view. In terms of the optimizer, Doris uses a combination of CBO and RBO, with RBO supporting constant folding, subquery rewriting, predicate pushdown, etc.</li><li>NO.3 is unified data warehouse. Thanks to the well-designed architecture, Doris can easily handle both low-latency, high-concurrency scenarios and high-throughput scenarios . </li><li>NO.4 is the federated query analysis. With the help of Doris's complete distributed query engine, Doris can access data lake such as Hive, Iceberg and Hudi, as well as high-speed queries to external data sources such as Elasticsearch and MySQL. </li><li>NO.5 is ecological enrichment. Doris provides rich data ingest methods, supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access data in MySQL, PostgreSQL, Oracle, S3, Hive, Iceberg, Elasticsearch and other systems without data replication. At the same time, the data stored in Doris can also be read by Spark and Flink, and can be output to the upstream data application for display and analysis.</li></ul><p>Next, we will review what remarkable achievements the Doris community has achieved in 2022.</p><h1>How should we look back on 2022?</h1><p>In 2022, the world has witnessed unprecedented changes, and countless magical moments are happening in reality. Thankfully, the power of technology and open source has navigated us to the right path. And 2022 is absolutely a fruitful year for Apache Doris. Let's review the development of Apache Doris in the past year from several angles:</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="important-indicators-of-the-community">Important Indicators of the Community<a href="#important-indicators-of-the-community" class="hash-link" aria-label="Important Indicators of the Community的直接链接" title="Important Indicators of the Community的直接链接"></a></h2><p><img loading="lazy" alt="community" src="https://cdnd.selectdb.com/zh-CN/assets/images/community-cac05d1c3c2c9671816abed6c46b608e.png" width="1994" height="1106" class="img_ev3q"></p><p>In the past year:</p><ul><li>The number of cumulative community contributors has increased from 200 to nearly 420, a year-on-year increase of more than 100%, which is still rising.</li><li>The number of monthly active contributors has doubled from 50 to 100.</li><li>The number of GitHub Stars has increased from 3.6k to 6.8k, and has been on the daily/weekly/monthly GitHub Trending list many times.</li><li>The number of all Commits increased from 3.7k to 7.6k. The amount of newly submitted code in the past year exceeded the total of previous years.</li></ul><p>![community 2]<!-- -->(/images/summit/en/community 2.png)</p><p>From these data, we can see that in 2022, there was an explosive growth in Apache Doris. The data indicators of all dimensions are grown by nearly 100%. The great effort has also made Apache Doris one of the most active open-source communities in the big data and database world. As is the growth shown in the trending of GitHub Contribution above, users and developers have made tremendous contribution to the community .
It is memorable that in June 2022, Apache Doris graduated from the Apache incubator and became a Top-Level Project, which is the biggest milestone since open-souced.</p><p>![top level]<!-- -->(/images/summit/en/top level.png)</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="open-source-user-scale">Open Source User Scale<a href="#open-source-user-scale" class="hash-link" aria-label="Open Source User Scale的直接链接" title="Open Source User Scale的直接链接"></a></h2><p>Thanks to the voluntary technical support from the developers of SelectDB, a commercial company funding Apache Doris. In 2022 Doris became smoother in user connection and communication, and we were able to interact with users more directly and listen to their real voices.
Last year, Apache Doris was applied in dozens of industries, such as the Internet, fintech, telecommunications, education, automobiles, manufacturing, logistics, energy, and government affairs, and especially in the Internet industry, which is known for massive data. 80% of the TOP 50 Chinese Internet companies have been using Apache Doris for a long time to solve data analysis problems in their own business, including Baidu, Meituan, Xiaomi, Tencent, JD.com, ByteDance , NetEase, Sina, 360 Total Security, MiHoYo, ZHIHU.COM, etc.</p><p>![logo wall]<!-- -->(/images/summit/en/logo wall.png)</p><p>Globally, Apache Doris has served thousands of enterprise users, and this number is still growing rapidly. Most enterprise users are glad to contact the community and participate in community building through various means. Moreover, many of the enterprise users participated in Doris Summit, giving a lecture of their own practical experience based on real business.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="releases">Releases<a href="#releases" class="hash-link" aria-label="Releases的直接链接" title="Releases的直接链接"></a></h2><p>In the early versions, ease of use has been frequently emphasized. The versions released in 2022 mainly focus on performance, stability, ease of use, which is a comprehensive evolution.</p><ul><li>In April, the community released Apache Doris V1.0.0, whose major version first changed from 0 to 1(V0.1.5 to V1.0.0) since open-sourced. In version 1.0, the extraordinary vectorized execution engine was first published, marking the beginning of Apache Doris to the era of ultra-high speed data analysis.</li><li>In version 1.1 released in June, we further improved and optimized the vectorized engine, and set it as default. Simultaneously, the community has also prepared LTS(Long-Term-Support) versions released to quickly fix bugs and optimize functions for version 1.1 on a monthly basis, aiming to ensure higher stability required by the growing community users.</li><li>Launched in early December, Version 1.2 not only introduces many important functions, such as Merge-on-Write for Unique Key model, Multi-Catalog, Java UDF, Array type, JSONB type, etc., but also improves the query performance by nearly ten times. These features allow Apache Doris to be more adaptable and possible for more data analysis.</li><li>In version 1.2, stability and quality assurance were strongly stressed. On the one hand, using automated testing tools such as SQL Smith and test cases from various well-known open source projects, we have built millions of test case sets; On the other hand, the community access pipeline and perfect regression testing framework ensure the quality of code-merge. </li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="evolution-of-core-features">Evolution of Core Features<a href="#evolution-of-core-features" class="hash-link" aria-label="Evolution of Core Features的直接链接" title="Evolution of Core Features的直接链接"></a></h2><p>In 2022, the community's research and development was mainly focused on four aspects, high performance, real-time processing, semi-structured data support and Lakehouse.</p><p><img loading="lazy" alt="2022" src="https://cdnd.selectdb.com/zh-CN/assets/images/2022-6e5870d078188886bd8841f5b7d5a213.png" width="1984" height="1094" class="img_ev3q"></p><ul><li>Query performance improvement. From the released version 1.0 to 1.2, Apache Doris has made remarkable achievement in performance. In the single-table test, Apache Doris won 3rd place in Clickbench database performance list launched by Clickhouse. In the multi-table association, thanks to the vectorized execution engine and various query optimization, compared to the released version 0.15 at the end of 2021, Apache Doris was 10 times faster in standard test data sets under SSB and TPC-H, which marks Apache Doris one of the best databases in the world!</li></ul><p><img loading="lazy" alt="performance" src="https://cdnd.selectdb.com/zh-CN/assets/images/performance-a6fb3c1bf21f90b1b47e354cb6c8cba7.png" width="1986" height="1102" class="img_ev3q"></p><ul><li>Real-time processing optimization. In version 1.2, we have implemented the Merge-On-Write data update method on the original Unique Key, with a query performance improved by 5-10 times during high-frequency updates and low latency on updateable data in real-time analytics. In addition, the lightweight Schema Change enables easier column adding and substraction of data, which is unecessary for users to convert historical data any more. Tools such as Flink CDC can be used instantly to synchronize DML or DDL operations in transaction databases, making data synchronization smoother and unified.</li></ul><p><img loading="lazy" alt="realtime" src="https://cdnd.selectdb.com/zh-CN/assets/images/realtime-c64e2770a88f60046d6acab06dde6395.png" width="1986" height="1106" class="img_ev3q"></p><ul><li>Semi-structured data analysis. At present, Apache Doris supports Array and JSONB types. The Array type can not only store complex data structures, but also support user behavior analysis through Array functions. JSONB is a binary JSON storage type, which not only has 4 times faster access performance than Text JSON, but has lower memory consumption as well. Various log data structures in JSON format can be easily ingested through JSONB efficiently. </li></ul><p><img loading="lazy" alt="semi" src="https://cdnd.selectdb.com/zh-CN/assets/images/semi-837be10c6126d78041e7523fa4d5c1a6.png" width="1986" height="1106" class="img_ev3q"></p><ul><li>Lakehouse. In version 1.2.0, through multiple performance optimizations for external data sources such as Native Format Reader, late materialization, asynchronous IO, data prefetching, high-performance execution engine and query optimizer, Apache Doris can easily access external data sources, for instance, Hive, Iceberg and Hudi. And the speed of access is 3-5 times faster than Trino/Presto and 10-100 times faster than Hive.</li></ul><p><img loading="lazy" alt="lakehouse" src="https://cdnd.selectdb.com/zh-CN/assets/images/lakehouse-22ce36b17d89df8c92000e7d19dd8115.png" width="1982" height="1096" class="img_ev3q"></p><h1>2023 RoadMap</h1><p>In 2023, the Apache Doris community will deep dive into new features development, as you can refer to the 2023 RoadMap and the specific plan for next year below:</p><p><img loading="lazy" alt="roadmap" src="https://cdnd.selectdb.com/zh-CN/assets/images/roadmap-acc9c114672fd18e24dbb95de8e5412a.png" width="1986" height="1108" class="img_ev3q"></p><p>In 2023, we will start the iteration of Apache Doris 2.x version on a quarterly basis . At the same time, for each 2-bit version, bug fixes and upgrades will be done on a monthly basis.
From a functional point of view, the follow-up research and development will focus on the following main directions:</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="high-performance">High Performance<a href="#high-performance" class="hash-link" aria-label="High Performance的直接链接" title="High Performance的直接链接"></a></h2><p>High performance is the goal that Apache Doris is constantly pursuing. Doris' excellent performance on public test datasets such as Clickbench and TPC-H has proved that it has become industry-leading. In the future, we will further enhance performance, including:</p><ul><li>More complex SQL: The new query optimizer will be available in the first quarter of 2023. The new query optimizer supports the strategy of combining RBO and CBO, and it can support complex queries more efficiently and fully execute all 99 SQLs of TPC-DS. </li><li>Higher concurrency point query: High concurrency is always what Apache Doris is good at. And in 2023 we will further strengthen this capability through a series of features such as Short-Circuit Plan, Prepare Statement, Query Cache, etc., to support ultra-high concurrency of 10,000 QPS with single node and has higher-concurrency scaling out.</li><li>More flexible multi-table materialized views: In previous versions, Apache Doris accelerated the analysis efficiency of fixed-dimensional data through strengthen single-table materialized views. The new multi-table materialized view will decouple the lifecycle of Base table and the MV table. In this way, Doris can easily deal with the multi-table JOINs and the pre-calculation acceleration of more complex SQL queries. And Doris is capable of asynchronous refresh and flexible incremental calculation methods. This feature will be available in the first quarter of 2023.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="cost-effective">Cost-effective<a href="#cost-effective" class="hash-link" aria-label="Cost-effective的直接链接" title="Cost-effective的直接链接"></a></h2><p>Cost efficiency is the key to winning market competition for enterprises, which is true for databases as well. In the past, Apache Doris helped users greatly save the cost in computing and storage resources with many designs of ease of use. In the future, we will introduce a series of cloud-native capabilities to further reduce costs without affecting business efficiency, including:</p><ul><li>Lower storage costs: We will explore the combination of object storage systems and file systems on the cloud to help users further reduce storage costs, including better separation of hot and cold data, and migrate cold data to cheaper object storage or file system. Combining technologies such as a single remote replica, cold data cache, and hot &amp; cold data conversion, we can ensure that query efficiency is not affected while saving up storage costs. This feature will be released in the first quarter of 2023.</li><li>More elastic computing resources: We plan to separate storage and computing state and adopt Elastic Compute Node for computing. Since no data is stored, Elastic Computing Nodes have faster elastic scaling capabilities, which is convenient for users to quickly scale out during peak business periods, and further improve the analysis efficiency in massive data computing, such as lakehouse analysis. This function will be released shortly. </li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="hybrid-workload">Hybrid Workload<a href="#hybrid-workload" class="hash-link" aria-label="Hybrid Workload的直接链接" title="Hybrid Workload的直接链接"></a></h2><p>Lots of users nowadays are building a unified analysis platform within the enterprise based on Apache Doris. On the one hand, Apache Doris is required to execute larger-scale data processing and analysis. On the other hand, Apache Doris is also required to deal with more analytical load challenges, such as real-time reports and Ad-hoc to ELT/ETL, log retrieval and more unified analysis. In order to better adapt to these cases, new features are about to be released in 2023, which include:</p><ul><li>Pipeline execution engine: Compared with the traditional volcano model, the Pipeline model does not need to set the concurrency manually, but instead, it can do parallel computing between different pipelines, making full usage of CPUs and is more flexible in execution scheduling, which improves the overall performance under mixed load cases.</li><li>Workload Manager: It is also urgent to improve resource isolation and division capabilities. Based on the Pipeline execution engine, we will launch features such as flexible load management, resource queues, and isolation in shared services to balance query performance and stability in various mixed load cases.</li><li>Lightweight fault tolerance: It can not only take advantage of the high efficiency of MPP structure but also tolerate errors to better adapt to the challenges of users in ETL/ELT.</li><li>Function compatibility and UDF in multiple languages: At the same time, we will be more compatible with Hive/Trino/Spark function and support multiple UDF in the future to help users process data more flexibly. And data migration to Apache Doris will be easier than before.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="multi-model-data-analysis">Multi-model Data Analysis<a href="#multi-model-data-analysis" class="hash-link" aria-label="Multi-model Data Analysis的直接链接" title="Multi-model Data Analysis的直接链接"></a></h2><p>In the past, Apache Doris was quite good at structured data analysis. As the demand for semi-structured and unstructured data analysis increased, we added Array and JSONB types from version 1.2 to support these data types naturally. In the future release, we will continue providing more cost-effective and better-performance solutions for log analysis cases, including:</p><ul><li>Richer complex data types: In addition to Array/JSONB types, we will increase support for Map/Struct types in the first quarter of 2023, including efficient writing, storage, analysis functions to better perform multi-model data analysis. In the future, more data types will be supported, such as IP and GEO geographic information, and more time series data.</li><li>More efficient text analysis algorithms: For text data, we will introduce text analysis algorithms, including adaptive Like, high-performance substring matching, high-performance regular matching, predicate pushdown of Like statements, Ngram Bloomfilter, etc. The full-text search is based on the inverted index and it provides higher performance and is more cost-effective in analysis compared with that of Elasticsearch in the log analysis. These features will come out in early 2023.
-Dynamic Schema table: In other databases, the schema is relatively static and DDL needs to be executed manually when the schema is changed. In recent cases, the table structure changes all the time, so we plan to launch Dynamic Table, which can automatically adapt to the Schema according to data writing without DDL execution, replacing manual adjusting. This feature will be released in the first quarter of 2023.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="lakehouse">Lakehouse<a href="#lakehouse" class="hash-link" aria-label="Lakehouse的直接链接" title="Lakehouse的直接链接"></a></h2><p>With the development of data lake technology, analysis performance has become the biggest constraint to data-mining. Building analysis services on top of data lakes based on an easy-to-use and high-performance query analysis engine has become a new trend. In the last year, through many performance optimizations on the data lake, high-performance execution engine and query optimizer, Apache Doris has become extremely fast in analysis and easy-to-use on the data lake with a performance 3-5 times higher than that of Presto/Trino. In 2023, we will continue to go deeper, including:</p><ul><li>Easier data access: In version 1.2, we released Multi-Catalog, which supports automatic metadata mapping and synchronization of multiple heterogeneous data sources and is used for accessing data lakes. Delta Lake, Iceberg and Hudi will be better supported.</li><li>More complete data lake capabilities: We provide incremental update and query of data on the data lake. Analysis result will be sent to data lake and the data from external tables will be ingested into internal tables. At the same time, Doris will also support multi-version Snapshot's read &amp; delete and materialized views.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-and-storage-engine-optimization">Real-time and storage engine optimization<a href="#real-time-and-storage-engine-optimization" class="hash-link" aria-label="Real-time and storage engine optimization的直接链接" title="Real-time and storage engine optimization的直接链接"></a></h2><p>The value of data will decrease over time, so real-time performance is very important for users. The Merge-on-Write data update in version 1.2 allows Apache Doris to be fast in both real-time updating and query. In 2023, we will upgrade the storage engine with the following:</p><ul><li>More stable data writing: Through a series of compaction operations and optimization of batch processing, resource cost is able to be saved. And through a new memory management framework, stability of the writing process will be improved.</li><li>More mature data-updating mechanism: In the past, column updates were implemented through Replace_if_not_null on the Agg model. In the future, we will increase support for partial column updates with the Unique Key model, and data updates such as Delete, Update, and Merge.</li><li>A unified data model: Currently, the three data models of Apache Doris are widely used in various cases. In the future, we will try to unify the existing data models to provide a better user experience.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="ease-of-use-and-stability">Ease of use and stability<a href="#ease-of-use-and-stability" class="hash-link" aria-label="Ease of use and stability的直接链接" title="Ease of use and stability的直接链接"></a></h2><p>In addition to improving functions, simplicity, ease of use and stability is also the goal that Apache Doris has been pursuing. In 2023, we will dive deeper in the following:</p><ul><li>Simplified table creation: Currently, Apache Doris already supports time functions in table partitioning. In the future, we will further simplify Bucket settings to help users build models easily.</li><li>Security: At present, a permission management mechanism based on the RBAC model has been launched, which makes user permissions more secure and reliable. Functions such as ID-federation, Row&amp;Column-level permissions, data desensitization will be further improved in the future.</li><li>Observability: Profile is an important means of locating query performance problems. In the future, we will strengthen the monitoring of Profile and provide visualized Profile tools to help users locate problems faster.</li><li>Better BI compatibility and data migration solution: Currently, various BI tools can be connected with Apache Doris through MySQL protocol, and we will further adapt mainstream BI software in the future to ensure a better query experience. With the rise of emerging data integration and migration tools such as DBT and Airbyte, more and more users synchronize data to Apache Doris in this way. So we should provide support for these users in the future.</li></ul><h1>How to join the community</h1><p>Last but not the least, we hope that more developers can participate in the community to jointly create a powerful database. There are 3 ways to participate in the community. First of all, users can subscribe to our developer mailing group through this address: <a href="mailto:dev@doris.apache.org" target="_blank" rel="noopener noreferrer">dev@doris.apache.org</a>, which is recommended by the Apache Way as well. You can send any related topics that you want to discuss with the community. Secondly, you can reach out to us virtually on developer's biweekly meeting. The biweekly meeting is held on Wednesdays at 8pm(UTC+8). The topic will cover new features, disclosure and development progress and more. Thirdly, the DSIP. DSIP is short for Doris Improvement Proposal. All Doris designed functions are recorded in this document. Both users and developers can follow and see detailed design and development of important functions on this Wiki.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="links">Links:<a href="#links" class="hash-link" aria-label="Links:的直接链接" title="Links:的直接链接"></a></h2><p>Apache Doris Repository</p><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><p>Apache Doris Website</p><p><a href="https://doris.apache.org" target="_blank" rel="noopener noreferrer">https://doris.apache.org</a></p>]]></content>
<author>
<name>Mingyu Chen</name>
</author>
<category label="Top News" term="Top News"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.1]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.1</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.1"/>
<updated>2023-01-04T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.2.1 is now available, with several enhancements and bug fixes based on 1.2.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="supports-new-type-decimalv3">Supports new type DecimalV3<a href="#supports-new-type-decimalv3" class="hash-link" aria-label="Supports new type DecimalV3的直接链接" title="Supports new type DecimalV3的直接链接"></a></h3><p>DecimalV3, which supports higher precision and better performance, has the following advantages over past versions.</p><ul><li><p>Larger representable range, the range of values are significantly expanded, and the valid number range <!-- -->[1,38]<!-- -->.</p></li><li><p>Higher performance, adaptive adjustment of the storage space occupied according to different precision.</p></li><li><p>More complete precision derivation support, for different expressions, different precision derivation rules are applied to the accuracy of the result.</p></li></ul><p><a href="https://doris.apache.org/docs/2.0/sql-manual/sql-reference/Data-Types/DECIMAL" target="_blank" rel="noopener noreferrer">DecimalV3</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-iceberg-v2">Support Iceberg V2<a href="#support-iceberg-v2" class="hash-link" aria-label="Support Iceberg V2的直接链接" title="Support Iceberg V2的直接链接"></a></h3><p>Support Iceberg V2 (only Position Delete is supported, Equality Delete will be supported in subsequent versions).</p><p>Tables in Iceberg V2 format can be accessed through the Multi-Catalog feature.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-or-condition-to-in">Support OR condition to IN<a href="#support-or-condition-to-in" class="hash-link" aria-label="Support OR condition to IN的直接链接" title="Support OR condition to IN的直接链接"></a></h3><p>Support converting OR condition to IN condition, which can improve the execution efficiency in some scenarios.<a href="https://github.com/apache/doris/pull/15437" target="_blank" rel="noopener noreferrer">#15437</a> <a href="https://github.com/apache/doris/pull/12872" target="_blank" rel="noopener noreferrer">#12872</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimize-the-import-and-query-performance-of-jsonb-type">Optimize the import and query performance of JSONB type<a href="#optimize-the-import-and-query-performance-of-jsonb-type" class="hash-link" aria-label="Optimize the import and query performance of JSONB type的直接链接" title="Optimize the import and query performance of JSONB type的直接链接"></a></h3><p>Optimize the import and query performance of JSONB type. <a href="https://github.com/apache/doris/pull/15219" target="_blank" rel="noopener noreferrer">#15219</a> <a href="https://github.com/apache/doris/pull/15219" target="_blank" rel="noopener noreferrer">#15219</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="stream-load-supports-quoted-csv-data">Stream load supports quoted csv data<a href="#stream-load-supports-quoted-csv-data" class="hash-link" aria-label="Stream load supports quoted csv data的直接链接" title="Stream load supports quoted csv data的直接链接"></a></h3><p>Search trim_double_quotes in Document:<a href="https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="broker-supports-tencent-cloud-chdfs-and-baidu-cloud-bos-afs">Broker supports Tencent Cloud CHDFS and Baidu Cloud BOS, AFS<a href="#broker-supports-tencent-cloud-chdfs-and-baidu-cloud-bos-afs" class="hash-link" aria-label="Broker supports Tencent Cloud CHDFS and Baidu Cloud BOS, AFS的直接链接" title="Broker supports Tencent Cloud CHDFS and Baidu Cloud BOS, AFS的直接链接"></a></h3><p>Data on CHDFS, BOS, and AFS can be accessed through Broker. <a href="https://github.com/apache/doris/pull/15297" target="_blank" rel="noopener noreferrer">#15297</a> <a href="https://github.com/apache/doris/pull/15448" target="_blank" rel="noopener noreferrer">#15448</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-function">New function<a href="#new-function" class="hash-link" aria-label="New function的直接链接" title="New function的直接链接"></a></h3><p>Add function <code>substring_index</code>. <a href="https://github.com/apache/doris/pull/15373" target="_blank" rel="noopener noreferrer">#15373</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li><p>In some cases, after upgrading from version 1.1 to version 1.2, the user permission information will be lost. <a href="https://github.com/apache/doris/pull/15144" target="_blank" rel="noopener noreferrer">#15144</a></p></li><li><p>Fix the problem that the partition value is wrong when using datev2/datetimev2 type for partitioning. <a href="https://github.com/apache/doris/pull/15094" target="_blank" rel="noopener noreferrer">#15094</a></p></li><li><p>Bug fixes for a large number of released features. For a complete list see: <a href="https://github.com/apache/doris/pulls?q=is%3Apr+label%3Adev%2F1.2.1-merged+is%3Aclosed" target="_blank" rel="noopener noreferrer">PR List</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-notices">Upgrade Notices<a href="#upgrade-notices" class="hash-link" aria-label="Upgrade Notices的直接链接" title="Upgrade Notices的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="known-issues">Known Issues<a href="#known-issues" class="hash-link" aria-label="Known Issues的直接链接" title="Known Issues的直接链接"></a></h3><ul><li>Do not use JDK11 as the runtime JDK of BE, it will cause BE Crash.</li><li>The reading performance of the csv format in this version has declined, which will affect the import and reading efficiency of the csv format. We will fix it as soon as possible in the next three-digit version</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changed">Behavior Changed<a href="#behavior-changed" class="hash-link" aria-label="Behavior Changed的直接链接" title="Behavior Changed的直接链接"></a></h3><ul><li><p>The default value of the BE configuration item <code>high_priority_flush_thread_num_per_store</code> is changed from 1 to 6, to improve the write efficiency of Routine Load. (<a href="https://github.com/apache/doris/pull/14775" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/pull/14775</a>)</p></li><li><p>The default value of the FE configuration item <code>enable_new_load_scan_node</code> is changed to true. Import tasks will be performed using the new File Scan Node. No impact on users.<a href="https://github.com/apache/doris/pull/14808" target="_blank" rel="noopener noreferrer">#14808</a></p></li><li><p>Delete the FE configuration item <code>enable_multi_catalog</code>. The Multi-Catalog function is enabled by default.</p></li><li><p>The vectorized execution engine is forced to be enabled by default.<a href="https://github.com/apache/doris/pull/15213" target="_blank" rel="noopener noreferrer">#15213</a></p></li></ul><p>The session variable enable_vectorized_engine will no longer take effect. Enabled by default.</p><p>To make it valid again, set the FE configuration item <code>disable_enable_vectorized_engine</code> to false, and restart FE to make <code>enable_vectorized_engine</code> valid again.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>Thanks to ALL who contributed to this release!</p><p>@adonis0147</p><p>@AshinGau</p><p>@BePPPower</p><p>@BiteTheDDDDt</p><p>@ByteYue</p><p>@caiconghui</p><p>@cambyzju</p><p>@chenlinzhong</p><p>@dataroaring</p><p>@Doris-Extras</p><p>@dutyu</p><p>@eldenmoon</p><p>@englefly</p><p>@freemandealer</p><p>@Gabriel39</p><p>@HappenLee</p><p>@Henry2SS</p><p>@hf200012</p><p>@jacktengg</p><p>@Jibing-Li</p><p>@Kikyou1997</p><p>@liaoxin01</p><p>@luozenglin</p><p>@morningman</p><p>@morrySnow</p><p>@mrhhsg</p><p>@nextdreamblue</p><p>@qidaye</p><p>@spaces-X</p><p>@starocean999</p><p>@wangshuo128</p><p>@weizuo93</p><p>@wsjz</p><p>@xiaokang</p><p>@xinyiZzz</p><p>@xutaoustc</p><p>@yangzhg</p><p>@yiguolei</p><p>@yixiutt</p><p>@Yulei-Yang</p><p>@yuxuan-luo</p><p>@zenoyang</p><p>@zhangstar333</p><p>@zhannngchen</p><p>@zhengshengjun</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[The Efficiency of the data warehouse greatly improved in LY Digital"]]></title>
<id>https://doris.apache.org/zh-CN/blog/LY</id>
<link href="https://doris.apache.org/zh-CN/blog/LY"/>
<updated>2022-12-19T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Established in 2015, LY Digital is a financial service platform for tourism industry under LY. Com. In 2020, LY Digital introduced Apache Doris to build a data warehouse because of its rich data import methods, excellent parallel computing capabilities, and low maintenance costs. This article describes the evolution of data warehouse in LY Digital and why we switch to Apache Doris. ]]></summary>
<content type="html"><![CDATA[<blockquote><p>Guide: Established in 2015, LY Digital is a financial service platform for tourism industry under LY. Com. In 2020, LY Digital introduced Apache Doris to build a data warehouse because of its rich data import methods, excellent parallel computing capabilities, and low maintenance costs. This article describes the evolution of data warehouse in LY Digital and why we switch to Apache Doris. I hope you like it.</p></blockquote><blockquote><p>Author: XingWang, Lead Developer of LY Digital</p></blockquote><p><img loading="lazy" alt="kv" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-fb77e142257a98bea6656a33a626b310.png" width="900" height="383" class="img_ev3q"></p><h1>1. Background</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="11-about-ly-digital">1.1 About LY Digital<a href="#11-about-ly-digital" class="hash-link" aria-label="1.1 About LY Digital的直接链接" title="1.1 About LY Digital的直接链接"></a></h2><p>LY Digital is a tourism financial service platform under LY. Com. Formally established in 2015, LY Digital takes "Digital technology empowers the tourism industry." as its vision.
At present, LY Digital's business covers financial services, consumer financial services, financial technology and digital technology. So far, more than 10 million users and 76 cities have enjoyed our services.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="12-requirements-for-data-warehouse">1.2 Requirements for Data Warehouse<a href="#12-requirements-for-data-warehouse" class="hash-link" aria-label="1.2 Requirements for Data Warehouse的直接链接" title="1.2 Requirements for Data Warehouse的直接链接"></a></h2><ul><li>Dashboard: Needs dashboard for T+1 business, etc.</li><li>Early Warning System: Needs risk control, anomaly capital management and traffic monitoring, etc.</li><li>Business Analysis: Needs timely data query analysis and temporary data retrieval, etc.</li><li>Finance: Needs liquidation and payment reconciliation.</li></ul><h1>2. Previous Data Warehouse</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="21-architecture">2.1 Architecture<a href="#21-architecture" class="hash-link" aria-label="2.1 Architecture的直接链接" title="2.1 Architecture的直接链接"></a></h2><p><img loading="lazy" alt="page_1" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_1-42732f62f592f158a33670ae04987e75.png" width="1152" height="679" class="img_ev3q"></p><p>Our previous data warehouse adopted the combination of SteamSets and Apache Kudu, which was very popular in the past few years. In this architecture, Binlog is ingested into Apache Kudu after passing through StreamSets in real-time, and is finally queried and used through Apache Impala and visualization tools.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="212-downside">2.1.2 Downside<a href="#212-downside" class="hash-link" aria-label="2.1.2 Downside的直接链接" title="2.1.2 Downside的直接链接"></a></h3><ul><li>The previous data warehouse has a sophisticated structure that consists of many components that interact with one another, which requires huge operation and maintenance costs. </li><li>The previous data warehouse has a sophisticated structure that consists of many components that interact with one another, which requires huge operation and maintenance costs.</li><li>Apache Kudu's performance in wide tables Join is not so good.</li><li>SLA is not fully guaranteed because tenant isolation is not provided.</li><li>Although SteamSets are equipped with early warning capabilities, job recovery capabilities are still poor. When configuring multiple tasks, the JVM consumes a lot, resulting in slow recovery.</li></ul><h1>3. New Data Warehouse</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="31-research-of-popular-data-warehouses">3.1 Research of Popular Data Warehouses<a href="#31-research-of-popular-data-warehouses" class="hash-link" aria-label="3.1 Research of Popular Data Warehouses的直接链接" title="3.1 Research of Popular Data Warehouses的直接链接"></a></h2><p>Due to so many shortcomings, we had to give up the previous data warehouse. In 2020, we conducted an in-depth research on the popular data warehouses in the market.</p><p>During the research, we focused on comparing Clickhouse and Apache Doris. ClickHouse has a high utilization rate of CPU, so it performs well in single-table query. But it does not perform well in multitable Joins and high QPS. On the other hand, Doris can not only support thousands of QPS per node. Thanks to the function of partitioning, it can also support high-concurrency queries at the QPS level of 10,000. Moreover, the horiziontal scaling in and out of ClickHouse are complex, which cannot be done automatically at present. Doris supports online dynamic scaling, and can be expanded horizontally according to the development of the business.</p><p>In the research, Apache Doris stood out. Doris's high-concurrency query capability is very attractive. Its dynamic scaling capabilities are also suitable for our flexible advertising business. So we chose Apache Doris for sure.</p><p><img loading="lazy" alt="page_2" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_2-414a885ce6917a5bfddb76d64d882ea4.png" width="1145" height="676" class="img_ev3q"></p><p>After introducing Apache Doris, we upgraded the entire data warehouse:</p><ul><li>We collect MySQL Binlog through Canal and then it is ingested into Kafka. Because Apache Doris is highly capatible with Kafka, we can easily use Routine Load to load and import data.</li><li>We have made minor adjustments to the batch processing. For data stored in Hive, Apache Doris can ingest data from Hive through Broker Load. In this way, the data in batch processing can be directly ingested into Doris.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="32-why-we-choose-doris">3.2 Why We Choose Doris<a href="#32-why-we-choose-doris" class="hash-link" aria-label="3.2 Why We Choose Doris的直接链接" title="3.2 Why We Choose Doris的直接链接"></a></h2><p><img loading="lazy" alt="page_3" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_3-ec6524eea65a399078e60bff590cb3ab.png" width="1137" height="676" class="img_ev3q"></p><p>The overall performance of Apache Doris is impressive:</p><ul><li>Data access: It provides rich data import methods and can support the access of many types of data sources;</li><li>Data connection: Doris supports JDBC and ODBC connections. And it can easily connect with BI tools. In addition, Doris uses the MySQL protocol for communication. Users can directly access Doris through various Client tools;</li><li>SQL syntax: Doris adopts MySQL protocol and it is highly compatible with MySQL syntax, supporting standard SQL, and is low in learning costs for developers;</li><li>MPP parallel computing: Doris provides excellent parallel computing capabilities and has obvious advantages in complex Join and wide table Join;</li><li>Fully-completed documentation: Doris official documentation is very profound, which is friendly for new users. </li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="33--architecture-of-real-time-processing">3.3 Architecture of Real-time Processing<a href="#33--architecture-of-real-time-processing" class="hash-link" aria-label="3.3 Architecture of Real-time Processing的直接链接" title="3.3 Architecture of Real-time Processing的直接链接"></a></h2><p><img loading="lazy" alt="page_4" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_4-b6f04242c2a85d92cfd1814319127b20.png" width="1132" height="668" class="img_ev3q"></p><ul><li>Data source: In real-time processing, data sources come from business branches such as industrial finance, consumer finance, and risk control. They are all collected through Canal and API.</li><li>Data collection: After data collection through Canal-Admin, Canal sends the data to Kafka message queue. After that, the data is ingested into the Doris through Routine Load.</li><li>Inside Doris: The Doris cluster constitutes a three-level layer of the data warehouse, namely: the DWD layer with the Unique model, the DWS layer with the Aggregation model, and the ADS application layer.</li><li>Data application: The data is applied in three aspects: real-time dashboard, data timeliness analysis and data service.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="34-new-features">3.4 New Features<a href="#34-new-features" class="hash-link" aria-label="3.4 New Features的直接链接" title="3.4 New Features的直接链接"></a></h2><p>The data import method is simple and adopts 3 different import methods according to different scenarios:</p><ul><li>Routine Load: When we submit the Rountine Load task, there will be a process within Doris that consumes Kafka in real time, continuously reads data from Kafka and ingestes it into Doris.</li><li>Broker Load: Offline data such as dim-tables and historical data are ingested into Doris in an orderly manner.</li><li>Insert Into: Used for batch processing tasks, Insert into is responsible for processing data in the DWD layer</li></ul><p>Doris' data model improves our development efficiency:</p><ul><li>The Unique model is used when accessing the DWD layer, which can effectively prevent repeated consumption of data.</li><li>In Doris, aggregation supports 4 models, such as Sum, Replace, Min, and Max. In this way, it may reduce a large amount of SQL code, and no longer allow us to manually write Sum, Min, Max and other codes.</li></ul><p>Doris query is efficient:</p><ul><li>It supports materialized view and Rollup materialized index. The bottom layer of the materialized view is similar to the concept of Cube and the precomputation process. As a way of exchanging space for time, special tables are generated at the bottom layer. In the query, materialized view maps to the tables and responds quickly.</li></ul><h1>4. Benefits of the New Data Warehouse</h1><ul><li>Data access: In the previous architecture, the Kudu table needs to be created manually during the imports through SteamSets. Lack of tools, the entire process of creating tables and tasks takes 20-30 minutes. Nowadays, fast data access can be realized through the platform. The access process of each table has been shortened from the previous 20-30 minutes to the current 3-5 minutes, which is to say that the performance has been improved by 5-6 times.</li><li>Data development: After using Doris, we can directly use the data models, such as Unique and Aggregation. The Duplicate model can well support logs, greatly speeding up the development process in ETL.</li><li>Query analysis: The bottom layer of Doris has functions such as materialized view and Rollup materialized index. Moreover, Doris has made many optimizations for wide table associations, such as Runtime Filter and other Joins. Compared with Doris, Apache Kudu requires more complex optimization to be better used.</li><li>Data report: It took 1-2 minutes to complete the rendering when we used Kudu to query before, but Doris responded in seconds or even milliseconds.</li><li>Easy maintenance: Doris is not as complex as Hadoop. In March, our IDC was relocated, and 12 Doris virtual machines were all migrated within three days. The overall operation is relatively simple. In addition to physically moving the machine, FE's scaling only requires simple commands such as Add and Drop, which do not take a long time to do.</li></ul><h1>5. Look ahead</h1><ul><li>Realize data access based on Flink CDC: At present, Flink CDC is not introduced, but Kafka through Canal instead. The development efficiency can be even faster if we use Flink CDC. Flink CDC still needs us to write a certain amount of code, which is not friendly for data analysts to use directly. We hope that data analysts only need to write simple SQL or directly operate. In the future planning, we plan to introduce Flink CDC.</li><li>Keep up with the latest release: Now the latest version Apache Doris V1.2.0 has made great achievements in vectorization, multi-catalog, and light schema change. We will keep up with the community to upgrade the cluster and make full use of new features.</li><li>Strengthen the construction of related systems: Our current index system management, such as report metadata, business metadata, and other management levels still need to be improved. Although we have data quality monitoring functions, it still needs to be strengthened and improved in automation.</li></ul>]]></content>
<author>
<name>Xing Wang</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.1.5]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.1.5</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.1.5"/>
<updated>2022-12-19T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris team has fixed about 36 issues or performance improvements in version 1.1.5 compared to previous version.]]></summary>
<content type="html"><![CDATA[<p>In this release, Doris Team has fixed about 36 issues or performance improvement since 1.1.4. This release is a bugfix release on 1.1 and all users are encouraged to upgrade to this release.</p><h1>Behavior Changes</h1><p>When alias name is same as the original column name like "select year(birthday) as birthday" and use it in group by, order by , having clause, doris's behavior is different from MySQL in the past. In this release, we make it follow MySQL's behavior. Group by and having clause will use original column at first and order by will use alias first. It maybe a litter confuse here so there is a simple advice here, you'd better not use an alias the same as original column name.</p><h1>Features</h1><p>Add support of murmur_hash3_64. <a href="https://github.com/apache/doris/pull/14636" target="_blank" rel="noopener noreferrer">#14636</a></p><h1>Improvements</h1><p>Add timezone cache for convert_tz to improve performance. <a href="https://github.com/apache/doris/pull/14616" target="_blank" rel="noopener noreferrer">#14616</a></p><p>Sort result by tablename when call show clause. <a href="https://github.com/apache/doris/pull/14492" target="_blank" rel="noopener noreferrer">#14492</a></p><h1>Bug Fix</h1><p>Fix coredump when there is a if constant expr in select clause. <a href="https://github.com/apache/doris/pull/14858" target="_blank" rel="noopener noreferrer">#14858</a></p><p>ColumnVector::insert_date_column may crashed. <a href="https://github.com/apache/doris/pull/14839" target="_blank" rel="noopener noreferrer">#14839</a></p><p>Update high_priority_flush_thread_num_per_store default value to 6 and it will improve the load performance. <a href="https://github.com/apache/doris/pull/14775" target="_blank" rel="noopener noreferrer">#14775</a></p><p>Fix quick compaction core. <a href="https://github.com/apache/doris/pull/14731" target="_blank" rel="noopener noreferrer">#14731</a></p><p>Partition column is not duplicate key, spark load will throw IndexOutOfBounds error. <a href="https://github.com/apache/doris/pull/14661" target="_blank" rel="noopener noreferrer">#14661</a></p><p>Fix a memory leak problem in VCollectorIterator. <a href="https://github.com/apache/doris/pull/14549" target="_blank" rel="noopener noreferrer">#14549</a></p><p>Fix create table like when having sequence column. <a href="https://github.com/apache/doris/pull/14511" target="_blank" rel="noopener noreferrer">#14511</a></p><p>Using avg rowset to calculate batch size instead of using total_bytes since it costs a lot of cpu. <a href="https://github.com/apache/doris/pull/14273" target="_blank" rel="noopener noreferrer">#14273</a></p><p>Fix right outer join core with conjunct. <a href="https://github.com/apache/doris/pull/14821" target="_blank" rel="noopener noreferrer">#14821</a></p><p>Optimize policy of tcmalloc gc. <a href="https://github.com/apache/doris/pull/14777" target="_blank" rel="noopener noreferrer">#14777</a> <a href="https://github.com/apache/doris/pull/14738" target="_blank" rel="noopener noreferrer">#14738</a> <a href="https://github.com/apache/doris/pull/14374" target="_blank" rel="noopener noreferrer">#14374</a></p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Best practice in Kwai: Apache Doris on Elasticsearch]]></title>
<id>https://doris.apache.org/zh-CN/blog/BestPractice_Kwai</id>
<link href="https://doris.apache.org/zh-CN/blog/BestPractice_Kwai"/>
<updated>2022-12-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article mainly focuses on the practice of Apache Doris on Elasticsearch (DOE) in Kwai's business.Kwai’s commercial report engine provides advertisers with real-time query service for multi-dimensional analysis reports. And it also provides query service for multi-dimensional analysis reports for internal users. The engine is committed to dealing with high-performance, high-concurrency, and high-stability query problems in multi-dimensional analysis report cases. After using Doris, query becomes simple. We only need to synchronize the fact table and dim-table on a daily basis and Join while querying. By replacing Druid and Clickhouse with Doris, Doris basically covers all scenarios when we use Druid. In this way, Kwai's commercial report engine greatly improves the aggregation and analysis capabilities of massive data. During the use of Apache Doris, we also found some unexpected benefits: For example, the import method of Routine Load and Broker Load is relatively simple, which improves the query speed; The data occupation is greatly reduced; Doris supports the MySQL protocol, which is much easier for data analyst to fetch data and make charts.]]></summary>
<content type="html"><![CDATA[<blockquote><p>Author: Xiang He, Head Developer of Big Data, Commercialization Team of Kwai</p></blockquote><p><img loading="lazy" alt="kv" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-846e4e39fd88e1e34d2474b23690d9b2.png" width="900" height="383" class="img_ev3q"></p><h1>1 About Kwai</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="11-kwai">1.1 Kwai<a href="#11-kwai" class="hash-link" aria-label="1.1 Kwai的直接链接" title="1.1 Kwai的直接链接"></a></h2><p>Kwai(HKG:1024) is a social network for short videos and trends. Discover funny short videos, contribute to the virtual community with recordings, videos of your life, playing daily challenges or likes the best memes and videos. Share your life with short videos and choose from dozens of magical effects and filters for them.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="12-kwais-commercial-report-engine">1.2 Kwai's Commercial Report Engine<a href="#12-kwais-commercial-report-engine" class="hash-link" aria-label="1.2 Kwai's Commercial Report Engine的直接链接" title="1.2 Kwai's Commercial Report Engine的直接链接"></a></h2><p>Kwai’s commercial report engine provides advertisers with real-time query service for multi-dimensional analysis reports. And it also provides query service for multi-dimensional analysis reports for internal users. The engine is committed to dealing with high-performance, high-concurrency, and high-stability query problems in multi-dimensional analysis report cases.</p><h1>2 Previous Architecture</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="21-background">2.1 Background<a href="#21-background" class="hash-link" aria-label="2.1 Background的直接链接" title="2.1 Background的直接链接"></a></h2><p>Traditional OLAP engines deal with multi-dimensional analysis in a more pre-modeled way, by building a data cube (Cube) to perform operations such as Drill-down, Roll-up, Slice, and Dice and Pivot. Modern OLAP analysis introduces the idea of ​​a relational model, representing data in two-dimensional relational tables. In the modeling process, usally there are two modeling methods. One is to ingest the data of multiple tables into one wide table through Join; the other is to use the star schema, divide the data into fact table and dim-table. And then Join them when querying.
Both options have some pros and cons:</p><p>Wide table:</p><p>Taking the idea of ​​exchanging space for time. The primary key of the dim-table is the unique ID to fill all dimensions, and multiple dimension data is stored in redundant storage. Its advantage is that it is convenient to query, unnecessary to associate additional dim-tables, which is way better. The disadvantage is that if there is a change in dimension data, the entire table needs to be refreshed, which is bad for high-frequency Update.</p><p>Star Schema:</p><p>Dimension data is completely separated from fact data. Dimension data is often stored in a dedicated engine (such as MySQL, Elasticsearch, etc.). When querying, dimension data is associated with the primary key. The advantage is that changes in dimension data do not affect fact data, which can support high-frequency Update operations. The disadvantage is that the query logic is relatively more complex, and multi-table Join may lead to performance loss.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="22-requirement-for-an-olap-engine">2.2 Requirement for an OLAP Engine<a href="#22-requirement-for-an-olap-engine" class="hash-link" aria-label="2.2 Requirement for an OLAP Engine的直接链接" title="2.2 Requirement for an OLAP Engine的直接链接"></a></h2><p>In Kwai’s business, the commercial reports engine supports the real-time query of the advertising effect for advertisers. When building the report engine, we expect to meet the following requirements:</p><ul><li>Immersive data: the original data of a single table increases by ten billion every day</li><li>High QPS in Query: thousand-level QPS on average</li><li>High stability requirements: SLA level of 99.9999 %</li></ul><p>Most importantly, due to frequent changes in dimension data, dim-tables need to support Update operations up to thousand-level QPS and further support requirements such as fuzzy matching and word segmentation retrieval.
Based on the above requirements, we chose star schema and built a report engine architecture with Apache Druid and Elasticsearch.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="23-previous-architecture-based-on-apache-druid">2.3 Previous Architecture: Based on Apache Druid<a href="#23-previous-architecture-based-on-apache-druid" class="hash-link" aria-label="2.3 Previous Architecture: Based on Apache Druid的直接链接" title="2.3 Previous Architecture: Based on Apache Druid的直接链接"></a></h2><p>We chose the combination of Elasticsearch and Apache Druid. In data import, we use Flink to pre-aggregate the data at minute-evel, and use Kafka to pre-aggregate the data at hour-level. In data query, the application initiates a query request through RE Front API, and Re Query initiates queries to the dim-table engine (Elasticsearch and MySQL) and the extension engine respectively.</p><p>Druid is a timing-based query engine that supports real-time data ingestion and is used to store and query large amounts of fact data. We adopt Elasticseach based on those concerns:</p><ul><li>High update frequency, QPS is around 1000</li><li>Support word segmentation and fuzzy search, which is suitable for Kwai</li><li>Supports high-level dim-table data, which can be directly qualified without adopting sub-database and sub-table just like MySQL database</li><li>Supports data synchronization monitoring, and has check and recovery services as well</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="24-engine-of-the-reports">2.4 Engine of the Reports<a href="#24-engine-of-the-reports" class="hash-link" aria-label="2.4 Engine of the Reports的直接链接" title="2.4 Engine of the Reports的直接链接"></a></h2><p>The report engine can be divided into two layers: REFront and REQuery. REMeta is an independent metadata management module. The report engine implements MEMJoin inside REQuery. It supports associative query between fact data in Druid and dimension data in Elasticsearch. And it also provides virtual cube query for upper-layer business, avoiding the exposion of complex cross-engine management and query logic.</p><p><img loading="lazy" alt="page_1" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_1-9e4af3275a17b4c1c893caa7c6f7290b.png" width="709" height="698" class="img_ev3q"></p><h1>3 New Architecture Based on Apache Doris</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="31-problems-remained">3.1 Problems Remained<a href="#31-problems-remained" class="hash-link" aria-label="3.1 Problems Remained的直接链接" title="3.1 Problems Remained的直接链接"></a></h2><p>First, we came across a problem when we build the report engine. Mem Join is single-machine with serial execution. When the amount of data pulled from Elasticsearch exceeds 100,000 at a single time, the response time is close to 10s, and the user experience is poor. Moreover, using a single node to execute large-scale data Join will consume a lot of memory, causing Full GC.</p><p>Second, Druid's Lookup Join function is not so perfect, which is a big problem, and it cannot fully meet our business needs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="32-database-research">3.2 Database Research<a href="#32-database-research" class="hash-link" aria-label="3.2 Database Research的直接链接" title="3.2 Database Research的直接链接"></a></h2><p>So we conducted a survey on popular OLAP databases in the industry, the most representative of which are Apache Doris and Clickhouse. We found out that Apache Doris is more capable of Join between large and wide tables. ClickHouse can support Broadcast memory-based Join, but the performance is not good for the Join between large and wide tables with a large data volume. Both Doris and Clickhouse support detailed data storage, but the capability for concurrency of Clickhouse is low. On the contrary, Doris supports high-concurrency and low-latency query services, and a single machine supports up to thousands of QPS. When the concurrency increases, horizontal expansion of FE and BE can be supported. However, Clickhouse's data import is not able to support Transaction SQL, which cannot realize Exactly-once semantics and has limited ablility for standard SQL. In contrast, Doris provides Transaction SQL and atomicity for data import. Doris itself can ensure that messages in Kafka are not lost or re-subscribed, which is to say, Exactly-Once semantics is supported. ClickHouse has high learning cost, high operation and maintenance costs, and weak in distribution. The fact that it requires more customization and deeper technical strength is another problem. Doris is different. There are only two core components, FE and BE, and there are fewer external dependencies. We also found that because Doris is closer to the MySQL protocol, it is more convenient than Clickhouse and the cost of migration is not so large. In terms of horizontal expansion, Doris' expansion and contraction can also achieve self-balancing, which is much better than that of Clickhouse.</p><p>From this point of view, Doris can better improve the performance of Join and is much better in other aspects such as migration cost, horizontal expansion, and concurrency. However, Elasticsearch has inherent advantages in high-frequency Update.</p><p>It would be an ideal solution to deal with high-frequency Upate and Join performance at the same time by building engines through Doris on Elasticsearch.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="33-good-choice-doris-on-elasticsearch">3.3 Good Choice: Doris on Elasticsearch<a href="#33-good-choice-doris-on-elasticsearch" class="hash-link" aria-label="3.3 Good Choice: Doris on Elasticsearch的直接链接" title="3.3 Good Choice: Doris on Elasticsearch的直接链接"></a></h2><p>What is the query performance of Doris on Elasticsearch?</p><p>First of all, Apache Doris is a real-time analytical database based on MPP architecture, with strong performance and strong horizontal expansion capability. Doris on Elasticsearch takes advantage on this capability and does a lot of query optimization. Secondly, after integrating Elasticsearch, we have also made a lot of optimizations to the query:</p><ul><li>Shard-level concurrency</li><li>Automatic adaptation of row and column scanning, priority to column scanning</li><li>Sequential read, terminated early</li><li>Two-phase query becomes one-phase query</li><li>Broadcast Join is especially friendly for small batch data</li></ul><p><img loading="lazy" alt="page_2" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_2-a916fe2ffe5eeae0b166d30cfe8d8e42.png" width="890" height="1032" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="34-doris-on-elasticsearch">3.4 Doris on Elasticsearch<a href="#34-doris-on-elasticsearch" class="hash-link" aria-label="3.4 Doris on Elasticsearch的直接链接" title="3.4 Doris on Elasticsearch的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="341-data-link-upgrade">3.4.1 Data Link Upgrade<a href="#341-data-link-upgrade" class="hash-link" aria-label="3.4.1 Data Link Upgrade的直接链接" title="3.4.1 Data Link Upgrade的直接链接"></a></h3><p>The upgrade of the data link is relatively simple. In the first step, in Doris we build a new Olap table and configure the materialized view. Second, the routine load is initiated based on the Kafka topic of the previous fact data, and then real-time data is ingested. The third step is to ingest offline data from Hive's broker load. The last step is to create an Elasticsearch external table through Doris.</p><p><img loading="lazy" alt="page_3" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_3-2f23fe1184980f690da326e4446fd7f7.png" width="1313" height="1265" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="342-upgrades-of-the-report-engine">3.4.2 Upgrades of the Report Engine<a href="#342-upgrades-of-the-report-engine" class="hash-link" aria-label="3.4.2 Upgrades of the Report Engine的直接链接" title="3.4.2 Upgrades of the Report Engine的直接链接"></a></h3><p><img loading="lazy" alt="page_4" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_4-f9c9b95ac997f1d8f09fb5fe182c368f.png" width="1274" height="895" class="img_ev3q"></p><p>Note: The MySQL dim-table associated above is based on future planning. Currently, Elasticsearch is mainly used as the dim-table engine</p><p>Report Engine Adaptation</p><ul><li>Generate virtual cube table based on Doris's star schema</li><li>Adapt to cube table query analysis, intelligent Push-down</li><li>Gray Release</li></ul><h1>4 Online Performance</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="41-fact-table-query-performance-comparison">4.1 Fact Table Query Performance Comparison<a href="#41-fact-table-query-performance-comparison" class="hash-link" aria-label="4.1 Fact Table Query Performance Comparison的直接链接" title="4.1 Fact Table Query Performance Comparison的直接链接"></a></h2><p>Druid</p><p><img loading="lazy" alt="page_5" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_5-8e598f4abd11de7482c1a9dcc0747641.png" width="935" height="276" class="img_ev3q"></p><p>Doris</p><p><img loading="lazy" alt="page_6" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_6-7747547b14b4dbce6b2ee99fde03ab16.png" width="959" height="291" class="img_ev3q"></p><p>99th percentile of response time:
Druid: 270ms, Doris: 150ms and which is reduced by 45%</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="42-comparison-of-cube-table-query-performance-in-join">4.2 Comparison of Cube Table Query Performance in Join<a href="#42-comparison-of-cube-table-query-performance-in-join" class="hash-link" aria-label="4.2 Comparison of Cube Table Query Performance in Join的直接链接" title="4.2 Comparison of Cube Table Query Performance in Join的直接链接"></a></h2><p>Druid</p><p><img loading="lazy" alt="page_7" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_7-46c2a88aabf031ee764884d78837880f.png" width="987" height="316" class="img_ev3q"></p><p>Doris</p><p><img loading="lazy" alt="page_8" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_8-cc75cc3a5ced01182cac415175d4048a.png" width="950" height="291" class="img_ev3q"></p><p>99th percentile of response time:
Druid: 660ms, Doris: 440ms and which is reduced by 33%</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="43-benefits">4.3 Benefits<a href="#43-benefits" class="hash-link" aria-label="4.3 Benefits的直接链接" title="4.3 Benefits的直接链接"></a></h2><ul><li>The overall time consumption of 99 percentile is reduced by about 35%</li><li>Resource saving about 50%</li><li>Remove the complex logic of MemJoin from the report engine; Realize through DO(in the case of large query: dim-table results exceed 100,000, performance improvement exceeds 10 times, 10s to 1s)</li><li>Richer query semantics (Mem Join is relatively simple and does not support complex queries)</li></ul><h1>5 Summary and Plans</h1><p>In Kwai's commercial business, Join queries between dimension data and fact data is very common. After using Doris, query becomes simple. We only need to synchronize the fact table and dim-table on a daily basis and Join while querying. By replacing Druid and Clickhouse with Doris, Doris basically covers all scenarios when we use Druid. In this way, Kwai's commercial report engine greatly improves the aggregation and analysis capabilities of massive data. During the use of Apache Doris, we also found some unexpected benefits: For example, the import method of Routine Load and Broker Load is relatively simple, which improves the query speed; The data occupation is greatly reduced; Doris supports the MySQL protocol, which is much easier for data analyst to fetch data and make charts.</p><p>Although the Doris on Elasticsearch has fully meet our requirement, Elasticsearch external table still requires manual creation. However, Apache Doris recently released the latest version V1.2.0. The new version has added Multi-Catlog, which provides the ability to seamlessly access external table sources such as Hive, Elasticsearch, Hudi, and Iceberg. Users can connect to external tables through the CREATE CATALOG command, and Doris will automatically map the library and table information of the external dable. In this way, we don't need to manually create the Elasticsearch external tables to complete the mapping in the future, which greatly saves us time and cost of development and improves the efficiency of research and development. The power of other new functions such as Vectorization and Ligt Schema Change also gives us new expectations for Apache Doris. Bless Apache Doris!</p><h1>Contact Us</h1><p>Apache Doris Website:<a href="http://doris.apache.org" target="_blank" rel="noopener noreferrer">http://doris.apache.org</a></p><p>Github:<a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><p>Dev Email:<a href="mailto:dev@doris.apache.org" target="_blank" rel="noopener noreferrer">dev@doris.apache.org</a></p>]]></content>
<author>
<name>Xiang He</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Practice and optimization of Apache Doris in Xiaomi]]></title>
<id>https://doris.apache.org/zh-CN/blog/xiaomi_vector</id>
<link href="https://doris.apache.org/zh-CN/blog/xiaomi_vector"/>
<updated>2022-12-08T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Xiaomi Group introduced Apache Doris in 2019. At present, Apache Doris has been widely used in dozens of business departments within Xiaomi. A set of data ecology with Apache Doris has been formed. This article is transcribed from an online meetup speech of the Doris community, aiming to share the practice of Apache Doris in Xiaomi.]]></summary>
<content type="html"><![CDATA[<blockquote><p>Guide: Xiaomi Group introduced Apache Doris in 2019. At present, Apache Doris has been widely used in dozens of business departments within Xiaomi. A set of data ecology with Apache Doris has been formed. This article is transcribed from an online meetup speech of the Doris community, aiming to share the practice of Apache Doris in Xiaomi.</p></blockquote><blockquote><p>Author: ZuoWei, OLAP Engineer, Xiaomi</p></blockquote><p><img loading="lazy" alt="kv" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-b27d71e34981d9850785329cea2cb610.png" width="900" height="383" class="img_ev3q"></p><h1>About Xiaomi</h1><p><a href="https://www.mi.com/global" target="_blank" rel="noopener noreferrer">Xiaomi Corporation</a> (“Xiaomi” or the “Group”; HKG:1810), a consumer electronics and smart manufacturing company with smartphones and smart hardware connected by an Internet of Things (IoT) platform. In 2021, Xiaomi's total revenue amounted to RMB328.3 billion(USD472,231,316,200), an increase of 33.5% year-over-year; Adjusted net profit was RMB22.0 billion(USD3,164,510,800), an increase of 69.5% year-over-year.</p><p>Due to the growing need of data analysis, Xiaomi Group introduced Apache Doris in 2019. As one of the earliest users of Apache Doris, Xiaomi Group has been deeply involved in the open-source community. After three years of development, Apache Doris has been widely used in dozens of business departments within Xiaomi, such as Advertising, New Retail, Growth Analysis, Dashboards, UserPortraits, <a href="https://airstar.com/home" target="_blank" rel="noopener noreferrer">AISTAR</a>, <a href="https://www.xiaomiyoupin.com" target="_blank" rel="noopener noreferrer">Xiaomi Youpin</a>. Within Xiaomi, a data ecosystem has been built around Apache Doris. </p><p><img loading="lazy" alt="page_1" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_1-93afbd2f90769776af3083bc49fbf8dd.jpg" width="1135" height="661" class="img_ev3q"></p><p>At present, Apache Doris already has dozens of clusters in Xiaomi, with an overall scale of hundreds of virtual machines . Among them, the largest single cluster reaches nearly 100 nodes, with dozens of real-time data synchronization tasks. And the largest daily increment of a single table rocket to 12 billion, supporting PB-level storage. And a single cluster can support more than 20,000 multi-dimensional analysis queries per day.</p><h1>Architecture Evolution</h1><p>The original intention of Xiaomi to introduce Apache Doris is to solve the problems encountered in user behavior analysis. With the development of Xiaomi's Internet business, the demand for growth analysis using user behavior data is becoming stronger and stronger. If each business branch builds its own growth analysis system, it will not only be costly, but also inefficient. Therefore, if there is a product that can help them stop worrying about underlying complex technical details, it would be great to have relevant business personnel focus on their own technical work. In this way, it can greatly improve work efficiency. Therefore, Xiaomi Big Data and the cloud platform jointly developed the growth analysis system called Growing Analytics (referred to as GA), which aims to provide a flexible multi-dimensional real-time query and analysis platform, which can manage data access and query solutions in a unified way, and help business branches to refine operation.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="previous-architecture">Previous Architecture<a href="#previous-architecture" class="hash-link" aria-label="Previous Architecture的直接链接" title="Previous Architecture的直接链接"></a></h2><p>The growth analysis platform project was established in mid-2018. At that time, based on the consideration of development time and cost, Xiaomi reused various existing big data basic components (HDFS, Kudu, SparkSQL, etc.) to build a growth analysis query system based on Lambda architecture. The architecture of the first version of the GA system is shown in the figure below, including the following aspects:</p><ul><li>Data Source: The data source is the front-end embedded data and user behavior data.</li><li>Data Access: The event tracking data is uniformly cleaned and ingested into Xiaomi's internal self-developed message queue, and the data is imported into Kudu through Spark Streaming.</li><li>Storage: Separate hot and cold data in the storage layer. Hot data is stored in Kudu, and cold data is stored in HDFS. At the same time, partitioning is carried out in the storage layer. When the partition unit is day, part of the data will be cooled and stored on HDFS every night.</li><li>Compute and Query: In the query layer, use SparkSQL to perform federated queries on the data on Kudu and HDFS, and finally display the query results on the front-end page.</li></ul><p><img loading="lazy" alt="page_2" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_2-db57a1a2eadb0f1c787f440a26358339.jpg" width="1159" height="683" class="img_ev3q"></p><p>At that time, the first version of the growth analysis platform helped us solve a series of problems in the user operation process, but there were also two problems:</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="problem-no1-scattered-components">Problem No.1: Scattered components<a href="#problem-no1-scattered-components" class="hash-link" aria-label="Problem No.1: Scattered components的直接链接" title="Problem No.1: Scattered components的直接链接"></a></h3><p>Since the historical architecture is based on the combination of SparkSQL + Kudu + HDFS, too many dependent components lead to high operation and maintenance costs. The original design is that each component uses the resources of the public cluster, but in practice, it is found that during the execution of the query job, the query performance is easily affected by other jobs in the public cluster, and query jitter is prone to occur, especially when reading data from the HDFS public cluster , sometimes slower.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="problem-no2-high-resource-consumption">Problem No.2: High resource consumption<a href="#problem-no2-high-resource-consumption" class="hash-link" aria-label="Problem No.2: High resource consumption的直接链接" title="Problem No.2: High resource consumption的直接链接"></a></h3><p>When querying through SparkSQL, the latency is relatively high. SparkSQL is a query engine designed based on a batch processing system. In the process of exchanging data shuffle between each stage, it still needs to be placed on the disk, and the delay in completing the SQL query is relatively high. In order to ensure that SQL queries are not affected by resources, we ensure query performance by adding machines. However, in practice, we find that there is limited room for performance improvement. This solution cannot make full use of machine resources to achieve efficient queries. A certain waste of resources.</p><p>In response to the above two problems, our goal is to seek an MPP database that integrates computing and storage to replace our current storage and computing layer components. After technical selection, we finally decided to use Apache Doris to replace the older generation of historical architecture.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-choice">New Choice<a href="#new-choice" class="hash-link" aria-label="New Choice的直接链接" title="New Choice的直接链接"></a></h2><p>Popular MPP-based query engines such as Impala and Presto, can efficiently support SQL queries, but they still need to rely on Kudu, HDFS, Hive Metastore and other storage system, which increase the operation and maintenance costs. At the same time, due to the separation of storage and compute, the query engine cannot easily find the data changes in the storage layer, resulting in bad performance in detailed query optimization. If you want to cache at the SQL layer, you cannot guarantee that the query results are up-to-date.</p><p>Apache Doris is a top-level project of the Apache Foundation. It is mainly positioned as a high-performance, real-time analytical database, and is mainly used to solve reports and multi-dimensional analysis. It integrates Google Mesa and Cloudera Impala technologies. We conducted an in-depth performance tests on Doris and communicated with the community many times. And finally, we determined to replace the previous computing and storage components with Doris. </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="new-architecture-based-on-apache-doris">New Architecture Based on Apache Doris<a href="#new-architecture-based-on-apache-doris" class="hash-link" aria-label="New Architecture Based on Apache Doris的直接链接" title="New Architecture Based on Apache Doris的直接链接"></a></h2><p>The new architecture obtains event tracking data from the data source. Then data is ingested into Apache Doris. Query results can be directly displayed in the applications. In this way, Doris has truly realized the unification of computing, storage, and resource management tools.</p><p><img loading="lazy" alt="page_3" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_3-30c8cb46f4d289fa768e9a364779bc69.jpg" width="1149" height="674" class="img_ev3q"></p><p>We chose Doris because:</p><ul><li>Doris has excellent query performance and can meet our business needs.</li><li>Doris supports standard SQL, and the learning cost is low.</li><li>Doris does not depend on other external components and is easy to operate and maintain.</li><li>The Apache Doris community is very active and friendly, crowded with contributors. It is easier for further versions upgrades and convenient for maintenance.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="query-performance-comparision-between-apache-doris--spark-sql">Query Performance Comparision between Apache Doris &amp; Spark SQL<a href="#query-performance-comparision-between-apache-doris--spark-sql" class="hash-link" aria-label="Query Performance Comparision between Apache Doris &amp; Spark SQL的直接链接" title="Query Performance Comparision between Apache Doris &amp; Spark SQL的直接链接"></a></h2><p>Note: The comparison is based on Apache Doris V0.13</p><p><img loading="lazy" alt="page_4" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_4-3e71f2a8753e49f5a73bea4bb628fbbf.jpg" width="1242" height="1000" class="img_ev3q"></p><p>We selected a business model with an average daily data volume of about 1 billion, and conducted performance tests on Doris in different scenarios, including 6 event analysis scenarios, 3 retention analysis scenarios, and 3 funnel analysis scenarios. After comparing it with the previous architecture(SparkSQL+Kudu+HDFS), we found out:</p><ul><li>In the event analysis scenario, the average query time was reduced by 85%.</li><li>In the scenarios of retention analysis and funnel analysis, the average query time was reduced by 50%.</li></ul><h1>Real Practice</h1><p>Below we will introduce our experience of data import, data query, A/B test in the business application of Apache Doris.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-import">Data Import<a href="#data-import" class="hash-link" aria-label="Data Import的直接链接" title="Data Import的直接链接"></a></h2><p><img loading="lazy" alt="page_5" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_5-010f8edce4b736817d68815f31e52fd7.jpg" width="1130" height="667" class="img_ev3q"></p><p>Xiaomi writes data into Doris mainly through Stream Load, Broker Load and a small amount of data by Insert. Usually data is generally ingested into the message queue first, which is divided into real-time and offline data.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-write-real-time-data-into-apache-doris">How to write real-time data into Apache Doris:<a href="#how-to-write-real-time-data-into-apache-doris" class="hash-link" aria-label="How to write real-time data into Apache Doris:的直接链接" title="How to write real-time data into Apache Doris:的直接链接"></a></h3><p>After part of real-time data processed by Flink, they will be ingested into Doris through Flink-Doris-Connector provided by Apache Doris. The rest of the data is ingested through Spark Streaming. The bottom layer of these two writing approaches both rely on the Stream Load provided by Apache Doris.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="how-to-write-offline-data-into-apache-doris">How to write offline data into Apache Doris:<a href="#how-to-write-offline-data-into-apache-doris" class="hash-link" aria-label="How to write offline data into Apache Doris:的直接链接" title="How to write offline data into Apache Doris:的直接链接"></a></h3><p>After offline data is partially ingested into Hive, they will be ingested into Doris through Xiaomi's data import tool. Users can directly submit Broker Load tasks to the Xiaomi's data import tool and import data directly into Doris, or import data through Spark SQL, which relies on the Spark-Doris-Connector provided by Apache Doris. Spark Doris Connector is actually the encapsulation of Stream Load.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="data-qurey">Data Qurey<a href="#data-qurey" class="hash-link" aria-label="Data Qurey的直接链接" title="Data Qurey的直接链接"></a></h2><p><img loading="lazy" alt="page_6" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_6-14cf1592d25e4b6e4cc275e06c2e6673.jpg" width="1120" height="638" class="img_ev3q"></p><p>Users can query after data import is done. Inside Xiaomi, we query through our data platform. Users can perform visual queries on Doris through Xiaomi's data platform, and conduct user behavior analysis and user portrait analysis. In order to help our teams conduct event analysis, retention analysis, funnel analysis, path analysis and other behavioral analysis, we have added corresponding UDF (User Defined Function) and UDAF (User Defined Aggregate Function) to Doris.</p><p>In the upcoming version 1.2, Apache Doris adds the function of synchronizing metadata through external table, such as Hive/Hudi/Iceberg and Multi Catalog tool. External table query improves performance, and the ability to access external tables greatly increases ease of use. In the future, we will consider querying Hive and Iceberg data directly through Doris, which builds an architecture of datalake.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="ab-test">A/B Test<a href="#ab-test" class="hash-link" aria-label="A/B Test的直接链接" title="A/B Test的直接链接"></a></h2><p>In real business, the A/B test is a method of comparing two versions of strategies against each other to determine which one performs better. A/B test is essentially an experiment where two or more variants of a page are shown to users at random, and statistical analysis. It is popular approach used to determine which variation performs better for a given conversion goal. Xiaomi's A/B test platform is an operation tool product that conducts the A/B test with experimental grouping, traffic splitting, and scientific evaluation to assist in decision making. Xiaomi's A/B test platform has several query applications: user deduplication, indicator summation, covariance calculation, etc. The query types will involve Count (distinct), Bitmap, Like, etc.</p><p>Apache Doris also provides services to Xiaomi's A/B test platform. Everyday, Xiaomi's A/B test platform needs to process a temendous amount of data with billions of queries. That's why Xiaomi's A/B test platform is eager to improve the query performance. </p><p>Apache Doris V1.1 released just in time and has fully supported vectorization in the processing and storage. Compared with the non-vectorized version, the query performance has been significantly improved. It is time to update Xiaomi's Doris cluster to the latest version. That's why we first launched the latest vectorized version of Doris on Xiaomi's A/B test platform.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="test-before-launch">Test before Launch<a href="#test-before-launch" class="hash-link" aria-label="Test before Launch的直接链接" title="Test before Launch的直接链接"></a></h2><p>Note: The following tests are based on Apache Doris V1.1.2</p><p>We built a test cluster for Apache Doris V1.1.2, which is as big as that of the Xiaomi online Apache Doris V0.13 version, to test before the vectorization version goes online. The test is divided into two aspects: single SQL parrellel query test and batch SQL concurrent query test.</p><p>The configurations of the two clusters are exactly the same, and the specific configuration information is as follows:</p><ul><li>Scale: 3 FEs + 89 virtual machines</li><li>CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz 16 cores 32 threads × 2</li><li>Memory: 256GB</li><li>Disk: 7.3TB × 12 HDD</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="single-sql-parrellel-query-test">Single SQL Parrellel Query Test<a href="#single-sql-parrellel-query-test" class="hash-link" aria-label="Single SQL Parrellel Query Test的直接链接" title="Single SQL Parrellel Query Test的直接链接"></a></h3><p>We choose 7 classic queries in the Xiaomi A/B test. For each query, we limited the time range to 1 day, 7 days, and 20 days for testing, where the daily partition data size is about 3.1 billion (the data volume is about 2 TB). The test results are shown in the figures:</p><p><img loading="lazy" alt="page_7" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_7-b41817232fb711c583332d813de7f684.jpg" width="750" height="450" class="img_ev3q"></p><p><img loading="lazy" alt="page_8" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_8-c8e10196ce6917449e8372205333f12c.jpg" width="750" height="450" class="img_ev3q"></p><p><img loading="lazy" alt="page_9" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_9-cfbcd21a8b00a3b50508251b78ebd163.jpg" width="750" height="450" class="img_ev3q"></p><p>The Apache Doris V1.1.2 has at least 3~5 times performance improvement compared to the Xiaomi online Doris V0.13, which is remarkable.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimization">Optimization<a href="#optimization" class="hash-link" aria-label="Optimization的直接链接" title="Optimization的直接链接"></a></h2><p>Note: The following tests are based on Apache Doris V1.1.2</p><p>Based on Xiaomi's A/B test business data, we tuned Apache Doris V1.1.2 and conducted concurrent query tests on the tuned Doris V1.1.2 and Xiaomi's online Doris V0.13. The test results are as follows.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimization-in-test-1">Optimization in Test 1<a href="#optimization-in-test-1" class="hash-link" aria-label="Optimization in Test 1的直接链接" title="Optimization in Test 1的直接链接"></a></h3><p>We choose user deduplication, index summation, and covariance calculation query(the total number of SQL is 3245) in the A/B test to conduct concurrent query tests on the two versions. The single-day partition data of the table is about 3.1 billion (the amount of data is about 2 TB) and the query will be based on the latest week's data. The test results are shown in the figures:</p><p><img loading="lazy" alt="page_10" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_10-98057ca75a1689b6c6eb9932cdd5e841.jpg" width="1080" height="338" class="img_ev3q"></p><p>Compared with Apache Doris V0.13, the overall average latency of Doris V1.1.2 is reduced by about 48%, and the P95 latency is reduced by about 49%. In this test, the query performance of Doris V1.1.2 was nearly doubled compared to Doris V0.13.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimization-in-test-2">Optimization in Test 2<a href="#optimization-in-test-2" class="hash-link" aria-label="Optimization in Test 2的直接链接" title="Optimization in Test 2的直接链接"></a></h3><p>We choose 7 A/B test reports to test the two versions. Each A/B test report is corresponded to two modules in Xiaomi A/B test platform and each module represents thousands of SQL query. Each report submits query tasks to the cluster where the two versions reside at the same concurrency. The test results are shown in the figure:</p><p><img loading="lazy" alt="page_11" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_11-bbf60c474aaea1a007b5b413d6bad77a.jpg" width="750" height="450" class="img_ev3q"></p><p>Compared with Doris V0.13, Doris V1.1.2 reduces the overall average latency by around 52%. In the test, the query performance of Doris V1.1.2 version was more than 1 time higher than that of Doris V0.13. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="optimization-in-test-3">Optimization in Test 3<a href="#optimization-in-test-3" class="hash-link" aria-label="Optimization in Test 3的直接链接" title="Optimization in Test 3的直接链接"></a></h3><p>To verify the performance of the tuned Apache Doris V1.1.2 in other cases, we choose the Xiaomi user behavior analysis to conduct concurrent query performance tests of Doris V1.1.2 and Doris V0.13. We choose behavior analysis query for 4 days on October 24, 25, 26 and 27, 2022. The test results are shown in the figures:</p><p><img loading="lazy" alt="page_12" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_12-58242671fba5bbf25225b4d9d9f6d87c.jpg" width="1080" height="338" class="img_ev3q"></p><p>Compared with Doris V0.13, the overall average latency of Doris V1.1.2 has been reduced by about 77%, and the P95 latency has been reduced by about 83%. In this test, the query performance of Doris V1.1.2 version is 4~6 times higher than that of Doris V0.13.</p><h1>Conclusion</h1><p>Since we adopted Apache Doris in 2019, Apache Doris has currently served dozens of businesses and sub-brands within Xiaomi, with dozens of clusters and hundreds of nodes. It completes more than 10,000 user online analysis queries every day and is responsible for most of the online analysis in Xiaomi.</p><p>After performance test and tuning, Apache Doris V1.1.2 has met the launch requirements of the Xiaomi A/B test platform and does well in query performance and stability. In some cases, it even exceeds our expectations, such as the overall average latency being reduced by about 77% in our tuned version.</p><p>Meanwhile, some functions have in the above been released in Apache Doris V1.0 or V1.1, some PRs have been merged into the community Master Fork and should be released soon. Recently the activity of the community has been greatly enhanceed. We are glad to see that Apache Doris has become more and more mature, and stepped forward to an integrated datalake. We truly believe that in the future, more data analysis will be explored and realized within Apache Doris.</p><h1>Contact Us</h1><p>Apache Doris Website:<a href="http://doris.apache.org" target="_blank" rel="noopener noreferrer">http://doris.apache.org</a></p><p>Github Homepage:<a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><p>Email to DEV:<a href="mailto:dev@doris.apache.org" target="_blank" rel="noopener noreferrer">dev@doris.apache.org</a></p>]]></content>
<author>
<name>ZuoWei</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.2.0]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.2.0</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.2.0"/>
<updated>2022-12-07T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, we are pleased to announce that we have officially released Apache Doris 1.2.0 on December 7, 2022]]></summary>
<content type="html"><![CDATA[<p>Dear Community, after months of polishing, we are pleased to announce the release of Apache Doris 1.2.0 on December 07, 2022! </p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="highlights">Highlights<a href="#highlights" class="hash-link" aria-label="Highlights的直接链接" title="Highlights的直接链接"></a></h2><ol><li><p>Full Vectorizied-Engine support, greatly improved performance</p><p>In the standard ssb-100-flat benchmark, the performance of 1.2 is 2 times faster than that of 1.1; in complex TPCH 100 benchmark, the performance of 1.2 is 3 times faster than that of 1.1.</p></li><li><p>Merge-on-Write Unique Key</p><p>Support Merge-On-Write on Unique Key Model. This mode marks the data that needs to be deleted or updated when the data is written, thereby avoiding the overhead of Merge-On-Read when querying, and greatly improving the reading efficiency on the updateable data model.</p></li><li><p>Multi Catalog</p><p>The multi-catalog feature provides Doris with the ability to quickly access external data sources for access. Users can connect to external data sources through the <code>CREATE CATALOG</code> command. Doris will automatically map the library and table information of external data sources. After that, users can access the data in these external data sources just like accessing ordinary tables. It avoids the complicated operation that the user needs to manually establish external mapping for each table.</p><p>Currently this feature supports the following data sources:</p><ol><li>Hive Metastore: You can access data tables including Hive, Iceberg, and Hudi. It can also be connected to data sources compatible with Hive Metastore, such as Alibaba Cloud's DataLake Formation. Supports data access on both HDFS and object storage.</li><li>Elasticsearch: Access ES data sources.</li><li>JDBC: Access MySQL through the JDBC protocol.</li></ol><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/multi-catalog" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/multi-catalog</a>)</p><blockquote><p>Note: The corresponding permission level will also be changed automatically, see the "Upgrade Notes" section for details.</p></blockquote></li><li><p>Light table structure changes</p></li></ol><p>In the new version, it is no longer necessary to change the data file synchronously for the operation of adding and subtracting columns to the data table, and only need to update the metadata in FE, thus realizing the millisecond-level Schema Change operation. Through this function, the DDL synchronization capability of upstream CDC data can be realized. For example, users can use Flink CDC to realize DML and DDL synchronization from upstream database to Doris.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE</a></p><p>When creating a table, set <code>"light_schema_change"="true"</code> in properties.</p><ol start="5"><li><p>JDBC facade</p><p>Users can connect to external data sources through JDBC. Currently supported:</p><ul><li>MySQL</li><li>PostgreSQL</li><li>Oracle</li><li>SQL Server</li><li>Clickhouse</li></ul><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/jdbc-of-doris/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/ecosystem/external-table/jdbc-of-doris/</a></p><blockquote><p>Note: The ODBC feature will be removed in a later version, please try to switch to the JDBC.</p></blockquote></li><li><p>JAVA UDF</p><p>Supports writing UDF/UDAF in Java, which is convenient for users to use custom functions in the Java ecosystem. At the same time, through technologies such as off-heap memory and Zero Copy, the efficiency of cross-language data access has been greatly improved.</p><p>Document: <a href="https://doris.apache.org/zh-CN/docs/dev/ecosystem/udf/java-user-defined-function" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/ecosystem/udf/java-user-defined-function</a></p><p>Example: <a href="https://github.com/apache/doris/tree/master/samples/doris-demo" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/tree/master/samples/doris-demo</a></p></li><li><p>Remote UDF</p><p>Supports accessing remote user-defined function services through RPC, thus completely eliminating language restrictions for users to write UDFs. Users can use any programming language to implement custom functions to complete complex data analysis work.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/ecosystem/udf/remote-user-defined-function" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/ecosystem/udf/remote-user-defined-function</a></p><p>Example: <a href="https://github.com/apache/doris/tree/master/samples/doris-demo" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/tree/master/samples/doris-demo</a></p></li><li><p>More data types support</p><ul><li><p>Array type</p><p>Array types are supported. It also supports nested array types. In some scenarios such as user portraits and tags, the Array type can be used to better adapt to business scenarios. At the same time, in the new version, we have also implemented a large number of data-related functions to better support the application of data types in actual scenarios.</p></li></ul><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Types/ARRAY" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Types/ARRAY</a></p><p>Related functions: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/array-functions/array_max" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/array-functions/array_max</a></p><ul><li><p>Jsonb type</p><p>Support binary Json data type: Jsonb. This type provides a more compact json encoding format, and at the same time provides data access in the encoding format. Compared with json data stored in strings, it is several times newer and can be improved.</p></li></ul><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Types/JSONB" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Types/JSONB</a></p><p>Related functions: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/json-functions/jsonb_parse" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/json-functions/jsonb_parse</a></p><ul><li><p>Date V2</p><p>Sphere of influence:</p><ol><li>The user needs to specify datev2 and datetimev2 when creating the table, and the date and datetime of the original table will not be affected.</li><li>When datev2 and datetimev2 are calculated with the original date and datetime (for example, equivalent connection), the original type will be cast into a new type for calculation</li><li>The example is in the documentation</li></ol><p>Documentation: <a href="https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Types/DATEV2" target="_blank" rel="noopener noreferrer">https://doris.apache.org/docs/dev/sql-manual/sql-reference/Data-Types/DATEV2</a></p></li></ul></li></ol><h2 class="anchor anchorWithStickyNavbar_LWe7" id="more">More<a href="#more" class="hash-link" aria-label="More的直接链接" title="More的直接链接"></a></h2><ol><li><p>A new memory management framework</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/maint-monitor/memory-management/memory-tracker" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/maint-monitor/memory-management/memory-tracker</a></p></li><li><p>Table Valued Function</p><p>Doris implements a set of Table Valued Function (TVF). TVF can be regarded as an ordinary table, which can appear in all places where "table" can appear in SQL.</p><p>For example, we can use S3 TVF to implement data import on object storage:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into tbl select * from s3("s3://bucket/file.*", "ak" = "xx", "sk" = "xxx") where c1 &gt; 2;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Or directly query data files on HDFS:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">insert into tbl select * from hdfs("hdfs://bucket/file.*") where c1 &gt; 2;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>TVF can help users make full use of the rich expressiveness of SQL and flexibly process various data.</p><p>Documentation:</p><p><a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/s3" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/s3</a></p><p><a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/hdfs" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/table-functions/hdfs</a></p></li><li><p>A more convenient way to create partitions</p><p>Support for creating multiple partitions within a time range via the <code>FROM TO</code> command.</p></li><li><p>Column renaming</p><p>For tables with Light Schema Change enabled, column renaming is supported.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-RENAME" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-TABLE-RENAME</a></p></li><li><p>Richer permission management</p><ul><li><p>Support row-level permissions</p><p>Row-level permissions can be created with the <code>CREATE ROW POLICY</code> command.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-POLICY" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-POLICY</a></p></li><li><p>Support specifying password strength, expiration time, etc.</p></li><li><p>Support for locking accounts after multiple failed logins.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Account-Management-Statements/ALTER-USER" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Account-Management-Statements/ALTER-USER</a></p></li></ul></li><li><p>Import</p><ul><li><p>CSV import supports csv files with header.</p><p>Search for <code>csv_with_names</code> in the documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD/</a></p></li><li><p>Stream Load adds <code>hidden_columns</code>, which can explicitly specify the delete flag column and sequence column.</p><p>Search for <code>hidden_columns</code> in the documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD</a></p></li><li><p>Spark Load supports Parquet and ORC file import.</p></li><li><p>Support for cleaning completed imported Labels</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CLEAN-LABEL" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CLEAN-LABEL</a></p></li><li><p>Support batch cancellation of import jobs by status</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CANCEL-LOAD" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CANCEL-LOAD</a></p></li><li><p>Added support for Alibaba Cloud oss, Tencent Cloud cos/chdfs and Huawei Cloud obs in broker load.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/advanced/broker" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/advanced/broker</a></p></li><li><p>Support access to hdfs through hive-site.xml file configuration.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/config/config-dir" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/config/config-dir</a></p></li></ul></li><li><p>Support viewing the contents of the catalog recycle bin through <code>SHOW CATALOG RECYCLE BIN</code> function.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Show-Statements/SHOW-CATALOG-RECYCLE-BIN" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Show-Statements/SHOW-CATALOG-RECYCLE-BIN</a></p></li><li><p>Support <code>SELECT * EXCEPT</code> syntax.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/data-table/basic-usage" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/data-table/basic-usage</a></p></li><li><p>OUTFILE supports ORC format export. And supports multi-byte delimiters.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE</a></p></li><li><p>Support to modify the number of Query Profiles that can be saved through configuration.</p><p>Document search FE configuration item: max_query_profile_num</p></li><li><p>The DELETE statement supports IN predicate conditions. And it supports partition pruning.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/DELETE" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/DELETE</a></p></li><li><p>The default value of the time column supports using <code>CURRENT_TIMESTAMP</code></p><p>Search for "CURRENT_TIMESTAMP" in the documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE</a></p></li><li><p>Add two system tables: backends, rowsets</p><p>Documentation:</p><p><a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/backends" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/backends</a></p><p><a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/rowsets" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/system-table/rowsets</a></p></li><li><p>Backup and restore</p><ul><li><p>The Restore job supports the <code>reserve_replica</code> parameter, so that the number of replicas of the restored table is the same as that of the backup.</p></li><li><p>The Restore job supports <code>reserve_dynamic_partition_enable</code> parameter, so that the restored table keeps the dynamic partition enabled.</p></li></ul><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Backup-and-Restore/RESTORE" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Backup-and-Restore/RESTORE</a></p><ul><li>Support backup and restore operations through the built-in libhdfs, no longer rely on broker.</li></ul><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Backup-and-Restore/CREATE-REPOSITORY" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Backup-and-Restore/CREATE-REPOSITORY</a></p></li><li><p>Support data balance between multiple disks on the same machine</p><p>Documentation:</p><p><a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-REBALANCE-DISK" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-REBALANCE-DISK</a></p><p><a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-CANCEL-REBALANCE-DISK" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-CANCEL-REBALANCE-DISK</a></p></li><li><p>Routine Load supports subscribing to Kerberos-authenticated Kafka services.</p><p>Search for kerberos in the documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/data-operate/import/import-way/routine-load-manual" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/data-operate/import/import-way/routine-load-manual</a></p></li><li><p>New built-in-function</p><p>Added the following built-in functions:</p><ul><li><code>cbrt</code></li><li><code>sequence_match/sequence_count</code></li><li><code>mask/mask_first_n/mask_last_n</code></li><li><code>elt</code></li><li><code>any/any_value</code></li><li><code>group_bitmap_xor</code></li><li><code>ntile</code></li><li><code>nvl</code></li><li><code>uuid</code></li><li><code>initcap</code></li><li><code>regexp_replace_one/regexp_extract_all</code></li><li><code>multi_search_all_positions/multi_match_any</code></li><li><code>domain/domain_without_www/protocol</code></li><li><code>running_difference</code></li><li><code>bitmap_hash64</code></li><li><code>murmur_hash3_64</code></li><li><code>to_monday</code></li><li><code>not_null_or_empty</code></li><li><code>window_funnel</code></li><li><code>group_bit_and/group_bit_or/group_bit_xor</code></li><li><code>outer combine</code></li><li>and all array functions</li></ul></li></ol><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-notices">Upgrade Notices<a href="#upgrade-notices" class="hash-link" aria-label="Upgrade Notices的直接链接" title="Upgrade Notices的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="known-issues">Known Issues<a href="#known-issues" class="hash-link" aria-label="Known Issues的直接链接" title="Known Issues的直接链接"></a></h3><ul><li>Use JDK11 will cause BE crash, please use JDK8 instead.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="behavior-changes">Behavior Changes<a href="#behavior-changes" class="hash-link" aria-label="Behavior Changes的直接链接" title="Behavior Changes的直接链接"></a></h3><ul><li><p>Permission level changes</p><p>Because the catalog level is introduced, the corresponding user permission level will also be changed automatically. The rules are as follows:</p><ul><li>GlobalPrivs and ResourcePrivs remain unchanged</li><li>Added CatalogPrivs level.</li><li>The original DatabasePrivs level is added with the internal prefix (indicating the db in the internal catalog)</li><li>Add the internal prefix to the original TablePrivs level (representing tbl in the internal catalog)</li></ul></li><li><p>In GroupBy and Having clauses, match on column names in preference to aliases. (#14408)</p></li><li><p>Creating columns starting with <code>mv_</code> is no longer supported. <code>mv_</code> is a reserved keyword in materialized views (#14361)</p></li><li><p>Removed the default limit of 65535 rows added by the order by statement, and added the session variable <code>default_order_by_limit</code> to configure this limit. (#12478)</p></li><li><p>In the table generated by "Create Table As Select", all string columns use the string type uniformly, and no longer distinguish varchar/char/string (#14382)</p></li><li><p>In the audit log, remove the word <code>default_cluster</code> before the db and user names. (#13499) (#11408)</p></li><li><p>Add sql digest field in audit log (#8919)</p></li><li><p>The union clause always changes the order by logic. In the new version, the order by clause will be executed after the union is executed, unless explicitly associated by parentheses. (#9745)</p></li><li><p>During the decommission operation, the tablet in the recycle bin will be ignored to ensure that the decomission can be completed. (#14028)</p></li><li><p>The returned result of Decimal will be displayed according to the precision declared in the original column, or according to the precision specified in the cast function. (#13437)</p></li><li><p>Changed column name length limit from 64 to 256 (#14671)</p></li><li><p>Changes to FE configuration items</p><ul><li><p>The <code>enable_vectorized_load</code> parameter is enabled by default. (#11833)</p></li><li><p>Increased <code>create_table_timeout</code> value. The default timeout for table creation operations will be increased. (#13520)</p></li><li><p>Modify <code>stream_load_default_timeout_second</code> default value to 3 days.</p></li><li><p>Modify the default value of <code>alter_table_timeout_second</code> to one month.</p></li><li><p>Increase the parameter <code>max_replica_count_when_schema_change</code> to limit the number of replicas involved in the alter job, the default is 100000. (#12850)</p></li><li><p>Add <code>disable_iceberg_hudi_table</code>. The iceberg and hudi appearances are disabled by default, and the multi catalog function is recommended. (#13932)</p></li></ul></li><li><p>Changes to BE configuration items</p><ul><li><p>Removed <code>disable_stream_load_2pc</code> parameter. 2PC's stream load can be used directly. (#13520)</p></li><li><p>Modify <code>tablet_rowset_stale_sweep_time_sec</code> from 1800 seconds to 300 seconds.</p></li><li><p>Redesigned configuration item name about compaction (#13495)</p></li><li><p>Revisited parameter about memory optimization (#13781)</p></li></ul></li><li><p>Session variable changes</p><ul><li><p>Modify the variable <code>enable_insert_strict</code> to true by default. This will cause some insert operations that could be executed before, but inserted illegal values, to no longer be executed. (11866)</p></li><li><p>Modified variable <code>enable_local_exchange</code> to default to true (#13292)</p></li><li><p>Default data transmission via lz4 compression, controlled by variable <code>fragment_transmission_compression_codec</code> (#11955)</p></li><li><p>Add <code>skip_storage_engine_merge</code> variable for debugging unique or agg model data (#11952)</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/advanced/variables" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/advanced/variables</a></p></li></ul></li><li><p>The BE startup script will check whether the value is greater than 200W through <code>/proc/sys/vm/max_map_count</code>. Otherwise, the startup fails. (#11052)</p></li><li><p>Removed mini load interface (#10520)</p></li><li><p>FE Metadata Version</p><p>FE Meta Version changed from 107 to 114, and cannot be rolled back after upgrading.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="during-upgrade">During Upgrade<a href="#during-upgrade" class="hash-link" aria-label="During Upgrade的直接链接" title="During Upgrade的直接链接"></a></h3><ol><li><p>Upgrade preparation</p><ul><li><p>Need to replace: lib, bin directory (start/stop scripts have been modified)</p></li><li><p>BE also needs to configure JAVA_HOME, and already supports JDBC Table and Java UDF.</p></li><li><p>The default JVM Xmx parameter in fe.conf is changed to 8GB.</p></li></ul></li><li><p>Possible errors during the upgrade process</p><ul><li><p>The repeat function cannot be used and an error is reported: <code>vectorized repeat function cannot be executed</code>, you can turn off the vectorized execution engine before upgrading. (#13868)</p></li><li><p>schema change fails with error: <code>desc_tbl is not set. Maybe the FE version is not equal to the BE</code> (#13822)</p></li><li><p>Vectorized hash join cannot be used and an error will be reported. <code>vectorized hash join cannot be executed</code>. You can turn off the vectorized execution engine before upgrading. (#13753)</p></li></ul><p>The above errors will return to normal after a full upgrade.</p></li></ol><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-impact">Performance Impact<a href="#performance-impact" class="hash-link" aria-label="Performance Impact的直接链接" title="Performance Impact的直接链接"></a></h3><ul><li><p>By default, JeMalloc is used as the memory allocator of the new version BE, replacing TcMalloc (#13367)</p></li><li><p>The batch size in the tablet sink is modified to be at least 8K. (#13912)</p></li><li><p>Disable chunk allocator by default (#13285)</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="api-changes">API Changes<a href="#api-changes" class="hash-link" aria-label="API Changes的直接链接" title="API Changes的直接链接"></a></h3><ul><li><p>BE's http api error return information changed from <code>{"status": "Fail", "msg": "xxx"}</code> to more specific <code>{"status": "Not found", "msg": "Tablet not found. tablet_id=1202"}</code>(#9771)</p></li><li><p>In <code>SHOW CREATE TABLE</code>, the content of comment is changed from double quotes to single quotes (#10327)</p></li><li><p>Support ordinary users to obtain query profile through http command. (#14016)
Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/http-actions/fe/manager/query-profile-action" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/http-actions/fe/manager/query-profile-action</a></p></li><li><p>Optimized the way to specify the sequence column, you can directly specify the column name. (#13872)
Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/data-operate/update-delete/sequence-column-manual" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/data-operate/update-delete/sequence-column-manual</a></p></li><li><p>Increase the space usage of remote storage in the results returned by <code>show backends</code> and <code>show tablets</code> (#11450)</p></li><li><p>Removed Num-Based Compaction related code (#13409)</p></li><li><p>Refactored BE's error code mechanism, some returned error messages will change (#8855)
other</p></li><li><p>Support Docker official image.</p></li><li><p>Support compiling Doris on MacOS(x86/M1) and ubuntu-22.04
Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/install/source-install/compilation-mac/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/install/source-install/compilation-mac/</a></p></li><li><p>Support for image file verification.</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/maint-monitor/metadata-operation/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/maint-monitor/metadata-operation/</a></p></li><li><p>script related</p><ul><li><p>The stop scripts of FE and BE support exiting FE and BE via the <code>--grace</code> parameter (use kill -15 signal instead of kill -9)</p></li><li><p>FE start script supports checking the current FE version via --version (#11563)</p></li></ul></li><li><p>Support to get the data and related table creation statement of a tablet through the <code>ADMIN COPY TABLET</code> command, for local problem debugging (#12176)</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-COPY-TABLET" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Database-Administration-Statements/ADMIN-COPY-TABLET</a></p></li><li><p>Support to obtain a table creation statement related to a SQL statement through the http api for local problem reproduction (#11979)</p><p>Documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/admin-manual/http-actions/fe/query-schema-action" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/admin-manual/http-actions/fe/query-schema-action</a></p></li><li><p>Support to close the compaction function of this table when creating a table, for testing (#11743)</p><p>Search for "disble_auto_compaction" in the documentation: <a href="https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="big-thanks">Big Thanks<a href="#big-thanks" class="hash-link" aria-label="Big Thanks的直接链接" title="Big Thanks的直接链接"></a></h2><p>Thanks to ALL who contributed to this release! (alphabetically)</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">@924060929</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@a19920714liou</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@adonis0147</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Aiden-Dong</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@aiwenmo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@AshinGau</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@b19mud</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@BePPPower</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@BiteTheDDDDt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@bridgeDream</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ByteYue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@caiconghui</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@CalvinKirs</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@cambyzju</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@caoliang-web</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@carlvinhust2012</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@catpineapple</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ccoffline</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@chenlinzhong</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@chovy-3012</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@coderjiang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@cxzl25</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dataalive</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dataroaring</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dependabot[bot]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dinggege1024</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@DongLiang-0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Doris-Extras</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@eldenmoon</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@EmmyMiao87</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@englefly</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@FreeOnePlus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Gabriel39</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@gaodayue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@geniusjoe</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@gj-zhang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@gnehil</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@GoGoWen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@HappenLee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@hello-stephen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Henry2SS</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@hf200012</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@huyuanfeng2018</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jacktengg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jackwener</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jeffreys-cat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Jibing-Li</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@JNSimba</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Kikyou1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Lchangliang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@LemonLiTree</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lexoning</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liaoxin01</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lide-reed</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@link3280</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liutang123</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liuyaolin</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@LOVEGISER</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lsy3993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luozenglin</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luzhijing</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@madongz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morningman</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morningman-cmy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morrySnow</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@mrhhsg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Myasuka</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@myfjdthink</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@nextdreamblue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@pan3793</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@pangzhili</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@pengxiangyu</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@platoneko</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@qidaye</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@qzsee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SaintBacchus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SeekingYang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@smallhibiscus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@sohardforaname</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@song7788q</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@spaces-X</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ssusieee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@stalary</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@starocean999</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SWJTU-ZhangLei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@TaoZex</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@timelxy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Wahno</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangbo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangshuo128</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangyf0555</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@weizhengte</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@weizuo93</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wsjz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wunan1210</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xhmz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xiaokang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xiaokangguo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xinyiZzz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xy720</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yangzhg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Yankee24</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yeyudefeng</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yiguolei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yinzhijian</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yixiutt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yuanyuan8983</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zbtzbtzbt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zenoyang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhangboya1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhangstar333</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhannngchen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ZHbamboo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhengshiJ</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhenhb</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhqu1148980644</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zuochunwei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zy-kkk</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[JD.com's exploration and practice with Apache Doris in real time OLAP]]></title>
<id>https://doris.apache.org/zh-CN/blog/JD_OLAP</id>
<link href="https://doris.apache.org/zh-CN/blog/JD_OLAP"/>
<updated>2022-12-02T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article discusses the exploration and practice of the search engine team in JD.com using Apache Flink and Apache Doris in real-time data analysis. The popularity of stream computing is increasing day by day: More papers are published on Google Dataflow; Apache Flink has become the one of the most popular engine in the world; There is wide application of real-time analytical databases more than ever before, such as Apache Doris; Stream computing engines are really flourishing. However, no engine is perfect enough to solve every problem. It is important to find a suitable OLAP engine for the business. We hope that JD.com's practice in real-time OLAP and stream computing may give you some inspiration.]]></summary>
<content type="html"><![CDATA[<p><img loading="lazy" alt="kv" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-e94fd46c1522a3383d161daec2249d18.png" width="900" height="383" class="img_ev3q"></p><blockquote><p>Guide:
This article discusses the exploration and practice of the search engine team in JD.com using Apache Flink and Apache Doris in real-time data analysis. The popularity of stream computing is increasing day by day: More papers are published on Google Dataflow; Apache Flink has become the one of the most popular engine in the world; There is wide application of real-time analytical databases more than ever before, such as Apache Doris; Stream computing engines are really flourishing. However, no engine is perfect enough to solve every problem. It is important to find a suitable OLAP engine for the business. We hope that JD.com's practice in real-time OLAP and stream computing may give you some inspiration.</p></blockquote><blockquote><p>Author: Li Zhe, data engineer of JD.com, who focused on offline data, stream computing and application development.</p></blockquote><h2 class="anchor anchorWithStickyNavbar_LWe7" id="about-jdcom">About JD.com<a href="#about-jdcom" class="hash-link" aria-label="About JD.com的直接链接" title="About JD.com的直接链接"></a></h2><p>JD.com (NASDAQ: JD), a leading e-commerce company in China, had a net income of RMB 951.6 billion in 2021. JD Group owns JD Retail, JD Global, JD Technology, JD Logistics, JD Cloud, etc. Jingdong Group was officially listed on the NASDAQ Stock Exchange in May 2014.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="jd-search-boxs-requirement-real-time-data-analysis">JD Search Box's Requirement: Real-time Data Analysis<a href="#jd-search-boxs-requirement-real-time-data-analysis" class="hash-link" aria-label="JD Search Box's Requirement: Real-time Data Analysis的直接链接" title="JD Search Box's Requirement: Real-time Data Analysis的直接链接"></a></h2><p>JD search box, as the entrance of the e-commerce platform, provides a link betwee merchants and users. Users can express their needs through the search box. In order to better understand user intentions and quickly improve the conversion rate, multiple A/B tests are running online at the same time, which apply to multiple products. The category, organization, and brand all need to be monitored online for better conversion. At present, JD search box demands real-time data in application mainly includes three parts:</p><ol><li>The overall data of JD search box.</li><li>Real-time monitoring of the A/B test.</li><li>Top list of hot search words to reflect changes in public opinion. Words trending can reflect what users care</li></ol><p>The analysis mentioned above needs to refine the data to the SKU-level. At the same time, we also undertake the task of building a real-time data platform to show our business analysists different real-time stream computing data.</p><p>Although different business analysists care about different data granularity, time frequency, and dimensions, we are hoping to establish a unified real-time OLAP data warehouse and provide a set of safe, reliable and flexible real-time data services.</p><p>At present, the newly generated exposure logs every day reach hundreds of millions. The logs willl increase by 10 times if they are stored as SKU. And they would grow to billions of records if based on A/B test. Aggregation queries cross multi-dimension require second-level response time. </p><p>Such an amount of data also brings huge challenges to the team: 2 billion rows have been created daily; Up to 60 million rows need to be imported per minute; Data latency should be limited to 1 minute; MDX query needs to be executed within 3 seconds; QPS has reached above 20. Yet a new reliable OLAP database with high stability should be able to respond to priority 0 emergency.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-evolution-of-the-real-time-architecture">The Evolution of the Real-time Architecture<a href="#the-evolution-of-the-real-time-architecture" class="hash-link" aria-label="The Evolution of the Real-time Architecture的直接链接" title="The Evolution of the Real-time Architecture的直接链接"></a></h2><p>Our previous architecture is based on Apache Storm for a point-to-point data processing. This approach can quickly meet the needs of real-time reports during the stage of rapid business growth in the early days. However, with the continuous development of business, disadvantages gradually appear. For example, poor flexibility, poor data consistency, low development efficiency and increased resource costs.</p><p><img loading="lazy" alt="page_2" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_2-bc63d65e9c203504cbc7900319d0211c.png" width="1684" height="801" class="img_ev3q"></p><p>In order to solve the problems of the previous architecture, we first upgraded the architecture and replaced Apache Storm with Apache Flink to achieve high throughput. At the same time, according to the characteristics of the search data, the real-time data is processed hierarchically, which means the PV data flow, the SKU data flow and the A/B test data flow are created. It is expected to build the upper real-time OLAP layer based on the real-time flow.</p><p>When selecting OLAP database, the following points need to be considered:</p><ol><li>The data latency is at minute-level and the query response time is at second-level</li><li>Suppots standard SQL, which reduces the cost of use</li><li>Supports JOIN to facilitate adding dimension</li><li>Traffic data can be deduplicated approximately, but order data must be exact deduplicated </li><li>High throughput with tens of millions of records per minute and tens of billions of new records every day</li><li>Query concurrency needs to be high because Front-end may need it</li></ol><p>By comparing the OLAP engines that support real-time import , we made an in-depth comparison among Apache Druid, Elasticsearch, Clickhouse and Apache Doris:</p><p><img loading="lazy" alt="page_3" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_3-578754e222201a65b0601326dc8b298b.png" width="2667" height="778" class="img_ev3q"></p><p>We found out that Doris and Clickhouse can meet our needs. But the concurrency of Clickhouse is low for us, which is a potential risk. Moreover, the data import of Clickhouse has no TRANSACTION and cannot achieve Exactly-once semantics. Clickhouse is not fully supportive of SQL.</p><p>Finally, we chose Apache Doris as our real-time OLAP database. For user behavior log data, we use Aggregation Key data table; As for E-commerce orders data, we use Unique Key data table. Moreover, we split the previous tasks and reuse the logic we tried before. Therefore, when Flink is processing, there will be new topic flow and real-time flow of different granularities generated in DWD. The new architecture is as follows:</p><p><img loading="lazy" alt="page_4" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_4-1f5e1ab38f22766b4ac14b73ee164d59.png" width="3004" height="1571" class="img_ev3q"></p><p>In the current technical architecture, flink task is very light. Based on the production data detail layer, we directly use Doris to act as the aggregation layer function. And we ask Doris to complete window calculation which previously belongs to Flink. We also take advantage of the routine load to consume real-time data. Although the data is fine-grained before importing, based on the Aggregation Key, asynchronous aggregation will be automatically performed. The degree of aggregation is completely determined by the number of dimensions. By creating Rollup on the base table, double-write or multi-write and pre-aggregate operations are performed during import, which is similar to the function of materialized view, which can highly aggregate data to improve query performance.</p><p>Another advantage of using Kafka to directly connect to Doris at the detail layer is that it naturally supports data backtracking. Data backtracking means that when real-time data is out of order, the "late" data can be recalculated and the previous results can be updated. This is because delayed data can be written to the table whenever it arrives. The final solution is as follows:</p><p><img loading="lazy" alt="page_5" src="https://cdnd.selectdb.com/zh-CN/assets/images/page_5-e8fecc91db2d8fcc3495fb45a0e8e8c2.png" width="1116" height="705" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="optimization-during-the-promotion">Optimization during the Promotion<a href="#optimization-during-the-promotion" class="hash-link" aria-label="Optimization during the Promotion的直接链接" title="Optimization during the Promotion的直接链接"></a></h2><p>As mentioned above, we have established Aggregation Key of different granularities in Doris, including PV, SKU, and A/B test granularity. Here we take the exposure A/B test model with the largest amount of daily production data as an example to explain how to support the query of tens of billions of records per day during the big promotion period.</p><p>Strategy we used:</p><ul><li>Monitoring: 10, 30, 60 minutes A/B test with indicators, such as exposure PV, UV, exposure SKU pieces, click PV, click UV and CTR.</li><li>Data Modeling: Use exposed real-time data to establish Aggregation Key; And perform HyperLogLog approximate calculation with UV and PV</li></ul><p>Clusters we had:</p><ul><li>30+ virtual machines with storage of NVMe SSD</li><li>40+ partitions exposed by A/B test</li><li>Tens of billions of new data are created every day</li><li>2 Rollups</li></ul><p>Benefits overall:</p><ul><li>Bucket Field can quickly locate tablet partition when querying</li><li>Import 600 million records in 10 minutes</li><li>2 Rollups have relatively low IO, which meet the requirement of the query</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="look-ahead">Look Ahead<a href="#look-ahead" class="hash-link" aria-label="Look Ahead的直接链接" title="Look Ahead的直接链接"></a></h2><p>JD search box introduced Apache Doris in May 2020, with a scale of 30+ BEs, 10+ routine load tasks running online at the same time. Replacing Flink's window computing with Doris can not only improve development efficiency, adapt to dimension changes, but also reduce computing resources. Apache Doris provides unified interface services ensuring data consistency and security.
We are also pushing the upgrade of JD search box's OLAP platform to the latest version. After upgrading, we plan to use the bitmap function to support accurate deduplication operations of UV and other indicators. In addition, we also plan to use the appropriate Flink window to develop the real-time stream computing of the aggregation layer to increase the richness and completeness of the data.</p>]]></content>
<author>
<name>Li Zhe</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris helped Netease create a refined operation DMP system]]></title>
<id>https://doris.apache.org/zh-CN/blog/Netease</id>
<link href="https://doris.apache.org/zh-CN/blog/Netease"/>
<updated>2022-11-30T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Better data analysis enables users to get better experience. Currently, the normal analysis method is to build a user tags system to accurately generate user portraits and improve user experience. The topic we shared today is the practice of Netease DMP tags system.]]></summary>
<content type="html"><![CDATA[<h1>Apache Doris Helped Netease Create a Refined Operation DMP System</h1><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-a63c2e8908df91d10704f971aa636fa6.png" width="900" height="383" class="img_ev3q"></p><blockquote><p>Guide: Refined operation is a trend of the future Internet, which requires excellent data analysis. In this article, you will get knowledge of: the construction of Netease Lifease's DMP system and the application of Apache Doris.</p></blockquote><blockquote><p>Author | Xiaodong Liu, Lead Developer, Netease</p></blockquote><p>Better data analysis enables users to get better experience. Currently, the normal analysis method is to build a user tags system to accurately generate user portraits and improve user experience. The topic we shared today is the practice of Netease DMP tags system.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="about-netease-and-lifease">About Netease and Lifease<a href="#about-netease-and-lifease" class="hash-link" aria-label="About Netease and Lifease的直接链接" title="About Netease and Lifease的直接链接"></a></h2><p>NetEase (NASDAQ: NTES) is a leading Internet technology company in China, providing users with free emails, gaming, search engine services, news and entertainment, sports, e-commerce and other services.</p><p>Lifease is Netease's self-operated home furnishing e-commerce brand. Its products cover 8 categories in total: home life, apparel, food and beverages, personal care and cleaning, baby products, outdoor sport, digital home appliances, and Lifease's Special. In Q1 of 2022, Lifease launches "Pro " membership and other multiple memberships for different users. The number of Pro members has increased by 65% ​​compared with the previous year.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="about-the-dmp-system">About the DMP System<a href="#about-the-dmp-system" class="hash-link" aria-label="About the DMP System的直接链接" title="About the DMP System的直接链接"></a></h2><p>DMP system plays an important role in Lifease's data analysis.
The data sources of DMP mainly include:</p><ul><li>Business logs of APPs, H5s, PCs and other terminals</li><li>Basic data constructed within NetEase Group</li><li>Data from products sold by third-party such as JD.com, Alibaba, and Bytedance
Through data collection and data cleaning, the above data is ingested into data assets. Based on these data, DMP has created a system of functions, such as tag creation, grouping and portrait analysis, which supports the business including: intelligent product matching, user engagement, and user insight. In general, the DMP system concentrates on building a data-centric tagging system and portrait system to assist the business.</li></ul><p>You can get basic knowledge of the DMP system starting from the concepts below:</p><ul><li>Tagging: Tagging is one of the user monitoring abilities to uniquely identify individual users across different browsers, devices, and user sessions. This approach to user tagging works by capturing available data in your application's page source: age, address, preference and other variables. </li><li>Targeting: Target audience may be dictated by age, gender, income, location, interests or a myriad of other factors.</li><li>User Portrait Analysis: User portrait analysis is to develop user profiles, actions and attributes after targeting audience. For instance, check the behavior paths and consumption models of users whose portraits are "City: Hangzhou, Gender: Female" on Lifease APP.</li></ul><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/1__core_capability-188f05fadbac0c4dfa3574a4e140cb8b.png" width="1153" height="642" class="img_ev3q"></p><p>Llifease's tagging system mainly provides two core capabilities: </p><ol><li>Tag Query: the ability to query the specified tag of a specific entity, which is often used to display basic information. </li><li>Targeting Audience: for both real-time and offline targets. Result after targeting is mainly used for:</li></ol><ul><li>As Grouping Criteria: It can be used to tell if the user is in one or more specified groups. This occasionally occurs in scenarios such as advertising and contact marketing. </li><li>Resultset Pull: Extract specified data to business system for customized development.</li><li>Portrait Analysis: Analyze the behavioral and consumption models in specific groups of people for more refined operations.</li></ul><p>The overall business process is as follows:</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/2__business_process-ca10e9f507ff8157caa521d0c44d7fc4.png" width="1223" height="662" class="img_ev3q"></p><ul><li>First define the rules for tags and grouping;</li><li>After defining the DSL, the task can be submitted to Spark for processing;</li><li>After the processing is done, the results can be stored in Hive and Doris;</li><li>Data from Hive or Doris can be queried and used according to the actual business needs.</li></ul><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/3__dmp_architecture-82a3358b3eb8794fcff543415248505e.png" width="1197" height="706" class="img_ev3q"></p><p>The DMP platform is divided into four modules: Processing&amp;storage layer, scheduling layer, service layer, and metadata management.
All tag meta-information is stored in the source data table; The scheduling layer schedules tasks for the entire business process: Data processing and aggregation are converted into basic tags, and the data in the basic tags and source tables are converted into something that can be used for data query through SQL; The scheduling layer dispatches tasks to Spark to process, and then stores results in both Hive and Doris. The service layer consists of four parts: tag service, entity grouping service, basic tag data service, and portrait analysis service.</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/4__tag_lifecycle-ec086d95f04379a7f9a10993c0089e63.png" width="1124" height="648" class="img_ev3q"></p><p>The lifecycle of tag consists of 5 phases:</p><ul><li>Tag requirements: At this stage, the operation team demands and the product manager team evaluates the rationality and urgency of the requirements.</li><li>Scheduling production: Developers first sort out the data from ODS to DWD, which is the entire link of DM layer. Secondly, they build a model based on data, and at the same time, monitor the production process.</li><li>Targeting Audience: After the tag is produced, group the audience by those tags.</li><li>Precision marketing: Carry out precision marketing strategy to people grouped by.</li><li>Effect evaluation: In the end, tage usage rate and use effect need to be evaluated for future optimization.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="production-of-tags">Production of Tags<a href="#production-of-tags" class="hash-link" aria-label="Production of Tags的直接链接" title="Production of Tags的直接链接"></a></h2><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/5__production_of_tags-a53f5f1d2e03dc74f8d0e69092e4bd02.png" width="1145" height="675" class="img_ev3q"></p><p>Tag data layering:</p><ul><li>The bottom layer is the ODS layer, including user login logs, event tracking records, transaction data, and Binlog data of various databases</li><li>The data processed by the ODS layer, such as user login table, user activity table and order information table reaches the DWD detail layer</li><li>The DWD layer data is aggregated to the DM layer and the tags are all implemented based on the DM layer data.
At present, we have fully automated the data output from the original database to the ODS layer. And we also realized partial automation from the ODS layer to the DWD layer. And there are a small number of automated operations from the DWD to the DM layer, which will be our focus in the future.</li></ul><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/6__type_of__tags-91b30c2315a91d57aa96017a4ec716eb.png" width="1154" height="677" class="img_ev3q"></p><p>Tags are devided based on timeliness: offline tags, quasi-real-time tags and real-time tags. According to the scale of data, it is divided into: aggregation tags and detail tags. In other cases, tags can also be divided into: account attribute tags, consumption behavior tags, active behavior tags, user preference tags, asset information tags, etc. </p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/7__tags_settings-8a8c1c99a4afbc7f78ceb4659da2c184.png" width="1163" height="672" class="img_ev3q"></p><p>It is inconvenient to use the data of the DM layer directly because the basic data is relatively primitive. The abstraction level is lacking and it is not easy to use. By combining basic data with AND, OR, and NOT, business tags are formed for further use, which can reduce the cost of understanding operations and make it easier to use.</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/8__target_audience-cfe11c32b47db0639303f640a3452d98.png" width="1161" height="696" class="img_ev3q"></p><p>After the tags are merged, it is necessary to apply the tags to specific business scenarios, such as grouping. The configuration is shown on the left side of the figure above, which supports offline crowd packages and real-time behaviors (need to be configured separately). After configuration, generate the DSL rules shown on the right side of the figure above, expressed in Json format, which is more friendly to FE, and can also be converted into query statements of the datebase engine.</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/9__target_audience-mapping-1b00b571d178577b4f0c4f2c8a5b1acf.png" width="1120" height="649" class="img_ev3q"></p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/10__automation-fe72dc6c87f37fdd94f217a9174706bd.png" width="1114" height="649" class="img_ev3q"></p><p>Tagging is partially automated. The degree of automation in grouping is relatively high. For example, group refresh can be done regularly every day; Advanced processing, such as intersection/merge/difference between groups; Data cleaning means timely cleaning up expired and invalid data.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="tags-storage">Tags Storage<a href="#tags-storage" class="hash-link" aria-label="Tags Storage的直接链接" title="Tags Storage的直接链接"></a></h2><p>Lifease's DMP labeling system needs to carry relatively large customer end traffic, and has relatively high requirements for real-time performance. Our storage requirements include:</p><ul><li>Need support high-performance query to deal with large-scale customer end traffic</li><li>Need support SQL to facilitate data analysis scenarios</li><li>Need support data update mechanism</li><li>Can store large amount of data</li><li>Need support for extension functions to handle custom data structures</li><li>Closely integrated with big data ecology</li></ul><p>In the field of big data, multiple engines vary in different applicable scenarios. We used the popular engines in the chart below to optimize our database architecture for 2 times.</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/11__comparision-dd0d69a571e362dcca7711561a30db7c.png" width="1133" height="660" class="img_ev3q"></p><p>Our architecture V1.0 is shown below:</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/12__architecture_v1_0-59dffe2265ac0754860a4bc796c090fa.png" width="1175" height="695" class="img_ev3q"></p><p>Most of the offline data is stored in Hive while a small part is stored in Hbase (mainly used for querying basic tags). Part of the real-time data is stored in Hbase for basic tags query and the rest is double-written into KUDU and Elasticsearch for real-time grouping and data query. The data offline is processed by Impala and cached in Redis.
Disadvantages :</p><ul><li>Too many database engines.</li><li>Double writing has hidden problems with data quality. One side may succeed while the other side fails, resulting in data inconsistency.</li><li>The project is complex and maintainability is poor.
In order to reduce the usage of engine and storage, we improved and implemented version 2.0 :</li></ul><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/13__architecture_v2_0-1f1c2b508793cf146a606b3a453e01a5.png" width="1148" height="677" class="img_ev3q"></p><p>In storage architecture V2.0, Apache Doris is adopted. Offline data is mainly stored in Hive. At the same time, basic tags are imported into Doris, and real-time data as well. The query federation of Hive and Doris is performed based on Spark, and the results are stored in Redis. After this improvement, an storage engine which can manages offline and real-time data has been created. We are currently use Apache Doris 1.0, which enables : 1. The query performance can be controlled within 20ms at 99% 2. The query performance can be controlled within 50ms at 99.9%. Now the architecture is simplified, which greatly reduces operation and maintenance costs.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="advantages-of-apache-doris-in-practice">Advantages of Apache Doris in Practice<a href="#advantages-of-apache-doris-in-practice" class="hash-link" aria-label="Advantages of Apache Doris in Practice的直接链接" title="Advantages of Apache Doris in Practice的直接链接"></a></h2><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/14__advantages_in_practice-3fc1c9893383a6635c8c9612e3ef0a15.png" width="1128" height="658" class="img_ev3q"></p><p>Lifeuse has adopted Apache Doris to check, batch query, path analyse and grouping. The advantages are as follows:</p><ul><li>The query federation performance of key query and a small number of tables exceeds 10,000 QPS, with RT99&lt;50MS.</li><li>The horizontal expansion capability is relatively strong and maintenance cost is relatively low.</li><li>The offlin and real-time data are unified to reduce the complexity of the tags model.</li></ul><p>The downside is that importing a large amount of small data takes up more resources. But this problem has been optimized in Doris 1.1. Apache Doris has greatly enhanced the data compaction capability in version 1.1, and can quickly complete aggregation of new data, avoiding the -235 error caused by too many versions of sharded data and the low query efficiency problems.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="future-plan">Future Plan<a href="#future-plan" class="hash-link" aria-label="Future Plan的直接链接" title="Future Plan的直接链接"></a></h2><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/15__future_plan-199b125bad243e0dcd93f00b9f4395fe.png" width="1117" height="652" class="img_ev3q"></p><p>Hive and Spark are gradually turning into Apache Doris.
Optimize the tagging system:</p><ul><li>Establish a rich and accurate tag evaluation system</li><li>Improve tag quality and output speed</li><li>Improve tag coverage
More precision operation:</li><li>Build a rich user analysis model</li><li>Improve the user insight model evaluation system based on the frequency of use and user value</li><li>Establish general image analysis capabilities to assist intelligent decision-making in operations</li></ul>]]></content>
<author>
<name>Xiaodong Liu</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[The application of Apache Doris in NIO]]></title>
<id>https://doris.apache.org/zh-CN/blog/NIO</id>
<link href="https://doris.apache.org/zh-CN/blog/NIO"/>
<updated>2022-11-28T00:00:00.000Z</updated>
<summary type="html"><![CDATA[NIO Inc. (NYSE: NIO)is a leading company in the premium smart electric vehicle market. Founded in November 2014, NIO designs, develops, jointly manufactures and sells premium smart electric vehicles, driving innovations in autonomous driving, digital technologies, electric powertrains and batteries. Recently, NIO planned to enter the U.S. market alongside other western markets by the end of 2025. The company has already established a U.S. headquarters in San Jose, California, where they started hiring people..]]></summary>
<content type="html"><![CDATA[<h1>The Application of Apache Doris in NIO</h1><p><img loading="lazy" alt="NIO" src="https://cdnd.selectdb.com/zh-CN/assets/images/NIO_kv-7601d71a49c7ecd7fb42f03de600ae6c.png" width="900" height="383" class="img_ev3q"></p><blockquote><p>Guide: The topic of this sharing is the application of Apache Doris in NIO, which mainly includes the following topics:</p><ol><li>Introduction about NIO</li><li>The Development of OLAP in NIO</li><li>Apache Doris-the Unified OLAP Data warehouse</li><li>Best Practice of Apache Doris on CDP Architecture</li><li>Summery and Benefits</li></ol></blockquote><p>Author:Huaidong Tang, Data Team Leader, NIO INC</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="about-nio">About NIO<a href="#about-nio" class="hash-link" aria-label="About NIO的直接链接" title="About NIO的直接链接"></a></h2><p>NIO Inc. (NYSE: NIO)is a leading company in the premium smart electric vehicle market. Founded in November 2014, NIO designs, develops, jointly manufactures and sells premium smart electric vehicles, driving innovations in autonomous driving, digital technologies, electric powertrains and batteries.</p><p>Recently, NIO planned to enter the U.S. market alongside other western markets by the end of 2025. The company has already established a U.S. headquarters in San Jose, California, where they started hiring people.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="the-architecture-evolution-of-olap-in-nio">The Architecture Evolution of OLAP in NIO<a href="#the-architecture-evolution-of-olap-in-nio" class="hash-link" aria-label="The Architecture Evolution of OLAP in NIO的直接链接" title="The Architecture Evolution of OLAP in NIO的直接链接"></a></h2><p>The architectural evolution of OLAP in NIO took several steps for years.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-introduced-apache-druid">1. Introduced Apache Druid<a href="#1-introduced-apache-druid" class="hash-link" aria-label="1. Introduced Apache Druid的直接链接" title="1. Introduced Apache Druid的直接链接"></a></h3><p>At that time, there were not so many OLAP storage and query engines to choose from. The more common ones were Apache Druid and Apache Kylin. There are 2 reasons why we didn't choose Kylin.</p><ul><li><p>The most suitable and optimal storage at the bottom of Kylin is HBase and adding it would increase the cost of operation and maintenance.</p></li><li><p>Kylin's precalculation involves various dimensions and indicators. Too many dimensions and indicators would cause great pressure on storage.</p></li></ul><p>We prefer Druid because we used to be users and are familiar with it. Apache Druid has obvious advantages. It supports real-time and offline data import, columnar storage, high concurrency, and high query efficiency. But it has downsides as well:</p><ul><li><p>Standard protocols such as JDBC are not used</p></li><li><p>The capability of JOIN is weak</p></li><li><p>Significant performance downhill when performing dedeplication</p></li><li><p>High in operation and maintenance costs, different components have separate installation methods and different dependencies; Data import needs extra integration with Hadoop and the dependencies of JAR packages</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-introduced-tidb">2. Introduced TiDB<a href="#2-introduced-tidb" class="hash-link" aria-label="2. Introduced TiDB的直接链接" title="2. Introduced TiDB的直接链接"></a></h3><p><strong>TiDB is a mature datawarehouse focused on OLTP+OLAP, which also has distinctive advantages and disadvantages:</strong></p><p>Advantage:</p><ul><li><p>OLTP database, can be updated friendly</p></li><li><p>Supports detailed and aggregated query, which can handle dashboard statistical reports or query of detailed data at the same time</p></li><li><p>Supports standard SQL, which has low cost of use</p></li><li><p>Low operation and maintenance cost</p></li></ul><p>Disadvantages:</p><ul><li><p>It is not an independent OLAP. TiFlash relies on OLTP and will increase storage. Its OLAP ability is insufficient</p></li><li><p>The overall performance should be measured separately by each scene</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-introduced-apache-doris">3. Introduced Apache Doris<a href="#3-introduced-apache-doris" class="hash-link" aria-label="3. Introduced Apache Doris的直接链接" title="3. Introduced Apache Doris的直接链接"></a></h3><p>Since 2021, we have officially introduced Apache Doris. In the process of selection, we are most concerned about various factors such as product performance, SQL protocol, system compatibility, learning and operation and maintenance costs. After deep research and detailed comparison of the following systems, we came to the following conclusions:</p><p><strong>Apache Doris, whose advantages fully meet our demands:</strong></p><ul><li><p>Supports high concurrent query (what we concerned most)</p></li><li><p>Supports both real-time and offline data</p></li><li><p>Supports detailed and aggregated query</p></li><li><p>UNIQ model can be updated</p></li><li><p>The ability of Materialized View can greatly speed up query efficiency</p></li><li><p>Fully compatible with the MySQL protocol and the cost of development is relatively low</p></li><li><p>The performance fully meets our requirements</p></li><li><p>Lower operation and maintenance costs</p></li></ul><p><strong>Moreover, there is another competitor, Clickhouse. Its stand-alone performance is extremely strong, but its disadvantages are hard to accept:</strong></p><ul><li><p>In some cases, its multi-table JOIN is weak</p></li><li><p>Relatively low in concurrency</p></li><li><p>High operation and maintenance costs</p></li></ul><p>With multiple good performances, Apache Doris outstands Druid and TiDB. Meanwhile Clickhouse did not fit well in our business, which lead us to Apache Doris.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="apache-doris-the-unified-olap-datawarehouse">Apache Doris-the Unified OLAP Datawarehouse<a href="#apache-doris-the-unified-olap-datawarehouse" class="hash-link" aria-label="Apache Doris-the Unified OLAP Datawarehouse的直接链接" title="Apache Doris-the Unified OLAP Datawarehouse的直接链接"></a></h2><p><img loading="lazy" alt="NIO" src="https://cdnd.selectdb.com/zh-CN/assets/images/olap-96ad3bb86cebd92a200a0581f0418d3c.png" width="1018" height="669" class="img_ev3q"></p><p>This diagram basically describes our OLAP Architecuture, including data source, data import, data processing, data warehouse, data service and application.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-data-source">1. Data Source<a href="#1-data-source" class="hash-link" aria-label="1. Data Source的直接链接" title="1. Data Source的直接链接"></a></h3><p>In NIO, the data source not only refers to database, but also event tracking data, device data, vehicle data, etc. The data will be ingested into the big data platform. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-data-import">2. Data Import<a href="#2-data-import" class="hash-link" aria-label="2. Data Import的直接链接" title="2. Data Import的直接链接"></a></h3><p>For business data, you can trigger CDC and convert it into a data stream, store it in Kafka, and then perform stream processing. Some data that can only be passed in batches will directly enter our distributed storage.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3-data-processing">3. Data Processing<a href="#3-data-processing" class="hash-link" aria-label="3. Data Processing的直接链接" title="3. Data Processing的直接链接"></a></h3><p>We took the Lambda architecture rather than stream-batch integration.</p><p>Our own business determines that our Lambda architecture should be divided into two paths: offline and real-time:</p><ul><li><p>Some data is streamed.</p></li><li><p>Some data can be stored in the data stream, and some historical data will not be stored in Kafka.</p></li><li><p>Some data requires high precision in some circumstances. In order to ensure the accuracy of the data, an offline pipeline will recalculate and refresh the entire data.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4-data-warehouse">4. Data Warehouse<a href="#4-data-warehouse" class="hash-link" aria-label="4. Data Warehouse的直接链接" title="4. Data Warehouse的直接链接"></a></h3><p>From data processing to the data warehouse, we did not adopt Flink or Spark Doris Connector. We use Routine Load to connect Apache Doris and Flink, and Broker Load to connect Doris and Spark. The data generated in batches by Spark will be backed up to Hive for further use in other scenarios. In this way, each calculation is used for multiple scenarios at the same time, which greatly improves the efficiency. It also works for Flink.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="5-data-service">5. Data Service<a href="#5-data-service" class="hash-link" aria-label="5. Data Service的直接链接" title="5. Data Service的直接链接"></a></h3><p>What behind Doris is One Service. By registering the data source or flexible configuration, the API with flow and authority control is automatically generated, which greatly improves flexibility. And with the k8s serverless solution, the entire service is much more flexible.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="6-application">6. Application<a href="#6-application" class="hash-link" aria-label="6. Application的直接链接" title="6. Application的直接链接"></a></h3><p>In the application layer, we mainly deploy some reporting applications and other services.</p><p>We mainly have two types of scenarios:</p><ul><li><p><strong>User-oriented</strong> , which is similar to the Internet, contains a data dashboard and data indicators.</p></li><li><p><strong>Car-oriented</strong> , car data enters Doris in this way. After certain aggregation, the volume of Doris data is about billions. But the overall performance can still meet our requirements.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="best-practice-of-apache-doris-on-cdp-architecture">Best Practice of Apache Doris on CDP Architecture<a href="#best-practice-of-apache-doris-on-cdp-architecture" class="hash-link" aria-label="Best Practice of Apache Doris on CDP Architecture的直接链接" title="Best Practice of Apache Doris on CDP Architecture的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1-cdp-architecture">1. CDP Architecture<a href="#1-cdp-architecture" class="hash-link" aria-label="1. CDP Architecture的直接链接" title="1. CDP Architecture的直接链接"></a></h3><p><img loading="lazy" alt="NIO" src="https://cdnd.selectdb.com/zh-CN/assets/images/cdp-3d65926e741a2837759b07514e914bbf.png" width="1471" height="422" class="img_ev3q"></p><p>Next, let me introduce Doris' practice on the operating platform. This is what happens in our real business. Nowadays, Internet companies will make their own CDP, which includes several modules:</p><ul><li><p><strong>Tags</strong> , which is the most basic part.</p></li><li><p><strong>Target</strong> , based on tags, select people according to some certain logic.</p></li><li><p><strong>Insight</strong> , aiming at a group of people, clarify the distribution and characteristics of the group.</p></li><li><p><strong>Touch</strong> , use methods such as text messages, phone calls, voices, APP notifications, IM, etc. to reach users, and cooperate with flow control.</p></li><li><p><strong>Effect analysis,</strong> to improve the integrity of the operation platform, with action, effect and feedback.</p></li></ul><p>Doris plays the most important role here, including: tags storage, groups storage, and effect analysis.</p><p>Tags are divided into basic tags and basic data of user behavior. We can flexibly customize other tags based on those facts. From the perspective of time effectiveness, tags are also divided into real-time tags and offline tags.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2-considerations-for-cdp-storage-selection">2. Considerations for CDP Storage Selection<a href="#2-considerations-for-cdp-storage-selection" class="hash-link" aria-label="2. Considerations for CDP Storage Selection的直接链接" title="2. Considerations for CDP Storage Selection的直接链接"></a></h3><p>We took five dimensions into account when we select CDP storage.</p><p><strong>(1) Unification of Offline and Real-time</strong></p><p>As mentioned earlier, there are offline tags and real-time tags. Currently we are close to quasi-real-time. For some data, quasi-real-time is good enough to meet our needs. A large number of tags are still offline tags. The methods used are Doris's Routine Load and Broker Load.</p><table><thead><tr><th><strong>Scenes</strong></th><th><strong>Requirements</strong></th><th><strong>Apache Doris's Function</strong></th></tr></thead><tbody><tr><td>Real-time tags</td><td>Real-time data updates</td><td>Routine Load</td></tr><tr><td>Offline tags</td><td>Highly efficient batch import</td><td>Broker Load</td></tr><tr><td>Unification of offline and real-time</td><td>Unification of offline and real-time data storage</td><td>Routine Load and Broker Load update different columns of the same table</td></tr></tbody></table><p>In addition, on the same table, the update frequency of different columns is also different. For example, we need to update the user's identity in real time because the user's identity changes all the time. T+1's update does not meet our needs. Some tags are offline, such as the user's gender, age and other basic tags, T+1 update is sufficient to meet our standards. The maintenance cost caused by putting the tags of basic users on the same table is very low. When customizing tags later, the number of tables will be greatly reduced, which benefits the overall performance.</p><p><strong>(2) Efficient Targets</strong></p><p>When users tags are done, is time to target right group of people. The target is to filter out all the people who meet the conditions according to different combinations of tags. At this time, there will be queries with different combinations of tag conditions. There was an obvious improvement when Apache Doris upgraded to vectorization.</p><table><thead><tr><th><strong>Scenes</strong></th><th><strong>Requirements</strong></th><th><strong>Apache Doris's Function</strong></th></tr></thead><tbody><tr><td>Complex Condition Targets</td><td>Highly efficient combination of tags</td><td>Optimization of SIMD</td></tr></tbody></table><p><strong>(3) Efficient Polymerization</strong></p><p>The user insights and effect analysis statistics mentioned above require statistical analysis of the data, which is not a simple thing of obtaining tags by user ID. The amount of data read and query efficiency have a great impact on the distribution of our tags, the distribution of groups, and the statistics of effect analysis. Apache Doris helps a lot:</p><ul><li><p>Data Partition. We shard the data by time order and the analysis and statistics will greatly reduce the amount of data, which can greatly speed up the efficiency of query and analysis.</p></li><li><p>Node aggregation. Then we collect them for unified aggregation.</p></li><li><p>Vectorization. The vectorization execution engine has significant performance improvement.</p></li></ul><table><thead><tr><th><strong>Scenes</strong></th><th><strong>Requirements</strong></th><th><strong>Apache Doris's Function</strong></th></tr></thead><tbody><tr><td>Distribution of Tags Values</td><td>The distribution values ​​of all tags need to be updated every day. Fast and efficient statistics are required</td><td>Data partition lessens data transfer and calculation</td></tr><tr><td>Distribution of Groups</td><td>Same as Above</td><td>Unified storage and calculation, each node aggregates first</td></tr><tr><td>Statistics for Performance Analysis</td><td>Same as Above</td><td>Speed up SIMD</td></tr></tbody></table><p><strong>(4) Multi-table Association</strong></p><p>Our CDP might be different from common CDP scenarios in the industry, because common CDP tags in some scenarios are estimated in advance and no custom tags, which leaves the flexibility to users who use CDP to customize tags themselves. The underlying data is scattered in different database tables. If you want to create a custom tag, you must associate the tables.</p><p>A very important reason we chose Doris is the ability to associate multiple tables. Through performance tests, Apache Doris is able to meet our requirements. And Doris provides users with powerful capabilities because tags are dynamic.</p><table><thead><tr><th><strong>Scenes</strong></th><th><strong>Requirements</strong></th><th><strong>Apache Doris's Function</strong></th></tr></thead><tbody><tr><td>Distributed Characteristics of the Population</td><td>The distribution of statistical groups under a certain characteristic</td><td>Table Association</td></tr><tr><td>Single Tag</td><td>Display tags</td><td></td></tr></tbody></table><p><strong>(5) Query Federation</strong></p><p>Whether the user is successfully reached or not will be recorded in TiDB. Notifications during operations may only affect user experience. If a transaction is involved, such as gift cards or coupons, the task execution must be done without repetition. TiDB is more suitable for this OLTP scenario.</p><p>But for effect analysis, it is necessary to understand the extent to which the operation plan is implemented, whether the goal is achieved and its distribution. It is necessary to combine task execution and group selection for analysis, which requires the query association between Doris and TiDB.</p><p>The size of the tag is probably small, so we would like to save it into Elasticsearch. However, it proves us wrong later.</p><table><thead><tr><th><strong>Scenes</strong></th><th><strong>Requirements</strong></th><th><strong>Apache Doris's Function</strong></th></tr></thead><tbody><tr><td>Effect Analysis Associated with Execution Details</td><td>Doris query associated with TiDB</td><td>Query Association with other databases</td></tr><tr><td>Group Tags Associated with Behavior Aggregation</td><td>Doris query associated with Elasticsearch</td><td></td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="summery-and-benefits">Summery and Benefits<a href="#summery-and-benefits" class="hash-link" aria-label="Summery and Benefits的直接链接" title="Summery and Benefits的直接链接"></a></h2><ol><li><p><strong>bitmap</strong>. Our volume are not big enough to test its full efficiency. If the volume reaches a certain level, using bitmap might have a good performance improvement. For example, when calculating UV , bitmap aggregation can be considered if the full set of Ids is greater than 50 million.</p></li><li><p><strong>The performance is good</strong> when Elasticsearch single-table query is associated with Doris.</p></li><li><p><strong>Better to update columns in batches</strong>. In order to reduce the number of tables and improve the performance of the JOIN table, the table designed should be as streamlined as possible and aggregated as much as possible. However, fields of the same type may have different update frequencies. Some fields need to be updated at daily level, while others may need to be updated at hourly level. Updating a column alone is an important requirement. The solution from Apache Doris is to use REPLACE<!-- -->_<!-- -->IF<!-- -->_<!-- -->NOT<!-- -->_<!-- -->NULL. Note: It is impossible to replace the original non-null value with null. You can replace all nulls with meaningful default values, such as unknown.</p></li><li><p><strong>Online Services</strong>. Apache Doris serves online and offline scenarios at the same time, which requires high resource isolation.</p></li></ol>]]></content>
<author>
<name>Huaidong Tang</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[How does Apache Doris help AISPEECH build a data warehouse in AI chatbots scenario]]></title>
<id>https://doris.apache.org/zh-CN/blog/Use-Apache-Doris-with-AI-chatbots</id>
<link href="https://doris.apache.org/zh-CN/blog/Use-Apache-Doris-with-AI-chatbots"/>
<updated>2022-11-24T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Guide: In 2019, AISPEACH built a real-time and offline datawarehouse based on Apache Doris. Reling on its flexible query model, extremely low maintenance costs, high development efficiency, and excellent query performance, Apache Doris has been used in many business scenarios such as real-time business operations, AI chatbots analysis. It meets various data analysis needs such as device portrait/user label, real-time operation, data dashboard, self-service BI and financial reconciliation. And now I will share our experience through this article.]]></summary>
<content type="html"><![CDATA[<h1>How Does Apache Doris Help AISPEECH Build a Data warehouse in AI Chatbots Scenario</h1><p><img loading="lazy" alt="kv" src="https://cdnd.selectdb.com/zh-CN/assets/images/kv-7d5af44f82188444fd1c6ac613c1d7eb.png" width="900" height="383" class="img_ev3q"></p><blockquote><p>Guide: In 2019, AISPEACH built a real-time and offline datawarehouse based on Apache Doris. Reling on its flexible query model, extremely low maintenance costs, high development efficiency, and excellent query performance, Apache Doris has been used in many business scenarios such as real-time business operations, AI chatbots analysis. It meets various data analysis needs such as device portrait/user label, real-time operation, data dashboard, self-service BI and financial reconciliation. And now I will share our experience through this article.</p></blockquote><p>Author|Zhao Wei, Head Developer of AISPEACH's Big Data Departpment</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="backgounds">Backgounds<a href="#backgounds" class="hash-link" aria-label="Backgounds的直接链接" title="Backgounds的直接链接"></a></h2><p>AISPEACH is a professional conversational artificial intelligence company in China. It has full-link intelligent voice and language technology. It is committed to becoming a platform-based enterprise for full-link intelligent voice and language interaction. Recently it has developed a new generation of human-computer interaction platform DUI and artificial intelligence chip TH1520, providing natural language interaction solutions for partners in many industry scenarios such as Internet of Vehicles, IoT, government affairs and fintech.</p><p>Aspire introduced Apache Doris for the first time in 2019 and built a real-time and offline data warehouse based on Apache Doris. Compared with the previous architecture, Apache Doris has many advantages such as flexible query model, extremely low maintenance cost, high development efficiency and excellent query performance. Multiple business scenarios have been applied to meet various data analysis needs such as device portraits/user tags, real-time operation of business scenarios, data analysis dashboards, self-service BI, and financial reconciliation.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="architecture-evolution">Architecture Evolution<a href="#architecture-evolution" class="hash-link" aria-label="Architecture Evolution的直接链接" title="Architecture Evolution的直接链接"></a></h2><p>Offline data analysis in the early business was our main requirement. Recently, with the continuous development of business, the requirements for real-time data analysis in business scenarios have become higher and higher. The early datawarehouse architecture failed to meet our requirements. In order to meet the higher requirements of business scenarios for query performance, response time, and concurrency capabilities, Apache Doris was officially introduced in 2019 to build a real-time and offline integrated datawarehouse architecture.</p><p>In the following I will introduce the evolution of the AISPEACH Data Warehouse architecture, and share the reasons why we chose Apache Doris to build a new architecture.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="early-data-warehouse-architecture">Early Data Warehouse Architecture<a href="#early-data-warehouse-architecture" class="hash-link" aria-label="Early Data Warehouse Architecture的直接链接" title="Early Data Warehouse Architecture的直接链接"></a></h3><p>As shown in the architecture diagram below, the offline data warehouse is based on Hive + Kylin while the real-time data warehouse is based on Spark + MySQL.</p><p><img loading="lazy" alt="data_wharehouse_architecture_v1_0_git" src="https://cdnd.selectdb.com/zh-CN/assets/images/data_wharehouse_architecture_v1_0_git-006b22817872b04ad8f909e54e8c1411.png" width="1953" height="1106" class="img_ev3q"></p><p>There are three main types of data sources in our business, business databases such as MySQL, application systems such as K8s container service logs, and logs of automotive T-Box. Data sources are first written to Kafka through various methods such as MQTT/HTTP protocol, business database Binlog, and Filebeat log collection. In the early time, the data will be divided into real-time and offline links after passing through Kafka. Real-time part has a shorter link. The data buffered by Kafka is processed by Spark and put into MySQL for further analysis. MySQL can basically meet the early analysis requirements. After data cleaning and processing by Spark, an offline datawarehouse is built in Hive, and Apache Kylin is used to build Cube. Before building Cube, it is necessary to design the data model in advance, including association tables, dimension tables, index fields, and aggregation functions. After construction through the scheduling system, we can finally use HBase to store the Cube.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="pain-points-of-early-architecture">Pain Points of Early Architecture:<a href="#pain-points-of-early-architecture" class="hash-link" aria-label="Pain Points of Early Architecture:的直接链接" title="Pain Points of Early Architecture:的直接链接"></a></h4><ol><li><p><strong>There are many dependent components.</strong> Kylin strongly relies on Hadoop and HBase in versions 2.x and 3.x. The large number of application components leads to low development efficiency, many hidden dangers of architecture stability, and high maintenance costs.</p></li><li><p><strong>The construction process of Kylin is complicated and the construction task always fail.</strong> When we do construction for Kylin, we always need to do the following: widen tables, de-duplicate columns, generate dictionaries, build cubes, etc. If there are 1000-2000 or more tasks per day, at least 10 or more tasks will fail to build, resulting in a lot of time to write automatic operation and maintenance scripts.</p></li><li><p><strong>Dimension/dictionary expansion is heavy.</strong> Dimension expansion refers to the need for multiple analysis conditions and fields in some business scenarios. If many fields are selected in the data analysis model without pruning, it will lead to severe cube dimension expansion and longer construction time. Dictionary inflation means that in some scenarios, it takes a long time to do global accurate deduplication, which will make the dictionary construction bigger and bigger, and the construction time will become longer and longer, resulting in a continuous decline in data analysis performance.</p></li><li><p><strong>The data analysis model is fixed and low in flexibility.</strong> In the actual application, if a calculation field or business scenario is changed, some or even all of the data needs to be backtracked.</p></li><li><p><strong>Data detail query is not supported.</strong> The early data warehouse architecture could not provide detailed data query. The official Kylin solution is to relate to Presto for detailed query, which introduces another architecture and increases development costs.</p></li></ol><h3 class="anchor anchorWithStickyNavbar_LWe7" id="architecture-selection">Architecture Selection<a href="#architecture-selection" class="hash-link" aria-label="Architecture Selection的直接链接" title="Architecture Selection的直接链接"></a></h3><p>In order to solve the problems above, we began to explore other datawarehouse architecture solutions. And we conducted a series of research on OLAP engines such as Apache Doris and Clickhouse, which are most widely used in the market.</p><p>As the original creator, SelectDB provides commercial support and services for Apache Doris. With the new Apache Doris, SelectDB is now providing global users with a fully-managed database option for deployment.</p><p>Comparing with ClickHouse's heavy maintenance, various table types, and lack of support for associated queries, Apache Doris performed better. And combined with our OLAP analysis scenario, we finally decided to introduce Apache Doris.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-data-warehouse-architecture">New Data Warehouse Architecture<a href="#new-data-warehouse-architecture" class="hash-link" aria-label="New Data Warehouse Architecture的直接链接" title="New Data Warehouse Architecture的直接链接"></a></h3><p><img loading="lazy" alt="data_wharehouse_architecture_v2_0_git" src="https://cdnd.selectdb.com/zh-CN/assets/images/data_wharehouse_architecture_v2_0_git-825df043f0abf0fda4a92b8dc5d10956.png" width="1993" height="1144" class="img_ev3q"></p><p>As shown in the figure above, we built a new real-time + offline data warehouse architecture based on Apache Doris. Unlike the previous architecture, real-time and offline data are processed separately and written to Apache Doris for analysis.</p><p>Due to some historical reasons, data migration is difficult. The offline data is basically consistent with the previous datawarehouse architecture, and it is entirely possible to directly build an offline data warehouse on Apache Doris.</p><p>Comparing with the earlier architecture, the offline data is cleaned and processed by Spark, which is possible to build data warehouse in Hive. Then the data stored in Hive can be written to Apache Doris through Broker Load. What I want to explain here is that the data import speed of Broker Load is very fast and it only takes 10-20 minutes to import 100-200G data into Apache Doris on a daily basis.</p><p>When it comes to the real-time data flow, the new architecture uses Doris-Spark-Connector to consume data in Kafka and write it to Apache Doris after simple tasks. As shown in the architecture diagram, real-time and offline data are analyzed and processed in Apache Doris, which meets the business requirements of data applications for both real-time and offline.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="benefits-of-the-new-architecture">Benefits of the New Architecture:<a href="#benefits-of-the-new-architecture" class="hash-link" aria-label="Benefits of the New Architecture:的直接链接" title="Benefits of the New Architecture:的直接链接"></a></h4><ol><li><p><strong>Simplified operation, low maintenance cost, and does not depend on Hadoop ecological components.</strong> The deployment of Apache Doris is simple. There are only two processes of FE and BE. Both FE and BE processes can be scaled out. A single cluster supports hundreds of machines and tens of PB storage capacity. These two types of processes pass the consistency agreement to ensure high availability of services and high reliability of data. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system. The operation and maintenance time spent in the three years of using Doris is very small. Comparing with the previous architecture based on Kylin, the new architecture spends little time on operation and maintenance.</p></li><li><p><strong>The difficulty of developing and troubleshooting problems is greatly reduced.</strong> The real-time and offline unified data warehouse based on Doris supports real-time data services, interactive data analysis, and offline data processing scenarios, which greatly reduces the difficulty of troubleshooting.</p></li><li><p><strong>Apache Doris supports JOIN query in Runtime format.</strong> Runtime is similar to MySQL's table association, which is friendly to the scene where the data analysis model changes frequently, and solves the problem of low flexibility in the early structured data model.</p></li><li><p><strong>Apache Doris supports JOIN, aggregation, and detailed query at the same time.</strong> Meanwhile, it solves the problem that data details could not be queried in the previous architecture.</p></li><li><p><strong>Apache Doris supports multiple accelerated query methods.</strong> And it also supports rollup index, materialized view, and implements secondary index through rollup index to speed up query, which greatly improves query response time.</p></li><li><p><strong>Apache Doris supports multiple types of Query Federation.</strong> And it supports Federation Query analysis on data lakes such as Hive, Iceberg, and Hudi, and also databases such as MySQL and Elasticsearch.</p></li></ol><h2 class="anchor anchorWithStickyNavbar_LWe7" id="applications">Applications<a href="#applications" class="hash-link" aria-label="Applications的直接链接" title="Applications的直接链接"></a></h2><p>Apache Doris was first applied in real-time business and AI Chatbots analysis scenarios in AISPEACH. This chapter will introduce the requirements and applications of the two scenarios.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-business">Real-time Business<a href="#real-time-business" class="hash-link" aria-label="Real-time Business的直接链接" title="Real-time Business的直接链接"></a></h3><p><img loading="lazy" alt="real-time_operation_git" src="https://cdnd.selectdb.com/zh-CN/assets/images/real-time_operation_git-87d6e8ede096ba1551cb290941741126.png" width="1977" height="1226" class="img_ev3q"></p><p>As shown in the figure above, the technical architecture of the real-time operation business is basically the same as the new version of the data warehouse architecture mentioned above:</p><ul><li><p>Data Source: The data source is consistent in the new version with the architecture diagram in the new version, including business data in MySQL, event tracking data of the application system, device and terminal logs.</p></li><li><p>Data Import: Broker Load is used for offline data import, and Doris-Spark-Connector is used for real-time data import.</p></li><li><p>Data Storage and Development: Almost all real-time data warehouses are built on Apache Doris, and some offline data is placed on Airflow to perform DAG batch tasks.</p></li><li><p>Data Application: The top layer is the business analysis requirements, including large-screen display, real-time dashboard for data operation, user portrait, BI tools, etc.</p></li></ul><p><strong>In real-time operation business, there are two main requirements for data analysis:</strong></p><ul><li><p>Due to the large amount of real-time imported data, the query efficiency requirement is high.</p></li><li><p>In this scenario, a team of 20+ people is in charge. The data operation dashboard needs to be opened at the same time, so there will be relatively high requirements for real-time writing performance and query concurrency.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="ai-chatbots-analysis">AI Chatbots Analysis<a href="#ai-chatbots-analysis" class="hash-link" aria-label="AI Chatbots Analysis的直接链接" title="AI Chatbots Analysis的直接链接"></a></h3><p>In addition, the second application of Apache Doris in AISPEACG is a AI Chatbots analysis.</p><p><img loading="lazy" alt="ai_chatbots_git" src="https://cdnd.selectdb.com/zh-CN/assets/images/ai_chatbots_git-f094d1221b56b522cb93ba3bc766e659.png" width="1953" height="1118" class="img_ev3q"></p><p>As shown in the figure above, different from normal BI cases, our users only needs to describe the data analysis needs by typing. Based on our company's NLP capabilities, AI Chatbots BI will convert natural language into SQL, which similar to NL2SQL technology. It should be noted that the natural language analysis used here is customized. Comparing with open source NL2SQL, the hit rate is high and the analysis is more precise. After the natural language is converted into SQL, the SQL will give Apache Doris query to get the analysis result. As a result, users can view detailed data in any cases at any time by typing. <strong>Compared with pre-computed OLAP engines such as Apache Kylin and Apache Druid, Apache Doris performs better for the following reasons:</strong></p><ul><li><p>The query is flexible and the model is not fixed, which supports customization.</p></li><li><p>It needs to support table association, aggregation calculation, and detailed query.</p></li><li><p>Response time needs to be fast.</p></li></ul><p>Therefore, we have successfully implemented AI Chatbots analysis by using Apache Doris. At the same time, feedback on the application in our company is awesome.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="experience">Experience<a href="#experience" class="hash-link" aria-label="Experience的直接链接" title="Experience的直接链接"></a></h2><p>Based on the above two scenarios, we have accumulated some experience and insights and I will share them with you now.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="datawarehouse-table-design">Datawarehouse Table Design:<a href="#datawarehouse-table-design" class="hash-link" aria-label="Datawarehouse Table Design:的直接链接" title="Datawarehouse Table Design:的直接链接"></a></h3><ol><li><p>Tables which contain about tens of millions of data(for reference, related to the size of the cluster) is better to use the Duplicate table type. The Duplicate table type supports aggregation and detailed query at the same time, without additional detailed tables required.</p></li><li><p>When the amount of data is relatively large, we suggest to use the Aggregate aggregation table type, build a rollup index on the aggregation table type, use materialized views to optimize queries, and optimize aggregation fields.</p></li><li><p>When the amount of data is large with many associated tables, ETL can be used to write wide tables, imports to Doris, combined with Aggregate to optimize the aggregation table type. Or we suggest you use the official Doris JOIN optimization refer to: https://doris .apache.org/en-US/docs/dev/advanced/join-optimization/doris-join-optimization</p></li></ol><h3 class="anchor anchorWithStickyNavbar_LWe7" id="storage">Storage:<a href="#storage" class="hash-link" aria-label="Storage:的直接链接" title="Storage:的直接链接"></a></h3><p>We use SSD and HDD to separate hot and warm data storage. Data within the past year is stored in SSD, and data more than one year is stored in HDD. Apache Doris supports setting cooling time for partitions. The current solution is to set automatic synchronization to migrate historical data from SSD to HDD to ensure that the data within one year is placed in on the SSD.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade">Upgrade<a href="#upgrade" class="hash-link" aria-label="Upgrade的直接链接" title="Upgrade的直接链接"></a></h3><p>Make sure to back up the metadata before upgrading. You can also use the method of starting a new cluster to back up the data files to a remote storage system such as S3 or HDFS through Broker, and then import the previous cluster data into the new cluster through backup and recovery.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-comparison">Performance Comparison<a href="#performance-comparison" class="hash-link" aria-label="Performance Comparison的直接链接" title="Performance Comparison的直接链接"></a></h3><p>Aspire started using Apache Doris from version 0.12. This year we completed the upgrade from version 0.15 to the latest version 1.1, and conducted performance tests based on real business data.</p><p><img loading="lazy" alt="doris_1_1_performance_test_git" src="https://cdnd.selectdb.com/zh-CN/assets/images/doris_1_1_performance_test_git-ad375d6872f12ab1e3cca76d30caa1f6.png" width="1961" height="1126" class="img_ev3q"></p><p>As can be seen from the test report, among the 13 SQLs test in total, the performance difference of the first 3 SQLs after the upgrade is not obvious, because these 3 scenarios are mainly simple aggregation functions, which do not require high performance of Apache Doris. Version 0.15 can meet demand. In the scenario after Q4, SQL is more complex while Group By needs multiple fields, aggregation functions and complex functions. Therefore, the performance improvement after upgrading is obvious to see: the average query performance is 2- 3 times. We highly recommend that you upgrade to the latest version of Apache Doris.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="summary-and-benefits">Summary and Benefits<a href="#summary-and-benefits" class="hash-link" aria-label="Summary and Benefits的直接链接" title="Summary and Benefits的直接链接"></a></h2><ol><li><p>Apache Doris supports the construction of offline plus real-time unified data warehouses. One ETL script can support both real-time and offline data warehouses, which greatly greatly improved efficiency, reduces storage costs, and avoids problems such as inconsistencies between offline and real-time indicators.</p></li><li><p>Apache Doris 1.1.x version fully supports vectorization, which improves the query performance by 2-3 times compared with the previous version. After testing, the query performance of Apache Doris version 1.1.x in the wide table is equal to that of ClickHouse.</p></li><li><p>Apache Doris is powerful and does not depend on other components. Compared with Apache Kylin, Apache Druid, ClickHouse, Apache Doris does not need a second component to fill the technical gap. Apache Doris supports aggregation, detailed queries, and associated queries. Currently, more than 90% of AISPEACH' analysis have migrated to Apache Doris. Thanks to this advantage, developers operate and maintain fewer components, which greatly reduces the cost of operation and maintenance.</p></li><li><p>It is extremely easy to use, supporting MySQL protocol and standard SQL, which greatly reduces user learning costs.</p></li></ol><p><em>Special thanks to SelectDB, the company building Apache Doris helps us work with the community and get sufficient technical support.</em></p>]]></content>
<author>
<name>Zhao Wei</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 1.2 star-schema-benchmark performance test report]]></title>
<id>https://doris.apache.org/zh-CN/blog/ssb</id>
<link href="https://doris.apache.org/zh-CN/blog/ssb"/>
<updated>2022-11-22T00:00:00.000Z</updated>
<summary type="html"><![CDATA[On the SSB flat wide table, the overall performance of Apache Doris 1.2.0-rc01 has been improved by nearly 4 times compared with Apache Doris 1.1.3, and nearly 10 times compared with Apache Doris 0.15.0 RC04. On the SQL test with standard SSB, the overall performance of Apache Doris 1.2.0-rc01 has been improved by nearly 2 times compared with Apache Doris 1.1.3, and nearly 31 times compared with Apache Doris 0.15.0 RC04.]]></summary>
<content type="html"><![CDATA[<h1>Star Schema Benchmark</h1><p><a href="https://www.cs.umb.edu/~poneil/StarSchemaB.PDF" target="_blank" rel="noopener noreferrer">Star Schema Benchmark(SSB)</a> is a lightweight performance test set in the data warehouse scenario. SSB provides a simplified star schema data based on <a href="http://www.tpc.org/tpch/" target="_blank" rel="noopener noreferrer">TPC-H</a>, which is mainly used to test the performance of multi-table JOIN query under star schema. In addition, the industry usually flattens SSB into a wide table model (Referred as: SSB flat) to test the performance of the query engine, refer to <a href="https://clickhouse.com/docs/zh/getting-started" target="_blank" rel="noopener noreferrer">Clickhouse</a>.</p><p>This document mainly introduces the performance of Doris on the SSB 100G test set.</p><blockquote><p>Note 1: The standard test set including SSB usually has a large gap with the actual business scenario, and some tests will perform parameter tuning for the test set. Therefore, the test results of the standard test set can only reflect the performance of the database in a specific scenario. It is recommended that users use actual business data for further testing.</p><p>Note 2: The operations involved in this document are all performed in the Ubuntu Server 20.04 environment, and CentOS 7 as well.</p></blockquote><p>With 13 queries on the SSB standard test data set, we conducted a comparison test based on Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 versions.</p><p>On the SSB flat wide table, the overall performance of Apache Doris 1.2.0-rc01 has been improved by nearly 4 times compared with Apache Doris 1.1.3, and nearly 10 times compared with Apache Doris 0.15.0 RC04.</p><p>On the SQL test with standard SSB, the overall performance of Apache Doris 1.2.0-rc01 has been improved by nearly 2 times compared with Apache Doris 1.1.3, and nearly 31 times compared with Apache Doris 0.15.0 RC04.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-hardware-environment">1. Hardware Environment<a href="#1-hardware-environment" class="hash-link" aria-label="1. Hardware Environment的直接链接" title="1. Hardware Environment的直接链接"></a></h2><table><thead><tr><th>Number of machines</th><th>4 Tencent Cloud Hosts (1 FE, 3 BEs)</th></tr></thead><tbody><tr><td>CPU</td><td>AMD EPYC™ Milan (2.55GHz/3.5GHz) 16 Cores</td></tr><tr><td>Memory</td><td>64G</td></tr><tr><td>Network Bandwidth</td><td>7Gbps</td></tr><tr><td>Disk</td><td>High-performance Cloud Disk</td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-software-environment">2. Software Environment<a href="#2-software-environment" class="hash-link" aria-label="2. Software Environment的直接链接" title="2. Software Environment的直接链接"></a></h2><ul><li>Doris deployed 3BEs and 1FE;</li><li>Kernel version: Linux version 5.4.0-96-generic (buildd@lgw01-amd64-051)</li><li>OS version: Ubuntu Server 20.04 LTS 64-bit</li><li>Doris software versions: Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04</li><li>JDK: openjdk version "11.0.14" 2022-01-18</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="3-test-data-volume">3. Test Data Volume<a href="#3-test-data-volume" class="hash-link" aria-label="3. Test Data Volume的直接链接" title="3. Test Data Volume的直接链接"></a></h2><table><thead><tr><th align="left">SSB Table Name</th><th align="left">Rows</th><th align="left">Annotation</th></tr></thead><tbody><tr><td align="left">lineorder</td><td align="left">600,037,902</td><td align="left">Commodity Order Details</td></tr><tr><td align="left">customer</td><td align="left">3,000,000</td><td align="left">Customer Information</td></tr><tr><td align="left">part</td><td align="left">1,400,000</td><td align="left">Parts Information</td></tr><tr><td align="left">supplier</td><td align="left">200,000</td><td align="left">Supplier Information</td></tr><tr><td align="left">date</td><td align="left">2,556</td><td align="left">Date</td></tr><tr><td align="left">lineorder_flat</td><td align="left">600,037,902</td><td align="left">Wide Table after Data Flattening</td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="4-test-results">4. Test Results<a href="#4-test-results" class="hash-link" aria-label="4. Test Results的直接链接" title="4. Test Results的直接链接"></a></h2><p>We use Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 for comparative testing. The test results are as follows:</p><table><thead><tr><th>Query</th><th>Apache Doris 1.2.0-rc01(ms)</th><th>Apache Doris 1.1.3 (ms)</th><th>Doris 0.15.0 RC04 (ms)</th></tr></thead><tbody><tr><td>Q1.1</td><td>20</td><td>90</td><td>250</td></tr><tr><td>Q1.2</td><td>10</td><td>10</td><td>30</td></tr><tr><td>Q1.3</td><td>30</td><td>70</td><td>120</td></tr><tr><td>Q2.1</td><td>90</td><td>360</td><td>900</td></tr><tr><td>Q2.2</td><td>90</td><td>340</td><td>1,020</td></tr><tr><td>Q2.3</td><td>60</td><td>260</td><td>770</td></tr><tr><td>Q3.1</td><td>160</td><td>550</td><td>1,710</td></tr><tr><td>Q3.2</td><td>80</td><td>290</td><td>670</td></tr><tr><td>Q3.3</td><td>90</td><td>240</td><td>550</td></tr><tr><td>Q3.4</td><td>20</td><td>20</td><td>30</td></tr><tr><td>Q4.1</td><td>140</td><td>480</td><td>1,250</td></tr><tr><td>Q4.2</td><td>50</td><td>240</td><td>400</td></tr><tr><td>Q4.3</td><td>30</td><td>200</td><td>330</td></tr><tr><td>Total</td><td>880</td><td>3,150</td><td>8,030</td></tr></tbody></table><p><img loading="lazy" alt="ssb_v11_v015_compare" src="https://cdnd.selectdb.com/zh-CN/assets/images/ssb_flat-a8cfebbc53e6f2db116876e3d53e19c7.png" width="1522" height="674" class="img_ev3q"></p><p><strong>Interpretation of Results</strong></p><ul><li>The data set corresponding to the test results is scale 100, about 600 million.</li><li>The test environment is configured as the user's common configuration, with 4 cloud servers, 16-core 64G SSD, and 1 FE, 3 BEs deployment.</li><li>We select the user's common configuration test to reduce the cost of user selection and evaluation, but the entire test process will not consume so many hardware resources.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="5-standard-ssb-test-results">5. Standard SSB Test Results<a href="#5-standard-ssb-test-results" class="hash-link" aria-label="5. Standard SSB Test Results的直接链接" title="5. Standard SSB Test Results的直接链接"></a></h2><p>Here we use Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 for comparative testing. In the test, we use Query Time(ms) as the main performance indicator. The test results are as follows:</p><table><thead><tr><th>Query</th><th>Apache Doris 1.2.0-rc01 (ms)</th><th>Apache Doris 1.1.3 (ms)</th><th>Doris 0.15.0 RC04 (ms)</th></tr></thead><tbody><tr><td>Q1.1</td><td>40</td><td>18</td><td>350</td></tr><tr><td>Q1.2</td><td>30</td><td>100</td><td>80</td></tr><tr><td>Q1.3</td><td>20</td><td>70</td><td>80</td></tr><tr><td>Q2.1</td><td>350</td><td>940</td><td>20,680</td></tr><tr><td>Q2.2</td><td>320</td><td>750</td><td>18,250</td></tr><tr><td>Q2.3</td><td>300</td><td>720</td><td>14,760</td></tr><tr><td>Q3.1</td><td>650</td><td>2,150</td><td>22,190</td></tr><tr><td>Q3.2</td><td>260</td><td>510</td><td>8,360</td></tr><tr><td>Q3.3</td><td>220</td><td>450</td><td>6,200</td></tr><tr><td>Q3.4</td><td>60</td><td>70</td><td>160</td></tr><tr><td>Q4.1</td><td>840</td><td>1,480</td><td>24,320</td></tr><tr><td>Q4.2</td><td>460</td><td>560</td><td>6,310</td></tr><tr><td>Q4.3</td><td>610</td><td>660</td><td>10,170</td></tr><tr><td>Total</td><td>4,160</td><td>8,478</td><td>131,910</td></tr></tbody></table><p><img loading="lazy" alt="ssb_12_11_015" src="https://cdnd.selectdb.com/zh-CN/assets/images/ssb-6f7fc8825356019f61622f6fcb9fa1d0.png" width="1354" height="728" class="img_ev3q"></p><p><strong>Interpretation of Results</strong></p><ul><li>The data set corresponding to the test results is scale 100, about 600 million.</li><li>The test environment is configured as the user's common configuration, with 4 cloud servers, 16-core 64G SSD, and 1 FE 3 BEs deployment.</li><li>We select the user's common configuration test to reduce the cost of user selection and evaluation, but the entire test process will not consume so many hardware resources.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="6-environment-preparation">6. Environment Preparation<a href="#6-environment-preparation" class="hash-link" aria-label="6. Environment Preparation的直接链接" title="6. Environment Preparation的直接链接"></a></h2><p>Please first refer to the <!-- -->[official documentation]<!-- -->(. /install/install-deploy.md) to install and deploy Apache Doris first to obtain a Doris cluster which is working well(including at least 1 FE 1 BE, 1 FE 3 BEs is recommended).</p><p>The scripts mentioned in the following documents are stored in the Apache Doris codebase: <a href="https://github.com/apache/doris/tree/master/tools/ssb-tools" target="_blank" rel="noopener noreferrer">ssb-tools</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="7-data-preparation">7. Data Preparation<a href="#7-data-preparation" class="hash-link" aria-label="7. Data Preparation的直接链接" title="7. Data Preparation的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="71-download-and-install-the-ssb-data-generation-tool">7.1 Download and Install the SSB Data Generation Tool.<a href="#71-download-and-install-the-ssb-data-generation-tool" class="hash-link" aria-label="7.1 Download and Install the SSB Data Generation Tool.的直接链接" title="7.1 Download and Install the SSB Data Generation Tool.的直接链接"></a></h3><p>Execute the following script to download and compile the <a href="https://github.com/electrum/ssb-dbgen.git" target="_blank" rel="noopener noreferrer">ssb-dbgen</a> tool.</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> build-ssb-dbgen.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>After successful installation, the <code>dbgen</code> binary will be generated under the <code>ssb-dbgen/</code> directory.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="72-generate-ssb-test-set">7.2 Generate SSB Test Set<a href="#72-generate-ssb-test-set" class="hash-link" aria-label="7.2 Generate SSB Test Set的直接链接" title="7.2 Generate SSB Test Set的直接链接"></a></h3><p>Execute the following script to generate the SSB dataset:</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> gen-ssb-data.sh -s </span><span class="token number">100</span><span class="token plain"> -c </span><span class="token number">100</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><blockquote><p>Note 1: Check the script help via <code>sh gen-ssb-data.sh -h</code>.</p><p>Note 2: The data will be generated under the <code>ssb-data/</code> directory with the suffix <code>.tbl</code>. The total file size is about 60GB and may need a few minutes to an hour to generate.</p><p>Note 3: <code>-s 100</code> indicates that the test set size factor is 100, <code>-c 100</code> indicates that 100 concurrent threads generate the data of the lineorder table. The <code>-c</code> parameter also determines the number of files in the final lineorder table. The larger the parameter, the larger the number of files and the smaller each file.</p></blockquote><p>With the <code>-s 100</code> parameter, the resulting dataset size is:</p><table><thead><tr><th>Table</th><th>Rows</th><th>Size</th><th>File Number</th></tr></thead><tbody><tr><td>lineorder</td><td>600,037,902</td><td>60GB</td><td>100</td></tr><tr><td>customer</td><td>3,000,000</td><td>277M</td><td>1</td></tr><tr><td>part</td><td>1,400,000</td><td>116M</td><td>1</td></tr><tr><td>supplier</td><td>200,000</td><td>17M</td><td>1</td></tr><tr><td>date</td><td>2,556</td><td>228K</td><td>1</td></tr></tbody></table><h3 class="anchor anchorWithStickyNavbar_LWe7" id="73-create-table">7.3 Create Table<a href="#73-create-table" class="hash-link" aria-label="7.3 Create Table的直接链接" title="7.3 Create Table的直接链接"></a></h3><h4 class="anchor anchorWithStickyNavbar_LWe7" id="731-prepare-the-doris-clusterconf-file">7.3.1 Prepare the <code>doris-cluster.conf</code> File.<a href="#731-prepare-the-doris-clusterconf-file" class="hash-link" aria-label="731-prepare-the-doris-clusterconf-file的直接链接" title="731-prepare-the-doris-clusterconf-file的直接链接"></a></h4><p>Before import the script, you need to write the FE’s ip port and other information in the <code>doris-cluster.conf</code> file.</p><p>The file location is at the same level as <code>load-ssb-dimension-data.sh</code>.</p><p>The content of the file includes FE's ip, HTTP port, user name, password and the DB name of the data to be imported:</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">FE_HOST</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"xxx"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">FE_HTTP_PORT</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"8030"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">FE_QUERY_PORT</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"9030"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable environment constant" style="color:rgb(189, 147, 249);font-style:italic">USER</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"root"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">PASSWORD</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">'xxx'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">DB</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"ssb"</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h4 class="anchor anchorWithStickyNavbar_LWe7" id="732-execute-the-following-script-to-generate-and-create-the-ssb-table">7.3.2 Execute the Following Script to Generate and Create the SSB Table:<a href="#732-execute-the-following-script-to-generate-and-create-the-ssb-table" class="hash-link" aria-label="7.3.2 Execute the Following Script to Generate and Create the SSB Table:的直接链接" title="7.3.2 Execute the Following Script to Generate and Create the SSB Table:的直接链接"></a></h4><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> create-ssb-tables.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Or copy the table creation statements in <a href="https://github.com/apache/incubator-doris/tree/master/tools/ssb-tools/ddl/create-ssb-tables.sql" target="_blank" rel="noopener noreferrer">create-ssb-tables.sql</a> and <a href="https://github.com/apache/incubator-doris/tree/master/tools/ssb-tools/ddl/create-ssb-flat-table.sql" target="_blank" rel="noopener noreferrer"> create-ssb-flat-table.sql</a> and then execute them in the MySQL client.</p><p>The following is the <code>lineorder_flat</code> table build statement. Create the <code>lineorder_flat</code> table in the above <code>create-ssb-flat-table.sh</code> script, and perform the default number of buckets (48 buckets). You can delete this table and adjust the number of buckets according to your cluster scale node configuration, so as to obtain a better test result.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">CREATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">TABLE</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">lineorder_flat</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERDATE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">date</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERKEY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_LINENUMBER</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">tinyint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_CUSTKEY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_PARTKEY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_SUPPKEY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERPRIORITY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_SHIPPRIORITY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">tinyint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_QUANTITY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">tinyint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_EXTENDEDPRICE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDTOTALPRICE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_DISCOUNT</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">tinyint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_REVENUE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_SUPPLYCOST</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">int</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">11</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_TAX</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">tinyint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_COMMITDATE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">date</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_SHIPMODE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_NAME</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_ADDRESS</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_CITY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_NATION</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_REGION</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_PHONE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">C_MKTSEGMENT</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">S_NAME</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">S_ADDRESS</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">S_CITY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">S_NATION</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">S_REGION</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">S_PHONE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_NAME</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_MFGR</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_CATEGORY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_BRAND</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_COLOR</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_TYPE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_SIZE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">tinyint</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">4</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">P_CONTAINER</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">varchar</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token number">100</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">NOT</span><span class="token plain"> </span><span class="token boolean">NULL</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">""</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ENGINE</span><span class="token operator">=</span><span class="token plain">OLAP</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DUPLICATE</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">KEY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERDATE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERKEY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">COMMENT</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"OLAP"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> RANGE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERDATE</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p1 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'0000-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1993-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p2 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1993-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1994-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p3 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1994-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1995-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p4 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1995-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1996-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p5 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1996-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1997-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p6 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1997-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1998-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">PARTITION</span><span class="token plain"> p7 </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">VALUES</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">[</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1998-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'1999-01-01'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DISTRIBUTED</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">HASH</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token identifier">LO_ORDERKEY</span><span class="token identifier punctuation" style="color:rgb(248, 248, 242)">`</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> BUCKETS </span><span class="token number">48</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">PROPERTIES </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"replication_num"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"1"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"colocate_with"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"groupxx1"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"in_memory"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"false"</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token string" style="color:rgb(255, 121, 198)">"storage_format"</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">"DEFAULT"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="74-import-data">7.4 Import data<a href="#74-import-data" class="hash-link" aria-label="7.4 Import data的直接链接" title="7.4 Import data的直接链接"></a></h3><p>We use the following command to complete all data import of SSB test set and SSB FLAT wide table data synthesis and then import into the table.</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> bin/load-ssb-data.sh -c </span><span class="token number">10</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p><code>-c 5</code> means start 10 concurrent threads to import (5 by default). In the case of a single BE node, the lineorder data generated by <code>sh gen-ssb-data.sh -s 100 -c 100</code> will also generate the data of the ssb-flat table in the end. If more threads are enabled, the import speed can be accelerated. But it will cost extra memory.</p><blockquote><p>Notes.</p><ol><li><p>To get faster import speed, you can add <code>flush_thread_num_per_store=5</code> in be.conf and then restart BE. This configuration indicates the number of disk writing threads for each data directory, 2 by default. Larger data can improve write data throughput, but may increase IO Util. (Reference value: 1 mechanical disk, with 2 by default, the IO Util during the import process is about 12%. When it is set to 5, the IO Util is about 26%. If it is an SSD disk, it is almost 0%) .</p></li><li><p>The flat table data is imported by 'INSERT INTO ... SELECT ... '.</p></li></ol></blockquote><h3 class="anchor anchorWithStickyNavbar_LWe7" id="75-checking-imported-data">7.5 Checking Imported data<a href="#75-checking-imported-data" class="hash-link" aria-label="7.5 Checking Imported data的直接链接" title="7.5 Checking Imported data的直接链接"></a></h3><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> part</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> customer</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> supplier</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">date</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> lineorder</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> lineorder_flat</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>The amount of data should be consistent with the number of rows of generated data.</p><table><thead><tr><th>Table</th><th>Rows</th><th>Origin Size</th><th>Compacted Size(1 Replica)</th></tr></thead><tbody><tr><td>lineorder_flat</td><td>600,037,902</td><td></td><td>59.709 GB</td></tr><tr><td>lineorder</td><td>600,037,902</td><td>60 GB</td><td>14.514 GB</td></tr><tr><td>customer</td><td>3,000,000</td><td>277 MB</td><td>138.247 MB</td></tr><tr><td>part</td><td>1,400,000</td><td>116 MB</td><td>12.759 MB</td></tr><tr><td>supplier</td><td>200,000</td><td>17 MB</td><td>9.143 MB</td></tr><tr><td>date</td><td>2,556</td><td>228 KB</td><td>34.276 KB</td></tr></tbody></table><h3 class="anchor anchorWithStickyNavbar_LWe7" id="76-query-test">7.6 Query Test<a href="#76-query-test" class="hash-link" aria-label="7.6 Query Test的直接链接" title="7.6 Query Test的直接链接"></a></h3><ul><li>SSB-Flat Query Statement: <a href="https://github.com/apache/doris/tree/master/tools/ssb-tools/ssb-flat-queries" target="_blank" rel="noopener noreferrer"> ssb-flat-queries</a></li><li>Standard SSB Queries: <a href="https://github.com/apache/doris/tree/master/tools/ssb-tools/ssb-queries" target="_blank" rel="noopener noreferrer"> ssb-queries</a></li></ul><h4 class="anchor anchorWithStickyNavbar_LWe7" id="761-ssb-flat-test-for-sql">7.6.1 SSB FLAT Test for SQL<a href="#761-ssb-flat-test-for-sql" class="hash-link" aria-label="7.6.1 SSB FLAT Test for SQL的直接链接" title="7.6.1 SSB FLAT Test for SQL的直接链接"></a></h4><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)">--Q1.1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_EXTENDEDPRICE </span><span class="token operator">*</span><span class="token plain"> LO_DISCOUNT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19930101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19931231</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_DISCOUNT </span><span class="token operator">BETWEEN</span><span class="token plain"> </span><span class="token number">1</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token number">3</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_QUANTITY </span><span class="token operator">&lt;</span><span class="token plain"> </span><span class="token number">25</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q1.2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_EXTENDEDPRICE </span><span class="token operator">*</span><span class="token plain"> LO_DISCOUNT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19940101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19940131</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_DISCOUNT </span><span class="token operator">BETWEEN</span><span class="token plain"> </span><span class="token number">4</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token number">6</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_QUANTITY </span><span class="token operator">BETWEEN</span><span class="token plain"> </span><span class="token number">26</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token number">35</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q1.3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_EXTENDEDPRICE </span><span class="token operator">*</span><span class="token plain"> LO_DISCOUNT</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> weekofyear</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">=</span><span class="token plain"> </span><span class="token number">6</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19940101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19941231</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_DISCOUNT </span><span class="token operator">BETWEEN</span><span class="token plain"> </span><span class="token number">5</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token number">7</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_QUANTITY </span><span class="token operator">BETWEEN</span><span class="token plain"> </span><span class="token number">26</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> </span><span class="token number">35</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q2.1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> P_CATEGORY </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#12'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'AMERICA'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q2.2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> P_BRAND </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#2221'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> P_BRAND </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#2228'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'ASIA'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q2.3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> P_BRAND </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#2239'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'EUROPE'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q3.1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> C_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> C_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'ASIA'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'ASIA'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19920101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19971231</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> C_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> revenue </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q3.2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> C_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> C_NATION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED STATES'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_NATION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED STATES'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19920101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19971231</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> C_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> revenue </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q3.3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> C_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> C_CITY </span><span class="token operator">IN</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI5'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_CITY </span><span class="token operator">IN</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI5'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19920101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19971231</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> C_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> revenue </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q3.4</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> C_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> C_CITY </span><span class="token operator">IN</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI5'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_CITY </span><span class="token operator">IN</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED KI5'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19971201</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19971231</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> C_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> revenue </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">DESC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q4.1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> C_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE </span><span class="token operator">-</span><span class="token plain"> LO_SUPPLYCOST</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> profit</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> C_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'AMERICA'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'AMERICA'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> P_MFGR </span><span class="token operator">IN</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#2'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> C_NATION</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> C_NATION </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q4.2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain">S_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_CATEGORY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE </span><span class="token operator">-</span><span class="token plain"> LO_SUPPLYCOST</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> profit</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> C_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'AMERICA'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> S_REGION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'AMERICA'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19970101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19981231</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> P_MFGR </span><span class="token operator">IN</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#1'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#2'</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_NATION</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_CATEGORY</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_NATION </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_CATEGORY </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)">--Q4.3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">SELECT</span><span class="token plain"> </span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_ORDERDATE </span><span class="token operator">DIV</span><span class="token plain"> </span><span class="token number">10000</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">SUM</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token plain">LO_REVENUE </span><span class="token operator">-</span><span class="token plain"> LO_SUPPLYCOST</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">AS</span><span class="token plain"> profit</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">FROM</span><span class="token plain"> lineorder_flat</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">WHERE</span><span class="token plain"> S_NATION </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'UNITED STATES'</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&gt;=</span><span class="token plain"> </span><span class="token number">19970101</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> LO_ORDERDATE </span><span class="token operator">&lt;=</span><span class="token plain"> </span><span class="token number">19981231</span><span class="token plain"> </span><span class="token operator">AND</span><span class="token plain"> P_CATEGORY </span><span class="token operator">=</span><span class="token plain"> </span><span class="token string" style="color:rgb(255, 121, 198)">'MFGR#14'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">BY</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">YEAR</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> S_CITY </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> P_BRAND </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">ASC</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h4 class="anchor anchorWithStickyNavbar_LWe7" id="762-ssb-standard-test-for-sql">7.6.2 SSB Standard Test for SQL<a href="#762-ssb-standard-test-for-sql" class="hash-link" aria-label="7.6.2 SSB Standard Test for SQL的直接链接" title="7.6.2 SSB Standard Test for SQL的直接链接"></a></h4><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q1.1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM lineorder, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year = 1993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_discount BETWEEN 1 AND 3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_quantity &lt; 25;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q1.2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM lineorder, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_yearmonth = 'Jan1994'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_discount BETWEEN 4 AND 6</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_quantity BETWEEN 26 AND 35;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q1.3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_extendedprice * lo_discount) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM lineorder, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_weeknuminyear = 6</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year = 1994</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_discount BETWEEN 5 AND 7</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_quantity BETWEEN 26 AND 35;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q2.1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT SUM(lo_revenue), d_year, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM lineorder, dates, part, supplier</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND p_category = 'MFGR#12'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_region = 'AMERICA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY d_year, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY p_brand;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q2.2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT SUM(lo_revenue), d_year, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM lineorder, dates, part, supplier</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_region = 'ASIA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY d_year, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year, p_brand;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q2.3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT SUM(lo_revenue), d_year, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM lineorder, dates, part, supplier</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND p_brand = 'MFGR#2239'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_region = 'EUROPE'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY d_year, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year, p_brand;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q3.1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM customer, lineorder, supplier, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND c_region = 'ASIA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_region = 'ASIA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year &gt;= 1992</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year &lt;= 1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY c_nation, s_nation, d_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year ASC, REVENUE DESC;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q3.2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM customer, lineorder, supplier, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND c_nation = 'UNITED STATES'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_nation = 'UNITED STATES'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year &gt;= 1992</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year &lt;= 1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY c_city, s_city, d_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year ASC, REVENUE DESC;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q3.3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM customer, lineorder, supplier, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_city = 'UNITED KI1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR c_city = 'UNITED KI5'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_city = 'UNITED KI1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR s_city = 'UNITED KI5'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year &gt;= 1992</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_year &lt;= 1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY c_city, s_city, d_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year ASC, REVENUE DESC;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q3.4</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue) AS REVENUE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM customer, lineorder, supplier, dates</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_city = 'UNITED KI1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR c_city = 'UNITED KI5'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_city = 'UNITED KI1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR s_city = 'UNITED KI5'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND d_yearmonth = 'Dec1997'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY c_city, s_city, d_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year ASC, REVENUE DESC;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q4.1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue - lo_supplycost) AS PROFIT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM dates, customer, supplier, part, lineorder</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND c_region = 'AMERICA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_region = 'AMERICA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_mfgr = 'MFGR#1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR p_mfgr = 'MFGR#2'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY d_year, c_nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year, c_nation;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q4.2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */ </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_category,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue - lo_supplycost) AS PROFIT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM dates, customer, supplier, part, lineorder</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND c_region = 'AMERICA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_region = 'AMERICA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year = 1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR d_year = 1998</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_mfgr = 'MFGR#1'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR p_mfgr = 'MFGR#2'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY d_year, s_nation, p_category</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year, s_nation, p_category;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q4.3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_city,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_brand,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> SUM(lo_revenue - lo_supplycost) AS PROFIT</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">FROM dates, customer, supplier, part, lineorder</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">WHERE</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lo_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND lo_orderdate = d_datekey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND s_nation = 'UNITED STATES'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> d_year = 1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> OR d_year = 1998</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> AND p_category = 'MFGR#14'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">GROUP BY d_year, s_city, p_brand</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">ORDER BY d_year, s_city, p_brand;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris 1.2 TPC-H performance test report]]></title>
<id>https://doris.apache.org/zh-CN/blog/tpch</id>
<link href="https://doris.apache.org/zh-CN/blog/tpch"/>
<updated>2022-11-22T00:00:00.000Z</updated>
<summary type="html"><![CDATA[On 22 queries on the TPC-H standard test data set, we conducted a comparison test based on Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 versions. Compared with Apache Doris 1.1.3, the overall performance of Apache Doris 1.2.0-rc01 has been improved by nearly 3 times, and by nearly 11 times compared with Apache Doris 0.15.0 RC04.]]></summary>
<content type="html"><![CDATA[<h1>TPC-H Benchmark</h1><p>TPC-H is a decision support benchmark (Decision Support Benchmark), which consists of a set of business-oriented special query and concurrent data modification. The data that is queried and populates the database has broad industry relevance. This benchmark demonstrates a decision support system that examines large amounts of data, executes highly complex queries, and answers key business questions. The performance index reported by TPC-H is called TPC-H composite query performance index per hour (QphH@Size), which reflects multiple aspects of the system's ability to process queries. These aspects include the database size chosen when executing the query, the query processing capability when the query is submitted by a single stream, and the query throughput when the query is submitted by many concurrent users.</p><p>This document mainly introduces the performance of Doris on the TPC-H 100G test set.</p><blockquote><p>Note 1: The standard test set including TPC-H is usually far from the actual business scenario, and some tests will perform parameter tuning for the test set. Therefore, the test results of the standard test set can only reflect the performance of the database in a specific scenario. We suggest users use actual business data for further testing.</p><p>Note 2: The operations involved in this document are all tested on CentOS 7.x.</p></blockquote><p>On 22 queries on the TPC-H standard test data set, we conducted a comparison test based on Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 versions. Compared with Apache Doris 1.1.3, the overall performance of Apache Doris 1.2.0-rc01 has been improved by nearly 3 times, and by nearly 11 times compared with Apache Doris 0.15.0 RC04.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="1-hardware-environment">1. Hardware Environment<a href="#1-hardware-environment" class="hash-link" aria-label="1. Hardware Environment的直接链接" title="1. Hardware Environment的直接链接"></a></h2><table><thead><tr><th>Hardware</th><th>Configuration Instructions</th></tr></thead><tbody><tr><td>Number of mMachines</td><td>4 Tencent Cloud Virtual Machine(1FE,3BEs)</td></tr><tr><td>CPU</td><td>Intel Xeon(Cascade Lake) Platinum 8269CY 16C (2.5 GHz/3.2 GHz)</td></tr><tr><td>Memory</td><td>64G</td></tr><tr><td>Network</td><td>5Gbps</td></tr><tr><td>Disk</td><td>ESSD Cloud Hard Disk</td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="2-software-environment">2. Software Environment<a href="#2-software-environment" class="hash-link" aria-label="2. Software Environment的直接链接" title="2. Software Environment的直接链接"></a></h2><ul><li>Doris Deployed 3BEs and 1FE</li><li>Kernel Version: Linux version 5.4.0-96-generic (buildd@lgw01-amd64-051)</li><li>OS version: CentOS 7.8</li><li>Doris software version: Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 , Apache Doris 0.15.0 RC04</li><li>JDK: openjdk version "11.0.14" 2022-01-18</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="3-test-data-volume">3. Test Data Volume<a href="#3-test-data-volume" class="hash-link" aria-label="3. Test Data Volume的直接链接" title="3. Test Data Volume的直接链接"></a></h2><p>The TPCH 100G data generated by the simulation of the entire test are respectively imported into Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 for testing. The following is the relevant description and data volume of the table.</p><table><thead><tr><th align="left">TPC-H Table Name</th><th align="left">Rows</th><th>Size after Import</th><th align="left">Annotation</th></tr></thead><tbody><tr><td align="left">REGION</td><td align="left">5</td><td>400KB</td><td align="left">Region</td></tr><tr><td align="left">NATION</td><td align="left">25</td><td>7.714 KB</td><td align="left">Nation</td></tr><tr><td align="left">SUPPLIER</td><td align="left">1,000,000</td><td>85.528 MB</td><td align="left">Supplier</td></tr><tr><td align="left">PART</td><td align="left">20,000,000</td><td>752.330 MB</td><td align="left">Parts</td></tr><tr><td align="left">PARTSUPP</td><td align="left">20,000,000</td><td>4.375 GB</td><td align="left">Parts Supply</td></tr><tr><td align="left">CUSTOMER</td><td align="left">15,000,000</td><td>1.317 GB</td><td align="left">Customer</td></tr><tr><td align="left">ORDERS</td><td align="left">1,500,000,000</td><td>6.301 GB</td><td align="left">Orders</td></tr><tr><td align="left">LINEITEM</td><td align="left">6,000,000,000</td><td>20.882 GB</td><td align="left">Order Details</td></tr></tbody></table><h2 class="anchor anchorWithStickyNavbar_LWe7" id="4-test-sql">4. Test SQL<a href="#4-test-sql" class="hash-link" aria-label="4. Test SQL的直接链接" title="4. Test SQL的直接链接"></a></h2><p>TPCH 22 test query statements : <a href="https://github.com/apache/incubator-doris/tree/master/tools/tpch-tools/queries" target="_blank" rel="noopener noreferrer">TPCH-Query-SQL</a></p><p><strong>Notice:</strong></p><p>The following four parameters in the above SQL do not exist in Apache Doris 0.15.0 RC04. When executing, please remove:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">1. enable_vectorized_engine=true,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">2. batch_size=4096,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">3. disable_join_reorder=false</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">4. enable_projection=true</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="5-test-results">5. Test Results<a href="#5-test-results" class="hash-link" aria-label="5. Test Results的直接链接" title="5. Test Results的直接链接"></a></h2><p>Here we use Apache Doris 1.2.0-rc01, Apache Doris 1.1.3 and Apache Doris 0.15.0 RC04 for comparative testing. In the test, we use Query Time(ms) as the main performance indicator. The test results are as follows:</p><table><thead><tr><th>Query</th><th>Apache Doris 1.2.0-rc01 (ms)</th><th>Apache Doris 1.1.3 (ms)</th><th>Apache Doris 0.15.0 RC04 (ms)</th></tr></thead><tbody><tr><td>Q1</td><td>2.12</td><td>3.75</td><td>28.63</td></tr><tr><td>Q2</td><td>0.20</td><td>4.22</td><td>7.88</td></tr><tr><td>Q3</td><td>0.62</td><td>2.64</td><td>9.39</td></tr><tr><td>Q4</td><td>0.61</td><td>1.5</td><td>9.3</td></tr><tr><td>Q5</td><td>1.05</td><td>2.15</td><td>4.11</td></tr><tr><td>Q6</td><td>0.08</td><td>0.19</td><td>0.43</td></tr><tr><td>Q7</td><td>0.58</td><td>1.04</td><td>1.61</td></tr><tr><td>Q8</td><td>0.72</td><td>1.75</td><td>50.35</td></tr><tr><td>Q9</td><td>3.61</td><td>7.94</td><td>16.34</td></tr><tr><td>Q10</td><td>1.26</td><td>1.41</td><td>5.21</td></tr><tr><td>Q11</td><td>0.15</td><td>0.35</td><td>1.72</td></tr><tr><td>Q12</td><td>0.21</td><td>0.57</td><td>5.39</td></tr><tr><td>Q13</td><td>2.62</td><td>8.15</td><td>20.88</td></tr><tr><td>Q14</td><td>0.16</td><td>0.3</td><td></td></tr><tr><td>Q15</td><td>0.30</td><td>0.66</td><td>1.86</td></tr><tr><td>Q16</td><td>0.38</td><td>0.79</td><td>1.32</td></tr><tr><td>Q17</td><td>0.65</td><td>1.51</td><td>26.67</td></tr><tr><td>Q18</td><td>2.28</td><td>3.364</td><td>11.77</td></tr><tr><td>Q19</td><td>0.20</td><td>0.829</td><td>1.71</td></tr><tr><td>Q20</td><td>0.21</td><td>2.77</td><td>5.2</td></tr><tr><td>Q21</td><td>1.17</td><td>4.47</td><td>10.34</td></tr><tr><td>Q22</td><td>0.46</td><td>0.9</td><td>3.22</td></tr><tr><td><strong>Total</strong></td><td><strong>19.64</strong></td><td><strong>51.253</strong></td><td><strong>223.33</strong></td></tr></tbody></table><p><img loading="lazy" alt="image-20220614114351241" src="https://cdnd.selectdb.com/zh-CN/assets/images/tpch-2048da37571ef8b1d4b0a49c3fba44ca.png" width="1526" height="726" class="img_ev3q"></p><ul><li><strong>Result Description</strong><ul><li>The data set corresponding to the test results is scale 100, about 600 million.</li><li>The test environment is configured as the user's common configuration, with 4 cloud servers, 16-core 64G SSD, and 1 FE 3 BEs deployment.</li><li>Select the user's common configuration test to reduce the cost of user selection and evaluation, but the entire test process will not consume so many hardware resources.</li><li>Apache Doris 0.15 RC04 failed to execute Q14 in the TPC-H test, unable to complete the query.</li></ul></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="6-environmental-preparation">6. Environmental Preparation<a href="#6-environmental-preparation" class="hash-link" aria-label="6. Environmental Preparation的直接链接" title="6. Environmental Preparation的直接链接"></a></h2><p>Please refer to the <a href="https://doris.apache.org/docs/install/cluster-deployment/standard-deployment/" target="_blank" rel="noopener noreferrer">official document</a> to install and deploy Doris to obtain a normal running Doris cluster (at least 1 FE 1 BE, 1 FE 3 BE is recommended).</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="7-data-preparation">7. Data Preparation<a href="#7-data-preparation" class="hash-link" aria-label="7. Data Preparation的直接链接" title="7. Data Preparation的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="71-download-and-install-tpc-h-data-generation-tool">7.1 Download and Install TPC-H Data Generation Tool<a href="#71-download-and-install-tpc-h-data-generation-tool" class="hash-link" aria-label="7.1 Download and Install TPC-H Data Generation Tool的直接链接" title="7.1 Download and Install TPC-H Data Generation Tool的直接链接"></a></h3><p>Execute the following script to download and compile the <a href="https://github.com/apache/incubator-doris/tree/master/tools/tpch-tools" target="_blank" rel="noopener noreferrer">tpch-tools</a> tool.</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> build-tpch-dbgen.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>After successful installation, the <code>dbgen</code> binary will be generated under the <code>TPC-H_Tools_v3.0.0/</code> directory.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="72-generating-the-tpc-h-test-set">7.2 Generating the TPC-H Test Set<a href="#72-generating-the-tpc-h-test-set" class="hash-link" aria-label="7.2 Generating the TPC-H Test Set的直接链接" title="7.2 Generating the TPC-H Test Set的直接链接"></a></h3><p>Execute the following script to generate the TPC-H dataset:</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> gen-tpch-data.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><blockquote><p>Note 1: Check the script help via <code>sh gen-tpch-data.sh -h</code>.</p><p>Note 2: The data will be generated under the <code>tpch-data/</code> directory with the suffix <code>.tbl</code>. The total file size is about 100GB and may need a few minutes to an hour to generate.</p><p>Note 3: A standard test data set of 100G is generated by default.</p></blockquote><h3 class="anchor anchorWithStickyNavbar_LWe7" id="73-create-table">7.3 Create Table<a href="#73-create-table" class="hash-link" aria-label="7.3 Create Table的直接链接" title="7.3 Create Table的直接链接"></a></h3><h4 class="anchor anchorWithStickyNavbar_LWe7" id="731-prepare-the-doris-clusterconf-file">7.3.1 Prepare the <code>doris-cluster.conf</code> File<a href="#731-prepare-the-doris-clusterconf-file" class="hash-link" aria-label="731-prepare-the-doris-clusterconf-file的直接链接" title="731-prepare-the-doris-clusterconf-file的直接链接"></a></h4><p>Before import the script, you need to write the FE’s ip port and other information in the <code>doris-cluster.conf</code> file.</p><p>The file location is at the same level as <code>load-tpch-data.sh</code>.</p><p>The content of the file includes FE's ip, HTTP port, user name, password and the DB name of the data to be imported:</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token comment" style="color:rgb(98, 114, 164)"># Any of FE host</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">FE_HOST</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">'127.0.0.1'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># http_port in fe.conf</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">FE_HTTP_PORT</span><span class="token operator">=</span><span class="token number">8030</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># query_port in fe.conf</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">FE_QUERY_PORT</span><span class="token operator">=</span><span class="token number">9030</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Doris username</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable environment constant" style="color:rgb(189, 147, 249);font-style:italic">USER</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">'root'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># Doris password</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">PASSWORD</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">''</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token comment" style="color:rgb(98, 114, 164)"># The database where TPC-H tables located</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token builtin class-name" style="color:rgb(189, 147, 249)">export</span><span class="token plain"> </span><span class="token assign-left variable" style="color:rgb(189, 147, 249);font-style:italic">DB</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">'tpch1'</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h4 class="anchor anchorWithStickyNavbar_LWe7" id="execute-the-following-script-to-generate-and-create-tpc-h-table">Execute the Following Script to Generate and Create TPC-H Table<a href="#execute-the-following-script-to-generate-and-create-tpc-h-table" class="hash-link" aria-label="Execute the Following Script to Generate and Create TPC-H Table的直接链接" title="Execute the Following Script to Generate and Create TPC-H Table的直接链接"></a></h4><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> create-tpch-tables.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Or copy the table creation statement in <a href="https://github.com/apache/incubator-doris/blob/master/tools/tpch-tools/create-tpch-tables.sql" target="_blank" rel="noopener noreferrer">create-tpch-tables.sql</a> and excute it in Doris.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="74-import-data">7.4 Import Data<a href="#74-import-data" class="hash-link" aria-label="7.4 Import Data的直接链接" title="7.4 Import Data的直接链接"></a></h3><p>Please perform data import with the following command:</p><div class="language-shell codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-shell codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token function" style="color:rgb(80, 250, 123)">sh</span><span class="token plain"> ./load-tpch-data.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="75-check-imported-data">7.5 Check Imported Data<a href="#75-check-imported-data" class="hash-link" aria-label="7.5 Check Imported Data的直接链接" title="7.5 Check Imported Data的直接链接"></a></h3><p>Execute the following SQL statement to check that the imported data is consistent with the above data.</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> lineitem</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> orders</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> partsupp</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> part</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> customer</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> supplier</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> nation</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> region</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><span class="token plain"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"></span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> </span><span class="token function" style="color:rgb(80, 250, 123)">count</span><span class="token punctuation" style="color:rgb(248, 248, 242)">(</span><span class="token operator">*</span><span class="token punctuation" style="color:rgb(248, 248, 242)">)</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> revenue0</span><span class="token punctuation" style="color:rgb(248, 248, 242)">;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="76-query-test">7.6 Query Test<a href="#76-query-test" class="hash-link" aria-label="7.6 Query Test的直接链接" title="7.6 Query Test的直接链接"></a></h3><h4 class="anchor anchorWithStickyNavbar_LWe7" id="761-executing-query-scripts">7.6.1 Executing Query Scripts<a href="#761-executing-query-scripts" class="hash-link" aria-label="7.6.1 Executing Query Scripts的直接链接" title="7.6.1 Executing Query Scripts的直接链接"></a></h4><p>Execute the above test SQL or execute the following command</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">./run-tpch-queries.sh</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><blockquote><p>Notice:</p><ol><li><p>At present, the query optimizer and statistics functions of Doris are not so perfect, so we rewrite some queries in TPC-H to adapt to the execution framework of Doris, but it does not affect the correctness of the results</p></li><li><p>Doris' new query optimizer will be released in future versions</p></li><li><p>Set <code>set mem_exec_limit=8G</code> before executing the query</p></li></ol></blockquote><h4 class="anchor anchorWithStickyNavbar_LWe7" id="762-single-sql-execution">7.6.2 Single SQL Execution<a href="#762-single-sql-execution" class="hash-link" aria-label="7.6.2 Single SQL Execution的直接链接" title="7.6.2 Single SQL Execution的直接链接"></a></h4><p>The following is the SQL statement used in the test, you can also get the latest SQL from the code base.</p><div class="language-SQL codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-SQL codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=false, enable_projection=false) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_returnflag,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_linestatus,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_quantity) as sum_qty,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice) as sum_base_price,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> avg(l_quantity) as avg_qty,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> avg(l_extendedprice) as avg_price,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> avg(l_discount) as avg_disc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count(*) as count_order</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipdate &lt;= date '1998-12-01' - interval '90' day</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_returnflag,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_linestatus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_returnflag,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_linestatus;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=1, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_acctbal,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_mfgr,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_address,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_phone,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_comment</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partsupp join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_partkey as a_partkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> min(ps_supplycost) as a_min</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partsupp,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> region</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = ps_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_suppkey = ps_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and n_regionkey = r_regionkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and r_name = 'EUROPE'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_size = 15</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_type like '%BRASS'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> group by a_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) A on ps_partkey = a_partkey and ps_supplycost=a_min ,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> region</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = ps_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_suppkey = ps_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_size = 15</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_type like '%BRASS'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and n_regionkey = r_regionkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and r_name = 'EUROPE'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_acctbal desc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">limit 100;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=false, enable_projection=true, runtime_filter_wait_time_ms=10000) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_orderkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice * (1 - l_discount)) as revenue,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderdate,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_shippriority</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select l_orderkey, l_extendedprice, l_discount, o_orderdate, o_shippriority, o_custkey from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem join orders</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where l_orderkey = o_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate &lt; date '1995-03-15'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate &gt; date '1995-03-15'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t1 join customer c </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on c.c_custkey = t1.o_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where c_mktsegment = 'BUILDING'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_orderkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderdate,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_shippriority</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> revenue desc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">limit 10;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q4</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderpriority,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count(*) as order_count</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> *</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where l_commitdate &lt; l_receiptdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> right semi join orders</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on t1.l_orderkey = o_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderdate &gt;= date '1993-07-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate &lt; date '1993-07-01' + interval '3' month</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderpriority</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderpriority;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice * (1 - l_discount)) as revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> customer,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> region</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey = o_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_orderkey = o_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_suppkey = s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and c_nationkey = s_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and n_regionkey = r_regionkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and r_name = 'ASIA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate &gt;= date '1994-01-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate &lt; date '1994-01-01' + interval '1' year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> revenue desc;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q6</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=1, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice * l_discount) as revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipdate &gt;= date '1994-01-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate &lt; date '1994-01-01' + interval '1' year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_discount between .06 - 0.01 and .06 + 0.01</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_quantity &lt; 24;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q7</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=458589934592, parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supp_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cust_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(volume) as revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n1.n_name as supp_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n2.n_name as cust_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> extract(year from l_shipdate) as l_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_extendedprice * (1 - l_discount) as volume</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> customer,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation n1,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation n2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_suppkey = l_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderkey = l_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and c_custkey = o_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_nationkey = n1.n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and c_nationkey = n2.n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate between date '1995-01-01' and date '1996-12-31'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) as shipping</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supp_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cust_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supp_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cust_nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_year;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q8</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(case</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> when nation = 'BRAZIL' then volume</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> else 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> end) / sum(volume) as mkt_share</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> extract(year from o_orderdate) as o_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_extendedprice * (1 - l_discount) as volume,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n2.n_name as nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> customer,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation n1,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation n2,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> region</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_suppkey = l_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_orderkey = o_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_custkey = c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and c_nationkey = n1.n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and n1.n_regionkey = r_regionkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and r_name = 'AMERICA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and s_nationkey = n2.n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate between date '1995-01-01' and date '1996-12-31'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_type = 'ECONOMY ANODIZED STEEL'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) as all_nations</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_year;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q9</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select/*+SET_VAR(exec_mem_limit=37179869184, parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=false, enable_projection=true, enable_remove_no_conjuncts_runtime_filter_policy=true, runtime_filter_wait_time_ms=100000) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(amount) as sum_profit</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name as nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> extract(year from o_orderdate) as o_year,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem join orders on o_orderkey = l_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> join[shuffle] part on p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> join[shuffle] partsupp on ps_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> join[shuffle] supplier on s_suppkey = l_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> join[broadcast] nation on s_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_suppkey = l_suppkey and </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_name like '%green%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) as profit</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_year desc;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(t1.l_extendedprice * (1 - t1.l_discount)) as revenue,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_acctbal,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_address,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_phone,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_comment</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> customer,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select o_custkey,l_extendedprice,l_discount from lineitem, orders</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where l_orderkey = o_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate &gt;= date '1993-10-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderdate &lt; date '1993-10-01' + interval '3' month</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_returnflag = 'R'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t1,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey = t1.o_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and c_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_acctbal,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_phone,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> n_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_address,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_comment</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> revenue desc</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">limit 20;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q11</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_partkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(ps_supplycost * ps_availqty) as value</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partsupp,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from supplier, nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where s_nationkey = n_nationkey and n_name = 'GERMANY'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) B</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_suppkey = B.s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_partkey having</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(ps_supplycost * ps_availqty) &gt; (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(ps_supplycost * ps_availqty) * 0.000002</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partsupp,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (select s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from supplier, nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where s_nationkey = n_nationkey and n_name = 'GERMANY'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) A</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ps_suppkey = A.s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> value desc;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q12</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipmode,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(case</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> when o_orderpriority = '1-URGENT'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> or o_orderpriority = '2-HIGH'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> then 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> else 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> end) as high_line_count,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(case</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> when o_orderpriority &lt;&gt; '1-URGENT'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_orderpriority &lt;&gt; '2-HIGH'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> then 1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> else 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> end) as low_line_count</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> o_orderkey = l_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipmode in ('MAIL', 'SHIP')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_commitdate &lt; l_receiptdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate &lt; l_commitdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_receiptdate &gt;= date '1994-01-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_receiptdate &lt; date '1994-01-01' + interval '1' year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipmode</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_shipmode;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q13</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=45899345920, parallel_fragment_exec_instance_num=16, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_count,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count(*) as custdist</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count(o_orderkey) as c_count</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders right outer join customer on</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey = o_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and o_comment not like '%special%requests%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) as c_orders</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_count</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> custdist desc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_count desc;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q14</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=true, enable_projection=true, runtime_filter_mode=OFF) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 100.00 * sum(case</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> when p_type like 'PROMO%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> then l_extendedprice * (1 - l_discount)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> else 0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_partkey = p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate &gt;= date '1995-09-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate &lt; date '1995-09-01' + interval '1' month;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q15</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_suppkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_address,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_phone,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> total_revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> revenue0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_suppkey = supplier_no</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and total_revenue = (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> max(total_revenue)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> revenue0</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_suppkey;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q16</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=8, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_brand,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_type,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_size,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count(distinct ps_suppkey) as supplier_cnt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> partsupp,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = ps_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_brand &lt;&gt; 'Brand#45'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_type not like 'MEDIUM POLISHED%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_size in (49, 14, 23, 45, 19, 3, 36, 9)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and ps_suppkey not in (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> s_comment like '%Customer%Complaints%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_brand,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_type,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_size</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier_cnt desc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_brand,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_type,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_size;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q17</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=1, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice) / 7.0 as avg_yearly</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem join [broadcast]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part p1 on p1.p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p1.p_brand = 'Brand#23'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p1.p_container = 'MED BOX'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_quantity &lt; (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> 0.2 * avg(l_quantity)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem join [broadcast]</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part p2 on p2.p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_partkey = p1.p_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p2.p_brand = 'Brand#23'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p2.p_container = 'MED BOX'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> );</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q18</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=45899345920, parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_orderkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_orderdate,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_totalprice,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(t3.l_quantity)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">customer join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select * from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select * from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders left semi join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> l_orderkey having sum(l_quantity) &gt; 300</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on o_orderkey = t1.l_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on t2.o_orderkey = l_orderkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) t3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">on c_custkey = t3.o_custkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_name,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_custkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_orderkey,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_orderdate,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_totalprice</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_totalprice desc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.o_orderdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">limit 100;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q19</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=false, enable_cost_based_join_reorder=false, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(l_extendedprice* (1 - l_discount)) as revenue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> part</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_brand = 'Brand#12'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_quantity &gt;= 1 and l_quantity &lt;= 1 + 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_size between 1 and 5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipmode in ('AIR', 'AIR REG')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipinstruct = 'DELIVER IN PERSON'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> or</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_brand = 'Brand#23'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_quantity &gt;= 10 and l_quantity &lt;= 10 + 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_size between 1 and 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipmode in ('AIR', 'AIR REG')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipinstruct = 'DELIVER IN PERSON'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> )</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> or</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> p_partkey = l_partkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_brand = 'Brand#34'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_quantity &gt;= 20 and l_quantity &lt;= 20 + 10</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and p_size between 1 and 15</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipmode in ('AIR', 'AIR REG')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipinstruct = 'DELIVER IN PERSON'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> );</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q20</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=true, enable_projection=true, runtime_bloom_filter_size=551943) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">s_name, s_address from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">supplier left semi join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">(</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select * from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select l_partkey,l_suppkey, 0.5 * sum(l_quantity) as l_q</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from lineitem</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where l_shipdate &gt;= date '1994-01-01'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and l_shipdate &lt; date '1994-01-01' + interval '1' year</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> group by l_partkey,l_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t2 join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select ps_partkey, ps_suppkey, ps_availqty</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from partsupp left semi join part</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on ps_partkey = p_partkey and p_name like 'forest%'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on t2.l_partkey = t1.ps_partkey and t2.l_suppkey = t1.ps_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and t1.ps_availqty &gt; t2.l_q</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">) t3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">on s_suppkey = t3.ps_suppkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">join nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">where s_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and n_name = 'CANADA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by s_name;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q21</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, disable_join_reorder=true, enable_cost_based_join_reorder=true, enable_projection=true) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">s_name, count(*) as numwait</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem l2 right semi join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select * from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> lineitem l3 right anti join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select * from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders join lineitem l1 on l1.l_orderkey = o_orderkey and o_orderstatus = 'F'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> join</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select * from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> supplier join nation</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where s_nationkey = n_nationkey</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and n_name = 'SAUDI ARABIA'</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t1</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where t1.s_suppkey = l1.l_suppkey and l1.l_receiptdate &gt; l1.l_commitdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t2</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on l3.l_orderkey = t2.l_orderkey and l3.l_suppkey &lt;&gt; t2.l_suppkey and l3.l_receiptdate &gt; l3.l_commitdate</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) t3</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> on l2.l_orderkey = t3.l_orderkey and l2.l_suppkey &lt;&gt; t3.l_suppkey </span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.s_name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> numwait desc,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> t3.s_name</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">limit 100;</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">--Q22</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">with tmp as (select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> avg(c_acctbal) as av</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> customer</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_acctbal &gt; 0.00</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> and substring(c_phone, 1, 2) in</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('13', '31', '23', '29', '30', '18', '17'))</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">select /*+SET_VAR(exec_mem_limit=8589934592, parallel_fragment_exec_instance_num=4,runtime_bloom_filter_size=4194304) */</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cntrycode,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> count(*) as numcust,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> sum(c_acctbal) as totacctbal</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> (</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> select</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> substring(c_phone, 1, 2) as cntrycode,</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> c_acctbal</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> from</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> orders right anti join customer c on o_custkey = c.c_custkey join tmp on c.c_acctbal &gt; tmp.av</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> where</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> substring(c_phone, 1, 2) in</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ('13', '31', '23', '29', '30', '18', '17')</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> ) as custsale</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">group by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cntrycode</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">order by</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain"> cntrycode;</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.1.4]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.1.4</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.1.4"/>
<updated>2022-11-11T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris team has fixed about 60 issues or performance improvements in version 1.1.4 compared to previous verisons]]></summary>
<content type="html"><![CDATA[<p>In this release, Doris Team has fixed about 60 issues or performance improvement since 1.1.3. This release is a bugfix release on 1.1 and all users are encouraged to upgrade to this release.</p><h1>Features</h1><ul><li><p>Support obs broker load for Huawei Cloud. <a href="https://github.com/apache/doris/pull/13523" target="_blank" rel="noopener noreferrer">#13523</a></p></li><li><p>SparkLoad support parquet and orc file.<a href="https://github.com/apache/doris/pull/13438" target="_blank" rel="noopener noreferrer">#13438</a></p></li></ul><h1>Improvements</h1><ul><li>Do not acquire mutex in metric hook since it will affect query performance during heavy load.<a href="https://github.com/apache/doris/pull/10941" target="_blank" rel="noopener noreferrer">#10941</a></li></ul><h1>BugFix</h1><ul><li><p>The where condition does not take effect when spark load loads the file. <a href="https://github.com/apache/doris/pull/13804" target="_blank" rel="noopener noreferrer">#13804</a></p></li><li><p>If function return error result when there is nullable column in vectorized mode. <a href="https://github.com/apache/doris/pull/13779" target="_blank" rel="noopener noreferrer">#13779</a></p></li><li><p>Fix incorrect result when using anti join with other join predicates. <a href="https://github.com/apache/doris/pull/13743" target="_blank" rel="noopener noreferrer">#13743</a></p></li><li><p>BE crash when call function concat(ifnull). <a href="https://github.com/apache/doris/pull/13693" target="_blank" rel="noopener noreferrer">#13693</a></p></li><li><p>Fix planner bug when there is a function in group by clause. <a href="https://github.com/apache/doris/pull/13613" target="_blank" rel="noopener noreferrer">#13613</a></p></li><li><p>Table name and column name is not recognized correctly in lateral view clause. <a href="https://github.com/apache/doris/pull/13600" target="_blank" rel="noopener noreferrer">#13600</a></p></li><li><p>Unknown column when use MV and table alias. <a href="https://github.com/apache/doris/pull/13605" target="_blank" rel="noopener noreferrer">#13605</a></p></li><li><p>JSONReader release memory of both value and parse allocator. <a href="https://github.com/apache/doris/pull/13513" target="_blank" rel="noopener noreferrer">#13513</a></p></li><li><p>Fix allow create mv using to_bitmap() on negative value columns when enable_vectorized_alter_table is true. <a href="https://github.com/apache/doris/pull/13448" target="_blank" rel="noopener noreferrer">#13448</a></p></li><li><p>Microsecond in function from_date_format_str is lost. <a href="https://github.com/apache/doris/pull/13446" target="_blank" rel="noopener noreferrer">#13446</a></p></li><li><p>Sort exprs nullability property may not be right after subsitute using child's smap info. <a href="https://github.com/apache/doris/pull/13328" target="_blank" rel="noopener noreferrer">#13328</a></p></li><li><p>Fix core dump on case when have 1000 condition. <a href="https://github.com/apache/doris/pull/13315" target="_blank" rel="noopener noreferrer">#13315</a></p></li><li><p>Fix bug that last line of data lost for stream load. <a href="https://github.com/apache/doris/pull/13066" target="_blank" rel="noopener noreferrer">#13066</a></p></li><li><p>Restore table or partition with the same replication num as before the backup. <a href="https://github.com/apache/doris/pull/11942" target="_blank" rel="noopener noreferrer">#11942</a></p></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.1.3]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.1.3</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.1.3"/>
<updated>2022-10-17T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris team has fixed more than 80 issues or performance improvements in version 1.1.3 compared to previous verisons]]></summary>
<content type="html"><![CDATA[<p>In this release, Doris Team has fixed more than 80 issues or performance improvement since 1.1.2. This release is a bugfix release on 1.1 and all users are encouraged to upgrade to this release.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="features">Features<a href="#features" class="hash-link" aria-label="Features的直接链接" title="Features的直接链接"></a></h2><ul><li><p>Support escape identifiers for sqlserver and postgresql in ODBC table.</p></li><li><p>Could use Parquet as output file format.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><ul><li><p>Optimize flush policy to avoid small segments. <a href="https://github.com/apache/doris/pull/12706" target="_blank" rel="noopener noreferrer">#12706</a> <a href="https://github.com/apache/doris/pull/12716" target="_blank" rel="noopener noreferrer">#12716</a></p></li><li><p>Refactor runtime filter to reduce the prepare time. <a href="https://github.com/apache/doris/pull/13127" target="_blank" rel="noopener noreferrer">#13127</a></p></li><li><p>Lots of memory control related issues during query or load process. <a href="https://github.com/apache/doris/pull/12682" target="_blank" rel="noopener noreferrer">#12682</a> <a href="https://github.com/apache/doris/pull/12688" target="_blank" rel="noopener noreferrer">#12688</a> <a href="https://github.com/apache/doris/pull/12708" target="_blank" rel="noopener noreferrer">#12708</a> <a href="https://github.com/apache/doris/pull/12776" target="_blank" rel="noopener noreferrer">#12776</a> <a href="https://github.com/apache/doris/pull/12782" target="_blank" rel="noopener noreferrer">#12782</a> <a href="https://github.com/apache/doris/pull/12791" target="_blank" rel="noopener noreferrer">#12791</a> <a href="https://github.com/apache/doris/pull/12794" target="_blank" rel="noopener noreferrer">#12794</a> <a href="https://github.com/apache/doris/pull/12820" target="_blank" rel="noopener noreferrer">#12820</a> <a href="https://github.com/apache/doris/pull/12932" target="_blank" rel="noopener noreferrer">#12932</a> <a href="https://github.com/apache/doris/pull/12954" target="_blank" rel="noopener noreferrer">#12954</a> <a href="https://github.com/apache/doris/pull/12951" target="_blank" rel="noopener noreferrer">#12951</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><ul><li><p>Core dump on compaction with largeint. <a href="https://github.com/apache/doris/pull/10094" target="_blank" rel="noopener noreferrer">#10094</a></p></li><li><p>Grouping sets cause be core or return wrong results. <a href="https://github.com/apache/doris/pull/12313" target="_blank" rel="noopener noreferrer">#12313</a></p></li><li><p>PREAGGREGATION flag in orthogonal_bitmap_union_count operator is wrong. <a href="https://github.com/apache/doris/pull/12581" target="_blank" rel="noopener noreferrer">#12581</a></p></li><li><p>Level1Iterator should release iterators in heap and it may cause memory leak. <a href="https://github.com/apache/doris/pull/12592" target="_blank" rel="noopener noreferrer">#12592</a></p></li><li><p>Fix decommission failure with 2 BEs and existing colocation table. <a href="https://github.com/apache/doris/pull/12644" target="_blank" rel="noopener noreferrer">#12644</a></p></li><li><p>BE may core dump because of stack-buffer-overflow when TBrokerOpenReaderResponse too large. <a href="https://github.com/apache/doris/pull/12658" target="_blank" rel="noopener noreferrer">#12658</a></p></li><li><p>BE may OOM during load when error code -238 occurs. <a href="https://github.com/apache/doris/pull/12666" target="_blank" rel="noopener noreferrer">#12666</a></p></li><li><p>Fix wrong child expression of lead function. <a href="https://github.com/apache/doris/pull/12587" target="_blank" rel="noopener noreferrer">#12587</a></p></li><li><p>Fix intersect query failed in row storage code. <a href="https://github.com/apache/doris/pull/12712" target="_blank" rel="noopener noreferrer">#12712</a></p></li><li><p>Fix wrong result produced by curdate()/current_date() function. <a href="https://github.com/apache/doris/pull/12720" target="_blank" rel="noopener noreferrer">#12720</a></p></li><li><p>Fix lateral view explode_split with temp table bug. <a href="https://github.com/apache/doris/pull/13643" target="_blank" rel="noopener noreferrer">#13643</a></p></li><li><p>Bucket shuffle join plan is wrong in two same table. <a href="https://github.com/apache/doris/pull/12930" target="_blank" rel="noopener noreferrer">#12930</a></p></li><li><p>Fix bug that tablet version may be wrong when doing alter and load. <a href="https://github.com/apache/doris/pull/13070" target="_blank" rel="noopener noreferrer">#13070</a></p></li><li><p>BE core when load data using broker with md5sum()/sm3sum(). <a href="https://github.com/apache/doris/pull/13009" target="_blank" rel="noopener noreferrer">#13009</a></p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-notes">Upgrade Notes<a href="#upgrade-notes" class="hash-link" aria-label="Upgrade Notes的直接链接" title="Upgrade Notes的直接链接"></a></h2><p>PageCache and ChunkAllocator are disabled by default to reduce memory usage and can be re-enabled by modifying the configuration items <code>disable_storage_page_cache</code> and <code>chunk_reserved_bytes_limit</code>.</p><p>Storage Page Cache and Chunk Allocator cache user data chunks and memory preallocation, respectively.</p><p>These two functions take up a certain percentage of memory and are not freed. This part of memory cannot be flexibly allocated, which may lead to insufficient memory for other tasks in some scenarios, affecting system stability and availability. Therefore, we disabled these two features by default in version 1.1.3.</p><p>However, in some latency-sensitive reporting scenarios, turning off this feature may lead to increased query latency. If you are worried about the impact of this feature on your business after upgrade, you can add the following parameters to be.conf to keep the same behavior as the previous version.</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">disable_storage_page_cache=false</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">chunk_reserved_bytes_limit=10%</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><ul><li><code>disable_storage_page_cache</code>: Whether to disable Storage Page Cache. version 1.1.2 (inclusive), the default is false, i.e., on. version 1.1.3 defaults to true, i.e., off.</li><li><code>chunk_reserved_bytes_limit</code>: Chunk allocator reserved memory size. 1.1.2 (and earlier), the default is 10% of the overall memory. 1.1.3 version default is 209715200 (200MB).</li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.1.1]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.1.1</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.1.1"/>
<updated>2022-09-13T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris 1.1.1 is now available, with several enhancements and bug fixes based on 1.1.0,enabling smoother user experience.]]></summary>
<content type="html"><![CDATA[<h2 class="anchor anchorWithStickyNavbar_LWe7" id="features">Features<a href="#features" class="hash-link" aria-label="Features的直接链接" title="Features的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-odbc-sink-in-vectorized-engine">Support ODBC Sink in Vectorized Engine.<a href="#support-odbc-sink-in-vectorized-engine" class="hash-link" aria-label="Support ODBC Sink in Vectorized Engine.的直接链接" title="Support ODBC Sink in Vectorized Engine.的直接链接"></a></h3><p>This feature is enabled in non-vectorized engine but it is missed in vectorized engine in 1.1. So that we add back this feature in 1.1.1.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="simple-memtracker-for-vectorized-engine">Simple Memtracker for Vectorized Engine.<a href="#simple-memtracker-for-vectorized-engine" class="hash-link" aria-label="Simple Memtracker for Vectorized Engine.的直接链接" title="Simple Memtracker for Vectorized Engine.的直接链接"></a></h3><p>There is no memtracker in BE for vectorized engine in 1.1, so that the memory is out of control and cause OOM. In 1.1.1, a simple memtracker is added to BE and could control the memory and cancel the query when memory exceeded.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="cache-decompressed-data-in-page-cache">Cache decompressed data in page cache.<a href="#cache-decompressed-data-in-page-cache" class="hash-link" aria-label="Cache decompressed data in page cache.的直接链接" title="Cache decompressed data in page cache.的直接链接"></a></h3><p>Some data is compressed using bitshuffle and it costs a lot of time to decompress it during query. In 1.1.1, doris will decompress the data that encoded by bitshuffle to accelerate query and we find it could reduce 30% latency for some query in ssb-flat.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="fix-the-problem-that-could-not-do-rolling-upgrade-from-10serious">Fix the problem that could not do rolling upgrade from 1.0.(Serious)<a href="#fix-the-problem-that-could-not-do-rolling-upgrade-from-10serious" class="hash-link" aria-label="Fix the problem that could not do rolling upgrade from 1.0.(Serious)的直接链接" title="Fix the problem that could not do rolling upgrade from 1.0.(Serious)的直接链接"></a></h3><p>This issue was introduced in version 1.1 and may cause BE core when upgrade BE but not upgrade FE.</p><p>If you encounter this problem, you can try to fix it with <a href="https://github.com/apache/doris/pull/10833" target="_blank" rel="noopener noreferrer">#10833</a>.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="fix-the-problem-that-some-query-not-fall-back-to-non-vectorized-engine-and-be-will-core">Fix the problem that some query not fall back to non-vectorized engine, and BE will core.<a href="#fix-the-problem-that-some-query-not-fall-back-to-non-vectorized-engine-and-be-will-core" class="hash-link" aria-label="Fix the problem that some query not fall back to non-vectorized engine, and BE will core.的直接链接" title="Fix the problem that some query not fall back to non-vectorized engine, and BE will core.的直接链接"></a></h3><p>Currently, vectorized engine could not deal with all sql queries and some queries (like left outer join) will use non-vectorized engine to run. But there are some cases not covered in 1.1. And it will cause be crash.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="compaction-not-work-correctly-and-cause--235-error">Compaction not work correctly and cause -235 Error.<a href="#compaction-not-work-correctly-and-cause--235-error" class="hash-link" aria-label="Compaction not work correctly and cause -235 Error.的直接链接" title="Compaction not work correctly and cause -235 Error.的直接链接"></a></h3><p>One rowset multi segments in uniq key compaction, segments rows will be merged in generic_iterator but merged_rows not increased. Compaction will failed in check_correctness, and make a tablet with too much versions which lead to -235 load error.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="some-segment-fault-cases-during-query">Some segment fault cases during query.<a href="#some-segment-fault-cases-during-query" class="hash-link" aria-label="Some segment fault cases during query.的直接链接" title="Some segment fault cases during query.的直接链接"></a></h3><p><a href="https://github.com/apache/doris/pull/10961" target="_blank" rel="noopener noreferrer">#10961</a>
<a href="https://github.com/apache/doris/pull/10954" target="_blank" rel="noopener noreferrer">#10954</a>
<a href="https://github.com/apache/doris/pull/10962" target="_blank" rel="noopener noreferrer">#10962</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="thanks">Thanks<a href="#thanks" class="hash-link" aria-label="Thanks的直接链接" title="Thanks的直接链接"></a></h2><p>Thanks to everyone who has contributed to this release:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jacktengg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@mrhhsg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xinyiZzz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yixiutt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@starocean999</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morrySnow</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morningman</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@HappenLee</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.1.2]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.1.2</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.1.2"/>
<updated>2022-09-13T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, Apache Doris team has fixed more than 170 issues or performance improvements in version 1.1.2 compared to previous verisons]]></summary>
<content type="html"><![CDATA[<p>In this release, Doris Team has fixed more than 170 issues or performance improvement since 1.1.1. This release is a bugfix release on 1.1 and all users are encouraged to upgrade to this release.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="features">Features<a href="#features" class="hash-link" aria-label="Features的直接链接" title="Features的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-memtracker">New MemTracker<a href="#new-memtracker" class="hash-link" aria-label="New MemTracker的直接链接" title="New MemTracker的直接链接"></a></h3><p>Introduced new MemTracker for both vectorized engine and non-vectorized engine which is more accurate.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="add-api-for-showing-current-queries-and-kill-query">Add API for showing current queries and kill query<a href="#add-api-for-showing-current-queries-and-kill-query" class="hash-link" aria-label="Add API for showing current queries and kill query的直接链接" title="Add API for showing current queries and kill query的直接链接"></a></h3><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-readwrite-emoji-of-utf16-via-odbc-table">Support read/write emoji of UTF16 via ODBC Table<a href="#support-readwrite-emoji-of-utf16-via-odbc-table" class="hash-link" aria-label="Support read/write emoji of UTF16 via ODBC Table的直接链接" title="Support read/write emoji of UTF16 via ODBC Table的直接链接"></a></h3><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="data-lake-related-improvements">Data Lake related improvements<a href="#data-lake-related-improvements" class="hash-link" aria-label="Data Lake related improvements的直接链接" title="Data Lake related improvements的直接链接"></a></h3><ul><li><p>Improved HDFS ORC File scan performance about 300%. <a href="https://github.com/apache/doris/pull/11501" target="_blank" rel="noopener noreferrer">#11501</a></p></li><li><p>Support HDFS HA mode when query Iceberg table.</p></li><li><p>Support query Hive data created by <a href="https://tez.apache.org/" target="_blank" rel="noopener noreferrer">Apache Tez</a></p></li><li><p>Add Ali OSS as Hive external support.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="add-support-for-string-and-text-type-in-spark-load">Add support for string and text type in Spark Load<a href="#add-support-for-string-and-text-type-in-spark-load" class="hash-link" aria-label="Add support for string and text type in Spark Load的直接链接" title="Add support for string and text type in Spark Load的直接链接"></a></h3><h3 class="anchor anchorWithStickyNavbar_LWe7" id="add-reuse-block-in-non-vectorized-engine-and-have-50-performance-improvement-in-some-cases-11392">Add reuse block in non-vectorized engine and have 50% performance improvement in some cases. <a href="https://github.com/apache/doris/pull/11392" target="_blank" rel="noopener noreferrer">#11392</a><a href="#add-reuse-block-in-non-vectorized-engine-and-have-50-performance-improvement-in-some-cases-11392" class="hash-link" aria-label="add-reuse-block-in-non-vectorized-engine-and-have-50-performance-improvement-in-some-cases-11392的直接链接" title="add-reuse-block-in-non-vectorized-engine-and-have-50-performance-improvement-in-some-cases-11392的直接链接"></a></h3><h3 class="anchor anchorWithStickyNavbar_LWe7" id="improve-like-or-regex-performance">Improve like or regex performance<a href="#improve-like-or-regex-performance" class="hash-link" aria-label="Improve like or regex performance的直接链接" title="Improve like or regex performance的直接链接"></a></h3><h3 class="anchor anchorWithStickyNavbar_LWe7" id="disable-tcmallocs-aggressive_memory_decommit">Disable tcmalloc's aggressive_memory_decommit<a href="#disable-tcmallocs-aggressive_memory_decommit" class="hash-link" aria-label="Disable tcmalloc's aggressive_memory_decommit的直接链接" title="Disable tcmalloc's aggressive_memory_decommit的直接链接"></a></h3><p>It will have 40% performance gains in load or query.</p><p>Currently it is a config, you can change it by set config <code>tc_enable_aggressive_memory_decommit</code>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bug-fixes">Bug Fixes<a href="#bug-fixes" class="hash-link" aria-label="Bug Fixes的直接链接" title="Bug Fixes的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="some-issues-about-fe-that-will-cause-fe-failure-or-data-corrupt">Some issues about FE that will cause FE failure or data corrupt.<a href="#some-issues-about-fe-that-will-cause-fe-failure-or-data-corrupt" class="hash-link" aria-label="Some issues about FE that will cause FE failure or data corrupt.的直接链接" title="Some issues about FE that will cause FE failure or data corrupt.的直接链接"></a></h3><ul><li><p>Add reserved disk config to avoid too many reserved BDB-JE files.<strong>(Serious)</strong> In an HA environment, BDB JE will retains as many reserved files. The BDB-je log doesn't delete until approaching a disk limit.</p></li><li><p>Fix fatal bug in BDB-JE which will cause FE replica could not start correctly or data corrupted.<strong> (Serious)</strong></p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="fe-will-hang-on-waitfor_rpc-during-query-and-be-will-hang-in-high-concurrent-scenarios">Fe will hang on waitFor_rpc during query and BE will hang in high concurrent scenarios.<a href="#fe-will-hang-on-waitfor_rpc-during-query-and-be-will-hang-in-high-concurrent-scenarios" class="hash-link" aria-label="Fe will hang on waitFor_rpc during query and BE will hang in high concurrent scenarios.的直接链接" title="Fe will hang on waitFor_rpc during query and BE will hang in high concurrent scenarios.的直接链接"></a></h3><p><a href="https://github.com/apache/doris/pull/12459" target="_blank" rel="noopener noreferrer">#12459</a> <a href="https://github.com/apache/doris/pull/12458" target="_blank" rel="noopener noreferrer">#12458</a> <a href="https://github.com/apache/doris/pull/12392" target="_blank" rel="noopener noreferrer">#12392</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="a-fatal-issue-in-vectorized-storage-engine-which-will-cause-wrong-result-serious">A fatal issue in vectorized storage engine which will cause wrong result. <strong>(Serious)</strong><a href="#a-fatal-issue-in-vectorized-storage-engine-which-will-cause-wrong-result-serious" class="hash-link" aria-label="a-fatal-issue-in-vectorized-storage-engine-which-will-cause-wrong-result-serious的直接链接" title="a-fatal-issue-in-vectorized-storage-engine-which-will-cause-wrong-result-serious的直接链接"></a></h3><p><a href="https://github.com/apache/doris/pull/11754" target="_blank" rel="noopener noreferrer">#11754</a> <a href="https://github.com/apache/doris/pull/11694" target="_blank" rel="noopener noreferrer">#11694</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="lots-of-planner-related-issues-that-will-cause-be-core-or-in-abnormal-state">Lots of planner related issues that will cause BE core or in abnormal state.<a href="#lots-of-planner-related-issues-that-will-cause-be-core-or-in-abnormal-state" class="hash-link" aria-label="Lots of planner related issues that will cause BE core or in abnormal state.的直接链接" title="Lots of planner related issues that will cause BE core or in abnormal state.的直接链接"></a></h3><p><a href="https://github.com/apache/doris/pull/12080" target="_blank" rel="noopener noreferrer">#12080</a> <a href="https://github.com/apache/doris/pull/12075" target="_blank" rel="noopener noreferrer">#12075</a> <a href="https://github.com/apache/doris/pull/12040" target="_blank" rel="noopener noreferrer">#12040</a> <a href="https://github.com/apache/doris/pull/12003" target="_blank" rel="noopener noreferrer">#12003</a> <a href="https://github.com/apache/doris/pull/12007" target="_blank" rel="noopener noreferrer">#12007</a> <a href="https://github.com/apache/doris/pull/11971" target="_blank" rel="noopener noreferrer">#11971</a> <a href="https://github.com/apache/doris/pull/11933" target="_blank" rel="noopener noreferrer">#11933</a> <a href="https://github.com/apache/doris/pull/11861" target="_blank" rel="noopener noreferrer">#11861</a> <a href="https://github.com/apache/doris/pull/11859" target="_blank" rel="noopener noreferrer">#11859</a> <a href="https://github.com/apache/doris/pull/11855" target="_blank" rel="noopener noreferrer">#11855</a> <a href="https://github.com/apache/doris/pull/11837" target="_blank" rel="noopener noreferrer">#11837</a> <a href="https://github.com/apache/doris/pull/11834" target="_blank" rel="noopener noreferrer">#11834</a> <a href="https://github.com/apache/doris/pull/11821" target="_blank" rel="noopener noreferrer">#11821</a> <a href="https://github.com/apache/doris/pull/11782" target="_blank" rel="noopener noreferrer">#11782</a> <a href="https://github.com/apache/doris/pull/11723" target="_blank" rel="noopener noreferrer">#11723</a> <a href="https://github.com/apache/doris/pull/11569" target="_blank" rel="noopener noreferrer">#11569</a></p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Doris stream load principle analysis]]></title>
<id>https://doris.apache.org/zh-CN/blog/principle-of-Doris-Stream-Load</id>
<link href="https://doris.apache.org/zh-CN/blog/principle-of-Doris-Stream-Load"/>
<updated>2022-09-08T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Stream Load, one of the most commonly used data import methods for Doris users, is a synchronous import method. It allows users to import data into Doris in batch through HTTP access and returns the results of data import.]]></summary>
<content type="html"><![CDATA[<p><strong>Lead:</strong></p><p>Stream Load, one of the most commonly used data import methods for Doris users, is a synchronous import method. It allows users to import data into Doris in batch through HTTP access and returns the results of data import. The user can not only directly judge whether the data import is successful through the return body of the HTTP request, but also query the results of historical tasks by executing query SQL on the client.</p><h1><strong>Introduction to Stream Load</strong></h1><p>The Doris import (Load) function is to import the user's original data into the Doris table. And Doris realizes a unified streaming import framework at the bottom. On this basis, Doris provides a very rich import mode to adapt to different data sources and data import requirements. Stream Load is one of the most commonly used data import methods for Doris users. It is a synchronous import method that allows users to import data in CSV format or JSON format into Doris in batch through HTTP access and return the results of data import. User can directly judge whether the data import is successful through the return body of the HTTP request, and can query the results of historical tasks by executing query SQL on the client. In addition, Doris also provides the operation audit function for Stream Load, which can audit the historical Stream Load task information through the audit log. The implementation principle of Stream Load will be deeply analyzed from the aspects of execution process, transaction management, implementation of import plan, data writing and operation audit of Stream Load.</p><h1>1 Implementation Process</h1><p>The user submits the HTTP request of Stream Load to the FE, and the FE will forward the data import request to a BE node through HTTP Redirect, which will be the Coordinator of this Stream Load task. In this process, the FE node receiving the request only provides forwarding service. The BE node as the Coordinator is actually responsible for the entire import job, such as sending transaction requests to the Master FE, obtaining import execution plans from the FE, receiving real-time data, distributing data to other Executor BE nodes, and returning results to the user after data import. The user can also submit the HTTP request of Stream Load directly to a specified BE node, and the node will act as the Coordinator of this Stream Load task. During the Stream Load process, the Executor BE node is responsible for writing data to the storage layer.</p><p>In the Coordinator BE, all HTTP requests, including Stream Load requests, are processed through a thread pool. A Stream Load task is uniquely identified by the imported Label. The principle block diagram of Stream Load is shown in Figure 1.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_1_en-b2fe685555585338cf6207b8d24a878e.png" width="1080" height="1044" class="img_ev3q"></p><p>The complete execution process of Stream Load is shown in Figure 2:</p><p>(1)The user submits the HTTP request of Stream Load to the FE (the user can also directly submit the HTTP request of Stream Load to the Coordinator BE).</p><p>(2)FE, after receiving the Stream Load request submitted by the user, will perform HTTP Header parsing (including the library, table, Label and other information imported by parsing data), and then perform user authentication. If the HTTP Header is successfully resolved and the user authentication passes, the FE will forward the HTTP request of Stream Load to a BE node, which will be the Coordinator of this Stream Load. Otherwise, the FE will directly return the failure information of Stream Load to the user.</p><p>(3)After receiving the HTTP request from Stream Load, the Coordinator BE will first perform HTTP Header parsing and data verification, including the file format of the parsed data, the size of the data body, the HTTP timeout, and user authentication. If the Header data verification fails, the Stream Load failure information will be directly returned to the user.</p><p>(4)After the HTTP Header data verification is passed, the Coordinator BE will send a Begin Transaction request to the FE through Thrift RPC.</p><p>(5)After the FE receives the Begin Transaction request sent by the Coordinator BE, it will start a transaction and return the Transaction ID to the Coordinator BE.</p><p>(6)After the Coordinator BE receives the Begin Transaction success information, it will send a request to get the import plan to the FE through Thrift RPC.</p><p>(7)After receiving the request for obtaining the import plan sent by the Coordinator BE, the FE will generate the import plan for the Stream Load task and return it to the Coordinator BE.</p><p>(8)After receiving the import plan, the Coordinator BE starts to execute the import plan, including receiving the real-time data from HTTP and distributing the real-time data to other Executor BE through BRPC.</p><p>(9)After receiving the real-time data distributed by the Coordinator BE, the Executor BE is responsible for writing the data to the storage layer.</p><p>(10)After the Executor BE completes data writing, the Coordinator BE sends a Commit Transaction request to the FE through Thrift RPC.</p><p>(11)After receiving the Commit Transaction request sent by the Coordinator BE, the FE will commit transaction, send the Publish Version task to the Executor BE, and wait for the Executor BE to execute the Publish Version.</p><p>(12)The Executor BE asynchronously executes the Publish Version to change the Rowset generated by data import into a visible data version.</p><p>(13)After the Publish Version completes normally or the execution timeout, the FE returns the results of the Commit Transaction and the Publish Version to the Coordinator BE.</p><p>(14)The Coordinator BE returns the final result of Stream Load to the user.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_2_en-c2ea39e56fb64fa30ef649c281ee5e67.png" width="1068" height="1461" class="img_ev3q"></p><h1>2 Transaction Management</h1><p>Doris ensures the atomicity of data import through Transaction. One Stream Load task corresponds to one transaction. The FE is responsible for the transaction management of Stream Load. The FE receives the Thrift RPC transaction request sent by the Coordinator BE node through the FrontendService. Transaction request types include Begin Transaction, Commit Transaction and Rollback Transaction. The transaction states of Doris include PREPARE, COMMITTED, VISIBLE, and ABORTED. The status flow process of the Stream Load transaction is shown in Figure 3.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_3_en-afe100ea9995f8032cf312bb75825028.png" width="1080" height="165" class="img_ev3q"></p><p>The Coordinator BE node will send a Begin Transaction request to the FE before data import. The FE will check whether the label requested by the Begin Transaction already exists. If the label does not exist in the system, it will open a new transaction for the current label, assign a Transaction ID to the transaction, and set the transaction status to PREPARE, then returns the Transaction ID and the success information of the Begin Transaction to the Coordinator BE. Otherwise, this transaction may be a repeated data import. The FE returns the Begin Transaction failure message to the Coordinator BE, and the Stream Load task exits.</p><p>After the data is written in all Executor BE nodes, the Coordinator BE node will send a Commit Transaction request to the FE. After receiving the Commit Transaction request, the FE will execute the Commit Transaction and Publish Version operations. First, the FE will judge whether the number of replicas of data successfully written by each Tablet exceeds half of the total number of replicas of the tablet. If the number of replicas of data successfully written by each Tablet exceeds half of the total number of replicas of the Tablet (most of them are successful), the Commit Transaction is successful and the transaction status is set to COMMITTED; Otherwise, the Commit Transaction failure information is returned to the Coordinator BE. The COMMITTED status indicates that the data has been written successfully, but the data is not visible. You need to continue to execute the Publish Version task. After that, the transaction cannot be rolled back.</p><p>The FE will have a separate thread to execute the Publish Version on the Transaction with successful Commit. When the Publish Version is executed, the FE will send the Publish Version request to all Executor BE nodes related to the Transaction through Thrift RPC. The Publish Version task is executed asynchronously on each Executor BE node, and the Rowset generated by data import is changed into a visible data version. When all the Publish Version tasks on the Executor BE are successfully executed, the FE will set the transaction status to VISIBLE, and return the Commit Transaction and Publish Version success information to the Coordinator BE. When some Publish Version tasks fail, the FE will repeatedly issue a Publish Version request to the Executor BE node until the previously failed Publish Version task succeeds. If the transaction status has not been set to VISIBLE after a certain timeout period, the FE will return to the Coordinator BE the information that the Commit Transaction was successful but the Publish Version timed out (note that at this time, the data is still written successfully, but it is still invisible, and the user needs to wait for the transaction status to finally become VISIBLE).</p><p>When obtaining the import plan from the FE fails, executing data import fails, or Commit Transaction fails, the Coordinator BE node will send a Rollback Transaction request to the FE to execute transaction rollback. After receiving the transaction rollback request, the FE will set the transaction status to ABORTED, and send a Clear Transaction request to the Executor BE through Thrift RPC. The Clear Transaction task is asynchronously executed at the BE node, marking the Rowset generated by data import as unavailable. These Rowset will be deleted from the BE later. Transactions with COMMITTED status (transactions with Commit Transaction succeeded but Publish Version timed out) cannot be rolled back.</p><h1>3 Execution of the import plan</h1><p>In Doris BE, all execution plans are managed by FragmentMgr, and the execution of each import plan is managed by PlanFragmentExecutor. After the BE obtains the import execution plan from the FE, it will submit the import plan to the thread pool of FragmentMgr for execution. The import execution plan of Stream Load has only one Fragment, including one BrokerScanNode and one OlapTableSink. BrokerScanNode is responsible for reading streaming data in real time and converting the data lines in CSV format or JSON format to the Tuple format of Doris. OlapTableSink is responsible for sending real-time data to the corresponding Executor BE node. The Executor BE node corresponding to each data row is determined by which BE the Tablet where the data row is stored. The Partition and Tablet where the data row is stored can be determined according to the PartitionKey and DistributionKey of the data row. The BE node on which each Tablet and its replica are stored has been determined when the Table or Partition is created.</p><p>After importing the execution plan and submitting it to the thread pool of FragmentMgr, the Stream Load thread will receive the real-time data transmitted through HTTP in chunks and write it to the StreamLoadPipe. The BrokerScanNode will read the real-time data in batches from the StreamLoadPipe. OlapTableSink will send the batch data read by the BrokerScanNode to the Executor BE through BRPC for data writing. After all real-time data is written to the StreamLoadPipe, the Stream Load thread will wait for the import plan to finish.</p><p>The PlanFragmentExecutor executes a specific import plan process, which consists of three stages: Prepare, Open, and Close. In the Prepare stage, the import execution plan from the FE is mainly analyzed; In the Open stage, BrokerScanNode and OlapTableSink will be opened. BrokerScanNode is responsible for reading the real-time data of one Batch at a time, and OlapTableSink is responsible for calling BRPC to send the data of each Batch to other Executor BE nodes; In the Close stage, it is responsible for waiting for the data import to end and closing the BrokerScanNode and OlapTableSink. The import execution plan of Stream Load is shown in Figure 4.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_4_en-6bf14a31ea5acff82e83e5745a3603aa.png" width="1080" height="888" class="img_ev3q"></p><p>OlapTableSink is responsible for the data distribution of the Stream Load task. Tables in Doris may have Rollup or Materialized view. Each Table and its Rollup and Materialized view are called an Index. In the process of data distribution, the IndexChannel will maintain a data distribution channel of the Index. The Tablet under the Index may have multiple replicas and are distributed on different BE nodes. The NodeChannel will maintain the data distribution channel of an Executor BE node under the IndexChannel. Therefore, the OlapTableSink contains multiple IndexChannel, and each NodeChannel contains multiple NodeChannel, as shown in Figure 5.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_5_en-f040fee94a651a88a2a3ef68de235532.png" width="1080" height="471" class="img_ev3q"></p><p>When OlapTableSink distributes data, it will read the data Batch obtained by BrokerScanNode row by row, and add the data row to the IndexChannel of each Index. The Partition and Tablet of the data row can be determined according to the PartitionKey and DistributionKey, and then the corresponding Tablet of the data row in other Index can be calculated according to the order of the Tablet in the Partition. Each Tablet may have multiple replicas distributed on different BE nodes. Therefore, in the IndexChannel, each data row will be added to the NodeChannel corresponding to each replica of its Tablet. Each NodeChannel has a send queue. When the new data rows in NodeChannel accumulate to a certain size, they will be added to the send queue as a data Batch. There will be a fixed thread in OlapTableSink to train each NodeChannel under each IndexChannel in turn, and call BRPC to send a data Batch in the sending queue to the corresponding Executor BE. The data distribution process of the Stream Load task is shown in Figure 6.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_6_en-20cfadfbbb14b377e4a0debd6ef0bb1b.png" width="1080" height="850" class="img_ev3q"></p><h1>4 <strong>Data Write</strong></h1><p>After receiving the data Batch sent by the Coordinator BE, the BRPC server of the Executor BE will submit the data writing task to the thread pool for asynchronous execution. In Doris BE, data is written to the storage layer in a hierarchical manner. Each Stream Load task corresponds to a LoadChannel on each Executor BE. The LoadChannel maintains the data writing channel of a Stream Load task and is responsible for the data writing of a Stream Load task on the current Executor BE node, LoadChannel can write the data of a Stream Load task in the current BE node to the storage layer in batches until the Stream Load task is completed. Each LoadChannel is uniquely identified by the load ID, and all LoadChannel on the BE node are managed by LoadChannelMgr. The Table corresponding to a Stream Load task may have multiple Index. Each Index corresponds to a TabletsChannel, which is uniquely identified by the Index ID. Therefore, there will be multiple TabletsChannel under each LoadChannel. The TabletsChannel maintains an Index data writing channel, which is responsible for managing the data writing of all the Tablet under the Index. The TabletsChannel will read the data Batch row by row and write it to the corresponding Tablet through the DeltaWriter. The DeltaWriter maintains a data writing channel of a Tablet, which is uniquely identified by the Tablet ID. it is responsible for receiving the data import of a single Tablet and writing the data into the MemTable corresponding to the tablet. When the MemTable is full, the data in the MemTable will be flushed to the disk and Segment files will be generated. MemTable adopts the data structure of SkipList to temporarily store the data in memory. SkipList will sort the data rows according to the Key of Schema. In addition, if the data model is Aggregate or Unique, MemTable will aggregate data rows with the same Key. The data write channel of the Stream Load task is shown in Figure 7.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_7_en-757ccfec5d94537f5e85cc8026cc0d4a.png" width="1080" height="656" class="img_ev3q"></p><p>The Flush operation of MemTable is performed asynchronously by MemtableFlushExecutor. After the MemTable Flush task is submitted to the thread pool, a new MemTable will be generated to receive the subsequent data writing of the current Tablet. When the MemtableFlushExecutor performs data Flush, the RowsetWriter will read out all the data in the MemTable and write out multiple Segment files through the SegmentWriter. The size of each Segment file is no more than 256MB. For a Tablet, each Stream Load task will generate a newRowset. The generated Rowset can contain multiple Segment files. The data writing process of the Stream Load task is shown in Figure 8.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_8_en-11db5419d6ebd287d1abcb254fd174f0.png" width="1073" height="1280" class="img_ev3q"></p><p>The TxnManager on the Executor BE node is responsible for transaction management of Tablet level data import. When the Delta Writer is initialized, the PrepareTransaction will be executed to add the data write transaction of the corresponding Tablet in the current Stream Load task to the TxnManager for management. When the data write Tablet is completed and the DeltaWriter is closed, the Commit Transaction will be executed to add the new Rowset generated by the data import to the TxnManager for management. Note that the TxnManager here is only responsible for the transactions on a single BE, while the transaction management in the FE is responsible for the overall import of transactions.</p><p>After the data import is completed, when the Executor BE executes the Publish Version task issued by the FE, it will execute the Publish Transaction to change the new Rowset generated by the data import into a visible version, and delete the data writing task of the corresponding Tablet in the current Stream Load task from the TxnManager. This means that the data writing transaction of the Tablet in the current Stream Load task ends.</p><h1>5 <strong>Stream Load Operation Audit</strong></h1><p>Doris adds the operation audit function to Stream Load. After each Stream Load task is completed and the results are returned to the user, the Coordinator BE will persistently store the detailed information of this Stream Load task on the local RocksDB. The Master FE periodically pulls the information of the completed Stream Load task from each BE node of the cluster through Thrift RPC, pulls a batch of Stream Load operation records from one BE node at a time, and writes the pulled Stream Load task information into the audit log (fe.audit.log). Each Stream Load task information stored on the BE will be set with an expiration time (TTL), and the expired Stream Load task information will be deleted when RocksDB executes the Compaction. The user can audit the historical Stream Load task information through the FE audit log.</p><p>When the FE writes the pulled Stream Load task information into the Audit log, it will keep a copy in the memory. In order to prevent memory expansion, a fixed number of Stream Load task information will be kept in the memory. As the subsequent data pulling continues, the early Stream Load task information will be gradually eliminated from the FE memory. The user can query the latest Stream Load task information by executing the SHOW STREAM LOAD command at the client.</p><h1><strong>Summary</strong></h1><p>In this paper, the implementation principle of Stream Load is deeply analyzed from the aspects of execution process, transaction management, implementation of import plan, data writing and operation audit of Stream Load. Stream Load is one of the most commonly used data import methods for Doris users. It is a synchronous import method that allows users to import data into Doris in batch through HTTP access and return the results of data import. The user can not only directly judge whether the data import is successful through the return body of the HTTP request, but also query the results of historical tasks by executing query SQL on the client. Otherwise, Doris also provides the result audit function for Stream Load, which can audit the historical Stream Load task information through the audit log.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Doris analysis: Doris SQL principle analysis]]></title>
<id>https://doris.apache.org/zh-CN/blog/principle-of-Doris-SQL-parsing</id>
<link href="https://doris.apache.org/zh-CN/blog/principle-of-Doris-SQL-parsing"/>
<updated>2022-08-25T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This article mainly introduces the principle of Doris SQL parsing.Since there are many types of SQL, this article focuses on the analysis of query SQL. Doris's SQL analysis will be explained deeply in the algorithm principle and code implementation.]]></summary>
<content type="html"><![CDATA[<p><strong>Lead:</strong>
This article mainly introduces the principle of Doris SQL parsing.</p><p>It focuses on generating a single-machine logical plan, developing a distributed logical plan, and generating a distributed physical plan. Analyze, SinglePlan, DistributedPlan, and Schedule four parts correspond to the code implementation.</p><p>First, AST will be processed preliminary by Analyze and then optimized by SinglePlan to generate a single-machine query plan. Third, DistributedPlan will split the single-machine query plan into distributed query plans. In the end, the query plan will be sent to machines and executed orderly, which decide by Schedule.</p><p>Since there are many types of SQL, this article focuses on the analysis of query SQL. Doris's SQL analysis will be explained deeply in the algorithm principle and code implementation.</p><h1>1. Introduction to Doris</h1><p>Doris is an interactive SQL database based on MPP architecture, mainly used to solve near real-time reports and multi-dimensional analysis. The Doris architecture is straightforward, with only two types of processes.</p><ul><li><p>Frontend(FE): It is mainly responsible for user request access, query parsing and planning, storage and management of metadata, and node management-related work.</p></li><li><p>Backend(BE): It is mainly responsible for data storage and query plan execution.</p></li></ul><p>In Doris' storage engine, data will be horizontally divided into several data shards (Tablet, also called data bucket). Each tablet contains several rows of data. Multiple Tablets belong to different partitions logically. A Tablet only belongs to one Partition. And a Partition contains several Tablets. Tablet is the smallest physical storage unit for operations such as data movement, copying, etc.</p><h1>2. SQL parsing In Apache Doris</h1><p>SQL parsing in this article refers to <strong>the process of generating a complete physical execution plan after a series of parsing of an SQL statement</strong>.</p><p>This process includes the following four steps: lexical analysis, syntax analysis, generating a logical plan, and generating a physical plan.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_1_en-0c09f140305ed3879a5bdd86428f0f1c.png" width="1080" height="446" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="21-lexical-analysis">2.1 Lexical analysis<a href="#21-lexical-analysis" class="hash-link" aria-label="2.1 Lexical analysis的直接链接" title="2.1 Lexical analysis的直接链接"></a></h2><p>The lexical analysis will identify the SQL in the form of a string into tokens, in preparation for the grammatical analysis.</p><div class="language-undefined codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-undefined codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">select ...... from ...... where ....... group by ..... order by ......</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">SQL Tokens could be divided into the following categories:</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">○ Keywords (select, from, where)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">○ operator (+, -, &gt;=)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">○ Open/close flag ((, CASE)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">○ placeholder (?)</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">○ Comments</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">○ space</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">......</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="22-syntax-analysis">2.2 Syntax analysis<a href="#22-syntax-analysis" class="hash-link" aria-label="2.2 Syntax analysis的直接链接" title="2.2 Syntax analysis的直接链接"></a></h2><p>The syntax analysis will convert the token generated by the lexical analysis into an abstract syntax tree based on the syntax rules, as shown in Figure 2.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_2_en-3288af3435350e506b2d1f6314172e64.png" width="1080" height="473" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="23-logical-plan">2.3 Logical plan<a href="#23-logical-plan" class="hash-link" aria-label="2.3 Logical plan的直接链接" title="2.3 Logical plan的直接链接"></a></h2><p>The logical plan converts the abstract syntax tree into an algebraic relation, which is an operator tree, and each node represents a calculation method for data. The entire tree represents the calculation method and flows direction of data, as shown in Figure 3.</p><p> <img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_3_en-7a6ac1b525922fce20195f2224d176ad.png" width="573" height="893" class="img_ev3q"></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="24-physical-plan">2.4 Physical plan<a href="#24-physical-plan" class="hash-link" aria-label="2.4 Physical plan的直接链接" title="2.4 Physical plan的直接链接"></a></h2><p>The physical plan is the plan that determines which computing operations are performed on which machines. It will be generated based on the logical plan, the distribution of machines, and the distribution of data.</p><p>The SQL parsing of the Doris system also adopts these steps, but it is refined and optimized according to the characteristics of the Doris system structure and the storage method of data to maximize the computing power of the machine.</p><h1>3. Design goals</h1><p>The design goals of the Doris SQL parsing architecture are:</p><ol><li><p>Maximize Computational Parallelism</p></li><li><p>Minimize network transfer of data</p></li><li><p>Minimize the amount of data that needs to be scanned</p></li></ol><h1>4. Architecture</h1><p>Doris SQL parsing includes five steps: lexical analysis, syntax analysis, generation of a stand-alone logical plan, generation of a distributed logical plan, and generation of a physical execution plan.</p><p>In terms of code implementation, it corresponds to the following five steps: Parse, Analyze, SinglePlan, DistributedPlan, and Schedule, which as shown in Figure 4.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_4_en-cd2aa449d728cd42554bbf7ddbdbaad6.png" width="1080" height="1682" class="img_ev3q"></p><p>The Parse phase will not be discussed in this article. Analyze will do some pre-processing of the AST. A stand-alone query plan will be optimized by SinglePlan based on the AST. DistributedPlan will split the stand-alone query plan into distributed query plans. Schedule phase will determine which machines the query plan will be sent to for execution.</p><p><strong>Since there are many types of SQL, this article focuses on the analysis of query SQL.</strong></p><p>Figure 5 shows a simple query SQL parsing implementation in Doris.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_5_en-cd3c8dd60e28999551acce60541797d2.png" width="1080" height="1344" class="img_ev3q"></p><h1>5. Parse Phase</h1><p>In the Parse stage, JFlex technology is used for lexical analysis, java cup parser technology is used for syntax analysis, and an AST(Abstract Syntax Tree)will finally generate. These are existing and mature technologies and will not be introduced in detail here.</p><p>AST has a tree-like structure, which represents a piece of SQL. Therefore, different types of queries -- select, insert, show, set, alter table, create table, etc. will generate additional data structures after Parse (SelectStmt, InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, CreateTableStmt, etc.). However, they all inherit from Statements and will perform some specific processing according to their own grammar rules. For example: for select type SQL, the SelectStmt structure will be generated after Parse.</p><p>SelectStmt structure contains SelectList, FromClause, WhereClause, GroupByClause, SortInfo and other structures. These structures contain more basic data structures. For Example, WhereClause contains BetweenPredicate, BinaryPredicate, CompoundPredicate, InPredicate, and so on.</p><p>All structures in AST are composed of basic structure expressions--Expr by using various combinations, as shown in Figure 6.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_6_en-39088e65b97c95938d6cf9c1aba359e8.png" width="1080" height="718" class="img_ev3q"></p><h1>6. Analyze Phase</h1><p>Analyze will perform pre-processing and semantic analysis on the abstract syntax tree AST generated in the Parse phase, preparing for the generation of stand-alone logic plans.</p><p>The abstract class StatementBase represents the abstract syntax tree. This abstract class contains a most crucial member function--analyze(), which is used to perform what's needed to do in Analyze phase.</p><p>Different types of queries (select, insert, show, set, alter table, create table, etc.) will generate different data structures through the Parse stage(SelectStmt, InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, CreateTableStmt, etc.), these data structures inherit From StatementBase, and perform a specific Analysis on a specific type sof SQL by implementing the analyze() function.</p><p>For example, a query of select type will be converted into analyze() of the sub-statements SelectList, FromClause, GroupByClause, HavingClause, WhereClause, SortInfo, etc. of select SQL. Then these sub-statements further analyze() their sub-structures, and various scenarios of various types of SQL are analyzed by layer-by-layer iteration. For example, WhereClause will further explore the BetweenPredicate, BinaryPredicate, CompoundPredicate, InPredicate, etc., which it contains.</p><p><strong>For query type SQL, Analyze will performs several important steps:</strong></p><ul><li><p><strong>Metadata identification and parsing</strong>: Identify and parse metadata such as Cluster, Database, Table, Column, etc. involved in SQL, and determine which columns, tables, databases, and clusters need to be calculated.</p></li><li><p><strong>SQL correctness check</strong>:such as the window function cannot DISTINCT, whether the projection column is ambiguous, the where statement cannot contain grouping operations, etc.</p></li><li><p><strong>Rewrite SQL simply</strong>:for example, expand select * to select all columns, convert count distinct to bitmap or hll function, etc.</p></li><li><p><strong>Function correctness check</strong>:Check whether the functions contained in SQL are consistent with the system-defined procedures, including parameter types, number of parameters, etc.</p></li><li><p><strong>Aliasing for Table and Column.</strong></p></li><li><p><strong>Type checking and conversion</strong>: For example, when the types on both sides of a binary expression are inconsistent, one of the types needs to be converted (with BIGINT and DECIMAL, the BIGINT type needs to be cast to DECIMAL).</p></li></ul><p>After analyzing the AST, a rewrite operation will be performed again to simplify or convert it into a unified processing method. A present rewrite algorithm is a rule-based approach. It will rewrite the AST with each rule from bottom to top, based on the tree structure of the AST. If the AST changes after rewriting, analysis and rewrite will start again until there is no change in the AST.</p><p>For example: simplification of constant expressions: 1 + 1 + 1 is rewritten as 3, 1 &gt; 2 is rewritten as Flase, etc. Convert some statements into a unified processing method, such as rewriting where in, where exists as semi join, where not in, where not exists as anti join.</p><h1>7. Generate stand-alone logical Plan phase</h1><p>At this stage, algebraic relations will be generated according to the AST abstract syntax tree, also known as the operator number. Each node on the tree is an operator, representing an operation.</p><p>As shown in Figure 7, ScanNode represents scan and read operations on a table. HashJoinNode represents the join operation. A hash table of a small table will be constructed in memory, and the large table will be traversed to find the exact value of the join key. Project means the projection operation, which represents the column that needs to be output at the end. Figure 7 shows that only citycode column will output.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_7_en-3b659a292f7c875ca9651197305c47ab.png" width="1080" height="543" class="img_ev3q"></p><p>Without optimization, the generated relational algebra is very expensive to send to storage and execute.</p><p>For query:</p><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">select</span><span class="token plain"> a</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">siteid</span><span class="token punctuation" style="color:rgb(248, 248, 242)">,</span><span class="token plain"> a</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">pv </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">from</span><span class="token plain"> table1 a </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">join</span><span class="token plain"> table2 b </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">on</span><span class="token plain"> a</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">siteid </span><span class="token operator">=</span><span class="token plain"> b</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">siteid </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">where</span><span class="token plain"> a</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">citycode</span><span class="token operator">=</span><span class="token number">122216</span><span class="token plain"> </span><span class="token operator">and</span><span class="token plain"> b</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">username</span><span class="token operator">=</span><span class="token string" style="color:rgb(255, 121, 198)">"test"</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">order</span><span class="token plain"> </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">by</span><span class="token plain"> a</span><span class="token punctuation" style="color:rgb(248, 248, 242)">.</span><span class="token plain">pv </span><span class="token keyword" style="color:rgb(189, 147, 249);font-style:italic">limit</span><span class="token plain"> </span><span class="token number">10</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>As shown in Figure 8, for unoptimized relational algebra, all columns need to be read out for a series of calculations. In the end, siteid and pv column are selected and output. A large amount of useless column data wastes computing resources.</p><p>When Doris generates algebraic relations, a lot of optimizations are made: the projection columns and query conditions will be put into the scan operation as much as possible.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_8_en-021b337867f379cc036dfbe34f5fe9f8.png" width="500" height="1110" class="img_ev3q"></p><p><strong>Specifically, this phase mainly does the following tasks:</strong></p><ul><li><p><strong>Slot materialization</strong>:Determine the column that needs to be scanned and calculated for the expression. Such as aggregate function expressions and Group By words of aggregate nodes need to be materialized.</p></li><li><p><strong>Projection pushdown</strong>:BE only scans the columns that must be read when Scanning.</p></li><li><p><strong>Predicate pushdown</strong>:Push down the filter conditions to the Scan node as much as possible under the premise of semantically correct.</p></li><li><p><strong>Partition, bucket cutting</strong>:According to the information in the filter conditions, determine which partitions and buckets of tablets need to be scanned.</p></li><li><p><strong>Join Reorder</strong>:For Inner Join, Doris will adjust the order of the table according to the number of rows--put the large table in the front.</p></li><li><p><strong>Sort + Limit optimized to TopN</strong>:For the order by the limit statement, it will be converted into TopN operation nodes, which is convenient for unified processing.</p></li><li><p><strong>MaterializedView selection</strong>: The best-materialized view will be selected according to the columns required by the query, the columns for filtering, sorting and Join, the number of rows, the number of columns, and other factors.</p></li></ul><p>Figure 9 shows an example of optimization. The optimization of Doris is carried out in generating relational algebra. Generating one will optimize one.· Projection pushdown: BE only scans the columns that must be read when Scanning.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_9_en-cceafd6d3dd41c1765b4dbbf3ce047e1.png" width="1080" height="290" class="img_ev3q"></p><h1>8 Generate Distributed Plan Phase</h1><p>After the single-machine PlanNode tree is generated, it needs to be split into a distributed PlanFragment tree (PlanFragment is used to represent an independent execution unit) according to the distributed environment. A table's data is distributed across multiple hosts could allow some computations to be parallelized.</p><p>The primary purpose of this step is to maximize parallelism and data localization. The primary strategy is to split the nodes that can be executed in parallel and create a separate PlanFragment. ExchangeNodes will replace the split nodes to receive data. Finally, a DataSinkNode will be added to the split node to transmit the calculated data to the ExchangeNode for further processing.</p><p>This step adopts a recursive method, traverses the entire PlanNode tree from bottom to top, and then creates a PlanFragment for each leaf node on the tree. If the parent node is encountered, splitting the child nodes that can be executed in parallel will be considered.</p><p>For query operations, the join operation is the most common.</p><p><strong>Doris currently supports four join algorithms:</strong> broadcast join, hash partition join, colocate join, and bucket shuffle join.</p><p><strong>broadcast join</strong>:Send the small table to each machine where the large table is located and perform a hash join operation. When the amount of data scanned from a table is small, the cost of broadcast join will be calculated, and the method with the smallest cost will be selected by calculating and comparing the cost of hash partitions.</p><p><strong>hash partition join</strong>:When the data scanned from the two tables are both large, hash partition join is generally used. It traverses all the data in the table, calculates the hash value of the key, then modulizes the number of clusters, and whichever machine is selected, the data will be sent to this machine for hash join operation.</p><p><strong>colocate join</strong>:If the data distribution of the two tables is specified to be consistent when they are created, the colocate join algorithm will be used when the join key of the two tables is the same as the bucket key. Since the data distribution of the two tables is the same, the hash join operation is equivalent to a local process. It does not involve data transmission, which significantly improves query performance.</p><p><strong>bucket shuffle join</strong>:When the join key is a bucketing key, and only one partition is involved, the bucket shuffle join algorithm is preferred. Since bucketing itself represents a way of dividing data, it only needs to take the hash modulo of the number of buckets from the right table to the left table, so that only one copy of the data in the right table needs to be transmitted over the network, which greatly reduces the network of data transmission, as shown in Figure 10.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_10_en-e99cc952e6ef7e1500565bffbd73da18.png" width="878" height="938" class="img_ev3q"></p><p>Figure 11 shows the core process of creating a distributed logical plan with a single-machine logical plan with HashJoinNode.</p><ul><li><p>For PlanNodes, PlanFragments are created bottom-up.</p></li><li><p>If it is a ScanNode, PlanFragment will be created directly, and the RootPlanNode of the PlanFragment is this ScanNode.</p></li><li><p>If it is a HashJoinNode, the broadcastCost will be calculated at first, which could provide a reference for selecting boracast join or hash partition join.</p></li><li><p>Join algorithm will be chosen according to different conditions.</p></li><li><p>If colocate joins are used, since joins are all local, no splitting is required. Set the left child node of HashJoinNode as the RootPlanNode of leftFragment, and the right child node as the RootPlanNode of rightFragment, share a PlanFragment with leftFragment, and delete rightFragment.</p></li><li><p>If bucket shuffle join is used, data from the right table needs to be sent to the left table. So first create an ExchangeNode, set the left child node of HashJoinNode as the RootPlanNode of leftFragment, the right child node as this ExchangeNode, share a PlanFragment with leftFragment, and specify the destination of rightFragment data to be sent to this ExchangeNode.</p></li><li><p>If broadcast join is used, the data from the right table needs to be sent to the left table. So first create an ExchangeNode, set the left child node of HashJoinNode as the RootPlanNode of leftFragment, the right child node as this ExchangeNode, share a PlanFragment with leftFragment, and specify the destination of rightFragment data to be sent to this ExchangeNode.</p></li><li><p>If hash partition join is used, the data in the left table and the right table must be split, and both left and right nodes need to be split out to create left ExchangeNode and right ExchangeNode respectively. HashJoinNode specifies the left and right nodes as left ExchangeNode and right ExchangeNode. Create a PlanFragment separately and specify RootPlanNode as this HashJoinNode. Finally, specify the data sending destination of leftFragment and rightFragment as left ExchangeNode and right ExchangeNode.</p></li></ul><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_11_en-df9efe7a2e23bc2caa676a52414ed916.png" width="1080" height="975" class="img_ev3q"></p><p>Figure 12 is an example after the join operation of two tables is converted into a PlanFragment tree, there are 3 PlanFragments generated. The final output data passes through the ResultSinkNode node.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_12_en-fc24ac9080f5429b9e7a871a34192f97.png" width="1080" height="1079" class="img_ev3q"></p><h1>9. Schedule phase</h1><p>This step is to create a distributed physical plan based on the distributed logical plan. will solve the following questions:</p><ul><li><p>Which BE executes which PlanFragment</p></li><li><p>Which replica to chooes for each Tablet to query</p></li><li><p>How to perform multi-instance concurrency</p></li></ul><p><strong>Figure 13 shows the core process for creating a distributed physical plan:</strong></p><p><strong>a. Prepare phase</strong>:Create a FragmentExecParams structure for each PlanFragment to represent all the parameters required for PlanFragment execution; if a PlanFragment contains DataSinkNode, the destination PlanFragment for data transmission will be found, and specify the input of FragmentExecParams of the destination PlanFragment as FragmentExecParams of this PlanFragment.</p><p><strong>b. computeScanRangeAssignment phase</strong>:Different processing is performed for different types of joins.</p><ul><li><p>computeScanRangeAssignmentByColocate: For colocate join processing, since the data distribution in the two table buckets of the join is the same, they are based on the bucket join operation, so here is to determine which host is selected for each bucket. When allocating buckets to hosts, try to ensure that the buckets allocated to each host are even.</p></li><li><p>computeScanRangeAssignmentByBucket: Processing for bucket shuffle join, which is only based on bucket operations, so here is to determine which host is selected for each bucket. When allocating buckets to hosts, it is also necessary to try to ensure that the buckets allocated to each host are even.</p></li><li><p>computeScanRangeAssignmentByScheduler: Process for other types of joins. Determines which replica of the tablet each scanNode reads. A scanNode will read multiple tablets, and each tablet has various copies. To distribute the scan operation on various machines as much as possible, improve concurrent performance, and reduce IO pressure, Doris uses the Round-Robin algorithm to distribute tablet scans to multiple machines as much as possible. For example, 100 tablets need to be scanned, each tablet has three copies, and ten machines could be used. When allocating, each machine is guaranteed to scan ten tablets.</p></li></ul><p><strong>c.computeFragmentExecParams phase</strong>:This stage determines which BE the PlanFragment is issued to for execution and how to handle instance concurrency. After the scan address of each tablet is determined, FragmentExecParams will generate multiple instances with the address as the dimension. If various addresses are contained in FragmentExecParams, various instances of FInstanceExecParam will be generated. If the concurrency is set, the execution instance of an address will be further split into multiple FInstanceExecParams. There will be some special processing for bucket shuffle join and colocate join, but the basic logic is the same. After FInstanceExecParam is created, a unique ID will be assigned to facilitate tracking information. If FragmentExecParams contains ExchangeNode, the number of senders will be counted to know how many senders' data needs to be accepted. Finally, FragmentExecParams determines the destinations and fills in the destination address.</p><p><strong>d. Create result receiver stage</strong>:The resulting receiver is where the final data needs to be output after the query is completed.</p><p><strong>e. to thrift stage</strong>:Create RPC requests based on FInstanceExecParam of all PlanFragments, then send them to the BE side for execution. A complete SQL parsing process is completed.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_13_en-11d11e8bdcacdc813f16f698e3c7cb6d.png" width="1080" height="846" class="img_ev3q"></p><p>Figure 14 is a simple example. The PlanFrament in the figure contains a ScanNode. The ScanNode scans three tablets. Each tablet has two copies, and the cluster assumes that there are two hosts.</p><p>The computeScanRangeAssignment stage determines that replicas 1, 3, 5, 8, 10, and 12 need to be scanned, where replicas 1, 3, and 5 are located on host1, and replicas 8, 10, and 12 are located on host2.</p><p>If the global concurrency is set to 1, 2 instances of FInstanceExecParam are created and sent to host1 and host2 for execution. If the global concurrency is set to 3, 3 instances of FInstanceExecParam are created on this host1, and three instances of FInstanceExecParam are created on host2. Each instance scans one replica, equivalent to initiating 6 RPC requests.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/Figure_14_en-584e7935ee2ef6eb13e0cd4dada6ac8d.png" width="1080" height="545" class="img_ev3q"></p><h1>10 Summary</h1><p>This article first briefly introduces Doris and then introduces the general process of SQL parsing: lexical analysis, syntax analysis, generating logical plans, and generating physical plans. Then, it presents the overall architecture of DorisSQL parsing. In the end, the five processes: Parse, Analyze, SinglePlan, DistributedPlan, and Schedule are explained in detail, and an in-depth explanation is given of the algorithm principle and code implementation.</p><p>Doris complies with the standard methods of SQL parsing. Still, according to the underlying storage architecture and distributed characteristics, many optimizations have been made in SQL parsing to achieve maximum parallelism and minimize network transmission, reducing a lot of burden on the SQL execution level.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[How Flink's real-time writes to Apache Doris ensure both high throughput and low latency]]></title>
<id>https://doris.apache.org/zh-CN/blog/Flink-realtime-write</id>
<link href="https://doris.apache.org/zh-CN/blog/Flink-realtime-write"/>
<updated>2022-07-29T00:00:00.000Z</updated>
<summary type="html"><![CDATA[With the increasing demand for real-time analysis, the timeliness of data is becoming more and more important to the refined operation of enterprises. With the massive data, real-time data warehouse plays an irreplaceable role in effectively digging out valuable information, quickly obtaining data feedback, helping companies make faster decisions and better product iterations.]]></summary>
<content type="html"><![CDATA[<p>With the increasing demand for real-time analysis, the timeliness of data is becoming more and more important to the refined operation of enterprises. With the massive data, real-time data warehouse plays an irreplaceable role in effectively digging out valuable information, quickly obtaining data feedback, helping companies make faster decisions and better product iterations.</p><p>In this situation, Apache Doris stands out as a real-time MPP analytic database, which is high performance and easy to use, and supports various data import methods. Combined with Apache Flink, users can quickly import unstructured data from Kafka and CDC(Change Data Capture) from upstream database like MySQL. Apache Doris also provides sub-second analytic query capabilities, which can effectively satisfy the needs of several real-time scenarios: multi-dimensional analysis, dashboard and data serving etc.</p><h1>Challange</h1><p>Usually, there are many challenges to ensure high end-to-end concurrency and low latency for real-time data warehouses , such as:</p><ul><li><p>How to ensure end-to-end data sync in second-level ?</p></li><li><p>How to quickly ensure data visibility ?</p></li><li><p>How to solve the problem of small files writing under high concurrency situation?</p></li><li><p>How to ensure end-to-end Exactly-Once?</p></li></ul><p>Within the challenges above , we conducted an in-depth research on the business scenarios of users using Flink and Doris to build real-time data warehouses . After grasping the pain points of users, we made targeted optimizations in Doris version 1.1 and greatly improved the user experience and improved the stability. The resource consumption of Doris has also been greatly optimized.</p><h1>Optimization</h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="streamming-write">Streamming Write<a href="#streamming-write" class="hash-link" aria-label="Streamming Write的直接链接" title="Streamming Write的直接链接"></a></h3><p>The initial practice of Flink Doris Connector is to cache the data into the memory batch after receiving data.The method of data writing is saving batches, and using parameters such as <code>batch.size</code> and <code>batch.interval</code> to control the timing of Stream Load writing at the same time.</p><p>It usually runs stably when the parameters are reasonable. Whatever the parameters are unreasonable, it would cause frequent Stream Load and compaction untimely, resulting in excessive version errors ( -235 ). On the other hand, when there is too much data, in order to reduce the writing frequency of Stream Load , the setting of <code>batch.size</code> too large may also cause OOM.</p><p><strong>To solve this problem, we introduce streaming write:</strong></p><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/otliigutb8p9l1y6qyp6.png" alt="Image description" class="img_ev3q"></p><ul><li><p>After the Flink task starts, the Stream Load Http request will be asynchronously initiated.</p></li><li><p>When the data is received, it will be continuously transmitted to Doris through the Chunked transfer encoding of Http.</p></li><li><p>Http request will end at Checkpoint and complete the Stream Load writing . The next Stream Load request will be asynchronously initiated at the same time.</p></li><li><p>The data will continue to be received and the follow-up process is the same as above.</p></li></ul><p>The pressure on the memory of the batch is avoided since the Chunked mechanism is used to transmit data. And the timing of writing is bound to the Checkpoint, which makes the timing of Stream Load controllable, and provides a basis for the following Exactly-Once semantics.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="exactly-once">Exactly-Once<a href="#exactly-once" class="hash-link" aria-label="Exactly-Once的直接链接" title="Exactly-Once的直接链接"></a></h3><p>Exactly-Once means that data will not be reprocessed or lost, even machine or application failure. Flink supports the End-to-End's Exactly-Once scenario a long time ago, mainly through the two-phase commit protocol to realize the Exactly-Once semantics of the Sink operator.</p><p>On the basis of Flink's two-stage submission, with the help of Doris 1.0's Stream Load two-stage submission,Flink Doris Connector implements Exactly Once semantics. The specific principles are as follows:</p><ul><li>When the Flink task is started, it will initiate a Stream Load PreCommit request. At this time, a transaction will be opened first, and data will be continuously sent to Doris through the Chunked mechanism of Http.</li></ul><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ole5tqi91jibzdg9vqep.png" alt="Image description" class="img_ev3q"></p><ul><li>Http request will be completed when the data writing ends at Checkpoint , and set the transaction status to preCommitted. The data has been written to BE and is invisible to the user at this time.</li></ul><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jiieu1eff6smunkr85s5.png" alt="Image description" class="img_ev3q"></p><ul><li>A Commit request will be initiated after the Checkpoint, and the transaction status will be set to Committed. The data will become visible to the user after request.</li></ul><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eaona8eslljmkpaa9324.png" alt="Image description" class="img_ev3q"></p><ul><li>After the Flink application ends unexpectedly and restarts from Checkpoint, if the last transaction was in the preCommitted state, a rollback request will be initiated and the transaction state will be set to Aborted.</li></ul><p>Based on the above , Flink Doris Connector can be used to realize real-time data storage without loss or weight.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="second--level-data-synchronization">Second- Level Data Synchronization<a href="#second--level-data-synchronization" class="hash-link" aria-label="Second- Level Data Synchronization的直接链接" title="Second- Level Data Synchronization的直接链接"></a></h3><p>End-to-end second-level data sync and real-time visibility of data in high concurrent write scenarios require Doris to have the following capabilities:</p><ul><li><strong>Transaction Processing Capability</strong></li></ul><p>Flink real-time writing interacts with Doris in the form of Stream Load 2pc, which requires Doris to have the corresponding transaction processing capabilities to ensure the basic ACID characteristics, and support Flink's second-level data sync in high concurrency scenarios.</p><ul><li><strong>Rapid Aggregation Capability of Data Versions</strong></li></ul><p>One import in Doris will generate one data version. In a high concurrent write scenario, an inevitable impact is that there are too many data versions, and the amount of data imported in a single time will not be too large. The continuous high-concurrency small file writing scenario extremely tests the real-time ability and Doris' data merging performance, which is not friendly to Doris, and in turn affects the performance of the query. Doris has greatly enhanced the data compaction capability in version 1.1, which can quickly complete the aggregation of new data, avoiding -235 errors and query efficiency problems which are caused by too many versions of sharded data.</p><p>First of all, in Doris 1.1 version, QuickCompaction was introduced, which can actively triggered Compaction when the data version increased. At the same time, by improving the ability to scan fragment meta information, fragments that need to be compacted can be quickly discovered and trigger Compaction. Through active triggering and passive scanning, the real-time problem of data merging is completely solved.</p><p>For high-frequency small file Cumulative Compaction, the scheduling and isolation of Compaction tasks is implemented to prevent the heavyweight Base Compaction from affecting the merging of new data.</p><p>Finally, the strategy of merging small files is optimized by adopting gradient merge method. Each time the files participating in the merging belong to the same data magnitude,which can prevent versions with large differences in size from merging, and gradually merges hierarchically, reducing the number of times a single file is involved in merging, which can greatly save the CPU consumption of the system.</p><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ez5qdcpgwjw60g9aacqd.png" alt="Image description" class="img_ev3q"></p><p>Doris version 1.1 has made targeted optimizations for scenarios such as high concurrent import, second-level data sync, and real-time data visibility, which greatly increases the ease of use and stability of the Flink system and Doris system, saves the overall resources of the cluster.</p><h1>Effect</h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="general-flink-high-concurrency-scenarios">General Flink High Concurrency Scenarios<a href="#general-flink-high-concurrency-scenarios" class="hash-link" aria-label="General Flink High Concurrency Scenarios的直接链接" title="General Flink High Concurrency Scenarios的直接链接"></a></h3><p>In the general scenario of the survey, Flink is used to synchronize unstructured data in upstream Kafka. The data is written to Doris in real time by the Flink Doris Connector after ETL.</p><p>The customer scenario is extremely strict here. The upstream maintains a high frequency of 10w per second, and the data needs to be able to complete the upstream and downstream sync within 5s to achieve second-level data visibility. Flink is configured with 20 concurrency, and the Checkpoint interval is 5s. The performance of Doris version 1.1 is quite excellent.</p><p>Specifically reflected in the following aspects:</p><ul><li><strong>Compaction Real-Time</strong></li></ul><p>Data can be merged quickly, the number of tablet data versions is kept below 50, and the compaction score is stable. Compared with the previous -235 problem in high concurrent import scenario, the compaction efficiency is improved more than 10 times.</p><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d6enyv1zj68o7myjypnl.png" alt="Image description" class="img_ev3q"></p><ul><li><strong>CPU Resource Consumption</strong></li></ul><p>Doris version 1.1 has optimized the strategy for compaction of small files. In high-concurrency import scenarios, CPU resource consumption is reduced by 25%.</p><ul><li><strong>QPS Query Delay is Stable</strong></li></ul><p>By reducing the CPU usage and the number of data versions, the overall order of data has been improved, and the delay of SQL queries will be reduced.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="second-level-data-synchronization-scenario-extreme-high-pressure">Second-Level Data Synchronization Scenario (Extreme High Pressure)<a href="#second-level-data-synchronization-scenario-extreme-high-pressure" class="hash-link" aria-label="Second-Level Data Synchronization Scenario (Extreme High Pressure)的直接链接" title="Second-Level Data Synchronization Scenario (Extreme High Pressure)的直接链接"></a></h3><p>In single bet and single tablet with 30 concurrent limit stream load pressure test on the client side, data in real-time &lt;1s, the comparison before and after compaction score optimization as below:</p><p><img loading="lazy" src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r01hn8hv6arzbdclknis.png" alt="Image description" class="img_ev3q"></p><h1>Recommendations</h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="real-time-data-visualization-scenario">Real-Time Data Visualization Scenario<a href="#real-time-data-visualization-scenario" class="hash-link" aria-label="Real-Time Data Visualization Scenario的直接链接" title="Real-Time Data Visualization Scenario的直接链接"></a></h3><p>For strict latency requirements scenarios, such as second-level data synchronization, usually mean that a single import file is small, and it is recommended to reduce <code>cumulative_size_based_promotion_min_size_mbytes </code>. The default unit is 64 MB, and you can set it to 8 MB manually, which can greatly improve the compaction real-time performance. </p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="high-concurrency-scenario">High Concurrency Scenario<a href="#high-concurrency-scenario" class="hash-link" aria-label="High Concurrency Scenario的直接链接" title="High Concurrency Scenario的直接链接"></a></h3><p>For high concurrent writing scenarios, you can reduce the frequency of Stream Load by increasing the checkpoint interval. For example, setting checkpoint to 5-10s can not only increase the throughput of Flink tasks, but also reduce the generation of small files and avoid causing compaction more pressure.</p><p>In addition, for scenarios that do not require high real-time data, such as minute-level data sync, the checkpoint interval can be increased, such as 5-10 minutes. And the Flink Doris connector can still ensure the integrity of data through the two-stage submission and checkpoint mechanism.</p><h1>Future planning</h1><ul><li><strong>Real-time Schema Change</strong></li></ul><p>When accessing data in real time through Flink CDC, the upstream business table will perform the schema change operation, it has to modify the schema manually in Doris and Flink tasks. In the end, the data of the new schema can be synchronized after restart the task . </p><p>This way requires human intervention, which will bring a great operation burden to users. In subsequent versions, real-time schema changes will support CDC scenarios, and the upstream schema changes will be synchronized to the downstream in real-time, which will comprehensively improve the efficiency of schema changes.</p><ul><li><strong>Doris Multi-table Writting</strong></li></ul><p>At present, the Doris Sink operator only supports synchronizing a single table, so for the entire database, it still has to divide the flow manually at the Flink level and write to multiple Doris Sinks, which will increase the difficulty of developers. In subsequent versions, we will support a single Doris Sink to synchronize multiple tables, which greatly simplifies the user's operation.</p><ul><li><strong>Adaptive Compaction Parameter Tuning</strong></li></ul><p>At present, the compaction strategy has many parameters, which can play a good role in most general scenarios, but these strategies still can't play an efficient role in some special scenarios. We will continue to optimize in subsequent versions, carry out adaptive compaction tuning for different scenarios, and keep improving data merging efficiency and real-time performance in various scenarios.</p><ul><li><strong>Single-Copy Compaction</strong></li></ul><p>The current compaction strategy is that each BE is carried out separately. In subsequent versions, we will implement single-copy compaction, and realize compaction tasks by cloning snapshots, reduce system load while reducing about 2/3 compaction tasks of the cluster, leaving more system resources to the user side.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Tech Sharing" term="Tech Sharing"/>
</entry>
<entry>
<title type="html"><![CDATA[Best practice of Apache Doris in JD]]></title>
<id>https://doris.apache.org/zh-CN/blog/jd</id>
<link href="https://doris.apache.org/zh-CN/blog/jd"/>
<updated>2022-07-20T00:00:00.000Z</updated>
<summary type="html"><![CDATA[This paper mainly discusses how to use Doris for business exploration and practice in the multi-dimensional analysis of real-time and offline data in the large real-time screen of JD customer service in the scenarios of manual consultation, customer event list, after-sales service list, etc.]]></summary>
<content type="html"><![CDATA[<h1><strong>Introduction:</strong></h1><p>Apache Doris is an open source MPP analytical database product that not only can get query results in sub-second response time, effectively supporting real-time data analysis, but also supports huge data sets of more than 10PB. Compared with other industry-hot OLAP database systems, the distributed architecture of Apache is very simple. Itsupports elastic scaling and is easy to operate and maintain, saving a lot of labor and time costs. At present, the domestic community is very popular , and there are also many companies which have large scale uses, such as Meituan and Xiaomi,etc. </p><p>This paper mainly discusses how to use Doris for business exploration and practice in the multi-dimensional analysis of real-time and offline data in the large real-time screen of JD customer service in the scenarios of manual consultation, customer event list, after-sales service list, etc.</p><p>In recent years, with the explosive growth of data volume and the emergence of the demand for online analysis of massive data, traditional relational databases such as MySQL and Oracle have encountered bottlenecks under large data volume, while databases such as Hive and Kylin lack timeliness. So Apache Doris, Apache Druid, ClickHouse and other real-time analytic databases begun to appear, not only to cope with the second-level queries of massive data, but also to meet the real-time and quasi-real-time analysis needs. Offline and real-time computing engines are in full bloom. But for different scenarios and facing different problems, no single engine is a panacea. We hope that this article can give you some inspiration on the application and practice of offline and real-time analytics in JD's customer service business, and we hope you will communicate more and give us valuable suggestions.</p><h1><strong>JD Customer Service Business Form</strong></h1><p>As the entrance to the group's services, JD Customer Service provides efficient and reliable protection for users and merchants. JD customer service is responsible for solving users' problems in a timely manner and providing them with detailed and easy-to-understand instructions and explanations; in order to better understand users' feedback and the status of products, it is necessary to monitor a series of indicators such as the number of inquiries, pick-up rates, complaints, etc. in real time, and discover problems in a timely manner through ring comparison and year-on-year comparison, in order to better adapt to users' shopping styles, improve service quality and efficiency, and thus enhance the brand of JD influence.</p><h1><strong>Easy OLAP Design</strong></h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01-easyolap-doris-data-import-links"><strong>01 EasyOLAP Doris Data Import Links</strong><a href="#01-easyolap-doris-data-import-links" class="hash-link" aria-label="01-easyolap-doris-data-import-links的直接链接" title="01-easyolap-doris-data-import-links的直接链接"></a></h3><p>EasyOLAP Doris data sources are mainly real-time Kafka and offline HDFS files. The import of real-time data relies on Routine Load; offline data is mainly imported using Broker Load and Stream Load.</p><p><img loading="lazy" alt="1280X1280" src="https://cdnd.selectdb.com/zh-CN/assets/images/jd03-00bd471f0fab2d98798f5e3148b35fce.png" width="1080" height="604" class="img_ev3q"></p><p>EasyOLAP Doris Data Import Links</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02-easyolap-doris-full-link-monitor"><strong>02 EasyOLAP Doris Full Link Monitor</strong><a href="#02-easyolap-doris-full-link-monitor" class="hash-link" aria-label="02-easyolap-doris-full-link-monitor的直接链接" title="02-easyolap-doris-full-link-monitor的直接链接"></a></h3><p>The EasyOLAP Doris project currently uses the Prometheus + Grafana framework for monitoring. The node_exporter is responsible for collecting machine-level metrics, and Doris automatically spits out FE and BE service-level metrics in Prometheus format. In addition, OLAP Exporter service is deployed to collect Routine Load related metrics, aiming to discover real-time data stream import at the first time and ensure real-time data timeliness.</p><p><img loading="lazy" alt="EasyOLAP Doris monitoring link" src="https://cdnd.selectdb.com/zh-CN/assets/images/jd04-8770adfb04ffe977f931d9eaff4cb534.png" width="1080" height="594" class="img_ev3q"></p><p>EasyOLAP Doris monitoring link</p><p><img loading="lazy" alt="640" src="https://cdnd.selectdb.com/zh-CN/assets/images/jd01-47257e8bb0b14785f854db959cdfd931.png" width="871" height="600" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03-easyolap-doris-primary-secondary-dual-stream-design"><strong>03 EasyOLAP Doris Primary-Secondary Dual Stream Design</strong><a href="#03-easyolap-doris-primary-secondary-dual-stream-design" class="hash-link" aria-label="03-easyolap-doris-primary-secondary-dual-stream-design的直接链接" title="03-easyolap-doris-primary-secondary-dual-stream-design的直接链接"></a></h3><p>EasyOLAP Doris adopts a dual-write approach for the primary and secondary clusters in order to guarantee the service stability of Level 0 services during the promotion time.</p><p><img loading="lazy" alt="03 EasyOLAP Doris Primary-Secondary Dual Stream Design" src="https://cdnd.selectdb.com/zh-CN/assets/images/jd02-a6a4279c0c33a25862e89b56e7c986a7.png" width="1080" height="669" class="img_ev3q"></p><p>EasyOLAP Doris Primary-Secondary Dual Stream Design</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="04-easyolap-doris-dynamic-partition-management"><strong>04 EasyOLAP Doris Dynamic Partition Management</strong><a href="#04-easyolap-doris-dynamic-partition-management" class="hash-link" aria-label="04-easyolap-doris-dynamic-partition-management的直接链接" title="04-easyolap-doris-dynamic-partition-management的直接链接"></a></h3><p>After analyzing the requirements, the JD OLAP team did some customization of Doris, which involved dynamic partition management. Although the community version already had the function of dynamic partitioning, the function could not retain partitions of a specified time. For the characteristics of JD Group, we have retained historical data of specified time, such as data during 618 and 11.11, which will not be deleted due to dynamic partitioning. The dynamic partition management feature can control the amount of data stored in the cluster, and it is easy to use by the business side without the need to manage partition information manually or with additional code.</p><h1><strong>Doris Caching Mechanism</strong></h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01-demand-scenarios"><strong>01 Demand Scenarios</strong><a href="#01-demand-scenarios" class="hash-link" aria-label="01-demand-scenarios的直接链接" title="01-demand-scenarios的直接链接"></a></h3><p>Committed to continuously improving user experience, JD Customer Service's data analysis pursues the ultimate timeliness. Offline data analysis scenario is write less read more, data is written once and read frequently many times; real-time data analysis scenario, part of the data is not updated historical partition, part of the data is in the updated partition. In most analysis applications, there are the following scenarios:</p><ul><li><p>High concurrency scenario: Doris better support high concurrency, but too high QPS will cause cluster jitter, and a single node can not carry too high QPS;.</p></li><li><p>Complex queries: JD customer service real-time operation platform monitoring needs to display multi-dimensional complex indicators according to business scenarios, rich indicators display corresponding to a variety of different queries, and data sources from multiple tables . Although the response time of individual queries at milliseconds level , the overall response time may be at the second level.</p></li><li><p>Repeated queries: if there is no anti-refresh mechanism, repeatedly refreshing the page will lead to the submission of a large number of repeated queries due to delays or hand errors.</p></li></ul><p>For the above scenario, there are solutions at the application layer —— the query results are put into Redis and the cache is refreshed periodically or manually by the user, but there are some problems:</p><ul><li><p>Data inconsistency: can not respond immediately to data updates, and the user may receive results with old data.</p></li><li><p>Low hit rate: if the data is highly real-time and the cache is frequently invalidated, the hit rate of the cache is low and the load on the system cannot be relieved.</p></li></ul><p>Additional cost: introduction of external components increases system complexity and adds additional cost.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02-introduction-to-caching-mechanism"><strong>02 Introduction to Caching Mechanism</strong><a href="#02-introduction-to-caching-mechanism" class="hash-link" aria-label="02-introduction-to-caching-mechanism的直接链接" title="02-introduction-to-caching-mechanism的直接链接"></a></h3><p>There are three different types of Cache in EasyOLAP Doris, respectively Result Cache, SQL Cache and Partition Cache, depending on the applicable scenario. All three types of caches can be switched on and off by MySQL client commands.</p><p>These three caching mechanisms can coexist: which can be turned on at the same time. When querying, the query parser first determines whether the Result Cache is enabled or not, and if the Result Cache is enabled, it first finds out whether the cache exists for the query from the Result Cache, and if the cache fails or does not exist, it directly takes the cached value and returns it to the client. The cache is placed in the memory of each FE node for fast reading.</p><p>SQL Cache stores and gets the cache according to the signature of SQL, the ID of the partition of the queried table, and the latest version number of the partition. These three together serve as cache conditions. If one of these three conditions is changed, such as SQL statement change or partition version number change after data update, the cache will not be hit. In the case of multiple table joins, the partition update of one of the tables will also result in failure to hit the cache. SQL Cache is more suitable for T+1 update scenarios.</p><p>Partition Cache is a more fine-grained caching mechanism. Partition cache mainly splits a query into read-only partition and updatable partition in parallel based on partition, read-only partition is cached, updatable partition is not cached, and the corresponding result set is generated n, and then the results of each split subquery are merged. Therefore, if the query N days of data, data update the most recent D days, each day is only a different date range but similar queries, you can use Partition Cache, only need to query D partitions can be, the other parts are from the cache, can effectively reduce the cluster load, shorten the query response time.</p><p>When a query enters Doris, the system will first process the query statement and take it as the key, before executing the query statement, the query analyzer can automatically select the most suitable caching mechanism to ensure that the caching mechanism is used to shorten the query response time in the best case. Then, it checks whether the query result exists in the Cache, and if it does, it gets the data in the cache and returns it to the client; if it does not, it queries normally and stores the query result as Value and the query statement Key in the cache. SQL Cache is more suitable for T+1 scenarios and works well when partition updates are infrequent and SQL statements are repetitive Partition Cache is the least granular cache. When a query statement queries data for a time period, the query statement is split into multiple subqueries. It can shorten the query time and save cluster resources when the data is written to only one partition or partial partition.</p><p>To better observe the effectiveness of caching, metrics have been added to Doris' service metrics, which are monitored visually through Prometheus and Grafana monitoring systems. The metrics include the number of hits for different types of Cache, the hit rate for different types of Cache, the memory size of the Cache, and other metrics.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03-caching-mechanism-effect"><strong>03 Caching Mechanism Effect</strong><a href="#03-caching-mechanism-effect" class="hash-link" aria-label="03-caching-mechanism-effect的直接链接" title="03-caching-mechanism-effect的直接链接"></a></h3><p>For the JD Customer Service Doris main cluster, some services reached 100% CPU usage during 11.11 period without caching on; with Result Cache on, CPU usage was between 30% and 40%. The caching mechanism ensures that the business can get the query results quickly and protects the cluster resources well under high concurrency scenarios.</p><h1><strong>Doris' optimization during the 11.11 sale, 2020</strong></h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="01-import-task-optimization"><strong>01 Import Task Optimization</strong><a href="#01-import-task-optimization" class="hash-link" aria-label="01-import-task-optimization的直接链接" title="01-import-task-optimization的直接链接"></a></h3><p>The import of real-time data has always been a challenge. Among them, ensuring real-time data and importing stability is the most important. In order to observe the real-time data import situation more intuitively, JD OLAP team developed OLAP Exporter independently to collect real-time data import-related metrics, such as import speed, import backlog and suspended tasks. The import speed and import backlog can be used to determine the status of a real-time import task, and if find a trend of backlog, the sampling tool developed independently can be used to sample and analyze the real-time task. Real-time tasks have three main thresholds to control the submission of tasks, which are the maximum processing interval per batch, the maximum number of processing entries per batch and the maximum amount of data processed per batch, and a task will be submitted as soon as one of these thresholds is reached. By increasing the logs, we found that the task queue in FE was relatively busy, so the parameters were mainly adjusted to make the maximum number of processing entries per batch and the maximum amount of data processed per batch larger, and then the maximum processing interval per batch was adjusted to ensure that the data latency was within twice the maximum processing interval per batch according to the business requirements. Through the sampling tool, the analysis task ensures not only the real-time data, but also the stability of the import. In addition, we also set up alarms to detect abnormalities such as backlog of real-time import tasks and suspension of import tasks in a timely manner.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="02-monitoring-metrics-optimization"><strong>02 Monitoring Metrics Optimization</strong><a href="#02-monitoring-metrics-optimization" class="hash-link" aria-label="02-monitoring-metrics-optimization的直接链接" title="02-monitoring-metrics-optimization的直接链接"></a></h3><p>The monitoring metrics are divided into two main sections, a machine level metrics section and a business level metrics section. In the whole monitoring panel, detailed metrics bring comprehensive data and at the same time make it more difficult to get important metrics. So, to get a better view of important metrics for all clusters, a separate panel is created - 11.11 Important Metrics Summary Panel. The board contains metrics such as BE CPU usage, real-time task consumption backlog rows, TP99, QPS, and so on. The number of metrics is small, but the situation of all clusters can be observed, which can eliminate the trouble of frequent switching in monitoring.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="03-peripheral-tools-support"><strong>03 Peripheral Tools Support</strong><a href="#03-peripheral-tools-support" class="hash-link" aria-label="03-peripheral-tools-support的直接链接" title="03-peripheral-tools-support的直接链接"></a></h3><p>In addition to the sampling tools and OLAP Exporter mentioned above, the JD OLAP team has also developed a series maintenance tools for Doris.</p><ol><li>Import sampling tool: The import sampling tool not only collects the data imported in real time, but also supports adjusting the parameters of the real time import task, or generating creation statements (including the latest loci and other information) for task migration and other operations when the real time import task is paused.</li></ol><ol start="2"><li>Big query tool: Big queries not only cause jitter in cluster BE CPU usage, but also lead to longer response time for other queries. Before the Big Query tool, if you found jitter in cluster CPU, you needed to check the audit logs on all FEs and then do the statistics, which is not only time-consuming but also not intuitive. The Big Query tool is designed to solve the above problem. When the monitoring side finds that the cluster has jitter, you can use the Big Query tool and enter the cluster name and time point to get the total number of queries for different services at that time point, the number of queries with more than 5 seconds, 10 seconds, 20 seconds, the number of queries with huge scanning volume, etc. It is convenient for us to analyze the big queries from different dimensions. The details of the big queries will also be saved in the intermediate file, which can directly get the big queries of different businesses. The whole process only takes a few tens of seconds to a minute to locate the big query that is happening and get the corresponding query statements, which greatly saves time and operation and maintenance costs.</li></ol><ol start="3"><li>Downgrade and recovery tools: In order to ensure the stability of the Level 0 business during the 11.11 promotion, when the cluster pressure exceeds the safety level, it is necessary to downgrade other non-Level 0 businesses, and then restore them to the pre-downgrade settings with one click after the peak period. The degradation mainly involves reducing the maximum number of connections to the service, suspending non-level 0 real-time import tasks, and so on. This greatly increases the ease of operation and improves efficiency.</li></ol><ol start="4"><li>Cluster inspection tool: During 11.11 period, the health inspection of clusters is extremely important. Routine inspections include primary and secondary cluster consistency checks for dual-stream services. In order to ensure that the business can quickly switch to the other cluster when one cluster has problems, it is necessary to ensure that the library tables on both clusters are consistent and the data volume is not too different; check whether the number of copies of the library tables is 3 and whether there are unhealthy Tablet in the cluster; check the machine disk utilization, memory and other machine-level indicators, etc. Check the machine disk utilization, memory and other machine-level metrics, etc.</li></ol><h1><strong>Summary &amp; Outlook</strong></h1><p> JD Customer Service was introduced to Doris in early 2020, and currently has one standalone cluster and one shared cluster, and is an experienced user of JD OLAP.</p><p> In the business use, we also encountered problems such as task scheduling-related, import task configuration-related and query-related problems, which are driving the JD OLAP team to understand Doris more deeply. We plan to promote the use of materialized views to further improve the efficiency of queries; use Bitmap to support accurate de-duplication of UV and other metrics; use audit logs to make it easier to count large and slow queries; and solve the scheduling problem of real-time import tasks to make them more efficient and stable. In addition, we also plan to optimize table building, create high-quality Rollup or materialized views to improve the smoothness of the application, and accelerate more businesses to the OLAP platform to improve the impact of the application.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Best practice of Apache Doris in Meituan]]></title>
<id>https://doris.apache.org/zh-CN/blog/meituan</id>
<link href="https://doris.apache.org/zh-CN/blog/meituan"/>
<updated>2022-07-20T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Introduction: This paper mainly introduces a general method and practice of real-time data warehouse construction. The real-time data warehouse aims at end-to-end low latency, SQL standardization, rapid response to changes, and data unification. In practice, the best practice we summarize is: a common real-time production platform + a common interactive real-time analysis engine cooperate with each other to meet real-time and quasi-real-time business scenarios. The two have a reasonable division of labor and complement each other to form an easy-to-develop, easy-to-maintain, and most efficient assembly line, taking into account development efficiency and production costs, and satisfying diverse business needs with a better input-output ratio.]]></summary>
<content type="html"><![CDATA[<h1>Best Practice of Apache Doris in Meituan</h1><p>Introduction: This paper mainly introduces a general method and practice of real-time data warehouse construction. The real-time data warehouse aims at end-to-end low latency, SQL standardization, rapid response to changes, and data unification. In practice, the best practice we summarize is: a common real-time production platform + a common interactive real-time analysis engine cooperate with each other to meet real-time and quasi-real-time business scenarios. The two have a reasonable division of labor and complement each other to form an easy-to-develop, easy-to-maintain, and most efficient assembly line, taking into account development efficiency and production costs, and satisfying diverse business needs with a better input-output ratio.</p><h1>real-time scene</h1><p>There are many scenarios in which real-time data is delivered in Meituan, mainly including these following points:</p><ul><li>Operational level: Such as real-time business changes, real-time marketing effects, daily business status and daily real-time business trend analysis, etc.</li><li>Production level: such as whether the real-time system is reliable, whether the system is stable, real-time monitoring of the health of the system, etc.</li><li>C-end users: For example, search recommendation sorting requires real-time understanding of users' thoughts, behaviors and characteristics, and recommendation of more concerned content to users.</li><li>Risk control: Food delivery and financial technology are used a lot. Real-time risk identification, anti-fraud, abnormal transactions, etc., are all scenarios where a large number of real-time data are applied</li></ul><h1>Real-time technology and architecture</h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1real-time-computing-technology-selection">1.Real-time computing technology selection<a href="#1real-time-computing-technology-selection" class="hash-link" aria-label="1.Real-time computing technology selection的直接链接" title="1.Real-time computing technology selection的直接链接"></a></h3><p>At present, there are many open source real-time technologies, among which Storm, Spark Streaming and Flink are common. The specific selection depends on the business situation of different companies.</p><p>Meituan Takeaway relies on the overall construction of meituan's basic data system. In terms of technology maturity, It used Storm a few years ago, which was irreplaceable in terms of performance stability, reliability and scalability. As Flink becomes more and more mature, it has surpassed Storm in terms of technical performance and framework design advantages. In terms of trends, just like Spark replacing MR, Storm will be gradually replaced by Flink. Of course, there will be a process of migrating from Storm to Flink. We currently have some old tasks still on Storm, and we are constantly promoting task migration.</p><p>The comparison between Storm and Flink can refer to the form above.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2real-time-architecture">2.Real-time Architecture<a href="#2real-time-architecture" class="hash-link" aria-label="2.Real-time Architecture的直接链接" title="2.Real-time Architecture的直接链接"></a></h3><h4 class="anchor anchorWithStickyNavbar_LWe7" id="-lambda-architecture">① Lambda Architecture<a href="#-lambda-architecture" class="hash-link" aria-label="① Lambda Architecture的直接链接" title="① Lambda Architecture的直接链接"></a></h4><p>The Lambda architecture is a relatively classic architecture. In the past, there were not many real-time scenarios, mainly offline. When a real-time scene is attached, the technical ecology is different due to the different timeliness of offline and real- time. The Lambda architecture is equivalent to attaching a real-time production link, which is integrated at the application level, and two-way production is independent of each other.This is also a logical approach to adopt in business applications.</p><p>There will be some problems in dual-channel production, such as double processing logic, double development and operation and maintenance, and resources will also become two resource links. Because of these problems, Kappa architecture has been evolved.</p><h4 class="anchor anchorWithStickyNavbar_LWe7" id="-kappa-architecture">② Kappa Architecture<a href="#-kappa-architecture" class="hash-link" aria-label="② Kappa Architecture的直接链接" title="② Kappa Architecture的直接链接"></a></h4><p>The Kappa architecture is relatively simple in terms of architecture design, unified in production, and a set of logic produces both offline and real time. However, there are relatively large limitations in practical application scenarios. There are few cases in the industry that directly use the Kappa architecture for production and implementation, and the scene is relatively simple. These problems will also be encountered on our side, and we will also have some thoughts of our own, which will be discussed later.</p><h1>Business Pain Points</h1><p>In the take-away business, we also encountered some problems.</p><p>In the early stage of the business, in order to meet the business needs, the requirements are generally completed case by case after the requirements are obtained. The business has high real-time requirements. From the perspective of timeliness, there is no opportunity for middle-level precipitation. In the scenario, the business logic is generally directly embedded. This is a simple and effective method that can be imagined. This development mode is relatively common in the early stage of business development.</p><p>As shown in the figure above, after getting the data source, it will go through data cleaning, dimension expansion, business logic processing through Storm or Flink, and finally direct business output. Taking this link apart, the data source will repeatedly refer to the same data source, and the operations such as cleaning, filtering, and dimension expansion must be repeated. The only difference is that the code logic of the business is different. IIf there is less business, this model is acceptable, but when the subsequent business volume increases, there will be a situation where whoever develops will be responsible for operation and maintenance, the maintenance workload will increase, and the operations cannot be managed in a unified manner. Moreover, everyone is applying for resources, resulting in a rapid expansion of resource costs, and resources cannot be used intensively and effectively. Therefore, it is necessary to think about how to construct real-time data from the whole data source.</p><h1>Data features and Application Scenario</h1><p>So how to build a real-time data warehouse?</p><p>First of all, we need to disassemble this task into what data, what scenarios, and what features these scenarios have in common. For takeaway business scenarios, there are two categories, log class and business category.</p><ul><li><p>Log class: It is characterized by a large amount of data, semi-structured, and deeply nested.Log data has a great feature that once the log stream is formed, it will not change. It will collect all the logs of the platform by means of buried points, and then collect and distribute them uniformly. Just like a tree with really large roots. The whole process of pushing to the front-end application is just like the process of a tree branching from the root to a branch (the decomposition process from 1 to n). If all businesses search for data from the root, although the path seems to be the shortest, because of the heavy burden,the data retrieval efficiency is low. Log data is generally used for production monitoring and user behavior analysis. The timeliness requirements are relatively high . Generally, the time window will be 5 minutes or 10 minutes, or up to the current state. The main application is the real-time large screen and real-time features, such as behaviour can immediately perceive the need for waiting every time the user clicks.</p></li><li><p>Business category: The business class is mainly about business transaction data. Business systems are usually self-contained and distribute data down in the form of Binlog logs. All business systems are transactional, mainly using paradigm modeling methods, which have a structured characteristic and the main part can be seen clearly. However, due to the large number of data tables, multi-table associations are required to express the complete business. So it's an integrated machining process from n to 1 .</p></li></ul><p>Several difficulties faced by business real-time processing:</p><ul><li><p>Diversity of business: Business processes are constantly changing from the beginning to the end, such as from ordering -&gt; payment -&gt; delivery. The business database is changed on the original basis,and Binlog will generate a lot of changed logs. Business analysis is more focused on the end state, which leads to the problem of data retraction calculation, such as placing an order at 10 o'clock and canceling it at 13 o'clock, but hoping to subtract the canceled order at 10 o'clock.</p></li><li><p>Business integration: Business analysis data usually cannot be expressed by a single subject, and often many tables are associated to obtain the desired information. When confluent alignment of data is performed in real-time streaming, it often requires large cache processing and is complicated.</p></li><li><p>The analysis is batch, and the processing process is streaming: for a single data, no analysis can be formed, so the analysis object must be batch, and the data processing is one by one.</p></li></ul><p>The scenarios of log classes and business classes generally exist at the same time and are intertwined. Whether it is Lambda architecture or Kappa architecture, a single application will have some problems, so it is more meaningful to choose the architecture and practice according to the scenario.</p><h1>Architecture Design of Real-time Data Warehouse</h1><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1real-time-architecture-exploration-of-stream-batch-combination">1.Real-time Architecture: Exploration of Stream-Batch Combination<a href="#1real-time-architecture-exploration-of-stream-batch-combination" class="hash-link" aria-label="1.Real-time Architecture: Exploration of Stream-Batch Combination的直接链接" title="1.Real-time Architecture: Exploration of Stream-Batch Combination的直接链接"></a></h3><p>Based on the above problems, we have our own thinking and ideas,it is to deal with different business scenarios through the combination of flow and batch.</p><p>As shown in the figure above, the data is collected from the log to the message queue, and then to the ETL process of the data stream. The construction of the basic data stream is unified. Afterwards, for log real-time features, real-time large-screen applications use real-time stream computing. Real-time OLAP batch processing is used for Binlog business analysis.</p><p>What are the Pain Points of Stream Processing Analysis Business? For the paradigm business, both Storm and Flink require a large amount of external memory to achieve business alignment between data streams, which requires a lot of computing resources. Due to the limitation of external memory, the window limitation strategy must be carried out, and may eventually discard some data as a result. After calculation, it is generally stored in Redis as query support, and KV storage has many limitations in dealing with analytical query scenarios.</p><p>How to achieve real-time OLAP? Is there a real-time computing engine with its own storage, when the real-time data is entered,it can flexibly and freely calculate within a certain range, and has a certain data carrying capacity, and supports analysis of query responses at the same time? With the development of technology, the current MPP engine is developing very rapidly, and its performance is also improving rapidly, so there is a new possibility in this scenario, just like the Doris engine we use here.</p><p>This idea has been practiced in the industry and has become an important exploration direction. For example, Alibaba's real-time OLAP solution based on ADB, etc.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2architecture-design-of-real-time-data-warehouse">2.Architecture Design of Real-time Data Warehouse<a href="#2architecture-design-of-real-time-data-warehouse" class="hash-link" aria-label="2.Architecture Design of Real-time Data Warehouse的直接链接" title="2.Architecture Design of Real-time Data Warehouse的直接链接"></a></h3><p>From the perspective of the entire real-time data warehouse architecture, the first thing to consider is how to manage all real-time data, how to effectively integrate resources, and how to construct data.</p><p>In terms of methodology, the real-time and offline are very similar to each other. In the early stage of offline data warehouse, it is also case by case. Consider how to govern it when the scale of data increases to a certain amount. We all know that layering is a very effective way of data governing. So, on the issue of how to manage the real-time data warehouse, the first consideration is also the hierarchical processing logic, as follows:</p><ul><li><p>Data source: At the data source level, offline and real-time data sources are consistent. They are mainly divided into log classes and business classes. Log classes include user logs, DB logs, and server logs.</p></li><li><p>Real-time detail layer: At the detail level, in order to solve the problem of repeated construction, a unified construction should be carried out.Using the offline data warehouse model to build a unified basic detailed data layer, managed according to the theme, the purpose of the detail layer is to provide directly available data downstream, so the basic layer should be processed uniformly, such as cleaning, filtering, and dimension expansion.</p></li><li><p>Aggregation layer: The summary layer can directly calculate the result through the concise operator of Flink or Storm. And form a summary of indicators, all indicators are processed at the summary layer, and everyone manages and constructs according to unified specifications, forming a reusable summary result.</p></li></ul><p>In conclusion, from the perspective of the construction of the entire real-time data warehouse,first of all, the data construction needs to be layered, build the framework first, and set the specifications includs what extent each layer is processed and how each layer is used.The definition of specifications facilitates standardized processing in production.Due to the need to ensure timeliness, don't design too many layers when designing.For scenarios with high real-time requirements, you can basically refer to the left side of the figure above. For batch processing requirements, you can import from the real-time detail layer to the real-time OLAP engine, and perform fast retraction calculations based on the OLAP engine's own calculation and query capabilities, as shown in the data flow on the right side of the figure above.</p><h1>Real-time platform construction</h1><p>After the architecture is determined, the next consideration is how to build a platform.The construction of the real-time platform is completely attached to the real-time data warehouse management.</p><p>First, abstract the functions and abstract them into components, so that standardized production can be achieved, and systematic guarantees can be further constructed. For the basic processing layer cleaning, filtering, confluence, dimension expansion, conversion, encryption, screening and other functions can be abstracted, and the base layer builds a directly usable data result stream in this componentized way. How to meet diverse needs and how to be compatible with users are the problems that we need to figure out. In this case it may occur problems with redundant processing. In terms of storage, real-time data does not have a history and will not consume too much storage. This redundancy is acceptable.The production efficiency can be improved by means of redundancy, which is an ideological application of changing space for time.</p><p>Through the processing of the base layer, all data is deposited in the IDL layer, and written to the base layer of the OLAP engine at the same time, and then the real-time summary layer is calculated. Based on Storm, Flink or Doris, multi-dimensional summary indicators are produced to form a unified summary layer for unified storage distribution.</p><p>When these functions are available, system capabilities such as metadata management, indicator management, data security, SLA, and data quality will be gradually built.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="1real-time-base-layer-functions">1.Real-time base layer functions<a href="#1real-time-base-layer-functions" class="hash-link" aria-label="1.Real-time base layer functions的直接链接" title="1.Real-time base layer functions的直接链接"></a></h3><p>The construction of the real-time base layer needs to solve some problems.</p><p>The first is the problem of repeated reading of a stream. When a Binlog is called, it exists in the form of a DB package. Users may only use one of the tables. If everyone wants to use it, there may be a problem that everyone needs to access this stream. The solution can be deconstructed according to different businesses, restored to the basic data flow layer, made into a paradigm structure according to the needs of the business, and integrated with the theme construction according to the modeling method of the data warehouse.</p><p>Secondly, we need to encapsulate components, such as basic layer cleaning, filtering, and dimension expansion . Users can write logic by a very simple expression. Trans part is more flexible. For example, converting from one value to another value, for this custom logic expression, we also open custom components, which can develop custom scripts through Java or Python for data processing.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="2real-time-feature-production-capabilities">2.Real-time feature production capabilities<a href="#2real-time-feature-production-capabilities" class="hash-link" aria-label="2.Real-time feature production capabilities的直接链接" title="2.Real-time feature production capabilities的直接链接"></a></h3><p>Feature production can be expressed logically through SQL syntax, and the underlying logic is adapted, and transparently transmitted to the computing engine, shielding the user's dependence on the computing engine.Just like for offline scenarios, currently large companies rarely develop through code, unless there are some special cases, so they can basically be expressed in SQL.</p><p>At the functional level, the idea of indicator management is integrated. Atomic indicators, derived indicators, standard calculation apertures, dimension selection, window settings and other operations can be configured in a configurable way.In this way, the production logic can be uniformly parsed and packaged uniformly.</p><p>Another question,with the same source code a lot of SQL is written, and each submission will have a data stream which is a waste of resources.Our solution is to produce dynamic metrics through the same data stream, so that metrics can be added dynamically without stopping the service.</p><p>So, during the construction of the real-time platform, engineers should consider more about how to use resources more effectively and which links can use resources more economically.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="3sla-construction">3.SLA construction<a href="#3sla-construction" class="hash-link" aria-label="3.SLA construction的直接链接" title="3.SLA construction的直接链接"></a></h3><p>SLA mainly solves two problems, one is about the end-to-end SLA, the other is about the SLA of job productivity. We adopt the method of burying points + reporting.Because the real-time stream is relatively large, the burying point should be as simple as possible, do not bury too many things,can express the business information is enough.The output of each job is reported to the SLA monitoring platform in a unified manner, and the required information is reported at each job point through a unified interface, and finally the end-to-end SLA can be counted.</p><p>In real-time production, because the process is very long, it is impossible to control all links, but it can control the efficiency of its own operations, so job SLA is also essential.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="4real-time-olap-solution">4.Real-time OLAP solution<a href="#4real-time-olap-solution" class="hash-link" aria-label="4.Real-time OLAP solution的直接链接" title="4.Real-time OLAP solution的直接链接"></a></h3><p>Problems:</p><ul><li><p>Binlog business restoration is complex:There are many changes in the business, and changes at a certain point in time are required. Therefore, sorting and data storage are required, which consumes a lot of memory and CPU resources.</p></li><li><p>Binlog business association is complex:In stream computing, the relationship between streams and streams is very difficult to express business logic.</p></li></ul><p>solutions:</p><p>To solve the problem through the OLAP engine with computing power, there is no need to logically map a data stream, and only the problem of real-time and stable data storage needs to be solved.</p><p>We use Doris as a high-performance OLAP engine here.Due to the need for derivative calculations between the results generated by the business data and the results, Doris can quickly restore the business by using the unique model or the aggregation model, and can also perform aggregation at the summary layer while restoring the business,and is also designed for reuse.The application layer can be physical or logical view.</p><p>This mode focuses on solving the business rollback calculation. For example, when the business state changes, the value needs to be changed at a certain point in history. The cost of using flow calculation in this scenario is very high. The OLAP mode can solve this problem very well.</p><h1>Real-time use cases</h1><p>In the end, we use a case to illustrate.For example, merchants want to offer discounts to users based on the number of historical orders placed by users. Merchants need to see how many orders they have placed in history. They must have historical T+1 data and real-time data today.This scenario is a typical Lambda architecture,You can design a partition table in Doris, one is the historical partition, and the other is the today partition. The historical partition can be produced offline. Today's indicators can be calculated in real time and written to today's partition. When querying, a simple summary.</p><p>This scenario seems relatively simple, but the difficulty lies in the fact that many simple problems will become complicated after the number of merchants increases.Therefore, in the future, we will use more business input to precipitate more business scenarios, abstract them to form a unified production plan and function, and support diversified business needs with minimized real-time computing resources, which is also what needs to be achieved in the future. </p><p>That's all for today, thank you.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="about-the-author">about the author:<a href="#about-the-author" class="hash-link" aria-label="about the author:的直接链接" title="about the author:的直接链接"></a></h3><p>Zhu Liang, more than 5 years experience in data warehouse construction in traditional industries, 6 years experience in Internet data warehouse, technical direction involves offline, real-time data warehouse management, systematic capacity building, OLAP system and engine, big data related technologies, focusing on OLAP,and real-time technology frontier development trends.The business direction involves ad hoc query, operation analysis, strategy report product, user portrait, crowd recommendation, experimental evaluation, etc.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Best practice of Apache Doris in Xiaomi Group]]></title>
<id>https://doris.apache.org/zh-CN/blog/xiaomi</id>
<link href="https://doris.apache.org/zh-CN/blog/xiaomi"/>
<updated>2022-07-20T00:00:00.000Z</updated>
<summary type="html"><![CDATA[In order to improve the query performance of the Xiaomi growth analysis platform and reduce the operation and maintenance costs, Xiaomi Group introduced Apache Doris in September 2019. In the past two and a half years, **Apache Doris has been widely used in Xiaomi Group,** **such as business growth analytic platform, realtime dashboards for all business groups, finance analysis, user profile analysis, advertising reports, A/B testing platform and so on.** This article will share the best practice of Apache Doris in Xiaomi Group.]]></summary>
<content type="html"><![CDATA[<h1>Background</h1><p>In order to improve the query performance of the Xiaomi growth analysis platform and reduce the operation and maintenance costs, Xiaomi Group introduced Apache Doris in September 2019. In the past two and a half years, <strong>Apache Doris has been widely used in Xiaomi Group,</strong> <strong>such as business growth analytic platform, realtime dashboards for all business groups, finance analysis, user profile analysis, advertising reports, A/B testing platform and so on.</strong> This article will share the best practice of Apache Doris in Xiaomi Group. </p><h1>Business Practice</h1><p>The typical business practices of Apache Doris in Xiaomi are as follows:</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="01-user-access">01 User Access<a href="#01-user-access" class="hash-link" aria-label="01 User Access的直接链接" title="01 User Access的直接链接"></a></h2><p>Data Factory is a one-stop data development platform developed by Xiaomi for data developers and data analysts. This platform supports data sources such as Doris, Hive, Kudu, Iceberg, ES, Talso, TiDB, MySQL, etc. It also supports computing engines such as Flink, Spark, Presto,etc.</p><p>Inside Xiaomi, users need to access the Doris service through the data factory. Users need to register in the data factory and complete the approval for building the database. The Doris operation and maintenance classmates will connect according to the descriptions of the business scenarios and data usage expectations submitted by users in the data factory. After completing the access approval, users can use the Doris service to perform operations such as visual table creation and data import in the data factory.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="02-data-import">02 Data import<a href="#02-data-import" class="hash-link" aria-label="02 Data import的直接链接" title="02 Data import的直接链接"></a></h2><p>In Xiaomi's business, the two most common ways to import data into Doris are Stream Load and Broker Load. User data will be divided into real-time data and offline data, and users' real-time and offline data will generally be written to Talos first (Talos is a distributed, high-throughput message queue developed by Xiaomi). The offline data from Talos will be sink to HDFS, and then imported to Doris through the data factory. Users can directly submit Broker Load tasks in the data factory to import large batches of data on HDFS into Doris, In addition, you can run the SparkSQL command in the data factory to query data from Hive, Import the data found in SparkSQL into Doris through Spark-doris-Connector, and encapsulate Stream Load at the bottom layer of Spark-doris-Connector. Real-time data from Talos is generally imported into Doris in two ways. One is to first perform ETL on the data through Flink, and then import small batches of data to Doris through.Flink- Doris-connector encapsulates the Stream Load at the bottom layer. Another way is to import small batches of data into Doris through Stream Load encapsulated by Spark Streaming at regular intervals.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="03-data-query">03 Data Query<a href="#03-data-query" class="hash-link" aria-label="03 Data Query的直接链接" title="03 Data Query的直接链接"></a></h2><p>Doris users of Xiaomi generally analyze and query Doris and display the results through the ShuJing platform.ShuJing is a general-purpose BI analysis tool developed by Xiaomi. Users can query and visualize Doris through ShuJing platform, and realize user behavior analysis (in order to meet the needs of business event analysis, retention analysis, funnel analysis, path analysis and other behavior analysis needs, We added corresponding UDF and UDAF ) and user profile analysis for Doris.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="04-compaction-tuning">04 Compaction Tuning<a href="#04-compaction-tuning" class="hash-link" aria-label="04 Compaction Tuning的直接链接" title="04 Compaction Tuning的直接链接"></a></h2><p>For Doris, each data import will generate a data version under the relevant data shard (Tablet) of the storage layer, and the Compaction mechanism will asynchronously merge the smaller data versions generated by the import (the detailed principle of the Compaction mechanism can be Refer to the previous article "Doris Compaction Mechanism Analysis").</p><p>Xiaomi has many high-frequency, high-concurrency, near-real-time import business scenarios, and a large number of small versions will be generated in a short period of time. If Compaction does not merge data versions in time, it will cause version accumulation.On the one hand, too many minor versions will increase the pressure on metadata, and on the other hand, too many versions will affect query performance.In Xiaomi's usage scenarios, many tables use the Unique and Aggregate data models, and the query performance is heavily dependent on whether Compaction can merge data versions in time.In our business scenario, the query performance was reduced by tens of times due to delayed version merging, thus affecting online services.When a Compaction happens, it consumes CPU, memory, and disk I/O resources. Too much compaction will take up too many machine resources, affect query performance, and may cause OOM.</p><p><strong>In response to this problem of Compaction, we first start from the business side and guide users through the following aspects:</strong></p><ul><li><p>Set reasonable partitions and buckets for tables to avoid generating too many data fragments.</p></li><li><p>Standardize the user's data import operation, reduce the frequency of data import, increase the amount of data imported in a single time, and reduce the pressure of Compaction.</p></li><li><p>Avoid using delete operations too much.The delete operation will generate a delete version under the relevant data shard in the storage layer.The Cumulative Compaction task will be truncated when the delete version is encountered. This task can only merge the data version after the Cumulative Point and before the delete version, move the Cumulative Point to the delete version, and hand over the delete version to the subsequent Base Compaction task. to process. If you use the delete operation too much, too many delete versions will be generated under the Tablet, which will cause the Cumulative Compaction task to slow down the progress of version merging. Using the delete operation does not actually delete the data from the disk, but records the deletion conditions in the delete version. When the data is queried, the deleted data will be filtered out by Merge-On-Read. Only the delete version is merged by the Base Compaction task. After that, the data to be deleted by the delete operation can be cleared from the disk as expired data with the Stale Rowset. If you need to delete the data of an entire partition, you can use the truncated partition operation instead of the delete operation.</p></li></ul><p><strong>Second, we tuned Compaction from the operation and maintenance side:</strong></p><ul><li><p>According to different business scenarios, different Compaction parameters (Compaction strategy, number of threads, etc.) are configured for different clusters.</p></li><li><p>Appropriately lowers the priority of the Base Compaction task and increases the priority of the Cumulative Compaction task, because the Base Compaction task takes a long time to execute and has serious write amplification problems, while the Cumulative Compaction task executes faster and can quickly merge a large number of small versions.</p></li><li><p>Version backlog alarm, dynamic adjustment of Compaction parameters.When the Compaction Producer produces Compaction tasks, it will update the corresponding metric.It records the value of the largest Compaction Score on the BE node. You can check the trend of this indicator through Grafana to determine whether there is a version backlog. In addition, we have added a Version backlog alert.In order to facilitate the adjustment of Compaction parameters, we have optimized the code level to support dynamic adjustment of the Compaction strategy and the number of Compaction threads at runtime, avoiding the need to restart the process when adjusting the Compaction parameters.</p></li><li><p>Supports manual triggering of the Compaction task of the specified Table and data shards under the specified Partition, and improves the Compaction priority of the specified Table and data shards under the specified Partition.</p></li></ul><h1>Monitoring and Alarm Management</h1><h2 class="anchor anchorWithStickyNavbar_LWe7" id="01-monitoring-system">01 Monitoring System<a href="#01-monitoring-system" class="hash-link" aria-label="01 Monitoring System的直接链接" title="01 Monitoring System的直接链接"></a></h2><p>Prometheus will regularly pull Metrics metrics from Doris's FE and BE and display them in the Grafana monitoring panel.The service metadata based on QingZhou Warehouse will be automatically registered in Zookeeper, and Prometheus will regularly pull the latest cluster metadata information from Zookeeper and display it dynamically in the Grafana monitoring panel.(Qingzhou Data Warehouse is a data warehouse constructed by the Qingzhou platform based on the operation data of Xiaomi's full-scale big data service. It consists of 2 base tables and 30+ dimension tables.Covers the whole process data such as resources, server cmdb, cost, process status and so on when big data components are running)We have also added statistics and display boards for common troubleshooting data such as Doris large query list, real-time write data volume, data import transaction numbers, etc. in Grafana.In Grafana, we also added statistics and display boards for common troubleshooting data such as the Doris big query list, the amount of real-time data written, and the number of data import transactions, so that alarms can be linked. When the cluster is abnormal, Doris' operation and maintenance students can locate the cause of the cluster failure in the shortest time.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="02--falcon">02 Falcon<a href="#02--falcon" class="hash-link" aria-label="02 Falcon的直接链接" title="02 Falcon的直接链接"></a></h2><p>Falcon is a monitoring and alarm system widely used inside Xiaomi.Because Doris provides a relatively complete metrics interface, which can easily provide monitoring functions based on Prometheus and Grafana, we only use Falcon's alarm function in the Doris service.For different levels of faults in Doris, we define alarms as three levels of P0, P1 and P2:</p><ul><li><p>P2 alarm (alarm level is low): single node failure alarm.When a single node indicator or process status is abnormal, an alarm is generally issued as a P2 level.The alarm information is sent to the members of the alarm group in the form of Xiaomi Office messages.(Xiaomi Office is a privatized deployment product of ByteDance Feishu in Xiaomi, and its functions are similar to Feishu.)</p></li><li><p>P1 alarm (alarm level is higher):In a short period of time (within 3 minutes), the cluster will issue a P1 level alarm if there are short-term exceptions such as increased query delay and abnormal writing,etc.The alarm information is sent to the members of the alarm group in the form of Xiaomi Office messages.P1 level alarms require Oncall engineers to respond and provide feedback.</p></li><li><p>P0 alarm (alarm level is high):In a long period of time (more than 3 minutes), the cluster will issue a P0 level alarm if there are exceptions such as increased query delay and abnormal writing,etc.Alarm information is sent in the form of Xiaomi office messages and phone alarms.P0 level alarm requires Oncall engineers to respond within 1 minute and coordinate resources for failure recovery and review preparation.</p></li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="03--cloud-doris">03 Cloud-Doris<a href="#03--cloud-doris" class="hash-link" aria-label="03 Cloud-Doris的直接链接" title="03 Cloud-Doris的直接链接"></a></h2><p>cloud-Doris is a data collection component developed by Xiaomi for the internal Doris service. Its main capability is to detect the availability of the Doris service and collect the cluster indicator data of internal concern.For example, Cloud-Doris can periodically simulate users reading and writing to the Doris system to detect the availability of services.If the cluster has abnormal availability, it will be alerted through Falcon.Collect user's read and write data, and then generate user bill.Collect information such as table-level data volume, unhealthy copies, and oversized Tablets, and send alarms to abnormal information through Falcon.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="04-qingzhou-inspection">04 QingZhou inspection<a href="#04-qingzhou-inspection" class="hash-link" aria-label="04 QingZhou inspection的直接链接" title="04 QingZhou inspection的直接链接"></a></h2><p>For chronic hidden dangers such as capacity, user growth, resource allocation, etc., we use the unified QingZhou big data service inspection platform for inspection and reporting.The inspection generally consists of two parts:Service-specific inspections and basic indicator inspections.Among them, the service-specific inspection refers to the indicators that are unique to each big data service and cannot be used universally.For Doris, it mainly includes: Quota, number of shard copies, number of single table columns, number of table partitions, etc.By increasing the inspection method, the chronic hidden dangers that are difficult to be alarmed in advance can be well avoided, which provides support for the failure-free major festivals.</p><h1>Failure Recovery</h1><p>When an online cluster fails, the first principle should be to quickly restore services.If the cause of the failure is clear, handle it according to the specific cause and restore the service.If the cause of the failure is not clear, you should try restarting the process as soon as you keep the snapshot to restore the service.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="01-access-failures-handling">01 Access Failures Handling<a href="#01-access-failures-handling" class="hash-link" aria-label="01 Access Failures Handling的直接链接" title="01 Access Failures Handling的直接链接"></a></h2><p>Doris uses Xiaomi LVS as the access layer, which is similar to the LB service of open source or public cloud, and provides layer 4 or layer 7 traffic load scheduling capability.After Doris binds a reasonable port,Generally speaking, if an abnormality occurs in a single FE node, it will be automatically kicked out, and the service can be restored without the user's perception, and an alarm will be issued for the abnormal node.Of course, for FE faults that cannot be processed in a short time, we will first adjust the weight of the faulty node to 0 or delete the abnormal node from LVS first to prevent unpredictable problems caused by process detection exceptions.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="02-node-failure-handling">02 Node Failure Handling<a href="#02-node-failure-handling" class="hash-link" aria-label="02 Node Failure Handling的直接链接" title="02 Node Failure Handling的直接链接"></a></h2><p>For FE node failures, if the cause of the failure cannot be quickly located, it is generally necessary to keep thread snapshots and memory snapshots and restart the process.</p><div class="language-undefined codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-undefined codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">jstack 进程ID &gt;&gt; 快照文件名.jstack</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>Save a memory snapshot of FE with the command:</p><div class="language-undefined codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-undefined codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">jmap -dump:live,format=b,file=快照文件名.heap 进程ID</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><p>In the case of version upgrade or some unexpected scenarios, the image of the FE node may have abnormal metadata, and the abnormal metadata may be synchronized to other FE, resulting in all FE not working.Once a failed image is discovered, the fastest recovery option is to use Recovery mode to stop FE elections and replace the failed image with the backup image.Of course, it is not easy to backup images all the time.Since this failure is common in cluster upgrades, we recommend adding simple local image backup logic to the cluster upgrade procedure.Ensure that a copy of the current and latest image data will be retained before each upgrade starts the FE process.For BE node failure, if the process crashes, a core file will be generated, and minos will automatically pull the process;If the task is stuck, you need to restart the process after retaining the thread snapshot with the following command:</p><div class="language-undefined codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-undefined codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">pstack 进程ID &gt;&gt; 快照文件名.pstack</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h1>Concluding Remarks</h1><p>Apache Doris has been widely used by Xiaomi since the first use of open source software Apache Doris by Xiaomi Group in September 2019.At present, it has served dozens of businesses of Xiaomi, with dozens of clusters and hundreds of nodes, and a set of data ecology with Apache Doris as the core has been formed within Xiaomi.In order to improve the efficiency of operation and maintenance, Xiaomi has also developed a complete set of automated management and operation and maintenance systems around Doris.With the increasing number of services, Doris also exposed some problems. For example, there was no better resource isolation mechanism in the past version, and services would affect each other. In addition, system monitoring needs to be further improved.With the rapid development of the community, more and more small partners have participated in the community construction, the vectorized engine has been transformed, the transformation of the query optimizer is in full swing, and Apache Doris is gradually maturing.</p>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Best Practice" term="Best Practice"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris announced the official release of version 1.1.0]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.1.0</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.1.0"/>
<updated>2022-07-14T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community, we are pleased to announce that we have officially released Apache Doris 1.1 on July 14, 2022! This is the first release version after Apache Doris graduated from the Apache incubator and became an Apache Top-Level Project.]]></summary>
<content type="html"><![CDATA[<p>Dear community, we are pleased to announce that we have officially released Apache Doris 1.1 on July 14, 2022! This is the first release version after Apache Doris graduated from the Apache incubator and became an Apache Top-Level Project.</p><p>In version 1.1, we realized the full vectorization of the computing layer and storage layer, and officially enabled the vectorized execution engine as a stable function. All queries are executed by the vectorized execution engine by default, and the performance is 3-5 times higher than the previous version. It increases the ability to access the external tables of Apache Iceberg and supports federated query of data in Doris and Iceberg, and expands the analysis capabilities of Apache Doris on the data lake; on the basis of the original LZ4, the ZSTD compression algorithm is added , further improves the data compression rate; fixed many performance and stability problems in previous versions, greatly improving system stability. Downloading and using is recommended.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="upgrade-notes">Upgrade Notes<a href="#upgrade-notes" class="hash-link" aria-label="Upgrade Notes的直接链接" title="Upgrade Notes的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="the-vectorized-execution-engine-is-enabled-by-default">The vectorized execution engine is enabled by default<a href="#the-vectorized-execution-engine-is-enabled-by-default" class="hash-link" aria-label="The vectorized execution engine is enabled by default的直接链接" title="The vectorized execution engine is enabled by default的直接链接"></a></h3><p>In version 1.0, we introduced the vectorized execution engine as an experimental feature and Users need to manually enable it when executing queries by configuring the session variables through <code>set batch_size = 4096</code> and <code>set enable_vectorized_engine = true</code> .</p><p>In version 1.1, we officially fully enabled the vectorized execution engine as a stable function. The session variable <code>enable_vectorized_engine</code> is set to true by default. All queries are executed by default through the vectorized execution engine.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="be-binary-file-renaming">BE Binary File Renaming<a href="#be-binary-file-renaming" class="hash-link" aria-label="BE Binary File Renaming的直接链接" title="BE Binary File Renaming的直接链接"></a></h3><p>BE binary file has been renamed from palo_be to doris_be . Please pay attention to modifying the relevant scripts if you used to rely on process names for cluster management and other operations.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="segment-storage-format-upgrade">Segment storage format upgrade<a href="#segment-storage-format-upgrade" class="hash-link" aria-label="Segment storage format upgrade的直接链接" title="Segment storage format upgrade的直接链接"></a></h3><p>The storage format of earlier versions of Apache Doris was Segment V1. In version 0.12, we had implemented Segment V2 as a new storage format, which introduced Bitmap indexes, memory tables, page cache, dictionary compression, delayed materialization and many other features. Starting from version 0.13, the default storage format for newly created tables is Segment V2, while maintaining compatibility with the Segment V1 format.</p><p>In order to ensure the maintainability of the code structure and reduce the additional learning and development costs caused by redundant historical codes, we have decided to no longer support the Segment v1 storage format from the next version. It is expected that this part of the code will be deleted in the Apache Doris 1.2 version, and all users who are still using the Segment V1 storage format must complete the data format conversion in version 1.1. Please refer to the following link for the operation manual:</p><p><a href="https://doris.apache.org/zh-CN/docs/1.0/administrator-guide/segment-v2-usage" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/1.0/administrator-guide/segment-v2-usage</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="normal-upgrade">Normal Upgrade<a href="#normal-upgrade" class="hash-link" aria-label="Normal Upgrade的直接链接" title="Normal Upgrade的直接链接"></a></h3><p>For normal upgrade operations, you can perform rolling upgrades according to the cluster upgrade documentation on the official website.</p><p><a href="https://doris.apache.org/zh-CN/docs/admin-manual/cluster-management/upgrade" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/admin-manual/cluster-management/upgrade</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="features">Features<a href="#features" class="hash-link" aria-label="Features的直接链接" title="Features的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-random-distribution-of-data-experimental">Support random distribution of data <!-- -->[experimental]<a href="#support-random-distribution-of-data-experimental" class="hash-link" aria-label="support-random-distribution-of-data-experimental的直接链接" title="support-random-distribution-of-data-experimental的直接链接"></a></h3><p>In some scenarios (such as log data analysis), users may not be able to find a suitable bucket key to avoid data skew, so the system needs to provide additional distribution methods to solve the problem.</p><p>Therefore, when creating a table you can set <code>DISTRIBUTED BY random BUCKET number</code>to use random distribution, the data will be randomly written to a single tablet when importing to reduce the data fanout during the loading process. And reduce resource overhead and improve system stability.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-for-creating-iceberg-external-tablesexperimental">Support for creating Iceberg external tables<!-- -->[experimental]<a href="#support-for-creating-iceberg-external-tablesexperimental" class="hash-link" aria-label="support-for-creating-iceberg-external-tablesexperimental的直接链接" title="support-for-creating-iceberg-external-tablesexperimental的直接链接"></a></h3><p>Iceberg external tables provide Apache Doris with direct access to data stored in Iceberg. Through Iceberg external tables, federated queries on data stored in local storage and Iceberg can be implemented, which saves tedious data loading work, simplifies the system architecture for data analysis, and performs more complex analysis operations.</p><p>In version 1.1, Apache Doris supports creating Iceberg external tables and querying data, and supports automatic synchronization of all table schemas in the Iceberg database through the REFRESH command.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="added-zstd-compression-algorithm">Added ZSTD compression algorithm<a href="#added-zstd-compression-algorithm" class="hash-link" aria-label="Added ZSTD compression algorithm的直接链接" title="Added ZSTD compression algorithm的直接链接"></a></h3><p>At present, the data compression method in Apache Doris is uniformly specified by the system, and the default is LZ4. For some scenarios that are sensitive to data storage costs, the original data compression ratio requirements cannot be met.</p><p>In version 1.1, users can set "compression"="zstd" in the table properties to specify the compression method as ZSTD when creating a table. In the 25GB 110 million lines of text log test data, the highest compression rate is nearly 10 times, which is 53% higher than the original compression rate, and the speed of reading data from disk and decompressing it is increased by 30%.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="improvements">Improvements<a href="#improvements" class="hash-link" aria-label="Improvements的直接链接" title="Improvements的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="more-comprehensive-vectorization-support">More comprehensive vectorization support<a href="#more-comprehensive-vectorization-support" class="hash-link" aria-label="More comprehensive vectorization support的直接链接" title="More comprehensive vectorization support的直接链接"></a></h3><p>In version 1.1, we implemented full vectorization of the compute and storage layers, including:</p><p>Implemented vectorization of all built-in functions</p><p>The storage layer implements vectorization and supports dictionary optimization for low-cardinality string columns</p><p>Optimized and resolved numerous performance and stability issues with the vectorization engine.</p><p>We tested the performance of Apache Doris version 1.1 and version 0.15 on the SSB and TPC-H standard test datasets:</p><p>On all 13 SQLs in the SSB test data set, version 1.1 is better than version 0.15, and the overall performance is improved by about 3 times, which solves the problem of performance degradation in some scenarios in version 1.0;</p><p>On all 22 SQLs in the TPC-H test data set, version 1.1 is better than version 0.15, the overall performance is improved by about 4.5 times, and the performance of some scenarios is improved by more than ten times;</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-1.1.0-SSB-6067d10e7f8b966be8da2b64950622fb.png" width="1280" height="554" class="img_ev3q"></p><p align="center">SSB Benchmark</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/release-note-1.1.0-TPC-H-7d6975b410de89a004c7f058226a02da.png" width="1280" height="596" class="img_ev3q"></p><p align="center">TPC-H Benchmark</p><p><strong>Performance test report</strong></p><p><a href="https://doris.apache.org/zh-CN/docs/benchmark/ssb" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/benchmark/ssb</a></p><p><a href="https://doris.apache.org/zh-CN/docs/benchmark/tpch" target="_blank" rel="noopener noreferrer">https://doris.apache.org/zh-CN/docs/benchmark/tpch</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="compaction-logic-optimization-and-real-time-guarantee">Compaction logic optimization and real-time guarantee<a href="#compaction-logic-optimization-and-real-time-guarantee" class="hash-link" aria-label="Compaction logic optimization and real-time guarantee的直接链接" title="Compaction logic optimization and real-time guarantee的直接链接"></a></h3><p>In Apache Doris, each commit will generate a data version. In high concurrent write scenarios, -235 errors are prone to occur due to too many data versions and untimely compaction, and query performance will also decrease accordingly.</p><p>In version 1.1, we introduced QuickCompaction, which will actively trigger compaction when the data version increases. At the same time, by improving the ability to scan fragment metadata, it can quickly find fragments with too many data versions and trigger compaction. Through active triggering and passive scanning, the real-time problem of data merging is completely solved.</p><p>At the same time, for high-frequency small file cumulative compaction, the scheduling and isolation of compaction tasks is implemented to prevent the heavyweight base compaction from affecting the merging of new data.</p><p>Finally, for the merging of small files, the strategy of merging small files is optimized, and the method of gradient merging is adopted. Each time the files participating in the merging belong to the same data magnitude, it prevents versions with large differences in size from merging, and gradually merges hierarchically. , reducing the number of times a single file participates in merging, which can greatly save the CPU consumption of the system.</p><p>When the data upstream maintains a write frequency of 10w per second (20 concurrent write tasks, 5000 rows per job, and checkpoint interval of 1s), version 1.1 behaves as follows:</p><ul><li><p>Quick data consolidation: Tablet version remains below 50 and compaction score is stable. Compared with the -235 problem that frequently occurred during high concurrent writing in the previous version, the compaction merge efficiency has been improved by more than 10 times.</p></li><li><p>Significantly reduced CPU resource consumption: The strategy has been optimized for small file Compaction. In the above scenario of high concurrent writing, CPU resource consumption is reduced by 25%;</p></li><li><p>Stable query time consumption: The overall orderliness of data is improved, and the fluctuation of query time consumption is greatly reduced. The query time consumption during high concurrent writing is the same as that of only querying, and the query performance is improved by 3-4 times compared with the previous version.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="read-efficiency-optimization-for-parquet-and-orc-files">Read efficiency optimization for Parquet and ORC files<a href="#read-efficiency-optimization-for-parquet-and-orc-files" class="hash-link" aria-label="Read efficiency optimization for Parquet and ORC files的直接链接" title="Read efficiency optimization for Parquet and ORC files的直接链接"></a></h3><p>By adjusting arrow parameters, arrow's multi-threaded read capability is used to speed up Arrow's reading of each row_group, and it is modified to SPSC model to reduce the cost of waiting for the network through prefetching. After optimization, the performance of Parquet file import is improved by 4 to 5 times.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="safer-metadata-checkpoint">Safer metadata Checkpoint<a href="#safer-metadata-checkpoint" class="hash-link" aria-label="Safer metadata Checkpoint的直接链接" title="Safer metadata Checkpoint的直接链接"></a></h3><p>By double-checking the image files generated after the metadata checkpoint and retaining the function of historical image files, the problem of metadata corruption caused by image file errors is solved.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="bugfix">Bugfix<a href="#bugfix" class="hash-link" aria-label="Bugfix的直接链接" title="Bugfix的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="fix-the-problem-that-the-data-cannot-be-queried-due-to-the-missing-data-versionserious">Fix the problem that the data cannot be queried due to the missing data version.(Serious)<a href="#fix-the-problem-that-the-data-cannot-be-queried-due-to-the-missing-data-versionserious" class="hash-link" aria-label="Fix the problem that the data cannot be queried due to the missing data version.(Serious)的直接链接" title="Fix the problem that the data cannot be queried due to the missing data version.(Serious)的直接链接"></a></h3><p>This issue was introduced in version 1.0 and may result in the loss of data versions for multiple replicas.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="fix-the-problem-that-the-resource-isolation-is-invalid-for-the-resource-usage-limit-of-loading-tasks-moderate">Fix the problem that the resource isolation is invalid for the resource usage limit of loading tasks (Moderate)<a href="#fix-the-problem-that-the-resource-isolation-is-invalid-for-the-resource-usage-limit-of-loading-tasks-moderate" class="hash-link" aria-label="Fix the problem that the resource isolation is invalid for the resource usage limit of loading tasks (Moderate)的直接链接" title="Fix the problem that the resource isolation is invalid for the resource usage limit of loading tasks (Moderate)的直接链接"></a></h3><p>In 1.1, the broker load and routine load will use Backends with specified resource tags to do the load.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="use-http-brpc-to-transfer-network-data-packets-over-2gb-moderate">Use HTTP BRPC to transfer network data packets over 2GB (Moderate)<a href="#use-http-brpc-to-transfer-network-data-packets-over-2gb-moderate" class="hash-link" aria-label="Use HTTP BRPC to transfer network data packets over 2GB (Moderate)的直接链接" title="Use HTTP BRPC to transfer network data packets over 2GB (Moderate)的直接链接"></a></h3><p>In the previous version, when the data transmitted between Backends through BRPC exceeded 2GB,
it may cause data transmission errors.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="others">Others<a href="#others" class="hash-link" aria-label="Others的直接链接" title="Others的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="disabling-mini-load">Disabling Mini Load<a href="#disabling-mini-load" class="hash-link" aria-label="Disabling Mini Load的直接链接" title="Disabling Mini Load的直接链接"></a></h3><p>The <code>/_load</code> interface is disabled by default, please use <code>the /_stream_load</code> interface uniformly.
Of course, you can re-enable it by turning off the FE configuration item <code>disable_mini_load</code>.</p><p>The Mini Load interface will be completely removed in version 1.2.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="completely-disable-the-segmentv1-storage-format">Completely disable the SegmentV1 storage format<a href="#completely-disable-the-segmentv1-storage-format" class="hash-link" aria-label="Completely disable the SegmentV1 storage format的直接链接" title="Completely disable the SegmentV1 storage format的直接链接"></a></h3><p>Data in SegmentV1 format is no longer allowed to be created. Existing data can continue to be accessed normally.
You can use the <code>ADMIN SHOW TABLET STORAGE FORMAT</code> statement to check whether the data in SegmentV1 format
still exists in the cluster. And convert to SegmentV2 through the data conversion command</p><p>Access to SegmentV1 data will no longer be supported in version 1.2.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="limit-the-maximum-length-of-string-type">Limit the maximum length of String type<a href="#limit-the-maximum-length-of-string-type" class="hash-link" aria-label="Limit the maximum length of String type的直接链接" title="Limit the maximum length of String type的直接链接"></a></h3><p>In previous versions, String types were allowed a maximum length of 2GB.
In version 1.1, we will limit the maximum length of the string type to 1MB. Strings longer than this length cannot be written anymore.
At the same time, using the String type as a partitioning or bucketing column of a table is no longer supported.</p><p>The String type that has been written can be accessed normally.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="fix-fastjson-related-vulnerabilities">Fix fastjson related vulnerabilities<a href="#fix-fastjson-related-vulnerabilities" class="hash-link" aria-label="Fix fastjson related vulnerabilities的直接链接" title="Fix fastjson related vulnerabilities的直接链接"></a></h3><p>Update to Canal version to fix fastjson security vulnerability.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="added-admin-diagnose-tablet-command">Added <code>ADMIN DIAGNOSE TABLET</code> command<a href="#added-admin-diagnose-tablet-command" class="hash-link" aria-label="added-admin-diagnose-tablet-command的直接链接" title="added-admin-diagnose-tablet-command的直接链接"></a></h3><p>Used to quickly diagnose problems with the specified tablet.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="download-to-use">Download to Use<a href="#download-to-use" class="hash-link" aria-label="Download to Use的直接链接" title="Download to Use的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="download-link">Download Link<a href="#download-link" class="hash-link" aria-label="Download Link的直接链接" title="Download Link的直接链接"></a></h3><p><a href="https://doris.apache.org/download" target="_blank" rel="noopener noreferrer">hhttps://doris.apache.org/download</a></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="feedback">Feedback<a href="#feedback" class="hash-link" aria-label="Feedback的直接链接" title="Feedback的直接链接"></a></h3><p>If you encounter any problems with use, please feel free to contact us through GitHub discussion forum or Dev e-mail group anytime.</p><p>GitHub Forum: <a href="https://github.com/apache/doris/discussions" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris/discussions</a></p><p>Mailing list: <a href="/zh-CN/blog/dev@doris.apache.org">dev@doris.apache.org</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="thanks">Thanks<a href="#thanks" class="hash-link" aria-label="Thanks的直接链接" title="Thanks的直接链接"></a></h2><p>Thanks to everyone who has contributed to this release:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@adonis0147</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@airborne12</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@amosbird</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@aopangzi</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@arthuryangcs</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@awakeljw</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@BePPPower</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@BiteTheDDDDt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@bridgeDream</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@caiconghui</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@cambyzju</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ccoffline</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@chenlinzhong</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@daikon12</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@DarvenDuan</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dataalive</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dataroaring</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@deardeng</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Doris-Extras</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@emerkfu</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@EmmyMiao87</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@englefly</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Gabriel39</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@GoGoWen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@gtchaos</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@HappenLee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@hello-stephen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Henry2SS</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@hewei-nju</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@hf200012</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jacktengg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jackwener</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Jibing-Li</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@JNSimba</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@kangshisen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Kikyou1997</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@kylinmac</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Lchangliang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@leo65535</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liaoxin01</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liutang123</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lovingfeel</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luozenglin</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luwei16</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luzhijing</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@mklzl</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morningman</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morrySnow</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@nextdreamblue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Nivane</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@pengxiangyu</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@qidaye</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@qzsee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SaintBacchus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SleepyBear96</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@smallhibiscus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@spaces-X</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@stalary</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@starocean999</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@steadyBoy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SWJTU-ZhangLei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Tanya-W</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@tarepanda1024</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@tianhui5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Userwhite</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangbo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangyf0555</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@weizuo93</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@whutpencil</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wsjz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wunan1210</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xiaokang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xinyiZzz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xlwh</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xy720</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yangzhg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Yankee24</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yiguolei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yinzhijian</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yixiutt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zbtzbtzbt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zenoyang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhangstar333</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhangyifan27</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhannngchen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhengshengjun</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhengshiJ</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zingdle</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zuochunwei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zy-kkk</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Announcing open source realtime analytical database Apache Doris as a top-level project]]></title>
<id>https://doris.apache.org/zh-CN/blog/Annoucing</id>
<link href="https://doris.apache.org/zh-CN/blog/Annoucing"/>
<updated>2022-06-16T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Apache Doris is a modern, high-performance and real-time analytical database based on MPP. It is well known for its high-performance and easy-to-use. It can return query results under massive data within only sub-seconds. It can support not only high concurrent point query scenarios, but also complex analysis scenarios with high throughput. Based on this, Apache Doris can be well applied in many business fields, such as multi-dimensional reporting, user portrait, ad-hoc query, real-time dashboard and so on.]]></summary>
<content type="html"><![CDATA[<p>Apache Doris is a modern, high-performance and real-time analytical database based on MPP. It is well known for its high-performance and easy-to-use. It can return query results under massive data within only sub-seconds. It can support not only high concurrent point query scenarios, but also complex analysis scenarios with high throughput. Based on this, Apache Doris can be well applied in many business fields, such as multi-dimensional reporting, user portrait, ad-hoc query, real-time dashboard and so on.</p><p>Apache Doris was first born in the Palo Project within Baidu's advertising report business and officially opened source in 2017. It was donated by Baidu to Apache foundation for incubation in July 2018, and then incubated and operated by members of the podling project management committee (PPMC) under the guidance of Apache incubator mentors.</p><p>We are very proud that Doris graduated from Apache incubator successfully. It is an important milestone. In the whole incubating period, with the guidance of Apache Way and the help of incubator mentors, we learned how to develop our project and community in Apache Way, and have achieved great growth in this process.</p><p>At present, Apache Doris community has gathered more than 300 contributors from nearly 100 enterprises in different industries, and the number of active contributors per month is close to 100. During the incubation period, Apache Doris released a total of 8 major versions and completed many major functions, including storage engine upgrade, vectorization execution engine and so on, and released 1.0 version. It is the strength of these open source contributors that makes Apache Doris achieve today's results.</p><p>At the same time, Apache Doris now has a wide range of users in China and even around the world. Up to now, Apache Doris has been applied in the production environment of more than 500 enterprises around the world. Among the top 50 Internet companies in China by market value or valuation, more than 80% are long-term users of Apache Doris, including Baidu, Meituan, Xiaomi, JD, ByteDance, Tencent, Kwai, Netease, Sina, 360 and other well-known companies. It also has rich applications in some traditional industries, such as finance, energy, manufacturing, telecommunications and other fields.</p><p>You can quickly build a simple, easy-to-use and powerful data analysis platform based on Apache Doris, which is very easy to start, and the learning cost is very low. In addition, the distributed architecture of Apache Doris is very simple, which can greatly reduce the workload of system operation and maintenance. This is also the key factor for more and more users to choose Apache Doris.</p><p>As a mature analytical database project, Apache Doris has the following advantages:</p><ul><li><p>Excellent performance: it is equipped with an efficient column storage engine, which not only reduces the amount of data scanning, but also implements an ultra-high data compression ratio. At the same time, Doris also provides a rich index structure to speed up data reading and filtering. Using the partition and bucket pruning function, Doris can support ultra-high concurrency of online service business, and a single node can support up to thousands of QPS. Further, Apache Doris combines the vectorization execution engine to give full play to the modern CPU parallel computing power, supplemented by intelligent materialized view technology to accelerate pre-aggregation, and can simultaneously carry out planning based and cost based query optimization through the query optimizer. Through the above methods, Doris can reach ultimate query performance.</p></li><li><p>Easy to use: it supports ANSI SQL syntax, including single table aggregation, sorting, filtering and multi table join, sub query, etc. it also supports complex SQL syntax such as window function and grouping set. At the same time, users can expand system functions through UDF, UDAF and other user-defined functions. In addition, Apache Doris is also compatible with MySQL protocol. Users can access Doris through various client tools and support seamless connection with BI tools.</p></li><li><p>Streamlined architecture: the system has only two modules —— frontend (FE) and backend (BE). The FE node is responsible for the access of user requests, the analysis of query plans, metadata storage and cluster management, and the BE node is responsible for the implementation of data storage and query plans. It is a complete distributed database management system. Users can run the Apache Doris cluster without installing any third-party management and control components, and the deployment and upgrade process are very simple. At the same time, any module can support horizontal expansion, and the cluster can be expanded up to hundreds of nodes, supporting the storage of more than 10PB of ultra large scale data.</p></li><li><p>Scalability and reliability: it supports the storage of multiple replicas of data. The cluster is able to self-healing. Its own distributed management framework can automatically manage the distribution, repair and balance of data replicas. When the replicas are damaged, the system can automatically perceive and repair them. When a node is expanded, it can be completed with only one SQL command, and the data replicas will be automatically rebalanced among nodes without manual intervention or operation. Whether it is capacity expansion, capacity reduction, single node failure or upgrading, the system does not need to stop running, and can normally provide stable and reliable online services.</p></li><li><p>Ecological enrichment: It provides rich data synchronisation methods, supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access data in MySQL, PostgreSQL, Oracle, S3, Hive, Iceberg, Elasticsearch and other systems without data replication. At the same time, the data stored in Doris can also be read by Spark and Flink, and can be output to the upstream data application for display and analysis.</p></li></ul><p>Graduation is not the ultimate goal, it is the starting point of a new journey. In the past, our goal of launching Doris was to provide more people with better data analysis tools and solve their data analysis problems. Becoming an Apache top-level project is not only an affirmation of the hard work of all contributors to the Apache Doris community in the past, but also means that we have established a strong, prosperous and sustainable open source community under the guidance of Apache Way.In the future, we will continue to operate the community in the Way of Apache. I believe we will attract more excellent open source contributors to participate in the community, and the community will further grow with the help of all contributors.</p><p>Apache Doris will carry out more challenging and meaningful work in the future, including new query optimizer, support for Lakehouse integration, and architecture evolution for cloud infrastructure. More open source technology enthusiasts are welcome to join the Apache Doris community and grow together.</p><p>Once again, we sincerely thank all contributors who participated in the construction of Apache Doris community and all users who use Apache Doris and constantly put forward improvement suggestions. At the same time, we also thank our incubator mentors, IPMC members and friends in various open source project communities who have continuously encouraged, supported and helped us all the way.</p><p><strong>Apache Doris GitHub:</strong></p><p><a href="https://github.com/apache/doris" target="_blank" rel="noopener noreferrer">https://github.com/apache/doris</a></p><p><strong>Apache Doris website:</strong></p><p><a href="http://doris.apache.org" target="_blank" rel="noopener noreferrer">http://doris.apache.org</a></p><p><strong>Please contact us via:</strong></p><p><a href="/zh-CN/blog/dev@doris.apache.org.">dev@doris.apache.org.</a></p><p><strong>See How to subscribe:</strong></p><p><a href="https://doris.apache.org/community/subscribe-mail-list/" target="_blank" rel="noopener noreferrer">https://doris.apache.org/community/subscribe-mail-list</a></p>]]></content>
<author>
<name>morningman</name>
</author>
<category label="Top News" term="Top News"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris(Incubating) announced 1.0.0 release]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-1.0.0</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-1.0.0"/>
<updated>2022-04-18T00:00:00.000Z</updated>
<summary type="html"><![CDATA[Dear community friends, we are happy to announce that Apache Doris (incubating) has officially released the 1.0 Release version on April 18, 2022!]]></summary>
<content type="html"><![CDATA[<h1>Apache Doris(Incubating) 1.0.0 Release</h1><p>Dear community friends, after several months, we are happy to announce that Apache Doris (incubating) has officially released the 1.0 Release version on April 18, 2022! <strong>This is the first 1-bit version of Apache Doris since it was incubated by the Apache Foundation, and it is also the version with the largest refactoring of the core code of Apache Doris so far**</strong>! <strong>With </strong>114 Contributors<strong> committing </strong>over 660 optimizations and fixes** for Apache Doris, thank you to everyone who makes Apache Doris even better!</p><p>In version 1.0, we introduced important functions such as vectorized execution engine, Hive external table, Lateral View syntax and Table Function table function, Z-Order data index, Apache SeaTunnel plug-in, etc., and added support for synchronous update and deletion of data in Flink CDC. Support, optimize many problems in the process of data import and query, and comprehensively enhance the query performance, ease of use, stability and other special effects of Apache Doris. Welcome to download and use! Click "<strong>Read the original text</strong>" at the end of the article to go directly to the download address.</p><p>Every day that has not been published, there are countless contributors behind it, who dare not stop for half a minute. Here we would like to especially thank the small partners from SIG (Special Interest Group) such as <strong>vectorized execution engine, query optimizer, and visual operation and maintenance platform</strong>. Since the establishment of the Apache Doris Community SIG group in August 2021, data from more than ten companies including Baidu, Meituan, Xiaomi, JD, Shuhai, ByteDance, Tencent, NetEase, Alibaba, PingCAP, Nebula Graph, etc. Ten contributors<strong> joined the SIG as the first members, and for the first time completed the development of such major functions as the vectorized execution engine, query optimizer, and Doris Manager visual monitoring operation and maintenance platform in the form of open source collaboration of special groups. </strong>During more than half a year, conducting technical research and sharing dozens of times, holding nearly 100 remote meetings, accumulatively submitting hundreds of Commits, involving more than 100,000 lines of code**, it is precisely because of their contributions , only the 1.0 version came out, let us once again express our most sincere thanks for their hard work!</p><p>At the same time, the number of Apache Doris contributors has exceeded 300, the number of monthly active contributors has exceeded 60, and the average weekly number of Commits submitted in recent weeks has also exceeded 80. The scale and activity of developers gathered by the community There has been a huge improvement. We are very much looking forward to having more small partners participate in the community contribution, and work with us to build Apache Doris into the world's top analytical database. We also hope that all small partners can reap valuable growth with us. If you would like to participate in the community, please contact us via the developer email <a href="mailto:dev@doris.apache.org" target="_blank" rel="noopener noreferrer">dev@doris.apache.org</a>.</p><p>We welcome you to contact us with any questions during the use process through GitHub Discussion or Dev mail group, and we look forward to your participation in community discussions and construction.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="important-update">Important update<a href="#important-update" class="hash-link" aria-label="Important update的直接链接" title="Important update的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="vectorized-execution-engine-experimental">Vectorized Execution Engine <!-- -->[Experimental]<a href="#vectorized-execution-engine-experimental" class="hash-link" aria-label="vectorized-execution-engine-experimental的直接链接" title="vectorized-execution-engine-experimental的直接链接"></a></h3><p>In the past, the SQL execution engine of Apache Doris was designed based on the row-based memory format and the traditional volcano model. There was unnecessary overhead in performing SQL operator and function operations, which led to the limited efficiency of the Apache Doris execution engine, which did not Adapt to the architecture of modern CPUs. The goal of the vectorized execution engine is to replace the current row-based SQL execution engine of Apache Doris, fully release the computing power of modern CPUs, break through the performance limitations on the SQL execution engine, and exert extreme performance.</p><p>Based on the characteristics of modern CPUs and the execution characteristics of the volcano model, the vectorized execution engine redesigned the SQL execution engine in the columnar storage system:</p><ul><li>Reorganized the data structure of memory, replaced Tuple with Column, improved Cache affinity, branch prediction and prefetch memory friendliness during calculation</li><li>Type judgment is performed in batches. In this batch, the type determined during type judgment is used, and the virtual function cost of type judgment of each line is allocated to the batch level.</li><li>Through batch-level type judgment, virtual function calls are eliminated, allowing the compiler to have the opportunity for function inlining and SIMD optimization</li></ul><p>This greatly improves the efficiency of the CPU when executing SQL and improves the performance of SQL queries.</p><p>In Apache Doris version 1.0, enabling the vectorized execution engine with set batch_size = 4096 and set enable_vectorized_engine = true can significantly improve query performance in most cases. Under the SSB and OnTime standard test datasets, the overall performance of the two scenarios of multi-table association and wide-column query is improved by 3 times and 2.6 times respectively.</p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/1.0.0-1-e7888e124fefa8bd38215dd9d4be4794.png" width="1080" height="697" class="img_ev3q"></p><p><img loading="lazy" src="https://cdnd.selectdb.com/zh-CN/assets/images/1.0.0-2-d9e8be01f5ff99dd6e15fc33af4518fc.png" width="1080" height="819" class="img_ev3q"></p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="lateral-view-grammar-experimental">Lateral View Grammar <!-- -->[Experimental]<a href="#lateral-view-grammar-experimental" class="hash-link" aria-label="lateral-view-grammar-experimental的直接链接" title="lateral-view-grammar-experimental的直接链接"></a></h3><p>Through Lateral View syntax, we can use Table Function table functions such as explode_bitmap, explode_split, explode_jaon_array, etc., to expand bitmap, String or Json Array from one column into multiple rows, so that the expanded data can be further processed (such as Filter, Join, etc.) .</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="hive-external-table-experimental">Hive External Table <!-- -->[Experimental]<a href="#hive-external-table-experimental" class="hash-link" aria-label="hive-external-table-experimental的直接链接" title="hive-external-table-experimental的直接链接"></a></h3><p>Hive External Table provides users with the ability to directly access Hive tables through Doris. External tables save the tedious data import work, and can use Doris's own OLAP capabilities to solve data analysis problems of Hive tables. The current version supports connecting Hive data sources to Doris, and supports federated queries through data in Doris and Hive data sources for more complex analysis operations.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-z-order-data-sorting-format">Support Z-Order data sorting format<a href="#support-z-order-data-sorting-format" class="hash-link" aria-label="Support Z-Order data sorting format的直接链接" title="Support Z-Order data sorting format的直接链接"></a></h3><p>Apache Doris data is sorted and stored according to the prefix column, so when the prefix query condition is included, fast data search can be performed on the sorted data, but if the query condition is not a prefix column, the data sorting feature cannot be used for fast data search. The above problems can be solved by Z-Order Indexing. In version 1.0, we have added the Z-Order data sorting format, which can play a good filtering effect in the scenario of kanban multi-column query and accelerate the filtering performance of non-prefix column conditions. .</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="support-for-apache-seatunnel-incubating-plugin">Support for Apache SeaTunnel (Incubating) plugin<a href="#support-for-apache-seatunnel-incubating-plugin" class="hash-link" aria-label="Support for Apache SeaTunnel (Incubating) plugin的直接链接" title="Support for Apache SeaTunnel (Incubating) plugin的直接链接"></a></h3><p>Apache SeaTunnel is a high-performance distributed data integration framework built on Apache Spark and Apache Flink. In the 1.0 version of Apache Doris, we have added the SaeTunnel plugin, users can use Apache SeaTunnel for synchronization and ETL between multiple data sources.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-function">New Function<a href="#new-function" class="hash-link" aria-label="New Function的直接链接" title="New Function的直接链接"></a></h3><p>More bitmap functions are supported, see the function manual for details:</p><ul><li>bitmap_max</li><li>bitmap_and_not</li><li>bitmap_and_not_count</li><li>bitmap_has_all</li><li>bitmap_and_count</li><li>bitmap_or_count</li><li>bitmap_xor_count</li><li>bitmap_subset_limit</li><li>sub_bitmap</li></ul><p>Support national secret algorithm SM3/SM4;</p><blockquote><p><strong>Note</strong>: The functions marked <!-- -->[Experimental]<!-- --> above are experimental functions. We will continue to optimize and iterate on the above functions in subsequent versions, and further improve them in subsequent versions. If you have any questions or comments during use, please feel free to contact us</p></blockquote><h3 class="anchor anchorWithStickyNavbar_LWe7" id="important-optimization">Important Optimization<a href="#important-optimization" class="hash-link" aria-label="Important Optimization的直接链接" title="Important Optimization的直接链接"></a></h3><h3 class="anchor anchorWithStickyNavbar_LWe7" id="features-optimization">Features Optimization<a href="#features-optimization" class="hash-link" aria-label="Features Optimization的直接链接" title="Features Optimization的直接链接"></a></h3><ul><li>Reduced the number of segment files generated when importing in large batches to reduce Compaction pressure.</li><li>Transfer data through BRPC's attachment function to reduce serialization and deserialization overhead during query.</li><li>Support to directly return binary data of HLL/BITMAP type for external analysis of business.</li><li>Optimize and reduce the probability of OVERCROWDED and NOT_CONNECTED errors in BRPC, and enhance system stability.</li><li>Enhance the fault tolerance of import.</li><li>Support to update and delete data synchronously through Flink CDC.</li><li>Support adaptive Runtime Filter.</li><li>Significantly reduce the memory footprint of insert into operations</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="usability-improvements">Usability Improvements<a href="#usability-improvements" class="hash-link" aria-label="Usability Improvements的直接链接" title="Usability Improvements的直接链接"></a></h3><ul><li>Routine Load supports displaying the current offset delay number and other status.</li><li>Added statistics on peak memory usage of queries in FE audit log.</li><li>Added missing version information to Compaction URL results to facilitate troubleshooting.</li><li>Support marking BE as non-queryable or non-importable to quickly screen problem nodes.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="important-bug-fixes">Important Bug Fixes<a href="#important-bug-fixes" class="hash-link" aria-label="Important Bug Fixes的直接链接" title="Important Bug Fixes的直接链接"></a></h3><ul><li>Fixed several query errors.</li><li>Fixed some scheduling logic issues in Broker Load.</li><li>Fix the problem that the metadata cannot be loaded due to the STREAM keyword.</li><li>Fixed Decommission not executing correctly.</li><li>Fix the problem that -102 error may occur when operating Schema Change operation in some cases.</li><li>Fix the problem that using String type may cause BE to crash in some cases.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="other">Other<a href="#other" class="hash-link" aria-label="Other的直接链接" title="Other的直接链接"></a></h3><ul><li>Added Minidump function; easy to locate when problems occur</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="changelog">Changelog<a href="#changelog" class="hash-link" aria-label="Changelog的直接链接" title="Changelog的直接链接"></a></h2><p>For detailed Release Note, please check the link:</p><p><a href="https://github.com/apache/incubator-doris/issues/8549" target="_blank" rel="noopener noreferrer">https://github.com/apache/incubator-doris/issues/8549</a></p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="thanks">Thanks<a href="#thanks" class="hash-link" aria-label="Thanks的直接链接" title="Thanks的直接链接"></a></h2><p>The release of Apache Doris(incubating) 1.0 Release version is inseparable from the support of all community users. I would like to express my gratitude to all community contributors who participated in version design, development, testing and discussion. They are:</p><div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#F8F8F2;--prism-background-color:#282A36"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#F8F8F2"><span class="token plain">@924060929</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@adonis0147</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Aiden-Dong</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@aihai</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@airborne12</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Alibaba-HZY</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@amosbird</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@arthuryangcs</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@awakeljw</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@bingzxy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@BiteTheDDDDt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@blackstar-baba</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@caiconghui</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@CalvinKirs</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@cambyzju</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@caoliang-web</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ccoffline</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@chaplinthink</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@chovy-3012</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ChPi</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@DarvenDuan</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dataalive</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dataroaring</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dh-cloud</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dohongdayi</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@dongweizhao</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@drgnchan</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@e0c9</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@EmmyMiao87</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@englefly</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@eyesmoons</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@freemandealer</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Gabriel39</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@gaodayue</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@GoGoWen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Gongruixiao</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@gwdgithubnom</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@HappenLee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Henry2SS</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@hf200012</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@htyoung</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jacktengg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@jackwener</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@JNSimba</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Keysluomo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@kezhenxu94</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@killxdcj</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lihuigang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@littleeleventhwolf</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liutang123</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@liuzhuang2017</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lonre</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@lovingfeel</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luozenglin</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@luzhijing</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@MeiontheTop</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@mh-boy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@morningman</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@mrhhsg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Myasuka</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@nimuyuhan</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@obobj</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@pengxiangyu</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@qidaye</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@qzsee</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@renzhimin7</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Royce33</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@SleepyBear96</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@smallhibiscus</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@sodamnsure</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@spaces-X</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@sparklezzz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@stalary</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@steadyBoy</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@tarepanda1024</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@THUMarkLau</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@tianhui5</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@tinkerrrr</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ucasfl</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@Userwhite</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@vinson0526</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangbo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangshuo128</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wangyf0555</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@weajun</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@weizuo93</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@whutpencil</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@WindyGao</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@wunan1210</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xiaokang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xiaokangguo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xiedeyantu</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xinghuayu007</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xingtanzjr</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xinyiZzz</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xtr1993</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xu20160924</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xuliuzhe</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xuzifu666</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@xy720</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yangzhg</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yiguolei</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yinzhijian</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@yjant</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zbtzbtzbt</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zenoyang</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zh0122</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhangstar333</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhannngchen</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhengshengjun</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zhengshiJ</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ZhikaiZuo</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@ztgoto</span><br></span><span class="token-line" style="color:#F8F8F2"><span class="token plain">@zuochunwei</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="复制代码到剪贴板" title="复制" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
<entry>
<title type="html"><![CDATA[Apache Doris(Incubating) annoucned 0.15.0 release]]></title>
<id>https://doris.apache.org/zh-CN/blog/release-note-0.15.0</id>
<link href="https://doris.apache.org/zh-CN/blog/release-note-0.15.0"/>
<updated>2021-11-29T00:00:00.000Z</updated>
<summary type="html"><![CDATA[“Dear Community]]></summary>
<content type="html"><![CDATA[<h1>Apache Doris(Incubating) 0.15.0 Release</h1><p>Dear Community, After months of polishing, we are pleased to announce the release of Apache Doris(Incubating) on November 29, 2021! Nearly 700 optimizations and fixes have been submitted by 99 contributors to Apache Doris, and we'd like to express our sincere gratitude to all of them!</p><p>In the 0.15.0 Release, we have added many new features to optimize Apache Doris's query performance, ease of use, and stability: a new resource division and isolation feature that allows users to divide BE nodes in a cluster into resource groups by means of resource tags, enabling unified management of online and offline services and resource isolation; the addition of Runtime Filter and Join Reorder functions have been added to significantly improve the query efficiency of multi-table Join scenarios, with a 2-10 times performance improvement under the Star Schema Benchmark test data set; new import method Binlog Load enables Doris to incrementally synchronize the CDC of data update operations in MySQL; support for String column type The new import method, Binlog Load, allows Doris to incrementally synchronize the CDC of MySQL for data update operations; supports String column type with a maximum length of 2GB; supports List partitioning to create partitions by enumerating values; supports Update statements on the Unique Key model; Spark-Doris-Connector supports data writing to Doris ... ...and many more important features, welcome to download and use.</p><p>We welcome you to contact us via GitHub Discussion or the Dev email group if you have any questions during use, and we look forward to your participation in community discussions and building.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="high-lights">High Lights<a href="#high-lights" class="hash-link" aria-label="High Lights的直接链接" title="High Lights的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="resource-segregation-and-isolation">Resource Segregation and Isolation<a href="#resource-segregation-and-isolation" class="hash-link" aria-label="Resource Segregation and Isolation的直接链接" title="Resource Segregation and Isolation的直接链接"></a></h3><p>You can divide BE nodes in a Doris cluster into resource groups by using resource tags, allowing you to manage online and offline operations and isolate resources at the node level.
You can also control the resource overhead of individual queries by limiting the CPU and memory overhead and complexity of individual query tasks, thus reducing the resource hogging problem between different queries.</p><h3 class="anchor anchorWithStickyNavbar_LWe7" id="performance-optimization">Performance Optimization<a href="#performance-optimization" class="hash-link" aria-label="Performance Optimization的直接链接" title="Performance Optimization的直接链接"></a></h3><ul><li><p>The Runtime Filter feature can significantly improve query efficiency in most Join scenarios by using the Join Key column condition of the right table in the Join algorithm to filter the data in the left table. For example, you can get 2-10 times performance improvement under Star Schema Benchmark (TPCH's streamlined test set).</p></li><li><p>The Join Reorder feature can automatically help adjust the order of joins in SQL by using a cost model to help achieve optimal join efficiency.
It can be enabled via the session variable <code>set enable_cost_based_join_reorder=true</code>.</p></li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-features">New features<a href="#new-features" class="hash-link" aria-label="New features的直接链接" title="New features的直接链接"></a></h3><ul><li>Support synchronizing MySQL binlog data directly to Canal Server.</li><li>Support String column type, support up to 2GB.</li><li>Support List partitioning, you can create partitions for enumerated values.</li><li>Support transactional Insert statement function. You can import data in bulk by begin ; insert ; insert;, ... You can import data in bulk by begin ; insert ; insert ;, ... ;.</li><li>Support Update statement function on Unique Key model. You can execute Update Set where statement on Unique Key model table.</li><li>Support SQL blocking list function. You can block some SQL execution by regular, hash value matching, etc.</li><li>Support LDAP login authentication.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="extended-features">Extended Features<a href="#extended-features" class="hash-link" aria-label="Extended Features的直接链接" title="Extended Features的直接链接"></a></h3><ul><li>Support Flink-Doris-Connector.</li><li>Support for DataX doriswriter plugin.</li><li>Spark-Doris-Connector support for data writing to Doris.</li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="feature-optimization">Feature Optimization<a href="#feature-optimization" class="hash-link" aria-label="Feature Optimization的直接链接" title="Feature Optimization的直接链接"></a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="query">Query<a href="#query" class="hash-link" aria-label="Query的直接链接" title="Query的直接链接"></a></h3><ul><li>Support for computing all constant expressions in the SQL query planning phase using BE's functional computing power.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="import">Import<a href="#import" class="hash-link" aria-label="Import的直接链接" title="Import的直接链接"></a></h3><ul><li>Support for specifying multi-byte row separators or invisible separators when importing text format files.</li><li>Supports importing compressed format files via Stream Load.</li><li>Stream Load supports importing Json data in multi-line format.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="export">Export<a href="#export" class="hash-link" aria-label="Export的直接链接" title="Export的直接链接"></a></h3><ul><li>Support Export export function to specify where filter. Supports exporting files with multi-byte row separators. Support export to local files.</li><li>Export export function supports exporting only specified columns.</li><li>Supports exporting the result set to local disk via outfile statement and writing the exported marker file after exporting.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="ease-of-use">Ease of use<a href="#ease-of-use" class="hash-link" aria-label="Ease of use的直接链接" title="Ease of use的直接链接"></a></h3><ul><li>Dynamic partitioning function supports creating and keeping specified historical partitions, and supports automatic hot and cold data migration settings.</li><li>Supports displaying queries, imported schedules and Profiles using a visual tree structure at the command line.</li><li>Support to record and view Stream Load operation logs.</li><li>When consuming Kafka data via Routine Load, you can specify the time point for consumption.</li><li>Supports exporting Routine Load creation statements by show create routine load function.</li><li>Support to start and stop all Routine Load jobs with one click by pause/resume all routine load command.</li><li>Supports modifying the Broker List and Topic of Routine Load by alter routine load statement.</li><li>Support create table as select function.</li><li>Support modify column comments and table comments by alter table command.</li><li>show tablet status to add table creation time and data update time.</li><li>Support show data skew command to check the data volume distribution of a table to troubleshoot data skewing problems.</li><li>Support show/clean trash command to check the disk occupation of BE file recycle bin and clear it actively.</li><li>Support show view statement to show which views a table is referenced by.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="new-functions">New functions<a href="#new-functions" class="hash-link" aria-label="New functions的直接链接" title="New functions的直接链接"></a></h3><ul><li><code>bitmap_min</code>, <code>bit_length</code></li><li><code>yearweek</code>, <code>week</code>, <code>makedate</code></li><li><code>percentile</code> exact percentile function</li><li><code>json_array</code>, <code>json_object</code>, <code>json_quote</code></li><li>Support for creating custom public keys for the <code>AES_ENCRYPT</code> and <code>AES_DECRYPT</code> functions.</li><li>Support for creating function aliases to combine multiple functions by <code>create alias function</code>.</li></ul><h3 class="anchor anchorWithStickyNavbar_LWe7" id="other">Other<a href="#other" class="hash-link" aria-label="Other的直接链接" title="Other的直接链接"></a></h3><ul><li>Support for accessing the ES exterior of the SSL connection protocol.</li><li>Support specifying the number of hotspot partitions in the dynamic partition property, which will be stored in SSD disks.</li><li>Support importing Json format data via Broker Load.</li><li>Supports accessing HDFS directly through libhdfs3 library for data import and export without the Broker process.</li><li>select into outfile function supports exporting Parquet file format and parallel export.</li><li>ODBC external table support for SQLServer. </li></ul><h2 class="anchor anchorWithStickyNavbar_LWe7" id="致谢">致谢<a href="#致谢" class="hash-link" aria-label="致谢的直接链接" title="致谢的直接链接"></a></h2><p>The release of Apache Doris (incubating) 0.15.0 Release is made possible by the support of all community users. We would like to thank all the community contributors who participated in the design, development, testing, and discussion of the release, namely.</p><ul><li><a href="https://github.com/924060929" target="_blank" rel="noopener noreferrer">@924060929</a></li><li><a href="https://github.com/acelyc111" target="_blank" rel="noopener noreferrer">@acelyc111</a></li><li><a href="https://github.com/Aimiyoo" target="_blank" rel="noopener noreferrer">@Aimiyoo</a></li><li><a href="https://github.com/amosbird" target="_blank" rel="noopener noreferrer">@amosbird</a></li><li><a href="https://github.com/arthur-zhang" target="_blank" rel="noopener noreferrer">@arthur-zhang</a></li><li><a href="https://github.com/azurenake" target="_blank" rel="noopener noreferrer">@azurenake</a></li><li><a href="https://github.com/BiteTheDDDDt" target="_blank" rel="noopener noreferrer">@BiteTheDDDDt</a></li><li><a href="https://github.com/caiconghui" target="_blank" rel="noopener noreferrer">@caiconghui</a></li><li><a href="https://github.com/caneGuy" target="_blank" rel="noopener noreferrer">@caneGuy</a></li><li><a href="https://github.com/caoliang-web" target="_blank" rel="noopener noreferrer">@caoliang-web</a></li><li><a href="https://github.com/ccoffline" target="_blank" rel="noopener noreferrer">@ccoffline</a></li><li><a href="https://github.com/chaplinthink" target="_blank" rel="noopener noreferrer">@chaplinthink</a></li><li><a href="https://github.com/chovy-3012" target="_blank" rel="noopener noreferrer">@chovy-3012</a></li><li><a href="https://github.com/ChPi" target="_blank" rel="noopener noreferrer">@ChPi</a></li><li><a href="https://github.com/copperybean" target="_blank" rel="noopener noreferrer">@copperybean</a></li><li><a href="https://github.com/crazyleeyang" target="_blank" rel="noopener noreferrer">@crazyleeyang</a></li><li><a href="https://github.com/dh-cloud" target="_blank" rel="noopener noreferrer">@dh-cloud</a></li><li><a href="https://github.com/DinoZhang" target="_blank" rel="noopener noreferrer">@DinoZhang</a></li><li><a href="https://github.com/dixingxing0" target="_blank" rel="noopener noreferrer">@dixingxing0</a></li><li><a href="https://github.com/dohongdayi" target="_blank" rel="noopener noreferrer">@dohongdayi</a></li><li><a href="https://github.com/e0c9" target="_blank" rel="noopener noreferrer">@e0c9</a></li><li><a href="https://github.com/EmmyMiao87" target="_blank" rel="noopener noreferrer">@EmmyMiao87</a></li><li><a href="https://github.com/eyesmoons" target="_blank" rel="noopener noreferrer">@eyesmoons</a></li><li><a href="https://github.com/francisoliverlee" target="_blank" rel="noopener noreferrer">@francisoliverlee</a></li><li><a href="https://github.com/Gabriel39" target="_blank" rel="noopener noreferrer">@Gabriel39</a></li><li><a href="https://github.com/gaodayue" target="_blank" rel="noopener noreferrer">@gaodayue</a></li><li><a href="https://github.com/GoGoWen" target="_blank" rel="noopener noreferrer">@GoGoWen</a></li><li><a href="https://github.com/HappenLee" target="_blank" rel="noopener noreferrer">@HappenLee</a></li><li><a href="https://github.com/harveyyue" target="_blank" rel="noopener noreferrer">@harveyyue</a></li><li><a href="https://github.com/Henry2SS" target="_blank" rel="noopener noreferrer">@Henry2SS</a></li><li><a href="https://github.com/hf200012" target="_blank" rel="noopener noreferrer">@hf200012</a></li><li><a href="https://github.com/huangmengbin" target="_blank" rel="noopener noreferrer">@huangmengbin</a></li><li><a href="https://github.com/huozhanfeng" target="_blank" rel="noopener noreferrer">@huozhanfeng</a></li><li><a href="https://github.com/huzk8" target="_blank" rel="noopener noreferrer">@huzk8</a></li><li><a href="https://github.com/hxianshun" target="_blank" rel="noopener noreferrer">@hxianshun</a></li><li><a href="https://github.com/ikaruga4600" target="_blank" rel="noopener noreferrer">@ikaruga4600</a></li><li><a href="https://github.com/JameyWoo" target="_blank" rel="noopener noreferrer">@JameyWoo</a></li><li><a href="https://github.com/Jennifer88huang" target="_blank" rel="noopener noreferrer">@Jennifer88huang</a></li><li><a href="https://github.com/JinLiOnline" target="_blank" rel="noopener noreferrer">@JinLiOnline</a></li><li><a href="https://github.com/jinyuanlu" target="_blank" rel="noopener noreferrer">@jinyuanlu</a></li><li><a href="https://github.com/JNSimba" target="_blank" rel="noopener noreferrer">@JNSimba</a></li><li><a href="https://github.com/killxdcj" target="_blank" rel="noopener noreferrer">@killxdcj</a></li><li><a href="https://github.com/kuncle" target="_blank" rel="noopener noreferrer">@kuncle</a></li><li><a href="https://github.com/liutang123" target="_blank" rel="noopener noreferrer">@liutang123</a></li><li><a href="https://github.com/luozenglin" target="_blank" rel="noopener noreferrer">@luozenglin</a></li><li><a href="https://github.com/luzhijing" target="_blank" rel="noopener noreferrer">@luzhijing</a></li><li><a href="https://github.com/MarsXDM" target="_blank" rel="noopener noreferrer">@MarsXDM</a></li><li><a href="https://github.com/mh-boy" target="_blank" rel="noopener noreferrer">@mh-boy</a></li><li><a href="https://github.com/mk8310" target="_blank" rel="noopener noreferrer">@mk8310</a></li><li><a href="https://github.com/morningman" target="_blank" rel="noopener noreferrer">@morningman</a></li><li><a href="https://github.com/Myasuka" target="_blank" rel="noopener noreferrer">@Myasuka</a></li><li><a href="https://github.com/nimuyuhan" target="_blank" rel="noopener noreferrer">@nimuyuhan</a></li><li><a href="https://github.com/pan3793" target="_blank" rel="noopener noreferrer">@pan3793</a></li><li><a href="https://github.com/PatrickNicholas" target="_blank" rel="noopener noreferrer">@PatrickNicholas</a></li><li><a href="https://github.com/pengxiangyu" target="_blank" rel="noopener noreferrer">@pengxiangyu</a></li><li><a href="https://github.com/pierre94" target="_blank" rel="noopener noreferrer">@pierre94</a></li><li><a href="https://github.com/qidaye" target="_blank" rel="noopener noreferrer">@qidaye</a></li><li><a href="https://github.com/qzsee" target="_blank" rel="noopener noreferrer">@qzsee</a></li><li><a href="https://github.com/shiyi23" target="_blank" rel="noopener noreferrer">@shiyi23</a></li><li><a href="https://github.com/smallhibiscus" target="_blank" rel="noopener noreferrer">@smallhibiscus</a></li><li><a href="https://github.com/songenjie" target="_blank" rel="noopener noreferrer">@songenjie</a></li><li><a href="https://github.com/spaces-X" target="_blank" rel="noopener noreferrer">@spaces-X</a></li><li><a href="https://github.com/stalary" target="_blank" rel="noopener noreferrer">@stalary</a></li><li><a href="https://github.com/stdpain" target="_blank" rel="noopener noreferrer">@stdpain</a></li><li><a href="https://github.com/Stephen-Robin" target="_blank" rel="noopener noreferrer">@Stephen-Robin</a></li><li><a href="https://github.com/Sunt-ing" target="_blank" rel="noopener noreferrer">@Sunt-ing</a></li><li><a href="https://github.com/Taaang" target="_blank" rel="noopener noreferrer">@Taaang</a></li><li><a href="https://github.com/tarepanda1024" target="_blank" rel="noopener noreferrer">@tarepanda1024</a></li><li><a href="https://github.com/tianhui5" target="_blank" rel="noopener noreferrer">@tianhui5</a></li><li><a href="https://github.com/tinkerrrr" target="_blank" rel="noopener noreferrer">@tinkerrrr</a></li><li><a href="https://github.com/TobKed" target="_blank" rel="noopener noreferrer">@TobKed</a></li><li><a href="https://github.com/ucasfl" target="_blank" rel="noopener noreferrer">@ucasfl</a></li><li><a href="https://github.com/Userwhite" target="_blank" rel="noopener noreferrer">@Userwhite</a></li><li><a href="https://github.com/vinson0526" target="_blank" rel="noopener noreferrer">@vinson0526</a></li><li><a href="https://github.com/wangbo" target="_blank" rel="noopener noreferrer">@wangbo</a></li><li><a href="https://github.com/wangliansong" target="_blank" rel="noopener noreferrer">@wangliansong</a></li><li><a href="https://github.com/wangshuo128" target="_blank" rel="noopener noreferrer">@wangshuo128</a></li><li><a href="https://github.com/weajun" target="_blank" rel="noopener noreferrer">@weajun</a></li><li><a href="https://github.com/weihongkai2008" target="_blank" rel="noopener noreferrer">@weihongkai2008</a></li><li><a href="https://github.com/weizuo93" target="_blank" rel="noopener noreferrer">@weizuo93</a></li><li><a href="https://github.com/WindyGao" target="_blank" rel="noopener noreferrer">@WindyGao</a></li><li><a href="https://github.com/wunan1210" target="_blank" rel="noopener noreferrer">@wunan1210</a></li><li><a href="https://github.com/wuyunfeng" target="_blank" rel="noopener noreferrer">@wuyunfeng</a></li><li><a href="https://github.com/xhmz" target="_blank" rel="noopener noreferrer">@xhmz</a></li><li><a href="https://github.com/xiaokangguo" target="_blank" rel="noopener noreferrer">@xiaokangguo</a></li><li><a href="https://github.com/xiaoxiaopan118" target="_blank" rel="noopener noreferrer">@xiaoxiaopan118</a></li><li><a href="https://github.com/xinghuayu007" target="_blank" rel="noopener noreferrer">@xinghuayu007</a></li><li><a href="https://github.com/xinyiZzz" target="_blank" rel="noopener noreferrer">@xinyiZzz</a></li><li><a href="https://github.com/xuliuzhe" target="_blank" rel="noopener noreferrer">@xuliuzhe</a></li><li><a href="https://github.com/xxiao2018" target="_blank" rel="noopener noreferrer">@xxiao2018</a></li><li><a href="https://github.com/xy720" target="_blank" rel="noopener noreferrer">@xy720</a></li><li><a href="https://github.com/yangzhg" target="_blank" rel="noopener noreferrer">@yangzhg</a></li><li><a href="https://github.com/yx91490" target="_blank" rel="noopener noreferrer">@yx91490</a></li><li><a href="https://github.com/zbtzbtzbt" target="_blank" rel="noopener noreferrer">@zbtzbtzbt</a></li><li><a href="https://github.com/zenoyang" target="_blank" rel="noopener noreferrer">@zenoyang</a></li><li><a href="https://github.com/zh0122" target="_blank" rel="noopener noreferrer">@zh0122</a></li><li><a href="https://github.com/zhangboya1" target="_blank" rel="noopener noreferrer">@zhangboya1</a></li><li><a href="https://github.com/zhangstar333" target="_blank" rel="noopener noreferrer">@zhangstar333</a></li><li><a href="https://github.com/zuochunwei" target="_blank" rel="noopener noreferrer">@zuochunwei</a></li></ul>]]></content>
<author>
<name>Apache Doris</name>
</author>
<category label="Release Notes" term="Release Notes"/>
</entry>
</feed>