{ ‘title’: ‘Building a log analytics solution 10 times more cost-effective than Elasticsearch’, ‘description’: “Apache Doris has introduced inverted indexes in version 2.0.0 and further optimized it to realize two times faster log query performance than Elasticsearch with 1/5 of the storage space it uses.”, ‘date’: ‘2023-05-26’, ‘author’: ‘Apache Doris’, ‘tags’: [‘Tech Sharing’], “image”: ‘/images/building-a-log-analytics-solution-10-times-more-cost-effective-than-es.jpg’ }
Logs often take up the majority of a company's data assets. Examples of logs include business logs (such as user activity logs), and Operation & Maintenance logs of servers, databases, and network or IoT devices.
Logs are the guardian angel of business. On the one hand, they provide system risk alerts and help engineers in troubleshooting. On the other hand, if you zoom them out by time range, you might identify some helpful trends and patterns, not to mention that business logs are the cornerstone of user insights.
However, logs can be a handful, because:
Now you can see a clear picture of what an ideal log processing system is like. It should support:
There exist two common log processing solutions within the industry, exemplified by Elasticsearch and Grafana Loki, respectively.
A prominent strength of Elasticsearch in log processing is quick keyword search among a sea of logs. This is enabled by inverted indexes.
Inverted indexing was originally used to retrieve words or phrases in texts. The figure below illustrates how it works:
Upon data writing, the system tokenizes texts into terms, and stores these terms in a posting list which maps terms to the ID of the row where they exist. In text queries, the database finds the corresponding row ID of the keyword (term) in the posting list, and fetches the target row based on the row ID. By doing so, the system won't have to traverse the whole dataset and thus improves query speeds by orders of magnitudes.
In inverted indexing of Elasticsearch, quick retrieval comes at the cost of writing speed, writing throughput, and storage space. Why? Firstly, tokenization, dictionary sorting, and inverted index creation are all CPU- and memory-intensive operations. Secondly, Elasticssearch has to store the original data, the inverted index, and an extra copy of data stored in columns for query acceleration. That's triple redundancy.
But without inverted index, Grafana Loki, for example, is hurting user experience with its slow queries, which is the biggest pain point for engineers in log analysis.
Simply put, Elasticsearch and Grafana Loki represent different tradeoffs between high writing throughput, low storage cost, and fast query performance. What if I tell you there is a way to have them all? We have introduced inverted indexes in Apache Doris 2.0.0 and further optimized it to realize two times faster log query performance than Elasticsearch with 1/5 of the storage space it uses. Both factors combined, it is a 10 times better solution.
Generally, there are two ways to implement indexes: external indexing system or built-in indexes.
External indexing system: You connect an external indexing system to your database. In data ingestion, data is imported to both systems. After the indexing system creates indexes, it deletes the original data within itself. When data users input a query, the indexing system provides the IDs of the relevant data, and then the database looks up the target data based on the IDs.
Building an external indexing system is easier and less intrusive to the database, but it comes with some annoying flaws:
In Apache Doris, we opt for the other way. Built-in inverted indexes are more difficult to make, but once it is done, it is faster, more user-friendly, and trouble-free to maintain.
In Apache Doris, data is arranged in the following format. Indexes are stored in the Index Region:
We implement inverted indexes in a non-intrusive manner:
where clause includes a column with inverted index, the system will look up in the index file, return a DocID list, and convert the DocID list into a RowID Bitmap. Under the RowID filtering mechanism of Apache Doris, only the target rows will be read. This is how queries are accelerated.Such non-intrusive method separates the index file from the data files, so you can make any changes to the inverted indexes without worrying about affecting the data files themselves or other indexes.
C++ Implementation and Vectorization
Different from Elasticsearch, which uses Java, Apache Doris implements C++ in its storage modules, query execution engine, and inverted indexes. Compared to Java, C++ provides better performance, allows easier vectorization, and produces no JVM GC overheads. We have vectorized every step of inverted indexing in Apache Doris, such as tokenization, index creation, and queries. To provide you with a perspective, in inverted indexing, Apache Doris writes data at a speed of 20MB/s per core, which is four times that of Elasticsearch (5MB/s).
Columnar Storage & Compression
Apache Lucene lays the foundation for inverted indexes in Elasticsearch. As Lucene itself is built to support file storage, it stores data in a row-oriented format.
In Apache Doris, inverted indexes for different columns are isolated from each other, and the inverted index files adopt columnar storage to facilitate vectorization and data compression.
By utilizing Zstandard compression, Apache Doris realizes a compression ratio ranging from 5:1 to 10:1, faster compression speeds, and 50% less space usage than GZIP compression.
BKD Trees for Numeric / Datetime Columns
Apache Doris implements BKD trees for numeric and datetime columns. This not only increases performance of range queries, but is a more space-saving method than converting those columns to fixed-length strings. Other benefits of it include:
In addition to BKD trees, we have further optimized the queries on numeric and datetime columns.
Log analysis is a simple kind of query with no need for advanced features (e.g. relevance scoring in Apache Lucene). The bread and butter capability of a log processing tool is quick queries and low storage cost. Therefore, in Apache Doris, we have streamlined the inverted index structure to meet the needs of an OLAP database.
In light of the fact that logs are partitioned by time range and historical logs are visited less frequently. We plan to provide more granular and flexible index management in future versions of Apache Doris:
We tested Apache Doris on publicly available datasets against Elasticsearch and ClickHouse.
For a fair comparison, we ensure uniformity of testing conditions, including benchmarking tool, dataset, and hardware.
Benchmarking tool: ES Rally, the official testing tool for Elasticsearch
Dataset: 1998 World Cup HTTP Server Logs (self-contained dataset in ES Rally)
Data Size (Before Compression): 32G, 247 million rows, 134 bytes per row (on average)
Query: 11 queries including keyword search, range query, aggregation, and ranking; Each query is serially executed 100 times.
Environment: 3 × 16C 64G cloud virtual machines
Results of Apache Doris:
As ClickHouse launched inverted index as an experimental feature in v23.1, we tested Apache Doris with the same dataset and SQL as described in the ClickHouse blog, and compared performance of the two under the same testing resource, case, and tool.
Data: 6.7G, 28.73 million rows, the Hacker News dataset, Parquet format
Query: 3 keyword searches, counting the number of occurrence of the keywords “ClickHouse”, “OLAP” OR “OLTP”, and “avx” AND “sve”.
Environment: 1 × 16C 64G cloud virtual machine
Result: Apache Doris was 4.7 times, 12 times, 18.5 times faster than ClickHouse in the three queries, respectively.
Dataset: one million comment records from Hacker News
Step 1: Specify inverted index to the data table upon table creation.
Parameters:
comment): create an index named “idx_comment” comment for the “comment” columnCREATE TABLE hackernews_1m ( `id` BIGINT, `deleted` TINYINT, `type` String, `author` String, `timestamp` DateTimeV2, `comment` String, `dead` TINYINT, `parent` BIGINT, `poll` BIGINT, `children` Array<BIGINT>, `url` String, `score` INT, `title` String, `parts` Array<INT>, `descendants` INT, INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment' ) DUPLICATE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 10 PROPERTIES ("replication_num" = "1");
Note: You can add index to an existing table via ADD INDEX idx_comment ON hackernews_1m(comment) USING INVERTED PROPERTIES("parser" = "english") . Different from that of smart index and secondary index, the creation of inverted index only involves the reading of the comment column, so it can be much faster.
Step 2: Retrieve the words“OLAP” and “OLTP” in the comment column with MATCH_ALL. The response time here was 1/10 of that in hard matching with like. (The performance gap widens as data volume increases.)
mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%'; +---------+ | count() | +---------+ | 15 | +---------+ 1 row in set (0.13 sec) mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP'; +---------+ | count() | +---------+ | 15 | +---------+ 1 row in set (0.01 sec)
For more feature introduction and usage guide, see documentation: Inverted Index
In a word, what contributes to Apache Doris' 10-time higher cost-effectiveness than Elasticsearch is its OLAP-tailored optimizations for inverted indexing, supported by the columnar storage engine, massively parallel processing framework, vectorized query engine, and cost-based optimizer of Apache Doris.
As proud as we are about our own inverted indexing solution, we understand that self-published benchmarks can be controversial, so we are open to feedback from any third-party users and see how Apache Doris works in real-world cases.