{ ‘title’: ‘Log analysis: Elasticsearch vs Apache Doris’, ‘description’: “As a major part of a company's data asset, logs brings values to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine where you can extract information that points to business growth.”, ‘date’: ‘2023-09-28’, ‘author’: ‘Apache Doris’, ‘tags’: [‘Tech Sharing’], “image”: ‘/images/es-vs-doris.png’ }
As a major part of a company's data asset, logs brings values to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine where you can extract information that points to business growth.
Logs are the sequential records of events in the computer system. If you think about how logs are generated and used, you will know what an ideal log analysis system should look like:
A popular log processing solution within the data industry is the ELK stack: Elasticsearch, Logstash, and Kibana. The pipeline can be split into five modules:
The ELK stack has outstanding real-time processing capabilities, but frictions exist.
The Index Mapping in Elasticsearch defines the table scheme, which includes the field names, data types, and whether to enable index creation.
Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds fields to the Mapping according to the input JSON data. This provides some sort of schema-free support, but it's not enough because:
Elasticsearch has its unique Domain Specific Language (DSL), which is very different from the tech stack that most data engineers and analysts are familiar with, so there is a steep learning curve. Moreover, Elasticsearch has a relatively closed ecosystem so there might be strong resistance in integration with BI tools. Most importantly, Elastisearch only supports single-table analysis and is lagging behind the modern OLAP demands for multi-table join, sub-query, and views.
Elasticsearch users have been complaining about the computation and storage costs. The root reason lies in the way Elasticsearch works.
As data and cluster size grows, maintaining stability can be another issue:
During data writing peaks: Clusters are prone to overload during data writing peaks.
During queries: Since all queries are processed in the memory, big queries can easily lead to JVM OOM.
Slow recovery: For a cluster failure, Elasticsearch should reload indexes, which is resource-intensive, so it will take many minutes to recover. That challenges service availability guarantee.
Reflecting on the strengths and limitations of the Elasticsearch-based solution, the Apache Doris developers have optimized Apache Doris for log processing.
Benchmark tests with ES Rally, the official testing tool for Elasticsearch, showed that Apache Doris was around 5 times as fast as Elasticsearch in data writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a writing speed of 550 MB/s and a compression ratio of 10:1.
The below figure show what a typical Doris-based log processing system looks like. It is more inclusive and allows for more flexible usage from data ingestion, analysis, and application:
Moreover, Apache Doris has better scheme-free support and a more user-friendly analytic engine.
Firstly, we worked on the data types. We optimized the string search and regular expression matching for “text” through vectorization and brought a performance increase of 2~10 times. For JSON strings, Apache Doris will parse and store them as a more compacted and efficient binary format, which can speed up queries by 4 times. We also added a new data type for complicated data: Array Map. It can structuralize concatenated strings to allow for higher compression rate and faster queries.
Secondly, Apache Doris supports schema evolution. This means you can adjust the schema as your business changes. You can add or delete fields and indexes, and change the data types for fields.
Apache Doris provides Light Schema Change capabilities, so you can add or delete fields within milliseconds:
-- Add a column. Result will be returned in milliseconds. ALTER TABLE lineitem ADD COLUMN l_new_column INT;
You can also add index only for your target fields, so you can avoid overheads from unnecessary index creation. After you add an index, by default, the system will generate the index for all incremental data, and you can specify which historical data partitions that need the index.
-- Add inverted index. Doris will generate inverted index for all new data afterward. ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED; -- Build index for the specified historical data partitions. BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, partition_name2);
The SQL-based analytic engine makes sure that data engineers and analysts can smoothly grasp Apache Doris in a short time and bring their experience with SQL to this OLAP engine. Building on the rich features of SQL, users can execute data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and materialized views to serve their own needs.
With MySQL compatibility, Apache Doris can be integrated with most GUI and BI tools in the big data ecosystem, so users can realize more complex and diversified data analysis.
A gaming company has transitioned from the ELK stack to the Apache Doris solution. Their Doris-based log system used 1/6 of the storage space that they previously needed.
In a cybersecurity company who built their log analysis system utilizing inverted index in Apache Doris, they supported a data writing speed of 300,000 rows per second with 1/5 of the server resources that they formerly used.
Now let's go through the three steps of building a log analysis system with Apache Doris.
Before you start, download Apache Doris 2.0 or newer versions from the website and deploy clusters.
This is an example of table creation.
Explanations for the configurations:
CREATE DATABASE log_db; USE log_db; CREATE RESOURCE "log_s3" PROPERTIES ( "type" = "s3", "s3.endpoint" = "your_endpoint_url", "s3.region" = "your_region", "s3.bucket" = "your_bucket", "s3.root.path" = "your_path", "s3.access_key" = "your_ak", "s3.secret_key" = "your_sk" ); CREATE STORAGE POLICY log_policy_1day PROPERTIES( "storage_resource" = "log_s3", "cooldown_ttl" = "86400" ); CREATE TABLE log_table ( `ts` DATETIMEV2, `clientip` VARCHAR(20), `request` TEXT, `status` INT, `size` INT, INDEX idx_size (`size`) USING INVERTED, INDEX idx_status (`status`) USING INVERTED, INDEX idx_clientip (`clientip`) USING INVERTED, INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english") ) ENGINE = OLAP DUPLICATE KEY(`ts`) PARTITION BY RANGE(`ts`) () DISTRIBUTED BY RANDOM BUCKETS AUTO PROPERTIES ( "replication_num" = "1", "storage_policy" = "log_policy_1day", "deprecated_dynamic_schema" = "true", "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.start" = "-3", "dynamic_partition.end" = "7", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "AUTO", "dynamic_partition.replication_num" = "1" );
Apache Doris supports various ingestion methods. For real-time logs, we recommend the following three methods:
Ingest from Kafka
For JSON logs that are written into Kafka message queues, create Routine Load so Doris will pull data from Kafka. The following is an example. The property.*
configurations are optional:
-- Prepare the Kafka cluster and topic ("log_topic") -- Create Routine Load, load data from Kafka log_topic to "log_table" CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table COLUMNS(ts, clientip, request, status, size) PROPERTIES ( "max_batch_interval" = "10", "max_batch_rows" = "1000000", "max_batch_size" = "109715200", "strict_mode" = "false", "format" = "json" ) FROM KAFKA ( "kafka_broker_list" = "host:port", "kafka_topic" = "log_topic", "property.group.id" = "your_group_id", "property.security.protocol"="SASL_PLAINTEXT", "property.sasl.mechanism"="GSSAPI", "property.sasl.kerberos.service.name"="kafka", "property.sasl.kerberos.keytab"="/path/to/xxx.keytab", "property.sasl.kerberos.principal"="xxx@yyy.com" );
You can check how the Routine Load runs via the SHOW ROUTINE LOAD
command.
Ingest via Logstash
Configure HTTP Output for Logstash, and then data will be sent to Doris via HTTP Stream Load.
logstash.yml
to improve data writing performance.pipeline.batch.size: 100000 pipeline.batch.delay: 10000
testlog.conf
, URL => the Stream Load address in Doris.http basic auth
. It is computed with echo -n 'username:password' | base64
.load_to_single_tablet
in the headers can reduce the number of small files in data ingestion.output { http { follow_redirects => true keepalive => false http_method => "put" url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load" headers => [ "format", "json", "strip_outer_array", "true", "load_to_single_tablet", "true", "Authorization", "Basic cm9vdDo=", "Expect", "100-continue" ] format => "json_batch" } }
Ingest via self-defined program
This is an example of ingesting data to Doris via HTTP Stream Load.
Notes:
basic auth
for HTTP authorization, use echo -n 'username:password' | base64
in computationhttp header "format:json"
: the data type is specified as JSONhttp header "read_json_by_line:true"
: each line is a JSON recordhttp header "load_to_single_tablet:true"
: write to one tablet each timecurl \ --location-trusted \ -u username:password \ -H "format:json" \ -H "read_json_by_line:true" \ -H "load_to_single_tablet:true" \ -T logfile.json \ http://fe_host:fe_http_port/api/log_db/log_table/_stream_load
Apache Doris supports standard SQL, so you can connect to Doris via MySQL client or JDBC and then execute SQL queries.
mysql -h fe_host -P fe_mysql_port -u root -Dlog_db
A few common queries in log analysis:
SELECT * FROM log_table ORDER BY ts DESC LIMIT 10;
SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10;
SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10;
SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10;
If you are looking for an efficient log analytic solution, Apache Doris is friendly to anyone equipped with SQL knowledge; if you find friction with the ELK stack, try Apache Doris provides better schema-free support, enables faster data writing and queries, and brings much less storage burden.
But we won't stop here. We are going to provide more features to facilitate log analysis. We plan to add more complicated data types to inverted index, and support BKD index to make Apache Doris a fit for geo data analysis. We also plan to expand capabilities in semi-structured data analysis, such as working on the complex data types (Array, Map, Struct, JSON) and high-performance string matching algorithm. And we welcome any user feedback and development advice.