{ “title”: “TPC-H Benchmark”, “language”: “en” }

TPC-H benchmark

TPC-H is a Decision Support Benchmark consisting of a set of business-oriented ad hoc queries and concurrent data modifications. The data that queries and populates the database has broad industry relevance. This benchmark demonstrates a decision support system that examines large amounts of data, executes highly complex queries, and answers critical business questions. The performance metric reported by TPC-H is called the TPC-H Hourly Compound Query Performance Metric (QphH@Size) and reflects multiple aspects of the system's ability to process queries. These aspects include the database size chosen when executing the query, the query processing power when the query is submitted by a single stream, and the query throughput when the query is submitted by multiple concurrent users.

This document mainly introduces the performance of Doris on the TPC-H test set.

Note 1: Standard test sets including TPC-H are usually far from actual business scenarios, and some tests will perform parameter tuning for the test set. Therefore, the test results of the standard test set can only reflect the performance of the database in specific scenarios. Users are advised to conduct further testing with actual business data.

Note 2: The operations covered in this document are tested on CentOS 7.x.

On the 22 queries on the TPC-H standard test dataset, we conducted a comparison test based on Doris version 1.1 and Doris 0.15.0 RC04 version, and the overall performance was improved by 3-4 times. In individual scenarios, a ten-fold improvement has been achieved.

image-20220614114351241

1. Hardware Environment

HardwareConfiguration Instructions
number of machines4 Alibaba Cloud hosts (1 FE, 3 BE)
CPUIntel Xeon(Cascade Lake) Platinum 8269CY 16C (2.5 GHz/3.2 GHz)
Memory64G
Network5Gbps
DiskESSD cloud hard disk

2. Software Environment

  • Doris deploys 3BE 1FE;
  • Kernel version: Linux version 5.4.0-96-generic (buildd@lgw01-amd64-051)
  • OS version: CentOS 7.8
  • Doris software version: Apache Doris 1.1, Apache Doris 0.15.0 RC04
  • JDK: openjdk version “11.0.14” 2022-01-18

3. Test Data Volume

The entire test simulation generates 100G of data and is imported into Doris 0.15.0 RC04 and Doris 1.1 versions for testing. The following is the relevant description of the table and the amount of data.

TPC-H Table NameRowsdata sizeremark
REGION5400KB
NATION257.714 KB
SUPPLIER100 million85.528 MB
PART20 million752.330 MB
PARTSUPP80 million4.375 GB
CUSTOMER15 million1.317 GB
ORDERS1.5 billion6.301 GB
LINEITEM6 billion20.882 GB

4. Test SQL

TPCH 22 test query statements : TPCH-Query-SQL

Notice:

The following four parameters in the above SQL do not exist in 0.15.0 RC04. When executed in 0.15.0 RC04, remove them:

1. enable_vectorized_engine=true,
2. batch_size=4096,
3. disable_join_reorder=false
4. enable_projection=true

5. Test Result

Here we use the upcoming Doris-1.1 version and Doris-0.15.0 RC04 version for comparative testing. The test results are as follows:

QueryDoris-1.1(s)0.15.0 RC04(s)
Q13.7528.63
Q24.227.88
Q32.649.39
Q41.59.3
Q52.154.11
Q60.190.43
Q71.041.61
Q81.7550.35
Q97.9416.34
Q101.415.21
Q110.351.72
Q120.575.39
Q138.1520.88
Q140.3
Q150.661.86
Q160.791.32
Q171.5126.67
Q183.36411.77
Q190.8291.71
Q202.775.2
Q214.4710.34
Q220.93.22
total51.253223.33
  • Result description
    • The data set corresponding to the test results is scale 100, about 600 million.
    • The test environment is configured to be commonly used by users, including 4 cloud servers, 16-core 64G SSD, and 1 FE and 3 BE deployment.
    • Use common user configuration tests to reduce user selection and evaluation costs, but will not consume so many hardware resources during the entire test process.
    • The test results are averaged over 3 executions. And the data has been fully compacted (if the data is tested immediately after the data is imported, the query delay may be higher than the test result, and the speed of compaction is being continuously optimized, and will be significantly reduced in the future).
    • 0.15 RC04 Q14 execution failed in TPC-H test, unable to complete query.

6. Environmental Preparation

Please refer to the official document to install and deploy Doris to obtain a normal running Doris cluster (at least 1 FE 1 BE, 1 FE 3 BE is recommended).

7. Data Preparation

7.1 Download and install the TPC-H data generation tool

Execute the following script to download and compile the tpch-tools tool.

sh build-tpch-dbgen.sh

After successful installation, the dbgen binary will be generated in the TPC-H_Tools_v3.0.0/ directory.

7.2 Generate TPC-H test set

Execute the following script to generate the TPC-H dataset:

sh gen-tpch-data.sh

Note 1: View script help via sh gen-tpch-data.sh -h.

Note 2: The data will be generated in the tpch-data/ directory with the suffix .tbl. The total file size is about 100GB. The generation time may vary from a few minutes to an hour.

Note 3: The standard test data set of 100G is generated by default

7.3 Create Table

7.3.1 Prepare the doris-cluster.conf file

Before calling the import script, you need to write the FE's ip port and other information in the doris-cluster.conf file.

File location and load-tpch-data.sh level.

The contents of the file include FE's ip, HTTP port, user name, password and the DB name of the data to be imported:

# Any of FE host
export FE_HOST='127.0.0.1'
# http_port in fe.conf
export FE_HTTP_PORT=8030
# query_port in fe.conf
export FE_QUERY_PORT=9030
# Doris username
export USER='root'
# Doris password
export PASSWORD=''
# The database where TPC-H tables located
export DB='tpch1'

7.3.2 Execute the following script to generate and create the TPC-H table

sh create-tpch-tables.sh

Or copy the table creation statement in create-tpch-tables.sql, Execute in Doris.

7.4 导入数据

通过下面的命令执行数据导入:

sh ./load-tpch-data.sh

7.5 Check Imported Data

Execute the following SQL statement to check the imported data volume is consistent with the above data volume.

select count(*)  from  lineitem;
select count(*)  from  orders;
select count(*)  from  partsupp;
select count(*)  from  part;
select count(*)  from  customer;
select count(*)  from  supplier;
select count(*)  from  nation;
select count(*)  from  region;
select count(*)  from  revenue0;

7.6 Query Test

Execute the above test SQL or execute the following command

./run-tpch-queries.sh

Notice:

  1. At present, the query optimizer and statistics functions of Doris are not perfect, so we rewrite some queries in TPC-H to adapt to the execution framework of Doris, but it does not affect the correctness of the results

  2. Doris' new query optimizer will be released in subsequent versions

  3. Set set mem_exec_limit=8G before executing the query