This document focuses on introducing the Running Profile which recorded runtime status of Doris in query execution. Using these statistical information, we can understand the execution of frgment to become a expert of Doris's debugging and tuning.
FE: Frontend, frontend node of Doris. Responsible for metadata management and request access.
BE: Backend, backend node of Doris. Responsible for query execution and data storage.
Fragment: FE will convert the execution of specific SQL statements into corresponding fragments and distribute them to BE for execution. BE will execute corresponding fragments and gather the result of RunningProfile to send back FE.
FE splits the query plan into fragments and distributes them to BE for task execution. BE records the statistics of Running State when executing fragment. BE print the outputs statistics of fragment execution into the log. FE can also collect these statistics recorded by each fragment and print the results on FE's web page.
Turn on the report switch on FE through MySQL command
mysql> set enable_profile=true;
After executing the corresponding SQL statement(is_report_success
in old versions), we can see the report information of the corresponding SQL statement on the FE web page like the picture below.
The latest 100 statements executed will be listed here. We can view detailed statistics of RunningProfile.
Query: Summary: Query ID: 9664061c57e84404-85ae111b8ba7e83a Start Time: 2020-05-02 10:34:57 End Time: 2020-05-02 10:35:08 Total: 10s323ms Query Type: Query Query State: EOF Doris Version: trunk User: root Default Db: default_cluster:test Sql Statement: select max(Bid_Price) from quotes group by Symbol
Here is a detailed list of query ID, execution time, execution statement
and other summary information. The next step is to print the details of each fragment collected from be.
Fragment 0: Instance 9664061c57e84404-85ae111b8ba7e83d (host=TNetworkAddress(hostname:192.168.0.1, port:9060)):(Active: 10s270ms, % non-child: 0.14%) - MemoryLimit: 2.00 GB - BytesReceived: 168.08 KB - PeakUsedReservation: 0.00 - SendersBlockedTimer: 0ns - DeserializeRowBatchTimer: 501.975us - PeakMemoryUsage: 577.04 KB - RowsProduced: 8.322K (8322) EXCHANGE_NODE (id=4):(Active: 10s256ms, % non-child: 99.35%) - ConvertRowBatchTime: 180.171us - PeakMemoryUsage: 0.00 - RowsReturned: 8.322K (8322) - MemoryUsed: 0.00 - RowsReturnedRate: 811
The fragment ID is listed here; hostname
show the be node executing the fragment; active: 10s270ms
show the total execution time of the node; non child: 0.14%
means the execution time of the execution node itself (not including the execution time of child nodes) as a percentage of the total time.
PeakMemoryUsage
indicates the peak memory usage of EXCHANGE_NODE
; RowsReturned
indicates the number of rows returned by EXCHANGE_NODE
; RowsReturnedRate
=RowsReturned
/ActiveTime
; the meaning of these three statistics in other NODE
the same.
Subsequently, the statistics of the child nodes will be printed in turn. here you can distinguish the parent-child relationship by intent.
There are many statistical information collected at BE. so we list the corresponding meanings of profile are below:
Fragment
BlockMgr
DataStreamSender
ODBC_TABLE_SINK
EXCHANGE_NODE
SORT_NODE
AGGREGATION_NODE
HASH_JOIN_NODE
CROSS_JOIN_NODE
UNION_NODE
ANALYTIC_EVAL_NODE
OLAP_SCAN_NODE
The OLAP_SCAN_NODE
is responsible for specific data scanning tasks. One OLAP_SCAN_NODE
will generate one or more OlapScanner
. Each Scanner thread is responsible for scanning part of the data.
Some or all of the predicate conditions in the query will be pushed to OLAP_SCAN_NODE
. Some of these predicate conditions will continue to be pushed down to the storage engine in order to use the storage engine's index for data filtering. The other part will be kept in OLAP_SCAN_NODE
to filter the data returned from the storage engine.
The profile of the OLAP_SCAN_NODE
node is usually used to analyze the efficiency of data scanning. It is divided into three layers: OLAP_SCAN_NODE
, OlapScanner
, and SegmentIterator
according to the calling relationship.
The profile of a typical OLAP_SCAN_NODE
is as follows. Some indicators will have different meanings depending on the storage format (V1 or V2).
OLAP_SCAN_NODE (id=0):(Active: 1.2ms,% non-child: 0.00%) - BytesRead: 265.00 B # The amount of data read from the data file. Assuming that 10 32-bit integers are read, the amount of data is 10 * 4B = 40 Bytes. This data only represents the fully expanded size of the data in memory, and does not represent the actual IO size. - NumDiskAccess: 1 # The number of disks involved in this ScanNode node. - NumScanners: 20 # The number of Scanners generated by this ScanNode. - PeakMemoryUsage: 0.00 # Peak memory usage during query, not used yet - RowsRead: 7 # The number of rows returned from the storage engine to the Scanner, excluding the number of rows filtered by the Scanner. - RowsReturned: 7 # The number of rows returned from ScanNode to the upper node. - RowsReturnedRate: 6.979K /sec # RowsReturned/ActiveTime - TabletCount: 20 # The number of Tablets involved in this ScanNode. - TotalReadThroughput: 74.70 KB/sec # BytesRead divided by the total time spent in this node (from Open to Close). For IO bounded queries, this should be very close to the total throughput of all the disks - ScannerBatchWaitTime: 426.886us # To count the time the transfer thread waits for the scaner thread to return rowbatch. - ScannerWorkerWaitTime: 17.745us # To count the time that the scanner thread waits for the available worker threads in the thread pool. OlapScanner: - BlockConvertTime: 8.941us # The time it takes to convert a vectorized Block into a RowBlock with a row structure. The vectorized Block is VectorizedRowBatch in V1 and RowBlockV2 in V2. - BlockFetchTime: 468.974us # Rowset Reader gets the time of the Block. - ReaderInitTime: 5.475ms # The time when OlapScanner initializes Reader. V1 includes the time to form MergeHeap. V2 includes the time to generate various Iterators and read the first group of blocks. - RowsDelFiltered: 0 # Including the number of rows filtered out according to the Delete information in the Tablet, and the number of rows filtered for marked deleted rows under the unique key model. - RowsPushedCondFiltered: 0 # Filter conditions based on the predicates passed down, such as the conditions passed from BuildTable to ProbeTable in Join calculation. This value is not accurate, because if the filtering effect is poor, it will no longer be filtered. - ScanTime: 39.24us # The time returned from ScanNode to the upper node. - ShowHintsTime_V1: 0ns # V2 has no meaning. Read part of the data in V1 to perform ScanRange segmentation. SegmentIterator: - BitmapIndexFilterTimer: 779ns # Use bitmap index to filter data time-consuming. - BlockLoadTime: 415.925us # SegmentReader(V1) or SegmentIterator(V2) gets the time of the block. - BlockSeekCount: 12 # The number of block seeks when reading Segment. - BlockSeekTime: 222.556us # It takes time to block seek when reading Segment. - BlocksLoad: 6 # read the number of blocks - CachedPagesNum: 30 # In V2 only, when PageCache is enabled, the number of Pages that hit the Cache. - CompressedBytesRead: 0.00 # In V1, the size of the data read from the file before decompression. In V2, the pre-compressed size of the read page that did not hit the PageCache. - DecompressorTimer: 0ns # Data decompression takes time. - IOTimer: 0ns # IO time for actually reading data from the operating system. - IndexLoadTime_V1: 0ns # Only in V1, it takes time to read Index Stream. - NumSegmentFiltered: 0 # When generating Segment Iterator, the number of Segments that are completely filtered out through column statistics and query conditions. - NumSegmentTotal: 6 # Query the number of all segments involved. - RawRowsRead: 7 # The number of raw rows read in the storage engine. See below for details. - RowsBitmapIndexFiltered: 0 # Only in V2, the number of rows filtered by the Bitmap index. - RowsBloomFilterFiltered: 0 # Only in V2, the number of rows filtered by BloomFilter index. - RowsKeyRangeFiltered: 0 # In V2 only, the number of rows filtered out by SortkeyIndex index. - RowsStatsFiltered: 0 # In V2, the number of rows filtered by the ZoneMap index, including the deletion condition. V1 also contains the number of rows filtered by BloomFilter. - RowsConditionsFiltered: 0 # Only in V2, the number of rows filtered by various column indexes. - RowsVectorPredFiltered: 0 # The number of rows filtered by the vectorized condition filtering operation. - TotalPagesNum: 30 # Only in V2, the total number of pages read. - UncompressedBytesRead: 0.00 # V1 is the decompressed size of the read data file (if the file does not need to be decompressed, the file size is directly counted). In V2, only the decompressed size of the Page that missed PageCache is counted (if the Page does not need to be decompressed, the Page size is directly counted) - VectorPredEvalTime: 0ns # Time-consuming of vectorized condition filtering operation.
The predicate push down and index usage can be inferred from the related indicators of the number of data rows in the profile. The following only describes the profile in the reading process of segment V2 format data. In segment V1 format, the meaning of these indicators is slightly different.
RowsKeyRangeFiltered
.RowsBitmapIndexFiltered
.RowsBloomFilterFiltered
. The value of RowsBloomFilterFiltered
is the difference between the total number of rows of the Segment (not the number of rows filtered by the Bitmap index) and the number of remaining rows after BloomFilter, so the data filtered by BloomFilter may overlap with the data filtered by Bitmap.RowsStatsFiltered
.RowsConditionsFiltered
is the number of rows filtered by various indexes, including the values of RowsBloomFilterFiltered
and RowsStatsFiltered
.RowsDelFiltered
. Therefore, the number of rows actually filtered by the delete condition are recorded in RowsStatsFiltered
and RowsDelFiltered
respectively.RawRowsRead
is the final number of rows to be read after the above filtering.RowsRead
is the number of rows finally returned to Scanner. RowsRead
is usually smaller than RawRowsRead
, because returning from the storage engine to the Scanner may go through a data aggregation. If the difference between RawRowsRead
and RowsRead
is large, it means that a large number of rows are aggregated, and aggregation may be time-consuming.RowsReturned
is the number of rows finally returned by ScanNode to the upper node. RowsReturned
is usually smaller than RowsRead
. Because there will be some predicate conditions on the Scanner that are not pushed down to the storage engine, filtering will be performed once. If the difference between RowsRead
and RowsReturned
is large, it means that many rows are filtered in the Scanner. This shows that many highly selective predicate conditions are not pushed to the storage engine. The filtering efficiency in Scanner is worse than that in storage engine.Through the above indicators, you can roughly analyze the number of rows processed by the storage engine and the size of the final filtered result row. Through the Rows***Filtered
group of indicators, it is also possible to analyze whether the query conditions are pushed down to the storage engine, and the filtering effects of different indexes. In addition, a simple analysis can be made through the following aspects.
OlapScanner
, such as IOTimer
, BlockFetchTime
, etc., are the accumulation of all Scanner thread indicators, so the value may be relatively large. And because the Scanner thread reads data asynchronously, these cumulative indicators can only reflect the cumulative working time of the Scanner, and do not directly represent the time consumption of the ScanNode. The time-consuming ratio of ScanNode in the entire query plan is the value recorded in the Active
field. Sometimes it appears that IOTimer
has tens of seconds, but Active
is actually only a few seconds. This situation is usually due to:IOTimer
is the accumulated time of multiple Scanners, and there are more Scanners.Active
may be only a few milliseconds. Because while the upper layer is processing data, ScanNode has performed data scanning asynchronously and prepared the data. When the upper node obtains data from ScanNode, it can obtain the prepared data, so the Active time is very short.NumScanners
represents the number of Tasks submitted by the Scanner to the thread pool. It is scheduled by the thread pool in RuntimeState
. The two parameters doris_scanner_thread_pool_thread_num
and doris_scanner_thread_pool_queue_size
control the size of the thread pool and the queue length respectively. Too many or too few threads will affect query efficiency. At the same time, some summary indicators can be divided by the number of threads to roughly estimate the time consumption of each thread.TabletCount
indicates the number of tablets to be scanned. Too many may mean a lot of random read and data merge operations.UncompressedBytesRead
indirectly reflects the amount of data read. If the value is large, it means that there may be a lot of IO operations.CachedPagesNum
and TotalPagesNum
can check the hitting status of PageCache. The higher the hit rate, the less time-consuming IO and decompression operations.Buffer pool