Provide necessary data for the optimizer to calculate and compare various plans. This includes count, ndv, null_count, min, max, data_size, histogram for each column, as well as the number of rows in the table.
Support incremental collectio and auto collection
No conflicts with any other function.
There may be compatibility issues if there are changes to the schema of the stats table in the future.
Class name | Function |
---|---|
AnalyzeStmt | Constructed by parsing user-input SQL, each AnalyzeStmt corresponds to a Job, and a Job can have multiple Tasks, with each Task responsible for collecting statistics information on a column. |
AnalysisManager | Mainly responsible for managing Analyze Jobs/Tasks, including creation, execution, cancellation, and status updates, etc. |
StatisticsCache | The collected statistical information is cached here on demand. |
StatisticsCacheLoader | When StatsCalculator#computeScan fails to find the corresponding stats for a column in the cache, the load logic will be triggered, which is implemented in this class. |
AnalysisTaskExecutor | Used to excute AnalyzeJob |
AnalysisTaskWrapper | This class encapsulates an AnalysisTask and extends FutureTask . It overrides some methods for state updates. |
AnalysisTaskScheduler | AnalysisTaskExecutor retrieves jobs from here for execution. Manually submitted jobs always have higher priority than automatically triggered ones. |
StatisticsCleaner | Responsible for cleaning up expired statistics and job information. |
StatisticsAutoAnalyzer | Mainly responsible for automatically analysing statistics. Generate analysis job info for AnalysisManager to execute, including periodic and automatic analysis jobs. |
StatisticsRepository | Most of the related SQL is defined here. |
StatisticsUtil | Mainly consists of helper methods, such as checking the status of stats-related tables. |
sequenceDiagram DdlExecutor->>AnalysisManager: createAnalysisJob AnalysisManager->>AnalysisManager: validateAndGetPartitions AnalysisManager->>AnalysisManager: createTaskForEachColumns AnalysisManager->>AnalysisManager: createTaskForMVIdx alt is sync task AnalysisManager->>AnalysisManager: syncExecute else is async task AnalysisManager->>StatisticsRepository: persist StatisticsRepository->>BE: write AnalysisManager->>AnalysisTaskScheduler: schedule AnalysisTaskScheduler->>AnalysisTaskExecutor: notify AnalysisTaskExecutor->>AnalysisTaskScheduler: getPendingTasks AnalysisTaskExecutor->>ThreadPoolExecutor: submit(AnalysisTaskWrapper) ThreadPoolExecutor->>AnalysisTaskWrapper: run AnalysisTaskWrapper->>BE: collect && write AnalysisTaskWrapper->>StatisticsCache: refresh AnalysisTaskWrapper->>AnalysisManager: updateTaskStatus alt is all task finished AnalysisManager->> StatisticsUtil: execUpdate mark job finished StatisticsUtil->> BE: update job status end end
sequenceDiagram StatsCalculator->>StatisticsCache: get alt is cached StatisticsCache->>StatsCalculator: return cached stats else not cached StatisticsCache->>StatsCalculator: return UNKNOWN stats StatisticsCache->>ThreadPoolExecutor: submit load task ThreadPoolExecutor->>AsyncTask: get AsyncTask->>StatisticsUtil: execStatisticQuery alt exception occurred: AsyncTask->>StatisticsCache: return UNKNOWN stats StatisticsCache->> StatisticsCache: cache UNKNOWN for the column else no exception: StatisticsUtil->>AsyncTask: Return results rows AsyncTask->>StatisticsUtil: deserializeToColumnStatistics(result rows) alt exception occurred: AsyncTask->>StatisticsCache: return UNKNOWN stats StatisticsCache->> StatisticsCache: cache UNKNOWN for the column else no exception: StatisticsCache->> StatisticsCache: cache normal stats end end end
The regression tests now mainly cover the following.
ANALYZE
statement and its related characteristics, because some functions are affected by other factors (such as system metadata reporting time), may show instability, so this part is placed in p1.For more, see statistics_p0 statistics_p1
p0 tests:
p1 tests:
p0 tests:
For the modification of the statistics module, all the above cases should be guaranteed to pass!
20230508:
SHOW TABLE STATS
statement to show table level statistics.ANALYZE... WITH AUTO ...
statement to automatically analyze statistics.