tree: 37ba97988c817a509f77809805da0423081e85c5 [path history] [tgz]
  1. query/
  2. util/
  3. AggStatsDerive.java
  4. AnalysisInfo.java
  5. AnalysisInfoBuilder.java
  6. AnalysisJob.java
  7. AnalysisJobInfo.java
  8. AnalysisManager.java
  9. AnalysisState.java
  10. AnalysisTaskExecutor.java
  11. AnalysisTaskInfo.java
  12. AnalysisTaskWrapper.java
  13. AnalyticEvalStatsDerive.java
  14. AssertNumRowsStatsDerive.java
  15. AutoAnalysisPendingJob.java
  16. BaseAnalysisTask.java
  17. BaseStatsDerive.java
  18. BasicAsyncCacheLoader.java
  19. Bucket.java
  20. ColStatsData.java
  21. ColStatsMeta.java
  22. ColumnStatistic.java
  23. ColumnStatisticBuilder.java
  24. ColumnStatisticsCacheLoader.java
  25. DeriveFactory.java
  26. EmptySetStatsDerive.java
  27. ExchangeStatsDerive.java
  28. ExprStats.java
  29. ExternalAnalysisTask.java
  30. FollowerColumnSender.java
  31. HashJoinStatsDerive.java
  32. HistData.java
  33. Histogram.java
  34. HistogramBuilder.java
  35. HistogramCacheLoader.java
  36. HistogramTask.java
  37. HMSAnalysisTask.java
  38. InvalidateStatsTarget.java
  39. JobPriority.java
  40. MysqlStatsDerive.java
  41. NestedLoopJoinStatsDerive.java
  42. NewPartitionLoadedEvent.java
  43. OlapAnalysisJob.java
  44. OlapAnalysisTask.java
  45. OlapScanStatsDerive.java
  46. PlanStats.java
  47. QueryColumn.java
  48. README.md
  49. ResultRow.java
  50. SelectStatsDerive.java
  51. SlotStatsDeriveResult.java
  52. StatisticalType.java
  53. StatisticConstants.java
  54. StatisticRange.java
  55. Statistics.java
  56. StatisticsAutoCollector.java
  57. StatisticsBuilder.java
  58. StatisticsCache.java
  59. StatisticsCacheKey.java
  60. StatisticsCleaner.java
  61. StatisticsJobAppender.java
  62. StatisticsRepository.java
  63. StatsCategory.java
  64. StatsDeriveResult.java
  65. StatsGranularity.java
  66. StatsId.java
  67. StatsRecursiveDerive.java
  68. StatsType.java
  69. TableFunctionStatsDerive.java
  70. TableStatsMeta.java
  71. TaskStatusWrapper.java
  72. UpdateRowsEvent.java
fe/fe-core/src/main/java/org/apache/doris/statistics/README.md

Requiredments

Basic

Provide necessary data for the optimizer to calculate and compare various plans. This includes count, ndv, null_count, min, max, data_size, histogram for each column, as well as the number of rows in the table.

Adavanced(Not finished yet)

Support incremental collectio and auto collection

Specification

Compatibility

Function compatibility

No conflicts with any other function.

Version compatibility

There may be compatibility issues if there are changes to the schema of the stats table in the future.

Implementation

Main class

Class nameFunction
AnalyzeStmtConstructed by parsing user-input SQL, each AnalyzeStmt corresponds to a Job, and a Job can have multiple Tasks, with each Task responsible for collecting statistics information on a column.
AnalysisManagerMainly responsible for managing Analyze Jobs/Tasks, including creation, execution, cancellation, and status updates, etc.
StatisticsCacheThe collected statistical information is cached here on demand.
StatisticsCacheLoaderWhen StatsCalculator#computeScan fails to find the corresponding stats for a column in the cache, the load logic will be triggered, which is implemented in this class.
AnalysisTaskExecutorUsed to excute AnalyzeJob
AnalysisTaskWrapperThis class encapsulates an AnalysisTask and extends FutureTask. It overrides some methods for state updates.
AnalysisTaskSchedulerAnalysisTaskExecutor retrieves jobs from here for execution. Manually submitted jobs always have higher priority than automatically triggered ones.
StatisticsCleanerResponsible for cleaning up expired statistics and job information.
StatisticsAutoAnalyzerMainly responsible for automatically analysing statistics. Generate analysis job info for AnalysisManager to execute, including periodic and automatic analysis jobs.
StatisticsRepositoryMost of the related SQL is defined here.
StatisticsUtilMainly consists of helper methods, such as checking the status of stats-related tables.

Analyze execution flow

sequenceDiagram
DdlExecutor->>AnalysisManager: createAnalysisJob
AnalysisManager->>AnalysisManager: validateAndGetPartitions
AnalysisManager->>AnalysisManager: createTaskForEachColumns
AnalysisManager->>AnalysisManager: createTaskForMVIdx
alt is sync task
    AnalysisManager->>AnalysisManager: syncExecute
else is async task
    AnalysisManager->>StatisticsRepository: persist
    StatisticsRepository->>BE: write
    AnalysisManager->>AnalysisTaskScheduler: schedule
    AnalysisTaskScheduler->>AnalysisTaskExecutor: notify
    AnalysisTaskExecutor->>AnalysisTaskScheduler: getPendingTasks
    AnalysisTaskExecutor->>ThreadPoolExecutor: submit(AnalysisTaskWrapper)
    ThreadPoolExecutor->>AnalysisTaskWrapper: run
    AnalysisTaskWrapper->>BE: collect && write
    AnalysisTaskWrapper->>StatisticsCache: refresh
    AnalysisTaskWrapper->>AnalysisManager: updateTaskStatus
    alt is all task finished
        AnalysisManager->> StatisticsUtil: execUpdate mark job finished
        StatisticsUtil->> BE: update job status
    end
end

Load execution flow

sequenceDiagram
StatsCalculator->>StatisticsCache: get
alt is cached
    StatisticsCache->>StatsCalculator: return cached stats
else not cached
    StatisticsCache->>StatsCalculator: return UNKNOWN stats
    StatisticsCache->>ThreadPoolExecutor: submit load task
    ThreadPoolExecutor->>AsyncTask: get
    AsyncTask->>StatisticsUtil: execStatisticQuery
        alt exception occurred:
        AsyncTask->>StatisticsCache: return UNKNOWN stats
        StatisticsCache->> StatisticsCache: cache UNKNOWN for the column
    else no exception:
            StatisticsUtil->>AsyncTask: Return results rows
            AsyncTask->>StatisticsUtil: deserializeToColumnStatistics(result rows)
            alt exception occurred:
                AsyncTask->>StatisticsCache: return UNKNOWN stats
                StatisticsCache->> StatisticsCache: cache UNKNOWN for the column
            else no exception:
                StatisticsCache->> StatisticsCache: cache normal stats
            end
    end

end

Configure options

User interface

Test

The regression tests now mainly cover the following.

  • Analyze stats: mainly to verify the ANALYZE statement and its related characteristics, because some functions are affected by other factors (such as system metadata reporting time), may show instability, so this part is placed in p1.
  • Manage stats: mainly used to verify the injection, deletion, display and other related operations of statistical information.

For more, see statistics_p0 statistics_p1

Analyze stats

p0 tests:

  1. Universal analysis

p1 tests:

  1. Universal analysis
  2. Sampled analysis
  3. Incremental analysis
  4. Automatic analysis
  5. Periodic analysis

Manage stats

p0 tests:

  1. Alter table stats
  2. Show table stats
  3. Alter column stats
  4. Show column stats
  5. Show column histogram
  6. Drop column stats
  7. Drop expired stats

For the modification of the statistics module, all the above cases should be guaranteed to pass!

Feature note

20230508:

  1. Add table level statistics, support SHOW TABLE STATS statement to show table level statistics.
  2. Implement automatically analyze statistics, support ANALYZE... WITH AUTO ... statement to automatically analyze statistics.