src/backend/optimizer/README.cbdb.aqumv - cloudberry - Git at Google

 src/backend/optimizer/README.cbdb.aqumv

 Answer Query Using Materialized Views
 =====================================

 AQUMV for short, is used to compute part or all of a Query from materialized views during planning.
 It could provide massive improvements in query processing time, especially for aggregation queries over large tables[1].

 AQUMV usually uses Incremental Materialized Views(IMV) as candidates, as IMV has real time data
 when there are writable operations on related tables.

 Basic Theory
 ------------

 A materialized view(MV) could be use to compute a Query if:
 1. The view contains all rows needed by the query expression(Construct Rows).
       If MV has more rows than query wants, additional filter may be added if possible.
 2. All output expressions can be computed from the output of the view(Construct Columns).
       The output expressions could be fully or partially matched from MV's TargetList.
 3. Cost-based Equivalent Transformation.
       There may be multiple valid MV candidates, or select from MV is not better than
       select from origin table(ex: has an index and etc), let planner decide the best one.

 Vocabulary:
       origin_query: the SQL we want to query.
       view_query:  the materialized view's corresponding query, SELECT part of Create Materialized View statement.

 Construct Rows
 --------------

 If MV has all rows of query needed, it means that MV query's restrictions are looser than query's restrictions.
 For AQUMV_MVP0, we only do logistic transformation.
 All rewrites are on the Query tree, neither Equivalent Classes nor Restrictions are used.
 For a single relation:
 process view_query and origin_query's WHERE part to set:
 mv_query_quals and origin_query_quals.

 example0:
       CREATE MATERIALIZED VIEW mv0 AS SELECT * FROM t WHERE a = 1 AND b = 2;
       Query: SELECT * FROM t WHERE a = 1;

 mv_query_quals = {a = 1, b = 2}.
 origin_query_quals = {a = 1}.

 1) A MV can't be used if the difference set:  {mv_query_quals - origin_query_quals} is not empty.

 It 'typically' means that the MV has less rows than origin_query wants.
 For example0, the difference set is:
       mv_query_quals - origin_query_quals = {b = 2}.

 mv0's all rows meet requirement {a = 1 and b = 2}, but we only want rows {a = 1}.
 mv0 couldn't provide all the rows we want, we can't use it to answer the query.

 'typically' means that if there are range quals, this conclusion is not sure.
 But we couldn't handle that for now.

 2) The intersection set: {mv_query_quals ∩ origin_query_quals} should be dropped.

 If the intersection set is not empty, we choose to drop it.
 example1:
       CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM t WHERE a = 1;
       Query: SELECT * FROM t WHERE a = 1;

 {mv_query_quals ∩ origin_query_quals} = {a = 1};

 It seems everything is good and we have nothing more to do.
 Because the two quals are the same and we could rewrite the SQL to:
       Rewritten SQL: SELECT * FROM mv1 WHERE a = 1;

 As all mv1's rows meet the requirement: a = 1, it's pointless that we do filter a = 1 again at execution.

 What's worse is the unnecessary filter {a = 1} will mislead the clause-selectivity of the relation.
 For the example1, a {a = 1} will estimate less rows from relation MV, but as we are clear that all rows
 meet the requirement, and the selectivity from mv1 should be 100%.

 Another reason we dropped the intersection set is: we couldn't just keep the intersection set.
 example2:
       CREATE MATERIALIZED VIEW mv2 AS SELECT b FROM t WHERE a = 1;
       Query: SELECT b FROM t WHERE a = 1;

 {mv_query_quals ∩ origin_query_quals} = {a = 1};

 mv2 and origin_query only select column b from t with the same quals {a = 1}.
 If the intersection set is kept, we will get a wrong SQL:
       Wrong: SELECT b FROM mv2 WHERE a = 1;

 mv2 doesn't have column a, the SQL will get a syntax error.

 It's not always impossible to keep the intersection set, for
 example3:
       CREATE MATERIALIZED VIEW mv3 AS SELECT a, b FROM t WHERE a = 1;
       Query: SELECT a, b FROM t WHERE a = 1;

 We could rewrite it to:
       SELECT a, b FROM mv3 WHERE a = 1;

 There is a way to see if it's possible to rewrite that, but it isn't worth trying according to
 what we mentioned above.

 The disadvantages of dropping the intersection set of mv_query_quals and origin_query_quals is:
 We may lose some Equivalent Classes if there are equal operations like: a = 1.
 But not for other operations, ex: c > 1, because Postgres only have Equivalent Class for equal operations.
 And we haven't taken Equivalent Class into account for AQUMV_MVP0, it's reasonable to drop that.

 3) process difference set: {origin_query_quals - mv_query_quals}
 If 1) and 2) passed, the difference set on the other hand, we call it post_quals:

       post_quals = {origin_query_quals - mv_query_quals}

 The MV has more rows than query if post_quals is not empty.
 We have to add it to MV to filter the rows query want.
 example4:
       CREATE MATERIALIZED VIEW mv4 AS SELECT a, b FROM t WHERE a = 1;
       Query: SELECT a, b FROM t WHERE a = 1 and b = 2;

 We could rewrite it to:
       SELECT a, b FROM mv4 WHERE b = 2;

 All rows in MV are {a = 1} ones as the MV defination, we only need to add the extra filter {b = 2}.

 But it's not always true, if we don't have the columns that the post_quals need.
 example5:
       CREATE MATERIALIZED VIEW mv5 AS SELECT a FROM t WHERE a = 1;
       Query: SELECT a FROM t WHERE a = 1 and b = 2;

 mv5 has all rows {a = 1} and only have column 'a', but the query want additional filter {b = 2}.
 We couldn't rewrite it by just adding the {b = 2} to MV as no equivalent b in MV relation.
       Wrong: SELECT a FROM mv5 WHERE b = 2;

 The algorithm behind that is: all quals's expression could be computed from a view_query's target list.
 That's what Construct Columns does.

 Construct Columns
 -----------------

 A MV could be a candidate if the query's target list and the post_quals could be computed form
 view_query's target list and rewrite to expressions bases on MV relation's columns itself.

 example6:
       CREATE MATERIALIZED VIEW mv6 AS SELECT abs(c) as mc1, b as mc2 FROM t WHERE a = 1;
       Query: SELECT abs(c) as res1 FROM t WHERE a = 1 and b = 2;

 The post_quals is: {b = 2} while column b exists in mv6, corresponding to mc2 with alias.
 We can rewrite post_quals to {mc2 = 2}.

 The query wants a target abs(c) with an alias res1, while expression abs(c) exists in mv6,
 corresponding to column mc1 with alias.
 Then we can rewrite SQL to:

       Rewrite: SELECT mc1 as res1 FROM mv6 WHERE mc2 = 2;

 The expression abs(c) is eliminated and simplified to a column reference mc1, and the alias is kept.

 Things become complex when there are multiple expression candidates, and some ones could be
 part of others.

 example7:
       CREATE MATERIALIZED VIEW mv7 AS
             SELECT c1 AS mc1, c2 AS mc2, abs(c2) AS mc3, abs(abs(c2) - c1 - 1) AS mc4
             FROM t1 WHERE c1 > 30 AND c1 < 40;
       Query: SELECT sqrt(abs(abs(c2) - c1 - 1) + abs(c2)) AS res1 FROM t1 WHERE c1 > 30 AND c1 < 40 AND c2 > 23;

 There are many choices to construct column res1:
       sqrt(abs(abs(mc2) - mc1 - 1) + abs(mc2))  // constructed by mc1, mc2
       sqrt(abs(abs(mc2) - mc1 - 1) + mc3))      // constructed by mc1, mc2, mc3
       sqrt(abs(mc3 - mc1 - 1) + abs(mc2)))      // constructed by mc1, mc2, mc3
       sqrt(abs(mc3 - mc1 - 1) + mc3))           // constructed by mc1, mc3
       sqrt(mc4 + mc3))                          // constructed by mc3, mc4

 Obviously, the best one is sqrt(mc4 + mc3) which avoids much of expression execution for each row.

 We try to use the most-submatched expression to do a first rewrite and then next.
 It's not only optimization, but also unnecessary for some cases that a less-matched expression
 rewrite may close the door for more-matched ones, especially for post_quals rewrite.
 example8:
       CREATE MATERIALIZED VIEW mv8 AS
             SELECT c2 as mc3, c2 AS mc2, abs(c2) AS m_abs_c2
             FROM t1 WHERE c1 > 1;
       Query: SELECT c3 AS res1 FROM t1 WHERE c1 > 1 and (abs(c2) - c1 - 1) > 10;

       post_quals: {(abs(c2) - c1 - 1) > 10}

 If we choose less-matched mc2 to rewrite, an intermediate expression would be:
       {(abs(mc2) - c1 - 1) > 10}

 But mv8 don't have a corresponding column c1 to continue the work, that's bad and we will lose
 the chance to use it.

 The approach is: use a Greedy Algorithm to rewrite the target expression.

 First,  Split the MV query's expressions to pure-Var and nonpure-Var ones.
 Because pure Var expression is always the leaf of an expression tree if it needs to be rewritten.

 Sort the nonpure-Var expressions by complexity.
 We don't need an absolute order for every expression.
 All we need to guarantee is that:
 if expression A is sub part of expression B, put A after B.

 The approach applies to post_quals rewrite too.

 Expressions that have no Vars are kept to upper(ex: Const Expressions) or rewritten if there were
 corresponding expressions.


 Cost-based
 ----------

 There could be multiple candidates after equivalent transformation.
 After all things is done for a materialized view candidate, build a plan to compare with current one.
 Let the planner decide the best one.

 AQUMV_MVP
 ---------
 Support SELECT FROM a single relation both for view_query and the origin_query.
 Below are not supported now:
       Subquery
       Join
       Sublink
       Group By/Grouping Sets/Rollup/Cube (on view_query)
       Window Functions
       CTE
       Distinct (on view_query)
       Distinct On (on view_query)
       UNION/INTERSECT/EXCEPT
       FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR KEY SHARE
       Scatter By
       Refresh Materialized View
       Create AS
       Partition Tables
       Inherit Tables

 Reference:
    [1] Optimizing Queries Using Materialized Views: A Practical, Scalable Solution.
          https://courses.cs.washington.edu/courses/cse591d/01sp/opt_views.pdf
    [2] Automated Selection of Materialized Views and Indexes for SQL Databases.
          https://www.vldb.org/conf/2000/P496.pdf
	src/backend/optimizer/README.cbdb.aqumv

	Answer Query Using Materialized Views
	=====================================

	AQUMV for short, is used to compute part or all of a Query from materialized views during planning.
	It could provide massive improvements in query processing time, especially for aggregation queries over large tables[1].

	AQUMV usually uses Incremental Materialized Views(IMV) as candidates, as IMV has real time data
	when there are writable operations on related tables.

	Basic Theory
	------------

	A materialized view(MV) could be use to compute a Query if:
	1. The view contains all rows needed by the query expression(Construct Rows).
	If MV has more rows than query wants, additional filter may be added if possible.
	2. All output expressions can be computed from the output of the view(Construct Columns).
	The output expressions could be fully or partially matched from MV's TargetList.
	3. Cost-based Equivalent Transformation.
	There may be multiple valid MV candidates, or select from MV is not better than
	select from origin table(ex: has an index and etc), let planner decide the best one.

	Vocabulary:
	origin_query: the SQL we want to query.
	view_query: the materialized view's corresponding query, SELECT part of Create Materialized View statement.

	Construct Rows
	--------------

	If MV has all rows of query needed, it means that MV query's restrictions are looser than query's restrictions.
	For AQUMV_MVP0, we only do logistic transformation.
	All rewrites are on the Query tree, neither Equivalent Classes nor Restrictions are used.
	For a single relation:
	process view_query and origin_query's WHERE part to set:
	mv_query_quals and origin_query_quals.

	example0:
	CREATE MATERIALIZED VIEW mv0 AS SELECT * FROM t WHERE a = 1 AND b = 2;
	Query: SELECT * FROM t WHERE a = 1;

	mv_query_quals = {a = 1, b = 2}.
	origin_query_quals = {a = 1}.

	1) A MV can't be used if the difference set: {mv_query_quals - origin_query_quals} is not empty.

	It 'typically' means that the MV has less rows than origin_query wants.
	For example0, the difference set is:
	mv_query_quals - origin_query_quals = {b = 2}.

	mv0's all rows meet requirement {a = 1 and b = 2}, but we only want rows {a = 1}.
	mv0 couldn't provide all the rows we want, we can't use it to answer the query.

	'typically' means that if there are range quals, this conclusion is not sure.
	But we couldn't handle that for now.

	2) The intersection set: {mv_query_quals ∩ origin_query_quals} should be dropped.

	If the intersection set is not empty, we choose to drop it.
	example1:
	CREATE MATERIALIZED VIEW mv1 AS SELECT * FROM t WHERE a = 1;
	Query: SELECT * FROM t WHERE a = 1;

	{mv_query_quals ∩ origin_query_quals} = {a = 1};

	It seems everything is good and we have nothing more to do.
	Because the two quals are the same and we could rewrite the SQL to:
	Rewritten SQL: SELECT * FROM mv1 WHERE a = 1;

	As all mv1's rows meet the requirement: a = 1, it's pointless that we do filter a = 1 again at execution.

	What's worse is the unnecessary filter {a = 1} will mislead the clause-selectivity of the relation.
	For the example1, a {a = 1} will estimate less rows from relation MV, but as we are clear that all rows
	meet the requirement, and the selectivity from mv1 should be 100%.

	Another reason we dropped the intersection set is: we couldn't just keep the intersection set.
	example2:
	CREATE MATERIALIZED VIEW mv2 AS SELECT b FROM t WHERE a = 1;
	Query: SELECT b FROM t WHERE a = 1;

	{mv_query_quals ∩ origin_query_quals} = {a = 1};

	mv2 and origin_query only select column b from t with the same quals {a = 1}.
	If the intersection set is kept, we will get a wrong SQL:
	Wrong: SELECT b FROM mv2 WHERE a = 1;

	mv2 doesn't have column a, the SQL will get a syntax error.

	It's not always impossible to keep the intersection set, for
	example3:
	CREATE MATERIALIZED VIEW mv3 AS SELECT a, b FROM t WHERE a = 1;
	Query: SELECT a, b FROM t WHERE a = 1;

	We could rewrite it to:
	SELECT a, b FROM mv3 WHERE a = 1;

	There is a way to see if it's possible to rewrite that, but it isn't worth trying according to
	what we mentioned above.

	The disadvantages of dropping the intersection set of mv_query_quals and origin_query_quals is:
	We may lose some Equivalent Classes if there are equal operations like: a = 1.
	But not for other operations, ex: c > 1, because Postgres only have Equivalent Class for equal operations.
	And we haven't taken Equivalent Class into account for AQUMV_MVP0, it's reasonable to drop that.

	3) process difference set: {origin_query_quals - mv_query_quals}
	If 1) and 2) passed, the difference set on the other hand, we call it post_quals:

	post_quals = {origin_query_quals - mv_query_quals}

	The MV has more rows than query if post_quals is not empty.
	We have to add it to MV to filter the rows query want.
	example4:
	CREATE MATERIALIZED VIEW mv4 AS SELECT a, b FROM t WHERE a = 1;
	Query: SELECT a, b FROM t WHERE a = 1 and b = 2;

	We could rewrite it to:
	SELECT a, b FROM mv4 WHERE b = 2;

	All rows in MV are {a = 1} ones as the MV defination, we only need to add the extra filter {b = 2}.

	But it's not always true, if we don't have the columns that the post_quals need.
	example5:
	CREATE MATERIALIZED VIEW mv5 AS SELECT a FROM t WHERE a = 1;
	Query: SELECT a FROM t WHERE a = 1 and b = 2;

	mv5 has all rows {a = 1} and only have column 'a', but the query want additional filter {b = 2}.
	We couldn't rewrite it by just adding the {b = 2} to MV as no equivalent b in MV relation.
	Wrong: SELECT a FROM mv5 WHERE b = 2;

	The algorithm behind that is: all quals's expression could be computed from a view_query's target list.
	That's what Construct Columns does.

	Construct Columns
	-----------------

	A MV could be a candidate if the query's target list and the post_quals could be computed form
	view_query's target list and rewrite to expressions bases on MV relation's columns itself.

	example6:
	CREATE MATERIALIZED VIEW mv6 AS SELECT abs(c) as mc1, b as mc2 FROM t WHERE a = 1;
	Query: SELECT abs(c) as res1 FROM t WHERE a = 1 and b = 2;

	The post_quals is: {b = 2} while column b exists in mv6, corresponding to mc2 with alias.
	We can rewrite post_quals to {mc2 = 2}.

	The query wants a target abs(c) with an alias res1, while expression abs(c) exists in mv6,
	corresponding to column mc1 with alias.
	Then we can rewrite SQL to:

	Rewrite: SELECT mc1 as res1 FROM mv6 WHERE mc2 = 2;

	The expression abs(c) is eliminated and simplified to a column reference mc1, and the alias is kept.

	Things become complex when there are multiple expression candidates, and some ones could be
	part of others.

	example7:
	CREATE MATERIALIZED VIEW mv7 AS
	SELECT c1 AS mc1, c2 AS mc2, abs(c2) AS mc3, abs(abs(c2) - c1 - 1) AS mc4
	FROM t1 WHERE c1 > 30 AND c1 < 40;
	Query: SELECT sqrt(abs(abs(c2) - c1 - 1) + abs(c2)) AS res1 FROM t1 WHERE c1 > 30 AND c1 < 40 AND c2 > 23;

	There are many choices to construct column res1:
	sqrt(abs(abs(mc2) - mc1 - 1) + abs(mc2)) // constructed by mc1, mc2
	sqrt(abs(abs(mc2) - mc1 - 1) + mc3)) // constructed by mc1, mc2, mc3
	sqrt(abs(mc3 - mc1 - 1) + abs(mc2))) // constructed by mc1, mc2, mc3
	sqrt(abs(mc3 - mc1 - 1) + mc3)) // constructed by mc1, mc3
	sqrt(mc4 + mc3)) // constructed by mc3, mc4

	Obviously, the best one is sqrt(mc4 + mc3) which avoids much of expression execution for each row.

	We try to use the most-submatched expression to do a first rewrite and then next.
	It's not only optimization, but also unnecessary for some cases that a less-matched expression
	rewrite may close the door for more-matched ones, especially for post_quals rewrite.
	example8:
	CREATE MATERIALIZED VIEW mv8 AS
	SELECT c2 as mc3, c2 AS mc2, abs(c2) AS m_abs_c2
	FROM t1 WHERE c1 > 1;
	Query: SELECT c3 AS res1 FROM t1 WHERE c1 > 1 and (abs(c2) - c1 - 1) > 10;

	post_quals: {(abs(c2) - c1 - 1) > 10}

	If we choose less-matched mc2 to rewrite, an intermediate expression would be:
	{(abs(mc2) - c1 - 1) > 10}

	But mv8 don't have a corresponding column c1 to continue the work, that's bad and we will lose
	the chance to use it.

	The approach is: use a Greedy Algorithm to rewrite the target expression.

	First, Split the MV query's expressions to pure-Var and nonpure-Var ones.
	Because pure Var expression is always the leaf of an expression tree if it needs to be rewritten.

	Sort the nonpure-Var expressions by complexity.
	We don't need an absolute order for every expression.
	All we need to guarantee is that:
	if expression A is sub part of expression B, put A after B.

	The approach applies to post_quals rewrite too.

	Expressions that have no Vars are kept to upper(ex: Const Expressions) or rewritten if there were
	corresponding expressions.


	Cost-based
	----------

	There could be multiple candidates after equivalent transformation.
	After all things is done for a materialized view candidate, build a plan to compare with current one.
	Let the planner decide the best one.

	AQUMV_MVP
	---------
	Support SELECT FROM a single relation both for view_query and the origin_query.
	Below are not supported now:
	Subquery
	Join
	Sublink
	Group By/Grouping Sets/Rollup/Cube (on view_query)
	Window Functions
	CTE
	Distinct (on view_query)
	Distinct On (on view_query)
	UNION/INTERSECT/EXCEPT
	FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR KEY SHARE
	Scatter By
	Refresh Materialized View
	Create AS
	Partition Tables
	Inherit Tables

	Reference:
	[1] Optimizing Queries Using Materialized Views: A Practical, Scalable Solution.
	https://courses.cs.washington.edu/courses/cse591d/01sp/opt_views.pdf
	[2] Automated Selection of Materialized Views and Indexes for SQL Databases.
	https://www.vldb.org/conf/2000/P496.pdf