| src/backend/optimizer/README |
| |
| Optimizer |
| ========= |
| |
| These directories take the Query structure returned by the parser, and |
| generate a plan used by the executor. The /plan directory generates the |
| actual output plan, the /path code generates all possible ways to join the |
| tables, and /prep handles various preprocessing steps for special cases. |
| /util is utility stuff. /geqo is the separate "genetic optimization" planner |
| --- it does a semi-random search through the join tree space, rather than |
| exhaustively considering all possible join trees. (But each join considered |
| by /geqo is given to /path to create paths for, so we consider all possible |
| implementation paths for each specific join pair even in GEQO mode.) |
| |
| |
| Paths and Join Pairs |
| -------------------- |
| |
| During the planning/optimizing process, we build "Path" trees representing |
| the different ways of doing a query. We select the cheapest Path that |
| generates the desired relation and turn it into a Plan to pass to the |
| executor. (There is pretty nearly a one-to-one correspondence between the |
| Path and Plan trees, but Path nodes omit info that won't be needed during |
| planning, and include info needed for planning that won't be needed by the |
| executor.) |
| |
| The optimizer builds a RelOptInfo structure for each base relation used in |
| the query. Base rels are either primitive tables, or subquery subselects |
| that are planned via a separate recursive invocation of the planner. A |
| RelOptInfo is also built for each join relation that is considered during |
| planning. A join rel is simply a combination of base rels. There is only |
| one join RelOptInfo for any given set of baserels --- for example, the join |
| {A B C} is represented by the same RelOptInfo no matter whether we build it |
| by joining A and B first and then adding C, or joining B and C first and |
| then adding A, etc. These different means of building the joinrel are |
| represented as Paths. For each RelOptInfo we build a list of Paths that |
| represent plausible ways to implement the scan or join of that relation. |
| Once we've considered all the plausible Paths for a rel, we select the one |
| that is cheapest according to the planner's cost estimates. The final plan |
| is derived from the cheapest Path for the RelOptInfo that includes all the |
| base rels of the query. |
| |
| Possible Paths for a primitive table relation include plain old sequential |
| scan, plus index scans for any indexes that exist on the table, plus bitmap |
| index scans using one or more indexes. Specialized RTE types, such as |
| function RTEs, may have only one possible Path. |
| |
| Joins always occur using two RelOptInfos. One is outer, the other inner. |
| Outers drive lookups of values in the inner. In a nested loop, lookups of |
| values in the inner occur by scanning the inner path once per outer tuple |
| to find each matching inner row. In a mergejoin, inner and outer rows are |
| ordered, and are accessed in order, so only one scan is required to perform |
| the entire join: both inner and outer paths are scanned in-sync. (There's |
| not a lot of difference between inner and outer in a mergejoin...) In a |
| hashjoin, the inner is scanned first and all its rows are entered in a |
| hashtable, then the outer is scanned and for each row we lookup the join |
| key in the hashtable. |
| |
| A Path for a join relation is actually a tree structure, with the topmost |
| Path node representing the last-applied join method. It has left and right |
| subpaths that represent the scan or join methods used for the two input |
| relations. |
| |
| |
| Join Tree Construction |
| ---------------------- |
| |
| The optimizer generates optimal query plans by doing a more-or-less |
| exhaustive search through the ways of executing the query. The best Path |
| tree is found by a recursive process: |
| |
| 1) Take each base relation in the query, and make a RelOptInfo structure |
| for it. Find each potentially useful way of accessing the relation, |
| including sequential and index scans, and make Paths representing those |
| ways. All the Paths made for a given relation are placed in its |
| RelOptInfo.pathlist. (Actually, we discard Paths that are obviously |
| inferior alternatives before they ever get into the pathlist --- what |
| ends up in the pathlist is the cheapest way of generating each potentially |
| useful sort ordering and parameterization of the relation.) Also create a |
| RelOptInfo.joininfo list including all the join clauses that involve this |
| relation. For example, the WHERE clause "tab1.col1 = tab2.col1" generates |
| entries in both tab1 and tab2's joininfo lists. |
| |
| If we have only a single base relation in the query, we are done. |
| Otherwise we have to figure out how to join the base relations into a |
| single join relation. |
| |
| 2) Normally, any explicit JOIN clauses are "flattened" so that we just |
| have a list of relations to join. However, FULL OUTER JOIN clauses are |
| never flattened, and other kinds of JOIN might not be either, if the |
| flattening process is stopped by join_collapse_limit or from_collapse_limit |
| restrictions. Therefore, we end up with a planning problem that contains |
| lists of relations to be joined in any order, where any individual item |
| might be a sub-list that has to be joined together before we can consider |
| joining it to its siblings. We process these sub-problems recursively, |
| bottom up. Note that the join list structure constrains the possible join |
| orders, but it doesn't constrain the join implementation method at each |
| join (nestloop, merge, hash), nor does it say which rel is considered outer |
| or inner at each join. We consider all these possibilities in building |
| Paths. We generate a Path for each feasible join method, and select the |
| cheapest Path. |
| |
| For each planning problem, therefore, we will have a list of relations |
| that are either base rels or joinrels constructed per sub-join-lists. |
| We can join these rels together in any order the planner sees fit. |
| The standard (non-GEQO) planner does this as follows: |
| |
| Consider joining each RelOptInfo to each other RelOptInfo for which there |
| is a usable joinclause, and generate a Path for each possible join method |
| for each such pair. (If we have a RelOptInfo with no join clauses, we have |
| no choice but to generate a clauseless Cartesian-product join; so we |
| consider joining that rel to each other available rel. But in the presence |
| of join clauses we will only consider joins that use available join |
| clauses. Note that join-order restrictions induced by outer joins and |
| IN/EXISTS clauses are also checked, to ensure that we find a workable join |
| order in cases where those restrictions force a clauseless join to be done.) |
| |
| If we only had two relations in the list, we are done: we just pick |
| the cheapest path for the join RelOptInfo. If we had more than two, we now |
| need to consider ways of joining join RelOptInfos to each other to make |
| join RelOptInfos that represent more than two list items. |
| |
| The join tree is constructed using a "dynamic programming" algorithm: |
| in the first pass (already described) we consider ways to create join rels |
| representing exactly two list items. The second pass considers ways |
| to make join rels that represent exactly three list items; the next pass, |
| four items, etc. The last pass considers how to make the final join |
| relation that includes all list items --- obviously there can be only one |
| join rel at this top level, whereas there can be more than one join rel |
| at lower levels. At each level we use joins that follow available join |
| clauses, if possible, just as described for the first level. |
| |
| For example: |
| |
| SELECT * |
| FROM tab1, tab2, tab3, tab4 |
| WHERE tab1.col = tab2.col AND |
| tab2.col = tab3.col AND |
| tab3.col = tab4.col |
| |
| Tables 1, 2, 3, and 4 are joined as: |
| {1 2},{2 3},{3 4} |
| {1 2 3},{2 3 4} |
| {1 2 3 4} |
| (other possibilities will be excluded for lack of join clauses) |
| |
| SELECT * |
| FROM tab1, tab2, tab3, tab4 |
| WHERE tab1.col = tab2.col AND |
| tab1.col = tab3.col AND |
| tab1.col = tab4.col |
| |
| Tables 1, 2, 3, and 4 are joined as: |
| {1 2},{1 3},{1 4} |
| {1 2 3},{1 3 4},{1 2 4} |
| {1 2 3 4} |
| |
| We consider left-handed plans (the outer rel of an upper join is a joinrel, |
| but the inner is always a single list item); right-handed plans (outer rel |
| is always a single item); and bushy plans (both inner and outer can be |
| joins themselves). For example, when building {1 2 3 4} we consider |
| joining {1 2 3} to {4} (left-handed), {4} to {1 2 3} (right-handed), and |
| {1 2} to {3 4} (bushy), among other choices. Although the jointree |
| scanning code produces these potential join combinations one at a time, |
| all the ways to produce the same set of joined base rels will share the |
| same RelOptInfo, so the paths produced from different join combinations |
| that produce equivalent joinrels will compete in add_path(). |
| |
| The dynamic-programming approach has an important property that's not |
| immediately obvious: we will finish constructing all paths for a given |
| relation before we construct any paths for relations containing that rel. |
| This means that we can reliably identify the "cheapest path" for each rel |
| before higher-level relations need to know that. Also, we can safely |
| discard a path when we find that another path for the same rel is better, |
| without worrying that maybe there is already a reference to that path in |
| some higher-level join path. Without this, memory management for paths |
| would be much more complicated. |
| |
| Once we have built the final join rel, we use either the cheapest path |
| for it or the cheapest path with the desired ordering (if that's cheaper |
| than applying a sort to the cheapest other path). |
| |
| If the query contains one-sided outer joins (LEFT or RIGHT joins), or |
| IN or EXISTS WHERE clauses that were converted to semijoins or antijoins, |
| then some of the possible join orders may be illegal. These are excluded |
| by having join_is_legal consult a side list of such "special" joins to see |
| whether a proposed join is illegal. (The same consultation allows it to |
| see which join style should be applied for a valid join, ie, JOIN_INNER, |
| JOIN_LEFT, etc.) |
| |
| |
| Valid OUTER JOIN Optimizations |
| ------------------------------ |
| |
| The planner's treatment of outer join reordering is based on the following |
| identities: |
| |
| 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) |
| = (A innerjoin C on (Pac)) leftjoin B on (Pab) |
| |
| where Pac is a predicate referencing A and C, etc (in this case, clearly |
| Pac cannot reference B, or the transformation is nonsensical). |
| |
| 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) |
| = (A leftjoin C on (Pac)) leftjoin B on (Pab) |
| |
| 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) |
| = A leftjoin (B leftjoin C on (Pbc)) on (Pab) |
| |
| Identity 3 only holds if predicate Pbc must fail for all-null B rows |
| (that is, Pbc is strict for at least one column of B). If Pbc is not |
| strict, the first form might produce some rows with nonnull C columns |
| where the second form would make those entries null. |
| |
| RIGHT JOIN is equivalent to LEFT JOIN after switching the two input |
| tables, so the same identities work for right joins. |
| |
| An example of a case that does *not* work is moving an innerjoin into or |
| out of the nullable side of an outer join: |
| |
| A leftjoin (B join C on (Pbc)) on (Pab) |
| != (A leftjoin B on (Pab)) join C on (Pbc) |
| |
| SEMI joins work a little bit differently. A semijoin can be reassociated |
| into or out of the lefthand side of another semijoin, left join, or |
| antijoin, but not into or out of the righthand side. Likewise, an inner |
| join, left join, or antijoin can be reassociated into or out of the |
| lefthand side of a semijoin, but not into or out of the righthand side. |
| |
| ANTI joins work approximately like LEFT joins, except that identity 3 |
| fails if the join to C is an antijoin (even if Pbc is strict, and in |
| both the cases where the other join is a leftjoin and where it is an |
| antijoin). So we can't reorder antijoins into or out of the RHS of a |
| leftjoin or antijoin, even if the relevant clause is strict. |
| |
| The current code does not attempt to re-order FULL JOINs at all. |
| FULL JOIN ordering is enforced by not collapsing FULL JOIN nodes when |
| translating the jointree to "joinlist" representation. Other types of |
| JOIN nodes are normally collapsed so that they participate fully in the |
| join order search. To avoid generating illegal join orders, the planner |
| creates a SpecialJoinInfo node for each non-inner join, and join_is_legal |
| checks this list to decide if a proposed join is legal. |
| |
| What we store in SpecialJoinInfo nodes are the minimum sets of Relids |
| required on each side of the join to form the outer join. Note that |
| these are minimums; there's no explicit maximum, since joining other |
| rels to the OJ's syntactic rels may be legal. Per identities 1 and 2, |
| non-FULL joins can be freely associated into the lefthand side of an |
| OJ, but in some cases they can't be associated into the righthand side. |
| So the restriction enforced by join_is_legal is that a proposed join |
| can't join a rel within or partly within an RHS boundary to one outside |
| the boundary, unless the proposed join is a LEFT join that can associate |
| into the SpecialJoinInfo's RHS using identity 3. |
| |
| The use of minimum Relid sets has some pitfalls; consider a query like |
| A leftjoin (B leftjoin (C innerjoin D) on (Pbcd)) on Pa |
| where Pa doesn't mention B/C/D at all. In this case a naive computation |
| would give the upper leftjoin's min LHS as {A} and min RHS as {C,D} (since |
| we know that the innerjoin can't associate out of the leftjoin's RHS, and |
| enforce that by including its relids in the leftjoin's min RHS). And the |
| lower leftjoin has min LHS of {B} and min RHS of {C,D}. Given such |
| information, join_is_legal would think it's okay to associate the upper |
| join into the lower join's RHS, transforming the query to |
| B leftjoin (A leftjoin (C innerjoin D) on Pa) on (Pbcd) |
| which yields totally wrong answers. We prevent that by forcing the min RHS |
| for the upper join to include B. This is perhaps overly restrictive, but |
| such cases don't arise often so it's not clear that it's worth developing a |
| more complicated system. |
| |
| |
| Pulling Up Subqueries |
| --------------------- |
| |
| As we described above, a subquery appearing in the range table is planned |
| independently and treated as a "black box" during planning of the outer |
| query. This is necessary when the subquery uses features such as |
| aggregates, GROUP, or DISTINCT. But if the subquery is just a simple |
| scan or join, treating the subquery as a black box may produce a poor plan |
| compared to considering it as part of the entire plan search space. |
| Therefore, at the start of the planning process the planner looks for |
| simple subqueries and pulls them up into the main query's jointree. |
| |
| Pulling up a subquery may result in FROM-list joins appearing below the top |
| of the join tree. Each FROM-list is planned using the dynamic-programming |
| search method described above. |
| |
| If pulling up a subquery produces a FROM-list as a direct child of another |
| FROM-list, then we can merge the two FROM-lists together. Once that's |
| done, the subquery is an absolutely integral part of the outer query and |
| will not constrain the join tree search space at all. However, that could |
| result in unpleasant growth of planning time, since the dynamic-programming |
| search has runtime exponential in the number of FROM-items considered. |
| Therefore, we don't merge FROM-lists if the result would have too many |
| FROM-items in one list. |
| |
| |
| Vars and PlaceHolderVars |
| ------------------------ |
| |
| A Var node is simply the parse-tree representation of a table column |
| reference. However, in the presence of outer joins, that concept is |
| more subtle than it might seem. We need to distinguish the values of |
| a Var "above" and "below" any outer join that could force the Var to |
| null. As an example, consider |
| |
| SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y) WHERE foo(t2.z) |
| |
| (Assume foo() is not strict, so that we can't reduce the left join to |
| a plain join.) A naive implementation might try to push the foo(t2.z) |
| call down to the scan of t2, but that is not correct because |
| (a) what foo() should actually see for a null-extended join row is NULL, |
| and (b) if foo() returns false, we should suppress the t1 row from the |
| join altogether, not emit it with a null-extended t2 row. On the other |
| hand, it *would* be correct (and desirable) to push that call down to |
| the scan level if the query were |
| |
| SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y AND foo(t2.z)) |
| |
| This motivates considering "t2.z" within the left join's ON clause |
| to be a different value from "t2.z" outside the JOIN clause. The |
| former can be identified with t2.z as seen at the relation scan level, |
| but the latter can't. |
| |
| Another example occurs in connection with EquivalenceClasses (discussed |
| below). Given |
| |
| SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y) WHERE t1.x = 42 |
| |
| we would like to use the EquivalenceClass mechanisms to derive "t2.y = 42" |
| to use as a restriction clause for the scan of t2. (That works, because t2 |
| rows having y different from 42 cannot affect the query result.) However, |
| it'd be wrong to conclude that t2.y will be equal to t1.x in every joined |
| row. Part of the solution to this problem is to deem that "t2.y" in the |
| ON clause refers to the relation-scan-level value of t2.y, but not to the |
| value that y will have in joined rows, where it might be NULL rather than |
| equal to t1.x. |
| |
| Therefore, Var nodes are decorated with "varnullingrels", which are sets |
| of the rangetable indexes of outer joins that potentially null the Var |
| at the point where it appears in the query. (Using a set, not an ordered |
| list, is fine since it doesn't matter which join forced the value to null; |
| and that avoids having to change the representation when we consider |
| different outer-join orders.) In the examples above, all occurrences of |
| t1.x would have empty varnullingrels, since the left join doesn't null t1. |
| The t2 references within the JOIN ON clauses would also have empty |
| varnullingrels. But outside the JOIN clauses, any Vars referencing t2 |
| would have varnullingrels containing the index of the JOIN's rangetable |
| entry (RTE), so that they'd be understood as potentially different from |
| the t2 values seen at scan level. Labeling t2.z in the WHERE clause with |
| the JOIN's RT index lets us recognize that that occurrence of foo(t2.z) |
| cannot be pushed down to the t2 scan level: we cannot evaluate that value |
| at the scan level, but only after the join has been done. |
| |
| For LEFT and RIGHT outer joins, only Vars coming from the nullable side |
| of the join are marked with that join's RT index. For FULL joins, Vars |
| from both inputs are marked. (Such marking doesn't let us tell which |
| side of the full join a Var came from; but that information can be found |
| elsewhere at need.) |
| |
| Notionally, a Var having nonempty varnullingrels can be thought of as |
| CASE WHEN any-of-these-outer-joins-produced-a-null-extended-row |
| THEN NULL |
| ELSE the-scan-level-value-of-the-column |
| END |
| It's only notional, because no such calculation is ever done explicitly. |
| In a finished plan, Vars occurring in scan-level plan nodes represent |
| the actual table column values, but upper-level Vars are always |
| references to outputs of lower-level plan nodes. When a join node emits |
| a null-extended row, it just returns nulls for the relevant output |
| columns rather than copying up values from its input. Because we don't |
| ever have to do this calculation explicitly, it's not necessary to |
| distinguish which side of an outer join got null-extended, which'd |
| otherwise be essential information for FULL JOIN cases. |
| |
| Outer join identity 3 (discussed above) complicates this picture |
| a bit. In the form |
| A leftjoin (B leftjoin C on (Pbc)) on (Pab) |
| all of the Vars in clauses Pbc and Pab will have empty varnullingrels, |
| but if we start with |
| (A leftjoin B on (Pab)) leftjoin C on (Pbc) |
| then the parser will have marked Pbc's B Vars with the A/B join's |
| RT index, making this form artificially different from the first. |
| For discussion's sake, let's denote this marking with a star: |
| (A leftjoin B on (Pab)) leftjoin C on (Pb*c) |
| To cope with this, once we have detected that commuting these joins |
| is legal, we generate both the Pbc and Pb*c forms of that ON clause, |
| by either removing or adding the first join's RT index in the B Vars |
| that the parser created. While generating paths for a plan step that |
| joins B and C, we include as a relevant join qual only the form that |
| is appropriate depending on whether A has already been joined to B. |
| |
| It's also worth noting that identity 3 makes "the left join's RT index" |
| itself a bit of a fuzzy concept, since the syntactic scope of each join |
| RTE will depend on which form was produced by the parser. We resolve |
| this by considering that a left join's identity is determined by its |
| minimum set of right-hand-side input relations. In both forms allowed |
| by identity 3, we can identify the first join as having minimum RHS B |
| and the second join as having minimum RHS C. |
| |
| Another thing to notice is that C Vars appearing outside the nested |
| JOIN clauses will be marked as nulled by both left joins if the |
| original parser input was in the first form of identity 3, but if the |
| parser input was in the second form, such Vars will only be marked as |
| nulled by the second join. This is not really a semantic problem: |
| such Vars will be marked the same way throughout the upper part of the |
| query, so they will all look equal() which is correct; and they will not |
| look equal() to any C Var appearing in the JOIN ON clause or below these |
| joins. However, when building Vars representing the outputs of join |
| relations, we need to ensure that their varnullingrels are set to |
| values consistent with the syntactic join order, so that they will |
| appear equal() to pre-existing Vars in the upper part of the query. |
| |
| Outer joins also complicate handling of subquery pull-up. Consider |
| |
| SELECT ..., ss.x FROM tab1 |
| LEFT JOIN (SELECT *, 42 AS x FROM tab2) ss ON ... |
| |
| We want to be able to pull up the subquery as discussed previously, |
| but we can't just replace the "ss.x" Var in the top-level SELECT list |
| with the constant 42. That'd result in always emitting 42, rather |
| than emitting NULL in null-extended join rows. |
| |
| To solve this, we introduce the concept of PlaceHolderVars. |
| A PlaceHolderVar is somewhat like a Var, in that its value originates |
| at a relation scan level and can then be forced to null by higher-level |
| outer joins; hence PlaceHolderVars carry a set of nulling rel IDs just |
| like Vars. Unlike a Var, whose original value comes from a table, |
| a PlaceHolderVar's original value is defined by a query-determined |
| expression ("42" in this example); so we represent the PlaceHolderVar |
| as a node with that expression as child. We insert a PlaceHolderVar |
| whenever subquery pullup needs to replace a subquery-referencing Var |
| that has nonempty varnullingrels with an expression that is not simply a |
| Var. (When the replacement expression is a pulled-up Var, we can just |
| add the replaced Var's varnullingrels to its set. Also, if the replaced |
| Var has empty varnullingrels, we don't need a PlaceHolderVar: there is |
| nothing that'd force the value to null, so the pulled-up expression is |
| fine to use as-is.) In a finished plan, a PlaceHolderVar becomes just |
| the contained expression at whatever plan level it's supposed to be |
| evaluated at, and then upper-level occurrences are replaced by Var |
| references to that output column of the lower plan level. That causes |
| the value to go to null when appropriate at an outer join, in the same |
| way as for normal Vars. Thus, PlaceHolderVars are never seen outside |
| the planner. |
| |
| PlaceHolderVars (PHVs) are more complicated than Vars in another way: |
| their original value might need to be calculated at a join, not a |
| base-level relation scan. This can happen when a pulled-up subquery |
| contains a join. Because of this, a PHV can create a join order |
| constraint that wouldn't otherwise exist, to ensure that it can |
| be calculated before it is used. A PHV's expression can also contain |
| LATERAL references, adding complications that are discussed below. |
| |
| |
| Relation Identification and Qual Clause Placement |
| ------------------------------------------------- |
| |
| A qual clause obtained from WHERE or JOIN/ON can be enforced at the lowest |
| scan or join level that includes all relations used in the clause. For |
| this purpose we consider that outer joins listed in varnullingrels or |
| phnullingrels are used in the clause, since we can't compute the qual's |
| result correctly until we know whether such Vars have gone to null. |
| |
| The one exception to this general rule is that a non-degenerate outer |
| JOIN/ON qual (one that references the non-nullable side of the join) |
| cannot be enforced below that join, even if it doesn't reference the |
| nullable side. Pushing it down into the non-nullable side would result |
| in rows disappearing from the join's result, rather than appearing as |
| null-extended rows. To handle that, when we identify such a qual we |
| artificially add the join's minimum input relid set to the set of |
| relations it is considered to use, forcing it to be evaluated exactly at |
| that join level. The same happens for outer-join quals that mention no |
| relations at all. |
| |
| When attaching a qual clause to a join plan node that is performing an |
| outer join, the qual clause is considered a "join clause" (that is, it is |
| applied before the join performs null-extension) if it does not reference |
| that outer join in any varnullingrels or phnullingrels set, or a "filter |
| clause" (applied after null-extension) if it does reference that outer |
| join. A qual clause that originally appeared in that outer join's JOIN/ON |
| will fall into the first category, since the parser would not have marked |
| any of its Vars as referencing the outer join. A qual clause that |
| originally came from some upper ON clause or WHERE clause will be seen as |
| referencing the outer join if it references any of the nullable side's |
| Vars, since those Vars will be so marked by the parser. But, if such a |
| qual does not reference any nullable-side Vars, it's okay to push it down |
| into the non-nullable side, so it won't get attached to the join node in |
| the first place. |
| |
| These things lead us to identify join relations within the planner |
| by the sets of base relation RT indexes plus outer join RT indexes |
| that they include. In that way, the sets of relations used by qual |
| clauses can be directly compared to join relations' relid sets to |
| see where to place the clauses. These identifying sets are unique |
| because, for any given collection of base relations, there is only |
| one valid set of outer joins to have performed along the way to |
| joining that set of base relations (although the order of applying |
| them could vary, as discussed above). |
| |
| SEMI joins do not have RT indexes, because they are artifacts made by |
| the planner rather than the parser. (We could create rangetable |
| entries for them, but there seems no need at present.) This does not |
| cause a problem for qual placement, because the nullable side of a |
| semijoin is not referenceable from above the join, so there is never a |
| need to cite it in varnullingrels or phnullingrels. It does not cause a |
| problem for join relation identification either, since whether a semijoin |
| has been completed is again implicit in the set of base relations |
| included in the join. |
| |
| As usual, outer join identity 3 complicates matters. If we start with |
| (A leftjoin B on (Pab)) leftjoin C on (Pbc) |
| then the parser will have marked any C Vars appearing above these joins |
| with the RT index of the B/C join. If we now transform to |
| A leftjoin (B leftjoin C on (Pbc)) on (Pab) |
| then it would appear that a clause using only such Vars could be pushed |
| down and applied as a filter clause (not a join clause) at the lower |
| B/C join. But *this might not give the right answer* since the clause |
| might see a non-null value for the C Var that will be replaced by null |
| once the A/B join is performed. We handle this by saying that the |
| pushed-down join hasn't completely performed the work of the B/C join |
| and hence is not entitled to include that outer join relid in its |
| relid set. When we form the A/B join, both outer joins' relids will |
| be added to its relid set, and then the upper clause will be applied |
| at the correct join level. (Note there is no problem when identity 3 |
| is applied in the other direction: if we started with the second form |
| then upper C Vars are marked with both outer join relids, so they |
| cannot drop below whichever join is applied second.) Similarly, |
| Vars representing the output of a pushed-down join do not acquire |
| nullingrel bits for that join until after the upper join is performed. |
| |
| There is one additional complication for qual clause placement, which |
| occurs when we have made multiple versions of an outer-join clause as |
| described previously (that is, we have both "Pbc" and "Pb*c" forms of |
| the same clause seen in outer join identity 3). When forming an outer |
| join we only want to apply one of the redundant versions of the clause. |
| If we are forming the B/C join without having yet computed the A/B |
| join, it's easy to reject the "Pb*c" form since its required relid |
| set includes the A/B join relid which is not in the input. However, |
| if we form B/C after A/B, then both forms of the clause are applicable |
| so far as that test can tell. We have to look more closely to notice |
| that the "Pbc" clause form refers to relation B which is no longer |
| directly accessible. While such a check could be performed using the |
| per-relation RelOptInfo.nulling_relids data, it would be annoyingly |
| expensive to do over and over as we consider different join paths. |
| To make this simple and reliable, we compute an "incompatible_relids" |
| set for each variant version (clone) of a redundant clause. A clone |
| clause should not be applied if any of the outer-join relids listed in |
| incompatible_relids has already been computed below the current join. |
| |
| |
| Optimizer Functions |
| ------------------- |
| |
| The primary entry point is planner(). |
| |
| planner() |
| set up for recursive handling of subqueries |
| -subquery_planner() |
| pull up sublinks and subqueries from rangetable, if possible |
| canonicalize qual |
| Attempt to simplify WHERE clause to the most useful form; this includes |
| flattening nested AND/ORs and detecting clauses that are duplicated in |
| different branches of an OR. |
| simplify constant expressions |
| process sublinks |
| convert Vars of outer query levels into Params |
| --grouping_planner() |
| preprocess target list for non-SELECT queries |
| handle UNION/INTERSECT/EXCEPT, GROUP BY, HAVING, aggregates, |
| ORDER BY, DISTINCT, LIMIT |
| ---query_planner() |
| make list of base relations used in query |
| split up the qual into restrictions (a=1) and joins (b=c) |
| find qual clauses that enable merge and hash joins |
| ----make_one_rel() |
| set_base_rel_pathlists() |
| find seqscan and all index paths for each base relation |
| find selectivity of columns used in joins |
| make_rel_from_joinlist() |
| hand off join subproblems to a plugin, GEQO, or standard_join_search() |
| ------standard_join_search() |
| call join_search_one_level() for each level of join tree needed |
| join_search_one_level(): |
| For each joinrel of the prior level, do make_rels_by_clause_joins() |
| if it has join clauses, or make_rels_by_clauseless_joins() if not. |
| Also generate "bushy plan" joins between joinrels of lower levels. |
| Back at standard_join_search(), generate gather paths if needed for |
| each newly constructed joinrel, then apply set_cheapest() to extract |
| the cheapest path for it. |
| Loop back if this wasn't the top join level. |
| Back at grouping_planner: |
| do grouping (GROUP BY) and aggregation |
| do window functions |
| make unique (DISTINCT) |
| do sorting (ORDER BY) |
| do limit (LIMIT/OFFSET) |
| Back at planner(): |
| convert finished Path tree into a Plan tree |
| do final cleanup after planning |
| |
| |
| Optimizer Data Structures |
| ------------------------- |
| |
| PlannerGlobal - global information for a single planner invocation |
| |
| PlannerInfo - information for planning a particular Query (we make |
| a separate PlannerInfo node for each sub-Query) |
| |
| RelOptInfo - a relation or joined relations |
| |
| RestrictInfo - WHERE clauses, like "x = 3" or "y = z" |
| (note the same structure is used for restriction and |
| join clauses) |
| |
| Path - every way to generate a RelOptInfo(sequential,index,joins) |
| A plain Path node can represent several simple plans, per its pathtype: |
| T_SeqScan - sequential scan |
| T_SampleScan - tablesample scan |
| T_FunctionScan - function-in-FROM scan |
| T_TableFuncScan - table function scan |
| T_ValuesScan - VALUES scan |
| T_CteScan - CTE (WITH) scan |
| T_NamedTuplestoreScan - ENR scan |
| T_WorkTableScan - scan worktable of a recursive CTE |
| T_Result - childless Result plan node (used for FROM-less SELECT) |
| IndexPath - index scan |
| BitmapHeapPath - top of a bitmapped index scan |
| TidPath - scan by CTID |
| TidRangePath - scan a contiguous range of CTIDs |
| SubqueryScanPath - scan a subquery-in-FROM |
| ForeignPath - scan a foreign table, foreign join or foreign upper-relation |
| CustomPath - for custom scan providers |
| AppendPath - append multiple subpaths together |
| MergeAppendPath - merge multiple subpaths, preserving their common sort order |
| GroupResultPath - childless Result plan node (used for degenerate grouping) |
| MaterialPath - a Material plan node |
| MemoizePath - a Memoize plan node for caching tuples from sub-paths |
| UniquePath - remove duplicate rows (either by hashing or sorting) |
| GatherPath - collect the results of parallel workers |
| GatherMergePath - collect parallel results, preserving their common sort order |
| ProjectionPath - a Result plan node with child (used for projection) |
| ProjectSetPath - a ProjectSet plan node applied to some sub-path |
| SortPath - a Sort plan node applied to some sub-path |
| IncrementalSortPath - an IncrementalSort plan node applied to some sub-path |
| GroupPath - a Group plan node applied to some sub-path |
| UpperUniquePath - a Unique plan node applied to some sub-path |
| AggPath - an Agg plan node applied to some sub-path |
| GroupingSetsPath - an Agg plan node used to implement GROUPING SETS |
| MinMaxAggPath - a Result plan node with subplans performing MIN/MAX |
| WindowAggPath - a WindowAgg plan node applied to some sub-path |
| SetOpPath - a SetOp plan node applied to some sub-path |
| RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths |
| LockRowsPath - a LockRows plan node applied to some sub-path |
| ModifyTablePath - a ModifyTable plan node applied to some sub-path(s) |
| LimitPath - a Limit plan node applied to some sub-path |
| NestPath - nested-loop joins |
| MergePath - merge joins |
| HashPath - hash joins |
| |
| EquivalenceClass - a data structure representing a set of values known equal |
| |
| PathKey - a data structure representing the sort ordering of a path |
| |
| The optimizer spends a good deal of its time worrying about the ordering |
| of the tuples returned by a path. The reason this is useful is that by |
| knowing the sort ordering of a path, we may be able to use that path as |
| the left or right input of a mergejoin and avoid an explicit sort step. |
| Nestloops and hash joins don't really care what the order of their inputs |
| is, but mergejoin needs suitably ordered inputs. Therefore, all paths |
| generated during the optimization process are marked with their sort order |
| (to the extent that it is known) for possible use by a higher-level merge. |
| |
| It is also possible to avoid an explicit sort step to implement a user's |
| ORDER BY clause if the final path has the right ordering already, so the |
| sort ordering is of interest even at the top level. grouping_planner() will |
| look for the cheapest path with a sort order matching the desired order, |
| then compare its cost to the cost of using the cheapest-overall path and |
| doing an explicit sort on that. |
| |
| When we are generating paths for a particular RelOptInfo, we discard a path |
| if it is more expensive than another known path that has the same or better |
| sort order. We will never discard a path that is the only known way to |
| achieve a given sort order (without an explicit sort, that is). In this |
| way, the next level up will have the maximum freedom to build mergejoins |
| without sorting, since it can pick from any of the paths retained for its |
| inputs. |
| |
| |
| EquivalenceClasses |
| ------------------ |
| |
| During the deconstruct_jointree() scan of the query's qual clauses, we |
| look for mergejoinable equality clauses A = B. When we find one, we |
| create an EquivalenceClass containing the expressions A and B to record |
| that they are equal. If we later find another equivalence clause B = C, |
| we add C to the existing EquivalenceClass for {A B}; this may require |
| merging two existing EquivalenceClasses. At the end of the scan, we have |
| sets of values that are known all transitively equal to each other. We can |
| therefore use a comparison of any pair of the values as a restriction or |
| join clause (when these values are available at the scan or join, of |
| course); furthermore, we need test only one such comparison, not all of |
| them. Therefore, equivalence clauses are removed from the standard qual |
| distribution process. Instead, when preparing a restriction or join clause |
| list, we examine each EquivalenceClass to see if it can contribute a |
| clause, and if so we select an appropriate pair of values to compare. For |
| example, if we are trying to join A's relation to C's, we can generate the |
| clause A = C, even though this appeared nowhere explicitly in the original |
| query. This may allow us to explore join paths that otherwise would have |
| been rejected as requiring Cartesian-product joins. |
| |
| Sometimes an EquivalenceClass may contain a pseudo-constant expression |
| (i.e., one not containing Vars or Aggs of the current query level, nor |
| volatile functions). In this case we do not follow the policy of |
| dynamically generating join clauses: instead, we dynamically generate |
| restriction clauses "var = const" wherever one of the variable members of |
| the class can first be computed. For example, if we have A = B and B = 42, |
| we effectively generate the restriction clauses A = 42 and B = 42, and then |
| we need not bother with explicitly testing the join clause A = B when the |
| relations are joined. In effect, all the class members can be tested at |
| relation-scan level and there's never a need for join tests. |
| |
| The precise technical interpretation of an EquivalenceClass is that it |
| asserts that at any plan node where more than one of its member values |
| can be computed, output rows in which the values are not all equal may |
| be discarded without affecting the query result. (We require all levels |
| of the plan to enforce EquivalenceClasses, hence a join need not recheck |
| equality of values that were computable by one of its children.) |
| |
| Outer joins complicate this picture quite a bit, however. While we could |
| theoretically use mergejoinable equality clauses that appear in outer-join |
| conditions as sources of EquivalenceClasses, there's a serious difficulty: |
| the resulting deductions are not valid everywhere. For example, given |
| |
| SELECT * FROM a LEFT JOIN b ON (a.x = b.y AND a.x = 42); |
| |
| we can safely derive b.y = 42 and use that in the scan of B, because B |
| rows not having b.y = 42 will not contribute to the join result. However, |
| we cannot apply a.x = 42 at the scan of A, or we will remove rows that |
| should appear in the join result. We could apply a.x = 42 as an outer join |
| condition (and then it would be unnecessary to also check a.x = b.y). |
| This is not yet implemented, however. |
| |
| A related issue is that constants appearing below an outer join are |
| less constant than they appear. Ordinarily, if we find "A = 1" and |
| "B = 1", it's okay to put A and B into the same EquivalenceClass. |
| But consider |
| |
| SELECT * FROM a |
| LEFT JOIN (SELECT * FROM b WHERE b.z = 1) b ON (a.x = b.y) |
| WHERE a.x = 1; |
| |
| It would be a serious error to conclude that a.x = b.z, so we cannot |
| form a single EquivalenceClass {a.x b.z 1}. |
| |
| This leads to considering EquivalenceClasses as applying within "join |
| domains", which are sets of relations that are inner-joined to each other. |
| (We can treat semijoins as if they were inner joins for this purpose.) |
| There is a top-level join domain, and then each outer join in the query |
| creates a new join domain comprising its nullable side. Full joins create |
| two join domains, one for each side. EquivalenceClasses generated from |
| WHERE are associated with the top-level join domain. EquivalenceClasses |
| generated from the ON clause of an outer join are associated with the |
| domain created by that outer join. EquivalenceClasses generated from the |
| ON clause of an inner or semi join are associated with the syntactically |
| most closely nested join domain. |
| |
| Having defined these domains, we can fix the not-so-constant-constants |
| problem by considering that constants only match EquivalenceClass members |
| when they come from clauses within the same join domain. In the above |
| example, this means we keep {a.x 1} and {b.z 1} as separate |
| EquivalenceClasses and don't erroneously merge them. We don't have to |
| worry about this for Vars (or expressions containing Vars), because |
| references to the "same" column from different join domains will have |
| different varnullingrels and thus won't be equal() anyway. |
| |
| In the future, the join-domain concept may allow us to treat mergejoinable |
| outer-join conditions as sources of EquivalenceClasses. The idea would be |
| that conditions derived from such classes could only be enforced at scans |
| or joins that are within the appropriate join domain. This is not |
| implemented yet, however, as the details are trickier than they appear. |
| |
| Another instructive example is: |
| |
| SELECT * |
| FROM a LEFT JOIN |
| (SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss |
| ON a.x = ss.y |
| ORDER BY ss.y; |
| |
| We can form the EquivalenceClass {b.y c.z 10} and thereby apply c.z = 10 |
| while scanning C, as well as b.y = 10 while scanning B, so that no clause |
| needs to be checked at the inner join. The left-join clause "a.x = ss.y" |
| (really "a.x = b.y") is not considered an equivalence clause, so we do |
| not insert a.x into that same EquivalenceClass; if we did, we'd falsely |
| conclude a.x = 10. In the future though we might be able to do that, |
| if we can keep from applying a.x = 10 at the scan of A, which in principle |
| we could do by noting that the EquivalenceClass only applies within the |
| {B,C} join domain. |
| |
| Also notice that ss.y in the ORDER BY is really b.y* (that is, the |
| possibly-nulled form of b.y), so we will not confuse it with the b.y member |
| of the lower EquivalenceClass. Thus, we won't mistakenly conclude that |
| that ss.y is equal to a constant, which if true would lead us to think that |
| sorting for the ORDER BY is unnecessary (see discussion of PathKeys below). |
| Instead, there will be a separate EquivalenceClass containing only b.y*, |
| which will form the basis for the PathKey describing the required sort |
| order. |
| |
| Also consider this variant: |
| |
| SELECT * |
| FROM a LEFT JOIN |
| (SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss |
| ON a.x = ss.y |
| WHERE a.x = 42; |
| |
| We still form the EquivalenceClass {b.y c.z 10}, and additionally |
| we have an EquivalenceClass {a.x 42} belonging to a different join domain. |
| We cannot use "a.x = b.y" to merge these classes. However, we can compare |
| that outer join clause to the existing EquivalenceClasses and form the |
| derived clause "b.y = 42", which we can treat as a valid equivalence |
| within the lower join domain (since no row of that domain not having |
| b.y = 42 can contribute to the outer-join result). That makes the lower |
| EquivalenceClass {42 b.y c.z 10}, resulting in the contradiction 10 = 42, |
| which lets the planner deduce that the B/C join need not be computed at |
| all: the result of that whole join domain can be forced to empty. |
| (This gets implemented as a gating Result filter, since more usually the |
| potential contradiction involves Param values rather than just Consts, and |
| thus it has to be checked at runtime. We can use the join domain to |
| determine the join level at which to place the gating condition.) |
| |
| There is an additional complication when re-ordering outer joins according |
| to identity 3. Recall that the two choices we consider for such joins are |
| |
| A leftjoin (B leftjoin C on (Pbc)) on (Pab) |
| (A leftjoin B on (Pab)) leftjoin C on (Pb*c) |
| |
| where the star denotes varnullingrels markers on B's Vars. When Pbc |
| is (or includes) a mergejoinable clause, we have something like |
| |
| A leftjoin (B leftjoin C on (b.b = c.c)) on (Pab) |
| (A leftjoin B on (Pab)) leftjoin C on (b.b* = c.c) |
| |
| We could generate an EquivalenceClause linking b.b and c.c, but if we |
| then also try to link b.b* and c.c, we end with a nonsensical conclusion |
| that b.b and b.b* are equal (at least in some parts of the plan tree). |
| In any case, the conclusions we could derive from such a thing would be |
| largely duplicative. Conditions involving b.b* can't be computed below |
| this join nest, while any conditions that can be computed would be |
| duplicative of what we'd get from the b.b/c.c combination. Therefore, |
| we choose to generate an EquivalenceClause linking b.b and c.c, but |
| "b.b* = c.c" is handled as just an ordinary clause. |
| |
| To aid in determining the sort ordering(s) that can work with a mergejoin, |
| we mark each mergejoinable clause with the EquivalenceClasses of its left |
| and right inputs. For an equivalence clause, these are of course the same |
| EquivalenceClass. For a non-equivalence mergejoinable clause (such as an |
| outer-join qualification), we generate two separate EquivalenceClasses for |
| the left and right inputs. This may result in creating single-item |
| equivalence "classes", though of course these are still subject to merging |
| if other equivalence clauses are later found to bear on the same |
| expressions. |
| |
| Another way that we may form a single-item EquivalenceClass is in creation |
| of a PathKey to represent a desired sort order (see below). This happens |
| if an ORDER BY or GROUP BY key is not mentioned in any equivalence |
| clause. We need to reason about sort orders in such queries, and our |
| representation of sort ordering is a PathKey which depends on an |
| EquivalenceClass, so we have to make an EquivalenceClass. This is a bit |
| different from the above cases because such an EquivalenceClass might |
| contain an aggregate function or volatile expression. (A clause containing |
| a volatile function will never be considered mergejoinable, even if its top |
| operator is mergejoinable, so there is no way for a volatile expression to |
| get into EquivalenceClasses otherwise. Aggregates are disallowed in WHERE |
| altogether, so will never be found in a mergejoinable clause.) This is just |
| a convenience to maintain a uniform PathKey representation: such an |
| EquivalenceClass will never be merged with any other. Note in particular |
| that a single-item EquivalenceClass {a.x} is *not* meant to imply an |
| assertion that a.x = a.x; the practical effect of this is that a.x could |
| be NULL. |
| |
| An EquivalenceClass also contains a list of btree opfamily OIDs, which |
| determines what the equalities it represents actually "mean". All the |
| equivalence clauses that contribute to an EquivalenceClass must have |
| equality operators that belong to the same set of opfamilies. (Note: most |
| of the time, a particular equality operator belongs to only one family, but |
| it's possible that it belongs to more than one. We keep track of all the |
| families to ensure that we can make use of an index belonging to any one of |
| the families for mergejoin purposes.) |
| |
| For the same sort of reason, an EquivalenceClass is also associated |
| with a particular collation, if its datatype(s) care about collation. |
| |
| An EquivalenceClass can contain "em_is_child" members, which are copies |
| of members that contain appendrel parent relation Vars, transposed to |
| contain the equivalent child-relation variables or expressions. These |
| members are *not* full-fledged members of the EquivalenceClass and do not |
| affect the class's overall properties at all. They are kept only to |
| simplify matching of child-relation expressions to EquivalenceClasses. |
| Most operations on EquivalenceClasses should ignore child members. |
| |
| |
| PathKeys |
| -------- |
| |
| The PathKeys data structure represents what is known about the sort order |
| of the tuples generated by a particular Path. A path's pathkeys field is a |
| list of PathKey nodes, where the n'th item represents the n'th sort key of |
| the result. Each PathKey contains these fields: |
| |
| * a reference to an EquivalenceClass |
| * a btree opfamily OID (must match one of those in the EC) |
| * a sort direction (ascending or descending) |
| * a nulls-first-or-last flag |
| |
| The EquivalenceClass represents the value being sorted on. Since the |
| various members of an EquivalenceClass are known equal according to the |
| opfamily, we can consider a path sorted by any one of them to be sorted by |
| any other too; this is what justifies referencing the whole |
| EquivalenceClass rather than just one member of it. |
| |
| In single/base relation RelOptInfo's, the Paths represent various ways |
| of scanning the relation and the resulting ordering of the tuples. |
| Sequential scan Paths have NIL pathkeys, indicating no known ordering. |
| Index scans have Path.pathkeys that represent the chosen index's ordering, |
| if any. A single-key index would create a single-PathKey list, while a |
| multi-column index generates a list with one element per key index column. |
| Non-key columns specified in the INCLUDE clause of covering indexes don't |
| have corresponding PathKeys in the list, because they have no influence on |
| index ordering. (Actually, since an index can be scanned either forward or |
| backward, there are two possible sort orders and two possible PathKey lists |
| it can generate.) |
| |
| Note that a bitmap scan has NIL pathkeys since we can say nothing about |
| the overall order of its result. Also, an indexscan on an unordered type |
| of index generates NIL pathkeys. However, we can always create a pathkey |
| by doing an explicit sort. The pathkeys for a Sort plan's output just |
| represent the sort key fields and the ordering operators used. |
| |
| Things get more interesting when we consider joins. Suppose we do a |
| mergejoin between A and B using the mergeclause A.X = B.Y. The output |
| of the mergejoin is sorted by X --- but it is also sorted by Y. Again, |
| this can be represented by a PathKey referencing an EquivalenceClass |
| containing both X and Y. |
| |
| With a little further thought, it becomes apparent that nestloop joins |
| can also produce sorted output. For example, if we do a nestloop join |
| between outer relation A and inner relation B, then any pathkeys relevant |
| to A are still valid for the join result: we have not altered the order of |
| the tuples from A. Even more interesting, if there was an equivalence clause |
| A.X=B.Y, and A.X was a pathkey for the outer relation A, then we can assert |
| that B.Y is a pathkey for the join result; X was ordered before and still |
| is, and the joined values of Y are equal to the joined values of X, so Y |
| must now be ordered too. This is true even though we used neither an |
| explicit sort nor a mergejoin on Y. (Note: hash joins cannot be counted |
| on to preserve the order of their outer relation, because the executor |
| might decide to "batch" the join, so we always set pathkeys to NIL for |
| a hashjoin path.) |
| |
| An outer join doesn't preserve the ordering of its nullable input |
| relation(s), because it might insert nulls at random points in the |
| ordering. We don't need to think about this explicitly in the PathKey |
| representation, because a PathKey representing a post-join variable |
| will contain varnullingrel bits, making it not equal to a PathKey |
| representing the pre-join value. |
| |
| In general, we can justify using EquivalenceClasses as the basis for |
| pathkeys because, whenever we scan a relation containing multiple |
| EquivalenceClass members or join two relations each containing |
| EquivalenceClass members, we apply restriction or join clauses derived from |
| the EquivalenceClass. This guarantees that any two values listed in the |
| EquivalenceClass are in fact equal in all tuples emitted by the scan or |
| join, and therefore that if the tuples are sorted by one of the values, |
| they can be considered sorted by any other as well. It does not matter |
| whether the test clause is used as a mergeclause, or merely enforced |
| after-the-fact as a qpqual filter. |
| |
| Note that there is no particular difficulty in labeling a path's sort |
| order with a PathKey referencing an EquivalenceClass that contains |
| variables not yet joined into the path's output. We can simply ignore |
| such entries as not being relevant (yet). This makes it possible to |
| use the same EquivalenceClasses throughout the join planning process. |
| In fact, by being careful not to generate multiple identical PathKey |
| objects, we can reduce comparison of EquivalenceClasses and PathKeys |
| to simple pointer comparison, which is a huge savings because add_path |
| has to make a large number of PathKey comparisons in deciding whether |
| competing Paths are equivalently sorted. |
| |
| Pathkeys are also useful to represent an ordering that we wish to achieve, |
| since they are easily compared to the pathkeys of a potential candidate |
| path. So, SortGroupClause lists are turned into pathkeys lists for use |
| inside the optimizer. |
| |
| An additional refinement we can make is to insist that canonical pathkey |
| lists (sort orderings) do not mention the same EquivalenceClass more than |
| once. For example, in all these cases the second sort column is redundant, |
| because it cannot distinguish values that are the same according to the |
| first sort column: |
| SELECT ... ORDER BY x, x |
| SELECT ... ORDER BY x, x DESC |
| SELECT ... WHERE x = y ORDER BY x, y |
| Although a user probably wouldn't write "ORDER BY x,x" directly, such |
| redundancies are more probable once equivalence classes have been |
| considered. Also, the system may generate redundant pathkey lists when |
| computing the sort ordering needed for a mergejoin. By eliminating the |
| redundancy, we save time and improve planning, since the planner will more |
| easily recognize equivalent orderings as being equivalent. |
| |
| Another interesting property is that if the underlying EquivalenceClass |
| contains a constant, then the pathkey is completely redundant and need not |
| be sorted by at all! Every interesting row must contain the same value, |
| so there's no need to sort. This might seem pointless because users |
| are unlikely to write "... WHERE x = 42 ORDER BY x", but it allows us to |
| recognize when particular index columns are irrelevant to the sort order: |
| if we have "... WHERE x = 42 ORDER BY y", scanning an index on (x,y) |
| produces correctly ordered data without a sort step. We used to have very |
| ugly ad-hoc code to recognize that in limited contexts, but discarding |
| constant ECs from pathkeys makes it happen cleanly and automatically. |
| |
| |
| Order of processing for EquivalenceClasses and PathKeys |
| ------------------------------------------------------- |
| |
| As alluded to above, there is a specific sequence of phases in the |
| processing of EquivalenceClasses and PathKeys during planning. During the |
| initial scanning of the query's quals (deconstruct_jointree followed by |
| reconsider_outer_join_clauses), we construct EquivalenceClasses based on |
| mergejoinable clauses found in the quals. At the end of this process, |
| we know all we can know about equivalence of different variables, so |
| subsequently there will be no further merging of EquivalenceClasses. |
| At that point it is possible to consider the EquivalenceClasses as |
| "canonical" and build canonical PathKeys that reference them. At this |
| time we construct PathKeys for the query's ORDER BY and related clauses. |
| (Any ordering expressions that do not appear elsewhere will result in |
| the creation of new EquivalenceClasses, but this cannot result in merging |
| existing classes, so canonical-ness is not lost.) |
| |
| Because all the EquivalenceClasses are known before we begin path |
| generation, we can use them as a guide to which indexes are of interest: |
| if an index's column is not mentioned in any EquivalenceClass then that |
| index's sort order cannot possibly be helpful for the query. This allows |
| short-circuiting of much of the processing of create_index_paths() for |
| irrelevant indexes. |
| |
| There are some cases where planner.c constructs additional |
| EquivalenceClasses and PathKeys after query_planner has completed. |
| In these cases, the extra ECs/PKs are needed to represent sort orders |
| that were not considered during query_planner. Such situations should be |
| minimized since it is impossible for query_planner to return a plan |
| producing such a sort order, meaning an explicit sort will always be needed. |
| Currently this happens only for queries involving multiple window functions |
| with different orderings, for which extra sorts are needed anyway. |
| |
| |
| Parameterized Paths |
| ------------------- |
| |
| The naive way to join two relations using a clause like WHERE A.X = B.Y |
| is to generate a nestloop plan like this: |
| |
| NestLoop |
| Filter: A.X = B.Y |
| -> Seq Scan on A |
| -> Seq Scan on B |
| |
| We can make this better by using a merge or hash join, but it still |
| requires scanning all of both input relations. If A is very small and B is |
| very large, but there is an index on B.Y, it can be enormously better to do |
| something like this: |
| |
| NestLoop |
| -> Seq Scan on A |
| -> Index Scan using B_Y_IDX on B |
| Index Condition: B.Y = A.X |
| |
| Here, we are expecting that for each row scanned from A, the nestloop |
| plan node will pass down the current value of A.X into the scan of B. |
| That allows the indexscan to treat A.X as a constant for any one |
| invocation, and thereby use it as an index key. This is the only plan type |
| that can avoid fetching all of B, and for small numbers of rows coming from |
| A, that will dominate every other consideration. (As A gets larger, this |
| gets less attractive, and eventually a merge or hash join will win instead. |
| So we have to cost out all the alternatives to decide what to do.) |
| |
| It can be useful for the parameter value to be passed down through |
| intermediate layers of joins, for example: |
| |
| NestLoop |
| -> Seq Scan on A |
| Hash Join |
| Join Condition: B.Y = C.W |
| -> Seq Scan on B |
| -> Index Scan using C_Z_IDX on C |
| Index Condition: C.Z = A.X |
| |
| If all joins are plain inner joins then this is usually unnecessary, |
| because it's possible to reorder the joins so that a parameter is used |
| immediately below the nestloop node that provides it. But in the |
| presence of outer joins, such join reordering may not be possible. |
| |
| Also, the bottom-level scan might require parameters from more than one |
| other relation. In principle we could join the other relations first |
| so that all the parameters are supplied from a single nestloop level. |
| But if those other relations have no join clause in common (which is |
| common in star-schema queries for instance), the planner won't consider |
| joining them directly to each other. In such a case we need to be able |
| to create a plan like |
| |
| NestLoop |
| -> Seq Scan on SmallTable1 A |
| NestLoop |
| -> Seq Scan on SmallTable2 B |
| -> Index Scan using XYIndex on LargeTable C |
| Index Condition: C.X = A.AID and C.Y = B.BID |
| |
| so we should be willing to pass down A.AID through a join even though |
| there is no join order constraint forcing the plan to look like this. |
| |
| Before version 9.2, Postgres used ad-hoc methods for planning and |
| executing nestloop queries of this kind, and those methods could not |
| handle passing parameters down through multiple join levels. |
| |
| To plan such queries, we now use a notion of a "parameterized path", |
| which is a path that makes use of a join clause to a relation that's not |
| scanned by the path. In the example two above, we would construct a |
| path representing the possibility of doing this: |
| |
| -> Index Scan using C_Z_IDX on C |
| Index Condition: C.Z = A.X |
| |
| This path will be marked as being parameterized by relation A. (Note that |
| this is only one of the possible access paths for C; we'd still have a |
| plain unparameterized seqscan, and perhaps other possibilities.) The |
| parameterization marker does not prevent joining the path to B, so one of |
| the paths generated for the joinrel {B C} will represent |
| |
| Hash Join |
| Join Condition: B.Y = C.W |
| -> Seq Scan on B |
| -> Index Scan using C_Z_IDX on C |
| Index Condition: C.Z = A.X |
| |
| This path is still marked as being parameterized by A. When we attempt to |
| join {B C} to A to form the complete join tree, such a path can only be |
| used as the inner side of a nestloop join: it will be ignored for other |
| possible join types. So we will form a join path representing the query |
| plan shown above, and it will compete in the usual way with paths built |
| from non-parameterized scans. |
| |
| While all ordinary paths for a particular relation generate the same set |
| of rows (since they must all apply the same set of restriction clauses), |
| parameterized paths typically generate fewer rows than less-parameterized |
| paths, since they have additional clauses to work with. This means we |
| must consider the number of rows generated as an additional figure of |
| merit. A path that costs more than another, but generates fewer rows, |
| must be kept since the smaller number of rows might save work at some |
| intermediate join level. (It would not save anything if joined |
| immediately to the source of the parameters.) |
| |
| To keep cost estimation rules relatively simple, we make an implementation |
| restriction that all paths for a given relation of the same parameterization |
| (i.e., the same set of outer relations supplying parameters) must have the |
| same rowcount estimate. This is justified by insisting that each such path |
| apply *all* join clauses that are available with the named outer relations. |
| Different paths might, for instance, choose different join clauses to use |
| as index clauses; but they must then apply any other join clauses available |
| from the same outer relations as filter conditions, so that the set of rows |
| returned is held constant. This restriction doesn't degrade the quality of |
| the finished plan: it amounts to saying that we should always push down |
| movable join clauses to the lowest possible evaluation level, which is a |
| good thing anyway. The restriction is useful in particular to support |
| pre-filtering of join paths in add_path_precheck. Without this rule we |
| could never reject a parameterized path in advance of computing its rowcount |
| estimate, which would greatly reduce the value of the pre-filter mechanism. |
| |
| To limit planning time, we have to avoid generating an unreasonably large |
| number of parameterized paths. We do this by only generating parameterized |
| relation scan paths for index scans, and then only for indexes for which |
| suitable join clauses are available. There are also heuristics in join |
| planning that try to limit the number of parameterized paths considered. |
| |
| In particular, there's been a deliberate policy decision to favor hash |
| joins over merge joins for parameterized join steps (those occurring below |
| a nestloop that provides parameters to the lower join's inputs). While we |
| do not ignore merge joins entirely, joinpath.c does not fully explore the |
| space of potential merge joins with parameterized inputs. Also, add_path |
| treats parameterized paths as having no pathkeys, so that they compete |
| only on cost and rowcount; they don't get preference for producing a |
| special sort order. This creates additional bias against merge joins, |
| since we might discard a path that could have been useful for performing |
| a merge without an explicit sort step. Since a parameterized path must |
| ultimately be used on the inside of a nestloop, where its sort order is |
| uninteresting, these choices do not affect any requirement for the final |
| output order of a query --- they only make it harder to use a merge join |
| at a lower level. The savings in planning work justifies that. |
| |
| Similarly, parameterized paths do not normally get preference in add_path |
| for having cheap startup cost; that's seldom of much value when on the |
| inside of a nestloop, so it seems not worth keeping extra paths solely for |
| that. An exception occurs for parameterized paths for the RHS relation of |
| a SEMI or ANTI join: in those cases, we can stop the inner scan after the |
| first match, so it's primarily startup not total cost that we care about. |
| |
| |
| LATERAL subqueries |
| ------------------ |
| |
| As of 9.3 we support SQL-standard LATERAL references from subqueries in |
| FROM (and also functions in FROM). The planner implements these by |
| generating parameterized paths for any RTE that contains lateral |
| references. In such cases, *all* paths for that relation will be |
| parameterized by at least the set of relations used in its lateral |
| references. (And in turn, join relations including such a subquery might |
| not have any unparameterized paths.) All the other comments made above for |
| parameterized paths still apply, though; in particular, each such path is |
| still expected to enforce any join clauses that can be pushed down to it, |
| so that all paths of the same parameterization have the same rowcount. |
| |
| We also allow LATERAL subqueries to be flattened (pulled up into the parent |
| query) by the optimizer, but only when this does not introduce lateral |
| references into JOIN/ON quals that would refer to relations outside the |
| lowest outer join at/above that qual. The semantics of such a qual would |
| be unclear. Note that even with this restriction, pullup of a LATERAL |
| subquery can result in creating PlaceHolderVars that contain lateral |
| references to relations outside their syntactic scope. We still evaluate |
| such PHVs at their syntactic location or lower, but the presence of such a |
| PHV in the quals or targetlist of a plan node requires that node to appear |
| on the inside of a nestloop join relative to the rel(s) supplying the |
| lateral reference. (Perhaps now that that stuff works, we could relax the |
| pullup restriction?) |
| |
| |
| Security-level constraints on qual clauses |
| ------------------------------------------ |
| |
| To support row-level security and security-barrier views efficiently, |
| we mark qual clauses (RestrictInfo nodes) with a "security_level" field. |
| The basic concept is that a qual with a lower security_level must be |
| evaluated before one with a higher security_level. This ensures that |
| "leaky" quals that might expose sensitive data are not evaluated until |
| after the security barrier quals that are supposed to filter out |
| security-sensitive rows. However, many qual conditions are "leakproof", |
| that is we trust the functions they use to not expose data. To avoid |
| unnecessarily inefficient plans, a leakproof qual is not delayed by |
| security-level considerations, even if it has a higher syntactic |
| security_level than another qual. |
| |
| In a query that contains no use of RLS or security-barrier views, all |
| quals will have security_level zero, so that none of these restrictions |
| kick in; we don't even need to check leakproofness of qual conditions. |
| |
| If there are security-barrier quals, they get security_level zero (and |
| possibly higher, if there are multiple layers of barriers). Regular quals |
| coming from the query text get a security_level one more than the highest |
| level used for barrier quals. |
| |
| When new qual clauses are generated by EquivalenceClass processing, |
| they must be assigned a security_level. This is trickier than it seems. |
| One's first instinct is that it would be safe to use the largest level |
| found among the source quals for the EquivalenceClass, but that isn't |
| safe at all, because it allows unwanted delays of security-barrier quals. |
| Consider a barrier qual "t.x = t.y" plus a query qual "t.x = constant", |
| and suppose there is another query qual "leaky_function(t.z)" that |
| we mustn't evaluate before the barrier qual has been checked. |
| We will have an EC {t.x, t.y, constant} which will lead us to replace |
| the EC quals with "t.x = constant AND t.y = constant". (We do not want |
| to give up that behavior, either, since the latter condition could allow |
| use of an index on t.y, which we would never discover from the original |
| quals.) If these generated quals are assigned the same security_level as |
| the query quals, then it's possible for the leaky_function qual to be |
| evaluated first, allowing leaky_function to see data from rows that |
| possibly don't pass the barrier condition. |
| |
| Instead, our handling of security levels with ECs works like this: |
| * Quals are not accepted as source clauses for ECs in the first place |
| unless they are leakproof or have security_level zero. |
| * EC-derived quals are assigned the minimum (not maximum) security_level |
| found among the EC's source clauses. |
| * If the maximum security_level found among the EC's source clauses is |
| above zero, then the equality operators selected for derived quals must |
| be leakproof. When no such operator can be found, the EC is treated as |
| "broken" and we fall back to emitting its source clauses without any |
| additional derived quals. |
| |
| These rules together ensure that an untrusted qual clause (one with |
| security_level above zero) cannot cause an EC to generate a leaky derived |
| clause. This makes it safe to use the minimum not maximum security_level |
| for derived clauses. The rules could result in poor plans due to not |
| being able to generate derived clauses at all, but the risk of that is |
| small in practice because most btree equality operators are leakproof. |
| Also, by making exceptions for level-zero quals, we ensure that there is |
| no plan degradation when no barrier quals are present. |
| |
| Once we have security levels assigned to all clauses, enforcement |
| of barrier-qual ordering restrictions boils down to two rules: |
| |
| * Table scan plan nodes must not select quals for early execution |
| (for example, use them as index qualifiers in an indexscan) unless |
| they are leakproof or have security_level no higher than any other |
| qual that is due to be executed at the same plan node. (Use the |
| utility function restriction_is_securely_promotable() to check |
| whether it's okay to select a qual for early execution.) |
| |
| * Normal execution of a list of quals must execute them in an order |
| that satisfies the same security rule, ie higher security_levels must |
| be evaluated later unless leakproof. (This is handled in a single place |
| by order_qual_clauses() in createplan.c.) |
| |
| order_qual_clauses() uses a heuristic to decide exactly what to do with |
| leakproof clauses. Normally it sorts clauses by security_level then cost, |
| being careful that the sort is stable so that we don't reorder clauses |
| without a clear reason. But this could result in a very expensive qual |
| being done before a cheaper one that is of higher security_level. |
| If the cheaper qual is leaky we have no choice, but if it is leakproof |
| we could put it first. We choose to sort leakproof quals as if they |
| have security_level zero, but only when their cost is less than 10X |
| cpu_operator_cost; that restriction alleviates the opposite problem of |
| doing expensive quals first just because they're leakproof. |
| |
| Additional rules will be needed to support safe handling of join quals |
| when there is a mix of security levels among join quals; for example, it |
| will be necessary to prevent leaky higher-security-level quals from being |
| evaluated at a lower join level than other quals of lower security level. |
| Currently there is no need to consider that since security-prioritized |
| quals can only be single-table restriction quals coming from RLS policies |
| or security-barrier views, and security-barrier view subqueries are never |
| flattened into the parent query. Hence enforcement of security-prioritized |
| quals only happens at the table scan level. With extra rules for safe |
| handling of security levels among join quals, it should be possible to let |
| security-barrier views be flattened into the parent query, allowing more |
| flexibility of planning while still preserving required ordering of qual |
| evaluation. But that will come later. |
| |
| |
| Post scan/join planning |
| ----------------------- |
| |
| So far we have discussed only scan/join planning, that is, implementation |
| of the FROM and WHERE clauses of a SQL query. But the planner must also |
| determine how to deal with GROUP BY, aggregation, and other higher-level |
| features of queries; and in many cases there are multiple ways to do these |
| steps and thus opportunities for optimization choices. These steps, like |
| scan/join planning, are handled by constructing Paths representing the |
| different ways to do a step, then choosing the cheapest Path. |
| |
| Since all Paths require a RelOptInfo as "parent", we create RelOptInfos |
| representing the outputs of these upper-level processing steps. These |
| RelOptInfos are mostly dummy, but their pathlist lists hold all the Paths |
| considered useful for each step. Currently, we may create these types of |
| additional RelOptInfos during upper-level planning: |
| |
| UPPERREL_SETOP result of UNION/INTERSECT/EXCEPT, if any |
| UPPERREL_PARTIAL_GROUP_AGG result of partial grouping/aggregation, if any |
| UPPERREL_GROUP_AGG result of grouping/aggregation, if any |
| UPPERREL_WINDOW result of window functions, if any |
| UPPERREL_PARTIAL_DISTINCT result of partial "SELECT DISTINCT", if any |
| UPPERREL_DISTINCT result of "SELECT DISTINCT", if any |
| UPPERREL_ORDERED result of ORDER BY, if any |
| UPPERREL_FINAL result of any remaining top-level actions |
| |
| UPPERREL_FINAL is used to represent any final processing steps, currently |
| LockRows (SELECT FOR UPDATE), LIMIT/OFFSET, and ModifyTable. There is no |
| flexibility about the order in which these steps are done, and thus no need |
| to subdivide this stage more finely. |
| |
| These "upper relations" are identified by the UPPERREL enum values shown |
| above, plus a relids set, which allows there to be more than one upperrel |
| of the same kind. We use NULL for the relids if there's no need for more |
| than one upperrel of the same kind. Currently, in fact, the relids set |
| is vestigial because it's always NULL, but that's expected to change in |
| the future. For example, in planning set operations, we might need the |
| relids to denote which subset of the leaf SELECTs has been combined in a |
| particular group of Paths that are competing with each other. |
| |
| The result of subquery_planner() is always returned as a set of Paths |
| stored in the UPPERREL_FINAL rel with NULL relids. The other types of |
| upperrels are created only if needed for the particular query. |
| |
| |
| Parallel Query and Partial Paths |
| -------------------------------- |
| |
| Parallel query involves dividing up the work that needs to be performed |
| either by an entire query or some portion of the query in such a way that |
| some of that work can be done by one or more worker processes, which are |
| called parallel workers. Parallel workers are a subtype of dynamic |
| background workers; see src/backend/access/transam/README.parallel for a |
| fuller description. The academic literature on parallel query suggests |
| that parallel execution strategies can be divided into essentially two |
| categories: pipelined parallelism, where the execution of the query is |
| divided into multiple stages and each stage is handled by a separate |
| process; and partitioning parallelism, where the data is split between |
| multiple processes and each process handles a subset of it. The |
| literature, however, suggests that gains from pipeline parallelism are |
| often very limited due to the difficulty of avoiding pipeline stalls. |
| Consequently, we do not currently attempt to generate query plans that |
| use this technique. |
| |
| Instead, we focus on partitioning parallelism, which does not require |
| that the underlying table be partitioned. It only requires that (1) |
| there is some method of dividing the data from at least one of the base |
| tables involved in the relation across multiple processes, (2) allowing |
| each process to handle its own portion of the data, and then (3) |
| collecting the results. Requirements (2) and (3) are satisfied by the |
| executor node Gather (or GatherMerge), which launches any number of worker |
| processes and executes its single child plan in all of them, and perhaps |
| in the leader also, if the children aren't generating enough data to keep |
| the leader busy. Requirement (1) is handled by the table scan node: when |
| invoked with parallel_aware = true, this node will, in effect, partition |
| the table on a block by block basis, returning a subset of the tuples from |
| the relation in each worker where that scan node is executed. |
| |
| Just as we do for non-parallel access methods, we build Paths to |
| represent access strategies that can be used in a parallel plan. These |
| are, in essence, the same strategies that are available in the |
| non-parallel plan, but there is an important difference: a path that |
| will run beneath a Gather node returns only a subset of the query |
| results in each worker, not all of them. To form a path that can |
| actually be executed, the (rather large) cost of the Gather node must be |
| accounted for. For this reason among others, paths intended to run |
| beneath a Gather node - which we call "partial" paths since they return |
| only a subset of the results in each worker - must be kept separate from |
| ordinary paths (see RelOptInfo's partial_pathlist and the function |
| add_partial_path). |
| |
| One of the keys to making parallel query effective is to run as much of |
| the query in parallel as possible. Therefore, we expect it to generally |
| be desirable to postpone the Gather stage until as near to the top of the |
| plan as possible. Expanding the range of cases in which more work can be |
| pushed below the Gather (and costing them accurately) is likely to keep us |
| busy for a long time to come. |
| |
| Partitionwise joins |
| ------------------- |
| |
| A join between two similarly partitioned tables can be broken down into joins |
| between their matching partitions if there exists an equi-join condition |
| between the partition keys of the joining tables. The equi-join between |
| partition keys implies that all join partners for a given row in one |
| partitioned table must be in the corresponding partition of the other |
| partitioned table. Because of this the join between partitioned tables to be |
| broken into joins between the matching partitions. The resultant join is |
| partitioned in the same way as the joining relations, thus allowing an N-way |
| join between similarly partitioned tables having equi-join condition between |
| their partition keys to be broken down into N-way joins between their matching |
| partitions. This technique of breaking down a join between partitioned tables |
| into joins between their partitions is called partitionwise join. We will use |
| term "partitioned relation" for either a partitioned table or a join between |
| compatibly partitioned tables. |
| |
| Even if the joining relations don't have exactly the same partition bounds, |
| partitionwise join can still be applied by using an advanced |
| partition-matching algorithm. For both the joining relations, the algorithm |
| checks whether every partition of one joining relation only matches one |
| partition of the other joining relation at most. In such a case the join |
| between the joining relations can be broken down into joins between the |
| matching partitions. The join relation can then be considered partitioned. |
| The algorithm produces the pairs of the matching partitions, plus the |
| partition bounds for the join relation, to allow partitionwise join for |
| computing the join. The algorithm is implemented in partition_bounds_merge(). |
| For an N-way join relation considered partitioned this way, not every pair of |
| joining relations can use partitionwise join. For example: |
| |
| (A leftjoin B on (Pab)) innerjoin C on (Pac) |
| |
| where A, B, and C are partitioned tables, and A has an extra partition |
| compared to B and C. When considering partitionwise join for the join {A B}, |
| the extra partition of A doesn't have a matching partition on the nullable |
| side, which is the case that the current implementation of partitionwise join |
| can't handle. So {A B} is not considered partitioned, and the pair of {A B} |
| and C considered for the 3-way join can't use partitionwise join. On the |
| other hand, the pair of {A C} and B can use partitionwise join because {A C} |
| is considered partitioned by eliminating the extra partition (see identity 1 |
| on outer join reordering). Whether an N-way join can use partitionwise join |
| is determined based on the first pair of joining relations that are both |
| partitioned and can use partitionwise join. |
| |
| The partitioning properties of a partitioned relation are stored in its |
| RelOptInfo. The information about data types of partition keys are stored in |
| PartitionSchemeData structure. The planner maintains a list of canonical |
| partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of |
| any two partitioned relations with same partitioning scheme point to the same |
| PartitionSchemeData object. This reduces memory consumed by |
| PartitionSchemeData objects and makes it easy to compare the partition schemes |
| of joining relations. |
| |
| Partitionwise aggregates/grouping |
| --------------------------------- |
| |
| If the GROUP BY clause contains all of the partition keys, all the rows |
| that belong to a given group must come from a single partition; therefore, |
| aggregation can be done completely separately for each partition. Otherwise, |
| partial aggregates can be computed for each partition, and then finalized |
| after appending the results from the individual partitions. This technique of |
| breaking down aggregation or grouping over a partitioned relation into |
| aggregation or grouping over its partitions is called partitionwise |
| aggregation. Especially when the partition keys match the GROUP BY clause, |
| this can be significantly faster than the regular method. |
| |
| Aggregate push-down |
| ------------------- |
| |
| The obvious way to evaluate aggregates is to evaluate the FROM clause of the |
| SQL query (this is what query_planner does) and use the resulting paths as the |
| input of Agg node. However, if the groups are large enough, it may be more |
| efficient to apply the partial aggregation to the output of base relation |
| scan, and finalize it when we have all relations of the query joined: |
| |
| EXPLAIN |
| SELECT a.i, avg(b.y) |
| FROM a JOIN b ON b.j = a.i |
| GROUP BY a.i; |
| |
| Finalize HashAggregate |
| Group Key: a.i |
| -> Nested Loop |
| -> Partial HashAggregate |
| Group Key: b.j |
| -> Seq Scan on b |
| -> Index Only Scan using a_pkey on a |
| Index Cond: (i = b.j) |
| |
| Thus the join above the partial aggregate node receives fewer input rows, and |
| so the number of outer-to-inner pairs of tuples to be checked can be |
| significantly lower, which can in turn lead to considerably lower join cost. |
| |
| Note that there's often no GROUP BY expression to be used for the partial |
| aggregation, so we use equivalence classes to derive grouping expression: in |
| the example above, the grouping key "b.j" was derived from "a.i". |
| |
| Also note that in this case the partial aggregate uses the "b.j" as grouping |
| column although the column does not appear in the query target list. The point |
| is that "b.j" is needed to evaluate the join condition, and there's no other |
| way for the partial aggregate to emit its values. |
| |
| Besides base relation, the aggregation can also be pushed down to join: |
| |
| EXPLAIN |
| SELECT a.i, avg(b.y + c.v) |
| FROM a JOIN b ON b.j = a.i |
| JOIN c ON c.k = a.i |
| WHERE b.j = c.k GROUP BY a.i; |
| |
| Finalize HashAggregate |
| Group Key: a.i |
| -> Hash Join |
| Hash Cond: (b.j = a.i) |
| -> Partial HashAggregate |
| Group Key: b.j |
| -> Hash Join |
| Hash Cond: (b.j = c.k) |
| -> Seq Scan on b |
| -> Hash |
| -> Seq Scan on c |
| -> Hash |
| -> Seq Scan on a |
| |
| Whether the Agg node is created out of base relation or out of join, it's |
| added to a separate RelOptInfo that we call "grouped relation". Grouped |
| relation can be joined to a non-grouped relation, which results in a grouped |
| relation too. Join of two grouped relations does not seem to be very useful |
| and is currently not supported. |
| |
| If query_planner produces a grouped relation that contains valid paths, these |
| are added to the special hash table. Further processing of grouping path will |
| try to use these partial grouped paths. |