| $PostgreSQL: pgsql/src/backend/executor/README,v 1.8 2009/01/09 15:46:10 tgl Exp $ |
| |
| The Postgres Executor |
| ===================== |
| |
| The executor processes a tree of "plan nodes". The plan tree is essentially |
| a demand-pull pipeline of tuple processing operations. Each node, when |
| called, will produce the next tuple in its output sequence, or NULL if no |
| more tuples are available. If the node is not a primitive relation-scanning |
| node, it will have child node(s) that it calls in turn to obtain input |
| tuples. |
| |
| Refinements on this basic model include: |
| |
| * Choice of scan direction (forwards or backwards). Caution: this is not |
| currently well-supported. It works for primitive scan nodes, but not very |
| well for joins, aggregates, etc. |
| |
| * Rescan command to reset a node and make it generate its output sequence |
| over again. |
| |
| * Parameters that can alter a node's results. After adjusting a parameter, |
| the rescan command must be applied to that node and all nodes above it. |
| There is a moderately intelligent scheme to avoid rescanning nodes |
| unnecessarily (for example, Sort does not rescan its input if no parameters |
| of the input have changed, since it can just reread its stored sorted data). |
| |
| The plan tree concept implements SELECT directly: it is only necessary to |
| deliver the top-level result tuples to the client, or insert them into |
| another table in the case of INSERT ... SELECT. (INSERT ... VALUES is |
| handled similarly, but the plan tree is just a Result node with no source |
| tables.) For UPDATE, the plan tree selects the tuples that need to be |
| updated (WHERE condition) and delivers a new calculated tuple value for each |
| such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple. |
| The executor's top level then uses this information to update the correct |
| tuple. DELETE is similar to UPDATE except that only a CTID need be |
| delivered by the plan tree. |
| |
| XXX a great deal more documentation needs to be written here... |
| |
| |
| Plan Trees and State Trees |
| -------------------------- |
| |
| The plan tree delivered by the planner contains a tree of Plan nodes (struct |
| types derived from struct Plan). Each Plan node may have expression trees |
| associated with it, to represent its target list, qualification conditions, |
| etc. During executor startup we build a parallel tree of identical structure |
| containing executor state nodes --- every plan and expression node type has |
| a corresponding executor state node type. Each node in the state tree has a |
| pointer to its corresponding node in the plan tree, plus executor state data |
| as needed to implement that node type. This arrangement allows the plan |
| tree to be completely read-only as far as the executor is concerned: all data |
| that is modified during execution is in the state tree. Read-only plan trees |
| make life much simpler for plan caching and reuse. |
| |
| Altogether there are four classes of nodes used in these trees: Plan nodes, |
| their corresponding PlanState nodes, Expr nodes, and their corresponding |
| ExprState nodes. (Actually, there are also List nodes, which are used as |
| "glue" in all four kinds of tree.) |
| |
| |
| Memory Management |
| ----------------- |
| |
| A "per query" memory context is created during CreateExecutorState(); |
| all storage allocated during an executor invocation is allocated in that |
| context or a child context. This allows easy reclamation of storage |
| during executor shutdown --- rather than messing with retail pfree's and |
| probable storage leaks, we just destroy the memory context. |
| |
| In particular, the plan state trees and expression state trees described |
| in the previous section are allocated in the per-query memory context. |
| |
| To avoid intra-query memory leaks, most processing while a query runs |
| is done in "per tuple" memory contexts, which are so-called because they |
| are typically reset to empty once per tuple. Per-tuple contexts are usually |
| associated with ExprContexts, and commonly each PlanState node has its own |
| ExprContext to evaluate its qual and targetlist expressions in. |
| |
| |
| Query Processing Control Flow |
| ----------------------------- |
| |
| This is a sketch of control flow for full query processing: |
| |
| CreateQueryDesc |
| |
| ExecutorStart |
| CreateExecutorState |
| creates per-query context |
| switch to per-query context to run ExecInitNode |
| ExecInitNode --- recursively scans plan tree |
| CreateExprContext |
| creates per-tuple context |
| ExecInitExpr |
| |
| ExecutorRun |
| ExecProcNode --- recursively called in per-query context |
| ExecEvalExpr --- called in per-tuple context |
| ResetExprContext --- to free memory |
| |
| ExecutorEnd |
| ExecEndNode --- recursively releases resources |
| FreeExecutorState |
| frees per-query context and child contexts |
| |
| FreeQueryDesc |
| |
| Per above comments, it's not really critical for ExecEndNode to free any |
| memory; it'll all go away in FreeExecutorState anyway. However, we do need to |
| be careful to close relations, drop buffer pins, etc, so we do need to scan |
| the plan state tree to find these sorts of resources. |
| |
| |
| The executor can also be used to evaluate simple expressions without any Plan |
| tree ("simple" meaning "no aggregates and no sub-selects", though such might |
| be hidden inside function calls). This case has a flow of control like |
| |
| CreateExecutorState |
| creates per-query context |
| |
| CreateExprContext -- or use GetPerTupleExprContext(estate) |
| creates per-tuple context |
| |
| ExecPrepareExpr |
| temporarily switch to per-query context |
| run the expression through expression_planner |
| ExecInitExpr |
| |
| Repeatedly do: |
| ExecEvalExprSwitchContext |
| ExecEvalExpr --- called in per-tuple context |
| ResetExprContext --- to free memory |
| |
| FreeExecutorState |
| frees per-query context, as well as ExprContext |
| (a separate FreeExprContext call is not necessary) |
| |
| |
| EvalPlanQual (READ COMMITTED Update Checking) |
| --------------------------------------------- |
| |
| For simple SELECTs, the executor need only pay attention to tuples that are |
| valid according to the snapshot seen by the current transaction (ie, they |
| were inserted by a previously committed transaction, and not deleted by any |
| previously committed transaction). However, for UPDATE and DELETE it is not |
| cool to modify or delete a tuple that's been modified by an open or |
| concurrently-committed transaction. If we are running in SERIALIZABLE |
| isolation level then we just raise an error when this condition is seen to |
| occur. In READ COMMITTED isolation level, we must work a lot harder. |
| |
| The basic idea in READ COMMITTED mode is to take the modified tuple |
| committed by the concurrent transaction (after waiting for it to commit, |
| if need be) and re-evaluate the query qualifications to see if it would |
| still meet the quals. If so, we regenerate the updated tuple (if we are |
| doing an UPDATE) from the modified tuple, and finally update/delete the |
| modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its |
| action is just to lock the modified tuple. |
| |
| To implement this checking, we actually re-run the entire query from scratch |
| for each modified tuple, but with the scan node that sourced the original |
| tuple set to return only the modified tuple, not the original tuple or any |
| of the rest of the relation. If this query returns a tuple, then the |
| modified tuple passes the quals (and the query output is the suitably |
| modified update tuple, if we're doing UPDATE). If no tuple is returned, |
| then the modified tuple fails the quals, so we ignore it and continue the |
| original query. (This is reasonably efficient for simple queries, but may |
| be horribly slow for joins. A better design would be nice; one thought for |
| future investigation is to treat the tuple substitution like a parameter, |
| so that we can avoid rescanning unrelated nodes.) |
| |
| Note a fundamental bogosity of this approach: if the relation containing |
| the original tuple is being used in a self-join, the other instance(s) of |
| the relation will be treated as still containing the original tuple, whereas |
| logical consistency would demand that the modified tuple appear in them too. |
| But we'd have to actually substitute the modified tuple for the original, |
| while still returning all the rest of the relation, to ensure consistent |
| answers. Implementing this correctly is a task for future work. |
| |
| In UPDATE/DELETE, only the target relation needs to be handled this way, |
| so only one special recheck query needs to execute at a time. In SELECT FOR |
| UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible |
| that while we are executing a recheck query for one modified tuple, we will |
| hit another modified tuple in another relation. In this case we "stack up" |
| recheck queries: a sub-recheck query is spawned in which both the first and |
| second modified tuples will be returned as the only components of their |
| relations. (In event of success, all these modified tuples will be locked.) |
| Again, this isn't necessarily quite the right thing ... but in simple cases |
| it works. Potentially, recheck queries could get nested to the depth of the |
| number of FOR UPDATE/SHARE relations in the query. |
| |
| It should be noted also that UPDATE/DELETE expect at most one tuple to |
| result from the modified query, whereas in the FOR UPDATE case it's possible |
| for multiple tuples to result (since we could be dealing with a join in |
| which multiple tuples join to the modified tuple). We want FOR UPDATE to |
| lock all relevant tuples, so we pass all tuples output by all the stacked |
| recheck queries back to the executor toplevel for locking. |