| src/backend/executor/README |
| |
| The Postgres Executor |
| ===================== |
| |
| The executor processes a tree of "plan nodes". The plan tree is essentially |
| a demand-pull pipeline of tuple processing operations. Each node, when |
| called, will produce the next tuple in its output sequence, or NULL if no |
| more tuples are available. If the node is not a primitive relation-scanning |
| node, it will have child node(s) that it calls in turn to obtain input |
| tuples. |
| |
| Refinements on this basic model include: |
| |
| * Choice of scan direction (forwards or backwards). Caution: this is not |
| currently well-supported. It works for primitive scan nodes, but not very |
| well for joins, aggregates, etc. |
| |
| * Rescan command to reset a node and make it generate its output sequence |
| over again. |
| |
| * Parameters that can alter a node's results. After adjusting a parameter, |
| the rescan command must be applied to that node and all nodes above it. |
| There is a moderately intelligent scheme to avoid rescanning nodes |
| unnecessarily (for example, Sort does not rescan its input if no parameters |
| of the input have changed, since it can just reread its stored sorted data). |
| |
| For a SELECT, it is only necessary to deliver the top-level result tuples |
| to the client. For INSERT/UPDATE/DELETE, the actual table modification |
| operations happen in a top-level ModifyTable plan node. If the query |
| includes a RETURNING clause, the ModifyTable node delivers the computed |
| RETURNING rows as output, otherwise it returns nothing. Handling INSERT |
| is pretty straightforward: the tuples returned from the plan tree below |
| ModifyTable are inserted into the correct result relation. For UPDATE, |
| the plan tree returns the new values of the updated columns, plus "junk" |
| (hidden) column(s) identifying which table row is to be updated. The |
| ModifyTable node must fetch that row to extract values for the unchanged |
| columns, combine the values into a new row, and apply the update. (For a |
| heap table, the row-identity junk column is a CTID, but other things may |
| be used for other table types.) For DELETE, the plan tree need only deliver |
| junk row-identity column(s), and the ModifyTable node visits each of those |
| rows and marks the row deleted. |
| |
| XXX a great deal more documentation needs to be written here... |
| |
| |
| Plan Trees and State Trees |
| -------------------------- |
| |
| The plan tree delivered by the planner contains a tree of Plan nodes (struct |
| types derived from struct Plan). During executor startup we build a parallel |
| tree of identical structure containing executor state nodes --- generally, |
| every plan node type has a corresponding executor state node type. Each node |
| in the state tree has a pointer to its corresponding node in the plan tree, |
| plus executor state data as needed to implement that node type. This |
| arrangement allows the plan tree to be completely read-only so far as the |
| executor is concerned: all data that is modified during execution is in the |
| state tree. Read-only plan trees make life much simpler for plan caching and |
| reuse. |
| |
| A corresponding executor state node may not be created during executor startup |
| if the executor determines that an entire subplan is not required due to |
| execution time partition pruning determining that no matching records will be |
| found there. This currently only occurs for Append and MergeAppend nodes. In |
| this case the non-required subplans are ignored and the executor state's |
| subnode array will become out of sequence to the plan's subplan list. |
| |
| Each Plan node may have expression trees associated with it, to represent |
| its target list, qualification conditions, etc. These trees are also |
| read-only to the executor, but the executor state for expression evaluation |
| does not mirror the Plan expression's tree shape, as explained below. |
| Rather, there's just one ExprState node per expression tree, although this |
| may have sub-nodes for some complex expression node types. |
| |
| Altogether there are four classes of nodes used in these trees: Plan nodes, |
| their corresponding PlanState nodes, Expr nodes, and ExprState nodes. |
| (Actually, there are also List nodes, which are used as "glue" in all |
| three tree-based representations.) |
| |
| |
| Expression Trees and ExprState nodes |
| ------------------------------------ |
| |
| Expression trees, in contrast to Plan trees, are not mirrored into a |
| corresponding tree of state nodes. Instead each separately executable |
| expression tree (e.g. a Plan's qual or targetlist) is represented by one |
| ExprState node. The ExprState node contains the information needed to |
| evaluate the expression in a compact, linear form. That compact form is |
| stored as a flat array in ExprState->steps[] (an array of ExprEvalStep, |
| not ExprEvalStep *). |
| |
| The reasons for choosing such a representation include: |
| - commonly the amount of work needed to evaluate one Expr-type node is |
| small enough that the overhead of having to perform a tree-walk |
| during evaluation is significant. |
| - the flat representation can be evaluated non-recursively within a single |
| function, reducing stack depth and function call overhead. |
| - such a representation is usable both for fast interpreted execution, |
| and for compiling into native code. |
| |
| The Plan-tree representation of an expression is compiled into an |
| ExprState node by ExecInitExpr(). As much complexity as possible should |
| be handled by ExecInitExpr() (and helpers), instead of execution time |
| where both interpreted and compiled versions would need to deal with the |
| complexity. Besides duplicating effort between execution approaches, |
| runtime initialization checks also have a small but noticeable cost every |
| time the expression is evaluated. Therefore, we allow ExecInitExpr() to |
| precompute information that we do not expect to vary across execution of a |
| single query, for example the set of CHECK constraint expressions to be |
| applied to a domain type. This could not be done at plan time without |
| greatly increasing the number of events that require plan invalidation. |
| (Previously, some information of this kind was rechecked on each |
| expression evaluation, but that seems like unnecessary overhead.) |
| |
| |
| Expression Initialization |
| ------------------------- |
| |
| During ExecInitExpr() and similar routines, Expr trees are converted |
| into the flat representation. Each Expr node might be represented by |
| zero, one, or more ExprEvalSteps. |
| |
| Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp) |
| and it stores the result of its work into the Datum variable and boolean |
| null flag variable pointed to by ExprEvalStep->resvalue/resnull. |
| Complex expressions are performed by chaining together several steps. |
| For example, "a + b" (one OpExpr, with two Var expressions) would be |
| represented as two steps to fetch the Var values, and one step for the |
| evaluation of the function underlying the + operator. The steps for the |
| Vars would have their resvalue/resnull pointing directly to the appropriate |
| args[].value .isnull elements in the FunctionCallInfoBaseData struct that |
| is used by the function evaluation step, thus avoiding extra work to copy |
| the result values around. |
| |
| The last entry in a completed ExprState->steps array is always an |
| EEOP_DONE step; this removes the need to test for end-of-array while |
| iterating. Also, if the expression contains any variable references (to |
| user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps |
| array begins with EEOP_*_FETCHSOME steps that ensure that the relevant |
| tuples have been deconstructed to make the required columns directly |
| available (cf. slot_getsomeattrs()). This allows individual Var-fetching |
| steps to be little more than an array lookup. |
| |
| Most of ExecInitExpr()'s work is done by the recursive function |
| ExecInitExprRec() and its subroutines. ExecInitExprRec() maps one Expr |
| node into the steps required for execution, recursing as needed for |
| sub-expressions. |
| |
| Each ExecInitExprRec() call has to specify where that subexpression's |
| results are to be stored (via the resv/resnull parameters). This allows |
| the above scenario of evaluating a (sub-)expression directly into |
| fcinfo->args[].value/isnull, but also requires some care: target Datum/isnull |
| variables may not be shared with another ExecInitExprRec() unless the |
| results are only needed by steps executing before further usages of those |
| target Datum/isnull variables. Due to the non-recursiveness of the |
| ExprEvalStep representation that's usually easy to guarantee. |
| |
| ExecInitExprRec() pushes new operations into the ExprState->steps array |
| using ExprEvalPushStep(). To keep the steps as a consecutively laid out |
| array, ExprEvalPushStep() has to repalloc the entire array when there's |
| not enough space. Because of that it is *not* allowed to point directly |
| into any of the steps during expression initialization. Therefore, the |
| resv/resnull for a subexpression usually point to some storage that is |
| palloc'd separately from the steps array. For instance, the |
| FunctionCallInfoBaseData for a function call step is separately allocated |
| rather than being part of the ExprEvalStep array. The overall result |
| of a complete expression is typically returned into the resvalue/resnull |
| fields of the ExprState node itself. |
| |
| Some steps, e.g. boolean expressions, allow skipping evaluation of |
| certain subexpressions. In the flat representation this amounts to |
| jumping to some later step rather than just continuing consecutively |
| with the next step. The target for such a jump is represented by |
| the integer index in the ExprState->steps array of the step to execute |
| next. (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.) |
| |
| Typically, ExecInitExprRec() has to push a jumping step into the steps |
| array, then recursively generate steps for the subexpression that might |
| get skipped over, then go back and fix up the jump target index using |
| the now-known length of the subexpression's steps. This is handled by |
| adjust_jumps lists in execExpr.c. |
| |
| The last step in constructing an ExprState is to apply ExecReadyExpr(), |
| which readies it for execution using whichever execution method has been |
| selected. |
| |
| |
| Expression Evaluation |
| --------------------- |
| |
| To allow for different methods of expression evaluation, and for |
| better branch/jump target prediction, expressions are evaluated by |
| calling ExprState->evalfunc (via ExecEvalExpr() and friends). |
| |
| ExecReadyExpr() can choose the method of interpretation by setting |
| evalfunc to an appropriate function. The default execution function, |
| ExecInterpExpr, is implemented in execExprInterp.c; see its header |
| comment for details. Special-case evalfuncs are used for certain |
| especially-simple expressions. |
| |
| Note that a lot of the more complex expression evaluation steps, which are |
| less performance-critical than the simpler ones, are implemented as |
| separate functions outside the fast-path of expression execution, allowing |
| their implementation to be shared between interpreted and compiled |
| expression evaluation. This means that these helper functions are not |
| allowed to perform expression step dispatch themselves, as the method of |
| dispatch will vary based on the caller. The helpers therefore cannot call |
| for the execution of subexpressions; all subexpression results they need |
| must be computed by earlier steps. And dispatch to the following |
| expression step must be performed after returning from the helper. |
| |
| |
| Targetlist Evaluation |
| --------------------- |
| |
| ExecBuildProjectionInfo builds an ExprState that has the effect of |
| evaluating a targetlist into ExprState->resultslot. A generic targetlist |
| expression is executed by evaluating it as discussed above (storing the |
| result into the ExprState's resvalue/resnull fields) and then using an |
| EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[] |
| and tts_isnull[] array elements of the result slot. There are special |
| fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that |
| are simple Vars using only one step instead of two. |
| |
| |
| Memory Management |
| ----------------- |
| |
| A "per query" memory context is created during CreateExecutorState(); |
| all storage allocated during an executor invocation is allocated in that |
| context or a child context. This allows easy reclamation of storage |
| during executor shutdown --- rather than messing with retail pfree's and |
| probable storage leaks, we just destroy the memory context. |
| |
| In particular, the plan state trees and expression state trees described |
| in the previous section are allocated in the per-query memory context. |
| |
| To avoid intra-query memory leaks, most processing while a query runs |
| is done in "per tuple" memory contexts, which are so-called because they |
| are typically reset to empty once per tuple. Per-tuple contexts are usually |
| associated with ExprContexts, and commonly each PlanState node has its own |
| ExprContext to evaluate its qual and targetlist expressions in. |
| |
| |
| Query Processing Control Flow |
| ----------------------------- |
| |
| This is a sketch of control flow for full query processing: |
| |
| CreateQueryDesc |
| |
| ExecutorStart |
| CreateExecutorState |
| creates per-query context |
| switch to per-query context to run ExecInitNode |
| AfterTriggerBeginQuery |
| ExecInitNode --- recursively scans plan tree |
| ExecInitNode |
| recurse into subsidiary nodes |
| CreateExprContext |
| creates per-tuple context |
| ExecInitExpr |
| |
| ExecutorRun |
| ExecProcNode --- recursively called in per-query context |
| ExecEvalExpr --- called in per-tuple context |
| ResetExprContext --- to free memory |
| |
| ExecutorFinish |
| ExecPostprocessPlan --- run any unfinished ModifyTable nodes |
| AfterTriggerEndQuery |
| |
| ExecutorEnd |
| ExecEndNode --- recursively releases resources |
| FreeExecutorState |
| frees per-query context and child contexts |
| |
| FreeQueryDesc |
| |
| Per above comments, it's not really critical for ExecEndNode to free any |
| memory; it'll all go away in FreeExecutorState anyway. However, we do need to |
| be careful to close relations, drop buffer pins, etc, so we do need to scan |
| the plan state tree to find these sorts of resources. |
| |
| |
| The executor can also be used to evaluate simple expressions without any Plan |
| tree ("simple" meaning "no aggregates and no sub-selects", though such might |
| be hidden inside function calls). This case has a flow of control like |
| |
| CreateExecutorState |
| creates per-query context |
| |
| CreateExprContext -- or use GetPerTupleExprContext(estate) |
| creates per-tuple context |
| |
| ExecPrepareExpr |
| temporarily switch to per-query context |
| run the expression through expression_planner |
| ExecInitExpr |
| |
| Repeatedly do: |
| ExecEvalExprSwitchContext |
| ExecEvalExpr --- called in per-tuple context |
| ResetExprContext --- to free memory |
| |
| FreeExecutorState |
| frees per-query context, as well as ExprContext |
| (a separate FreeExprContext call is not necessary) |
| |
| |
| EvalPlanQual (READ COMMITTED Update Checking) |
| --------------------------------------------- |
| |
| For simple SELECTs, the executor need only pay attention to tuples that are |
| valid according to the snapshot seen by the current transaction (ie, they |
| were inserted by a previously committed transaction, and not deleted by any |
| previously committed transaction). However, for UPDATE and DELETE it is not |
| cool to modify or delete a tuple that's been modified by an open or |
| concurrently-committed transaction. If we are running in SERIALIZABLE |
| isolation level then we just raise an error when this condition is seen to |
| occur. In READ COMMITTED isolation level, we must work a lot harder. |
| |
| The basic idea in READ COMMITTED mode is to take the modified tuple |
| committed by the concurrent transaction (after waiting for it to commit, |
| if need be) and re-evaluate the query qualifications to see if it would |
| still meet the quals. If so, we regenerate the updated tuple (if we are |
| doing an UPDATE) from the modified tuple, and finally update/delete the |
| modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its |
| action is just to lock the modified tuple and return results based on that |
| version of the tuple. |
| |
| To implement this checking, we actually re-run the query from scratch for |
| each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the |
| relation scan nodes tweaked to return only the current tuples --- either |
| the original ones, or the updated (and now locked) versions of the modified |
| tuple(s). If this query returns a tuple, then the modified tuple(s) pass |
| the quals (and the query output is the suitably modified update tuple, if |
| we're doing UPDATE). If no tuple is returned, then the modified tuple(s) |
| fail the quals, so we ignore the current result tuple and continue the |
| original query. |
| |
| In UPDATE/DELETE, only the target relation needs to be handled this way. |
| In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE, |
| so we obtain lock on the current tuple version in each such relation before |
| executing the recheck. |
| |
| It is also possible that there are relations in the query that are not |
| to be locked (they are neither the UPDATE/DELETE target nor specified to |
| be locked in SELECT FOR UPDATE/SHARE). When re-running the test query |
| we want to use the same rows from these relations that were joined to |
| the locked rows. For ordinary relations this can be implemented relatively |
| cheaply by including the row TID in the join outputs and re-fetching that |
| TID. (The re-fetch is expensive, but we're trying to optimize the normal |
| case where no re-test is needed.) We have also to consider non-table |
| relations, such as a ValuesScan or FunctionScan. For these, since there |
| is no equivalent of TID, the only practical solution seems to be to include |
| the entire row value in the join output row. |
| |
| We disallow set-returning functions in the targetlist of SELECT FOR UPDATE, |
| so as to ensure that at most one tuple can be returned for any particular |
| set of scan tuples. Otherwise we'd get duplicates due to the original |
| query returning the same set of scan tuples multiple times. Likewise, |
| SRFs are disallowed in an UPDATE's targetlist. There, they would have the |
| effect of the same row being updated multiple times, which is not very |
| useful --- and updates after the first would have no effect anyway. |
| |
| |
| Asynchronous Execution |
| ---------------------- |
| |
| In cases where a node is waiting on an event external to the database system, |
| such as a ForeignScan awaiting network I/O, it's desirable for the node to |
| indicate that it cannot return any tuple immediately but may be able to do so |
| at a later time. A process which discovers this type of situation can always |
| handle it simply by blocking, but this may waste time that could be spent |
| executing some other part of the plan tree where progress could be made |
| immediately. This is particularly likely to occur when the plan tree contains |
| an Append node. Asynchronous execution runs multiple parts of an Append node |
| concurrently rather than serially to improve performance. |
| |
| For asynchronous execution, an Append node must first request a tuple from an |
| async-capable child node using ExecAsyncRequest. Next, it must execute the |
| asynchronous event loop using ExecAppendAsyncEventWait. Eventually, when a |
| child node to which an asynchronous request has been made produces a tuple, |
| the Append node will receive it from the event loop via ExecAsyncResponse. In |
| the current implementation of asynchronous execution, the only node type that |
| requests tuples from an async-capable child node is an Append, while the only |
| node type that might be async-capable is a ForeignScan. |
| |
| Typically, the ExecAsyncResponse callback is the only one required for nodes |
| that wish to request tuples asynchronously. On the other hand, async-capable |
| nodes generally need to implement three methods: |
| |
| 1. When an asynchronous request is made, the node's ExecAsyncRequest callback |
| will be invoked; it should use ExecAsyncRequestPending to indicate that the |
| request is pending for a callback described below. Alternatively, it can |
| instead use ExecAsyncRequestDone if a result is available immediately. |
| |
| 2. When the event loop wishes to wait or poll for file descriptor events, the |
| node's ExecAsyncConfigureWait callback will be invoked to configure the |
| file descriptor event for which the node wishes to wait. |
| |
| 3. When the file descriptor becomes ready, the node's ExecAsyncNotify callback |
| will be invoked; like #1, it should use ExecAsyncRequestPending for another |
| callback or ExecAsyncRequestDone to return a result immediately. |