src/backend/executor/README - hawq - Git at Google

 $PostgreSQL: pgsql/src/backend/executor/README,v 1.8 2009/01/09 15:46:10 tgl Exp $

 The Postgres Executor
 =====================

 The executor processes a tree of "plan nodes".  The plan tree is essentially
 a demand-pull pipeline of tuple processing operations.  Each node, when
 called, will produce the next tuple in its output sequence, or NULL if no
 more tuples are available.  If the node is not a primitive relation-scanning
 node, it will have child node(s) that it calls in turn to obtain input
 tuples.

 Refinements on this basic model include:

 * Choice of scan direction (forwards or backwards).  Caution: this is not
 currently well-supported.  It works for primitive scan nodes, but not very
 well for joins, aggregates, etc.

 * Rescan command to reset a node and make it generate its output sequence
 over again.

 * Parameters that can alter a node's results.  After adjusting a parameter,
 the rescan command must be applied to that node and all nodes above it.
 There is a moderately intelligent scheme to avoid rescanning nodes
 unnecessarily (for example, Sort does not rescan its input if no parameters
 of the input have changed, since it can just reread its stored sorted data).

 The plan tree concept implements SELECT directly: it is only necessary to
 deliver the top-level result tuples to the client, or insert them into
 another table in the case of INSERT ... SELECT.  (INSERT ... VALUES is
 handled similarly, but the plan tree is just a Result node with no source
 tables.)  For UPDATE, the plan tree selects the tuples that need to be
 updated (WHERE condition) and delivers a new calculated tuple value for each
 such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple.
 The executor's top level then uses this information to update the correct
 tuple.  DELETE is similar to UPDATE except that only a CTID need be
 delivered by the plan tree.

 XXX a great deal more documentation needs to be written here...


 Plan Trees and State Trees
 --------------------------

 The plan tree delivered by the planner contains a tree of Plan nodes (struct
 types derived from struct Plan).  Each Plan node may have expression trees
 associated with it, to represent its target list, qualification conditions,
 etc.  During executor startup we build a parallel tree of identical structure
 containing executor state nodes --- every plan and expression node type has
 a corresponding executor state node type.  Each node in the state tree has a
 pointer to its corresponding node in the plan tree, plus executor state data
 as needed to implement that node type.  This arrangement allows the plan
 tree to be completely read-only as far as the executor is concerned: all data
 that is modified during execution is in the state tree.  Read-only plan trees
 make life much simpler for plan caching and reuse.

 Altogether there are four classes of nodes used in these trees: Plan nodes,
 their corresponding PlanState nodes, Expr nodes, and their corresponding
 ExprState nodes.  (Actually, there are also List nodes, which are used as
 "glue" in all four kinds of tree.)


 Memory Management
 -----------------

 A "per query" memory context is created during CreateExecutorState();
 all storage allocated during an executor invocation is allocated in that
 context or a child context.  This allows easy reclamation of storage
 during executor shutdown --- rather than messing with retail pfree's and
 probable storage leaks, we just destroy the memory context.

 In particular, the plan state trees and expression state trees described
 in the previous section are allocated in the per-query memory context.

 To avoid intra-query memory leaks, most processing while a query runs
 is done in "per tuple" memory contexts, which are so-called because they
 are typically reset to empty once per tuple.  Per-tuple contexts are usually
 associated with ExprContexts, and commonly each PlanState node has its own
 ExprContext to evaluate its qual and targetlist expressions in.


 Query Processing Control Flow
 -----------------------------

 This is a sketch of control flow for full query processing:

 	CreateQueryDesc

 	ExecutorStart
 		CreateExecutorState
 			creates per-query context
 		switch to per-query context to run ExecInitNode
 		ExecInitNode --- recursively scans plan tree
 			CreateExprContext
 				creates per-tuple context
 			ExecInitExpr

 	ExecutorRun
 		ExecProcNode --- recursively called in per-query context
 			ExecEvalExpr --- called in per-tuple context
 			ResetExprContext --- to free memory

 	ExecutorEnd
 		ExecEndNode --- recursively releases resources
 		FreeExecutorState
 			frees per-query context and child contexts

 	FreeQueryDesc

 Per above comments, it's not really critical for ExecEndNode to free any
 memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
 be careful to close relations, drop buffer pins, etc, so we do need to scan
 the plan state tree to find these sorts of resources.


 The executor can also be used to evaluate simple expressions without any Plan
 tree ("simple" meaning "no aggregates and no sub-selects", though such might
 be hidden inside function calls).  This case has a flow of control like

 	CreateExecutorState
 		creates per-query context

 	CreateExprContext	-- or use GetPerTupleExprContext(estate)
 		creates per-tuple context

 	ExecPrepareExpr
 		temporarily switch to per-query context
 		run the expression through expression_planner
 		ExecInitExpr

 	Repeatedly do:
 		ExecEvalExprSwitchContext
 			ExecEvalExpr --- called in per-tuple context
 		ResetExprContext --- to free memory

 	FreeExecutorState
 		frees per-query context, as well as ExprContext
 		(a separate FreeExprContext call is not necessary)


 EvalPlanQual (READ COMMITTED Update Checking)
 ---------------------------------------------

 For simple SELECTs, the executor need only pay attention to tuples that are
 valid according to the snapshot seen by the current transaction (ie, they
 were inserted by a previously committed transaction, and not deleted by any
 previously committed transaction).  However, for UPDATE and DELETE it is not
 cool to modify or delete a tuple that's been modified by an open or
 concurrently-committed transaction.  If we are running in SERIALIZABLE
 isolation level then we just raise an error when this condition is seen to
 occur.  In READ COMMITTED isolation level, we must work a lot harder.

 The basic idea in READ COMMITTED mode is to take the modified tuple
 committed by the concurrent transaction (after waiting for it to commit,
 if need be) and re-evaluate the query qualifications to see if it would
 still meet the quals.  If so, we regenerate the updated tuple (if we are
 doing an UPDATE) from the modified tuple, and finally update/delete the
 modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
 action is just to lock the modified tuple.

 To implement this checking, we actually re-run the entire query from scratch
 for each modified tuple, but with the scan node that sourced the original
 tuple set to return only the modified tuple, not the original tuple or any
 of the rest of the relation.  If this query returns a tuple, then the
 modified tuple passes the quals (and the query output is the suitably
 modified update tuple, if we're doing UPDATE).  If no tuple is returned,
 then the modified tuple fails the quals, so we ignore it and continue the
 original query.  (This is reasonably efficient for simple queries, but may
 be horribly slow for joins.  A better design would be nice; one thought for
 future investigation is to treat the tuple substitution like a parameter,
 so that we can avoid rescanning unrelated nodes.)

 Note a fundamental bogosity of this approach: if the relation containing
 the original tuple is being used in a self-join, the other instance(s) of
 the relation will be treated as still containing the original tuple, whereas
 logical consistency would demand that the modified tuple appear in them too.
 But we'd have to actually substitute the modified tuple for the original,
 while still returning all the rest of the relation, to ensure consistent
 answers.  Implementing this correctly is a task for future work.

 In UPDATE/DELETE, only the target relation needs to be handled this way,
 so only one special recheck query needs to execute at a time.  In SELECT FOR
 UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible
 that while we are executing a recheck query for one modified tuple, we will
 hit another modified tuple in another relation.  In this case we "stack up"
 recheck queries: a sub-recheck query is spawned in which both the first and
 second modified tuples will be returned as the only components of their
 relations.  (In event of success, all these modified tuples will be locked.)
 Again, this isn't necessarily quite the right thing ... but in simple cases
 it works.  Potentially, recheck queries could get nested to the depth of the
 number of FOR UPDATE/SHARE relations in the query.

 It should be noted also that UPDATE/DELETE expect at most one tuple to
 result from the modified query, whereas in the FOR UPDATE case it's possible
 for multiple tuples to result (since we could be dealing with a join in
 which multiple tuples join to the modified tuple).  We want FOR UPDATE to
 lock all relevant tuples, so we pass all tuples output by all the stacked
 recheck queries back to the executor toplevel for locking.
	$PostgreSQL: pgsql/src/backend/executor/README,v 1.8 2009/01/09 15:46:10 tgl Exp $

	The Postgres Executor
	=====================

	The executor processes a tree of "plan nodes". The plan tree is essentially
	a demand-pull pipeline of tuple processing operations. Each node, when
	called, will produce the next tuple in its output sequence, or NULL if no
	more tuples are available. If the node is not a primitive relation-scanning
	node, it will have child node(s) that it calls in turn to obtain input
	tuples.

	Refinements on this basic model include:

	* Choice of scan direction (forwards or backwards). Caution: this is not
	currently well-supported. It works for primitive scan nodes, but not very
	well for joins, aggregates, etc.

	* Rescan command to reset a node and make it generate its output sequence
	over again.

	* Parameters that can alter a node's results. After adjusting a parameter,
	the rescan command must be applied to that node and all nodes above it.
	There is a moderately intelligent scheme to avoid rescanning nodes
	unnecessarily (for example, Sort does not rescan its input if no parameters
	of the input have changed, since it can just reread its stored sorted data).

	The plan tree concept implements SELECT directly: it is only necessary to
	deliver the top-level result tuples to the client, or insert them into
	another table in the case of INSERT ... SELECT. (INSERT ... VALUES is
	handled similarly, but the plan tree is just a Result node with no source
	tables.) For UPDATE, the plan tree selects the tuples that need to be
	updated (WHERE condition) and delivers a new calculated tuple value for each
	such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple.
	The executor's top level then uses this information to update the correct
	tuple. DELETE is similar to UPDATE except that only a CTID need be
	delivered by the plan tree.

	XXX a great deal more documentation needs to be written here...


	Plan Trees and State Trees
	--------------------------

	The plan tree delivered by the planner contains a tree of Plan nodes (struct
	types derived from struct Plan). Each Plan node may have expression trees
	associated with it, to represent its target list, qualification conditions,
	etc. During executor startup we build a parallel tree of identical structure
	containing executor state nodes --- every plan and expression node type has
	a corresponding executor state node type. Each node in the state tree has a
	pointer to its corresponding node in the plan tree, plus executor state data
	as needed to implement that node type. This arrangement allows the plan
	tree to be completely read-only as far as the executor is concerned: all data
	that is modified during execution is in the state tree. Read-only plan trees
	make life much simpler for plan caching and reuse.

	Altogether there are four classes of nodes used in these trees: Plan nodes,
	their corresponding PlanState nodes, Expr nodes, and their corresponding
	ExprState nodes. (Actually, there are also List nodes, which are used as
	"glue" in all four kinds of tree.)


	Memory Management
	-----------------

	A "per query" memory context is created during CreateExecutorState();
	all storage allocated during an executor invocation is allocated in that
	context or a child context. This allows easy reclamation of storage
	during executor shutdown --- rather than messing with retail pfree's and
	probable storage leaks, we just destroy the memory context.

	In particular, the plan state trees and expression state trees described
	in the previous section are allocated in the per-query memory context.

	To avoid intra-query memory leaks, most processing while a query runs
	is done in "per tuple" memory contexts, which are so-called because they
	are typically reset to empty once per tuple. Per-tuple contexts are usually
	associated with ExprContexts, and commonly each PlanState node has its own
	ExprContext to evaluate its qual and targetlist expressions in.


	Query Processing Control Flow
	-----------------------------

	This is a sketch of control flow for full query processing:

	CreateQueryDesc

	ExecutorStart
	CreateExecutorState
	creates per-query context
	switch to per-query context to run ExecInitNode
	ExecInitNode --- recursively scans plan tree
	CreateExprContext
	creates per-tuple context
	ExecInitExpr

	ExecutorRun
	ExecProcNode --- recursively called in per-query context
	ExecEvalExpr --- called in per-tuple context
	ResetExprContext --- to free memory

	ExecutorEnd
	ExecEndNode --- recursively releases resources
	FreeExecutorState
	frees per-query context and child contexts

	FreeQueryDesc

	Per above comments, it's not really critical for ExecEndNode to free any
	memory; it'll all go away in FreeExecutorState anyway. However, we do need to
	be careful to close relations, drop buffer pins, etc, so we do need to scan
	the plan state tree to find these sorts of resources.


	The executor can also be used to evaluate simple expressions without any Plan
	tree ("simple" meaning "no aggregates and no sub-selects", though such might
	be hidden inside function calls). This case has a flow of control like

	CreateExecutorState
	creates per-query context

	CreateExprContext -- or use GetPerTupleExprContext(estate)
	creates per-tuple context

	ExecPrepareExpr
	temporarily switch to per-query context
	run the expression through expression_planner
	ExecInitExpr

	Repeatedly do:
	ExecEvalExprSwitchContext
	ExecEvalExpr --- called in per-tuple context
	ResetExprContext --- to free memory

	FreeExecutorState
	frees per-query context, as well as ExprContext
	(a separate FreeExprContext call is not necessary)


	EvalPlanQual (READ COMMITTED Update Checking)
	---------------------------------------------

	For simple SELECTs, the executor need only pay attention to tuples that are
	valid according to the snapshot seen by the current transaction (ie, they
	were inserted by a previously committed transaction, and not deleted by any
	previously committed transaction). However, for UPDATE and DELETE it is not
	cool to modify or delete a tuple that's been modified by an open or
	concurrently-committed transaction. If we are running in SERIALIZABLE
	isolation level then we just raise an error when this condition is seen to
	occur. In READ COMMITTED isolation level, we must work a lot harder.

	The basic idea in READ COMMITTED mode is to take the modified tuple
	committed by the concurrent transaction (after waiting for it to commit,
	if need be) and re-evaluate the query qualifications to see if it would
	still meet the quals. If so, we regenerate the updated tuple (if we are
	doing an UPDATE) from the modified tuple, and finally update/delete the
	modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its
	action is just to lock the modified tuple.

	To implement this checking, we actually re-run the entire query from scratch
	for each modified tuple, but with the scan node that sourced the original
	tuple set to return only the modified tuple, not the original tuple or any
	of the rest of the relation. If this query returns a tuple, then the
	modified tuple passes the quals (and the query output is the suitably
	modified update tuple, if we're doing UPDATE). If no tuple is returned,
	then the modified tuple fails the quals, so we ignore it and continue the
	original query. (This is reasonably efficient for simple queries, but may
	be horribly slow for joins. A better design would be nice; one thought for
	future investigation is to treat the tuple substitution like a parameter,
	so that we can avoid rescanning unrelated nodes.)

	Note a fundamental bogosity of this approach: if the relation containing
	the original tuple is being used in a self-join, the other instance(s) of
	the relation will be treated as still containing the original tuple, whereas
	logical consistency would demand that the modified tuple appear in them too.
	But we'd have to actually substitute the modified tuple for the original,
	while still returning all the rest of the relation, to ensure consistent
	answers. Implementing this correctly is a task for future work.

	In UPDATE/DELETE, only the target relation needs to be handled this way,
	so only one special recheck query needs to execute at a time. In SELECT FOR
	UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible
	that while we are executing a recheck query for one modified tuple, we will
	hit another modified tuple in another relation. In this case we "stack up"
	recheck queries: a sub-recheck query is spawned in which both the first and
	second modified tuples will be returned as the only components of their
	relations. (In event of success, all these modified tuples will be locked.)
	Again, this isn't necessarily quite the right thing ... but in simple cases
	it works. Potentially, recheck queries could get nested to the depth of the
	number of FOR UPDATE/SHARE relations in the query.

	It should be noted also that UPDATE/DELETE expect at most one tuple to
	result from the modified query, whereas in the FOR UPDATE case it's possible
	for multiple tuples to result (since we could be dealing with a join in
	which multiple tuples join to the modified tuple). We want FOR UPDATE to
	lock all relevant tuples, so we pass all tuples output by all the stacked
	recheck queries back to the executor toplevel for locking.