src/backend/utils/resscheduler/README - hawq - Git at Google

 $PostgreSQL$

 Resource Scheduling
 ===================

 Overview
 --------

 The resource scheduler is an attempt to provide improved control of query
 resource consumption. It is primarily aimed at DSS workloads, but is amenable
 to more traditional reporting style workloads too.

 -----------------------------------------------------------------------------

 Data Structures
 ---------------

 The basic design makes use of data structures from the standard lock
 implementation, but adapts them to handle the differing semantics required
 for tracking transient resources.

 A psuedo E-R diagram for the various resource scheduling objects is:


 role >---- resource queue ----< limits
               |
               |
             queue  ----< limit thresholds, limit values
               |
               |
            lock ----- proclock -----------------+
                         |                       |
                         |                       ^
                       proc ---< portal ----< portal increment


 The objects that are additional to the standard lock system are are described
 below:

 Limits -
     These describe the resource itself. It contains a threshold, counter for
     current utilization and a type.

 Queues -
     A queue consists of a set of limits. It also has some attributes of its
     own - a flag for whether queries whose cost is > the cost limit will be
     allowed to run (under special circumstances) and a special limit that
     allows queries with a small cost to avoid being locked at all.


 Resource Lock -
     The lock related information for a queue, each queue has 1 and only 1
     lock.

 Portal Increment -
     The resource related information for a statement. This consists of the
     process id, portal id and the set of additional changes to the limit
     counters that will be applied should this portal execute. A portal that
     is managed by the scheduler [1] *always* has increments - portals for
     queries that are not going to be locked have portalId set to InvalidOid
     (either because the owning role is not in a queue or the query has cost
     smaller than the queue's "ignore" limit - see Queues above).

 Proclock -
      Some changes have been made to the proclock structure... this should be
      backed out and a new proclock like structure (similar to locallock) should
      be used to track the additional elements (TODO).

 Portal -
      Additional fields have been added to the portaldata structures: an id
      for indexing, the original nodetag of the query, the queue oid that
      is locking the portal and a flag for whether to release the resource
      lock at portal destruction time.

 -----------------------------------------------------------------------------

 Functions
 ---------

 The basic API consists of:

     ResLockPortal(Portal portal, PlannedStmt *stmt);
     ResUnLockPortal(Portal portal);

 which are inserted into PortalStart and PortalDrop respectively. In PortalStart
 the ResLockPortal call is made after planning is completed, so there is access
 to all elements of the plan structure, thus controlling resources on the basis
 of cost is possible.

 As the resource lock is taken after parse/analyze most standard locks have
 been already acquired before the resource lock is (tuple level locks are the
 exception).

 The internal routines are named similarly to the standard lock equivalents:

     ResLockAcquire(LOCKTAG *locktag, ResPortalIncrement *incrementSet);
     ResLockRelease(LOCKTAG *locktag, unit32 resPortalId);

 They require slightly different arguments, this is mainly to supply the various
 increments to the replacement for the Conflict routines:

     ResLockCheckLimit(LOCK *lock, RESPROCLOCK *proclock,
                       ResPortalIncrement *incrementSet,
                       bool increment);

 This function encapsulates the fact that resource lock conflict is not
 deterministic, but depends on whether current limit counter values are at their
 thresholds. If there is not a conflict, a statement's increments are added (or
 subtracted) to the shared counters by:

     ResLockUpdateLimit(LOCK *lock, RESPROCLOCK *proclock,
                       ResPortalIncrement *incrementSet,
                       bool increment);

 Commit and Abort handling is mostly handled from the existing ResourceOwner
 code - as dropping portals unlocks them! However for WITH HOLD cursors an
 extra function to unlock any remaining portals at sesion exit is required:

     AtExitCleanup_ResPortals(void);

 Some of the standard lock manager code is almost duplicated to handle interrupts
 ,sleeping and wait queue removal:

     ResLockWaitCancel(void);
     ResProcSleep(LOCKMODE lockmode, LOCALLOCK *locallock):
     ResRemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);

 Deadlocks between standard and resource locks are possible, and are handled
 by the deadlock detector. We pass the lock mode as ExclusiveLock - which
 results in overly agressive detection and rollback of deadocks. However, safety
 is the *initial* goal! More sophisticated detection logic is an obvious next
 step.

 Also deadlocks with oneself are possible, usually only achiveable with several
 open cursors:
     ResCheckSelfDeadLock(LOCK *lock);
 -----------------------------------------------------------------------------

 Lock Design
 -----------

 The resource locks share the lock hash table(s) with the standard (and user)
 locks. This means that they use the standard lock manager partitioned
 lightweight lock(s):

   FirstLockMgrLock,..., FirstLockMgrLock + NUM_LOCK_PARTITIONS

 for lightweight lock operations.

 The queues themselves and their data are protected by a seperate lightweight
 lock:

   ResQueueLock

 so at some points the partition lock and the queue lock could be held.


 -----------------------------------------------------------------------------

 Admin
 -----

 Some new GUCs are introduced to control the scheduler:

 #resource_scheduler = off         # enable resource scheduling


 Some new commands are added and others amended to allow administration of the
 resource queues:

 Queues can be created, altered and dropped via the CREATE|ALTER|DROP QUEUE
 commands.

 CREATE RESOURCE QUEUE <name> [ACTIVE THRESHOLD <int value>]
                              [COST THRESHOLD <float value>]
                              [IGNORE THRESHOLD <float value>]
                              [OVERCOMMIT|NOOVERCOMMIT];
 ALTER RESOURCE QUEUE <name>  [ACTIVE THRESHOLD <int value>]
                              [COST THRESHOLD <float value>]
                              [IGNORE THRESHOLD <float value>]
                              [OVERCOMMIT|NOOVERCOMMIT];
 DROP RESOURCE QUEUE <name>;

 The value used for the cost threshold *must* be a float - i.e. 2.0 or 2e+0
 not 2. This slightly annoying syntax is needed to allow values of 1e+19 or
 larger to be accepted. At this point we don't really have a good idea for what
 a maximum reasonable threshold for cost is likely to be, so we allow very large
 values.
 The ignore threshold sets a lower limit for queries to be locked for execution -
 query cost < this limit means locking will be skipped.
 The overcommit parameter controls what happens for a query whose cost > cost
 limit for the queue - it either aborts (the default) or waits until noone else
 is executing. This is to avoid the situation of querys being in the wait state
 forever.
 These last two parameters complicate the logic considerably, needless to say!

 The ALTER RESOURCE QUEUE command will only allow the thresholds to be amended
 to a value larger than the current counter for the in-memory queues. This is a
 safety feature to prevent unexpected behaviour.

 The DROP RESOURCE command does not allow queues to be dropped if there are roles
 assigned to it, It does not allow it to be dropped unless all of the current
 values for the in-memory queue are zero.

 Only superuser can create, alter or drop resource queues.


 Roles are assigned to and removed from the resource queues by CREATE|ALTER ROLE
 commands:

 CREATE ROLE <name> [role options] [RESOURCE QUEUE <name>];
 ALTER ROLE <name>  [role options] [RESOURCE QUEUE <name>];
 ALTER ROLE <name>  [role options] [RESOURCE QUEUE none];

 The resource queue name 'none' is used to mean 'no resource queue thanks'.

 -----------------------------------------------------------------------------

 Catalog Changes
 ---------------

 Resource Scheduling add one new catalog (pg_resqueue) and alters another
 (pg_authid). This alteration is propagated to the derived view (pg_roles).

 The new catalog and indexes have not been added to the catcache system.


 Tables:

   pg_resqueue     (new catalog with OID 6026)
     rsqname        name  not null
     rsqcountlimit  real  not null
     rsqcostlimit   real  not null
     rsqovercommit  real  not null
     rsqignorecostlimit  real  not null


   pg_authid       (existing catalog gets 1 new column)
     rolresqueue    oid

 Indexes:

   pg_resqueue_oid_index       (new unique index with OID 6027)
     pg_resqueue.oid

   pg_resqueue_rsqname_index   (new unique index with OID 6028)
     pg_resqueue.rsqname

   pg_authid_rolresqueue_index (new non-unique index with OID 6029)
     pg_authid.rolresqueue


 Views:

   pg_roles     (existing view gets 1 new column, added before pg_authid.oid)
     pg_authid.rolresqueue
  pg_resqueue_status (new)

 -----------------------------------------------------------------------------

 Debugging Code
 --------------

 There are a number of (hopefully) helpful elog messages at the DEBUG1 level -
 however the symbol RESLOCK_DEBUG must be defined before they are enabled
 (usually done in src/include/pg_config_manual.h)

 -----------------------------------------------------------------------------

 Notes
 -----

 [1] "Managed by the Resource Scheduler" - this deserves some explanation,
      In the current incarnation of the code it means:

        portal->queueId != InvalidOid
        portal->releaseResLock == true


      Typically every portal gets queueId set at creation, but this may get
      reset when the to-lock-or-not-to-lock decision comes around (see
      ResLockPortal).
	$PostgreSQL$

	Resource Scheduling
	===================

	Overview
	--------

	The resource scheduler is an attempt to provide improved control of query
	resource consumption. It is primarily aimed at DSS workloads, but is amenable
	to more traditional reporting style workloads too.

	-----------------------------------------------------------------------------

	Data Structures
	---------------

	The basic design makes use of data structures from the standard lock
	implementation, but adapts them to handle the differing semantics required
	for tracking transient resources.

	A psuedo E-R diagram for the various resource scheduling objects is:


	role >---- resource queue ----< limits
	\|
	\|
	queue ----< limit thresholds, limit values
	\|
	\|
	lock ----- proclock -----------------+
	\| \|
	\| ^
	proc ---< portal ----< portal increment


	The objects that are additional to the standard lock system are are described
	below:

	Limits -
	These describe the resource itself. It contains a threshold, counter for
	current utilization and a type.

	Queues -
	A queue consists of a set of limits. It also has some attributes of its
	own - a flag for whether queries whose cost is > the cost limit will be
	allowed to run (under special circumstances) and a special limit that
	allows queries with a small cost to avoid being locked at all.


	Resource Lock -
	The lock related information for a queue, each queue has 1 and only 1
	lock.

	Portal Increment -
	The resource related information for a statement. This consists of the
	process id, portal id and the set of additional changes to the limit
	counters that will be applied should this portal execute. A portal that
	is managed by the scheduler [1] always has increments - portals for
	queries that are not going to be locked have portalId set to InvalidOid
	(either because the owning role is not in a queue or the query has cost
	smaller than the queue's "ignore" limit - see Queues above).

	Proclock -
	Some changes have been made to the proclock structure... this should be
	backed out and a new proclock like structure (similar to locallock) should
	be used to track the additional elements (TODO).

	Portal -
	Additional fields have been added to the portaldata structures: an id
	for indexing, the original nodetag of the query, the queue oid that
	is locking the portal and a flag for whether to release the resource
	lock at portal destruction time.

	-----------------------------------------------------------------------------

	Functions
	---------

	The basic API consists of:

	ResLockPortal(Portal portal, PlannedStmt *stmt);
	ResUnLockPortal(Portal portal);

	which are inserted into PortalStart and PortalDrop respectively. In PortalStart
	the ResLockPortal call is made after planning is completed, so there is access
	to all elements of the plan structure, thus controlling resources on the basis
	of cost is possible.

	As the resource lock is taken after parse/analyze most standard locks have
	been already acquired before the resource lock is (tuple level locks are the
	exception).

	The internal routines are named similarly to the standard lock equivalents:

	ResLockAcquire(LOCKTAG locktag, ResPortalIncrement incrementSet);
	ResLockRelease(LOCKTAG *locktag, unit32 resPortalId);

	They require slightly different arguments, this is mainly to supply the various
	increments to the replacement for the Conflict routines:

	ResLockCheckLimit(LOCK lock, RESPROCLOCK proclock,
	ResPortalIncrement *incrementSet,
	bool increment);

	This function encapsulates the fact that resource lock conflict is not
	deterministic, but depends on whether current limit counter values are at their
	thresholds. If there is not a conflict, a statement's increments are added (or
	subtracted) to the shared counters by:

	ResLockUpdateLimit(LOCK lock, RESPROCLOCK proclock,
	ResPortalIncrement *incrementSet,
	bool increment);

	Commit and Abort handling is mostly handled from the existing ResourceOwner
	code - as dropping portals unlocks them! However for WITH HOLD cursors an
	extra function to unlock any remaining portals at sesion exit is required:

	AtExitCleanup_ResPortals(void);

	Some of the standard lock manager code is almost duplicated to handle interrupts
	,sleeping and wait queue removal:

	ResLockWaitCancel(void);
	ResProcSleep(LOCKMODE lockmode, LOCALLOCK *locallock):
	ResRemoveFromWaitQueue(PGPROC *proc, uint32 hashcode);

	Deadlocks between standard and resource locks are possible, and are handled
	by the deadlock detector. We pass the lock mode as ExclusiveLock - which
	results in overly agressive detection and rollback of deadocks. However, safety
	is the initial goal! More sophisticated detection logic is an obvious next
	step.

	Also deadlocks with oneself are possible, usually only achiveable with several
	open cursors:
	ResCheckSelfDeadLock(LOCK *lock);
	-----------------------------------------------------------------------------

	Lock Design
	-----------

	The resource locks share the lock hash table(s) with the standard (and user)
	locks. This means that they use the standard lock manager partitioned
	lightweight lock(s):

	FirstLockMgrLock,..., FirstLockMgrLock + NUM_LOCK_PARTITIONS

	for lightweight lock operations.

	The queues themselves and their data are protected by a seperate lightweight
	lock:

	ResQueueLock

	so at some points the partition lock and the queue lock could be held.


	-----------------------------------------------------------------------------

	Admin
	-----

	Some new GUCs are introduced to control the scheduler:

	#resource_scheduler = off # enable resource scheduling


	Some new commands are added and others amended to allow administration of the
	resource queues:

	Queues can be created, altered and dropped via the CREATE\|ALTER\|DROP QUEUE
	commands.

	CREATE RESOURCE QUEUE <name> [ACTIVE THRESHOLD <int value>]
	[COST THRESHOLD <float value>]
	[IGNORE THRESHOLD <float value>]
	[OVERCOMMIT\|NOOVERCOMMIT];
	ALTER RESOURCE QUEUE <name> [ACTIVE THRESHOLD <int value>]
	[COST THRESHOLD <float value>]
	[IGNORE THRESHOLD <float value>]
	[OVERCOMMIT\|NOOVERCOMMIT];
	DROP RESOURCE QUEUE <name>;

	The value used for the cost threshold must be a float - i.e. 2.0 or 2e+0
	not 2. This slightly annoying syntax is needed to allow values of 1e+19 or
	larger to be accepted. At this point we don't really have a good idea for what
	a maximum reasonable threshold for cost is likely to be, so we allow very large
	values.
	The ignore threshold sets a lower limit for queries to be locked for execution -
	query cost < this limit means locking will be skipped.
	The overcommit parameter controls what happens for a query whose cost > cost
	limit for the queue - it either aborts (the default) or waits until noone else
	is executing. This is to avoid the situation of querys being in the wait state
	forever.
	These last two parameters complicate the logic considerably, needless to say!

	The ALTER RESOURCE QUEUE command will only allow the thresholds to be amended
	to a value larger than the current counter for the in-memory queues. This is a
	safety feature to prevent unexpected behaviour.

	The DROP RESOURCE command does not allow queues to be dropped if there are roles
	assigned to it, It does not allow it to be dropped unless all of the current
	values for the in-memory queue are zero.

	Only superuser can create, alter or drop resource queues.


	Roles are assigned to and removed from the resource queues by CREATE\|ALTER ROLE
	commands:

	CREATE ROLE <name> [role options] [RESOURCE QUEUE <name>];
	ALTER ROLE <name> [role options] [RESOURCE QUEUE <name>];
	ALTER ROLE <name> [role options] [RESOURCE QUEUE none];

	The resource queue name 'none' is used to mean 'no resource queue thanks'.

	-----------------------------------------------------------------------------

	Catalog Changes
	---------------

	Resource Scheduling add one new catalog (pg_resqueue) and alters another
	(pg_authid). This alteration is propagated to the derived view (pg_roles).

	The new catalog and indexes have not been added to the catcache system.


	Tables:

	pg_resqueue (new catalog with OID 6026)
	rsqname name not null
	rsqcountlimit real not null
	rsqcostlimit real not null
	rsqovercommit real not null
	rsqignorecostlimit real not null


	pg_authid (existing catalog gets 1 new column)
	rolresqueue oid

	Indexes:

	pg_resqueue_oid_index (new unique index with OID 6027)
	pg_resqueue.oid

	pg_resqueue_rsqname_index (new unique index with OID 6028)
	pg_resqueue.rsqname

	pg_authid_rolresqueue_index (new non-unique index with OID 6029)
	pg_authid.rolresqueue


	Views:

	pg_roles (existing view gets 1 new column, added before pg_authid.oid)
	pg_authid.rolresqueue
	pg_resqueue_status (new)

	-----------------------------------------------------------------------------

	Debugging Code
	--------------

	There are a number of (hopefully) helpful elog messages at the DEBUG1 level -
	however the symbol RESLOCK_DEBUG must be defined before they are enabled
	(usually done in src/include/pg_config_manual.h)

	-----------------------------------------------------------------------------

	Notes
	-----

	[1] "Managed by the Resource Scheduler" - this deserves some explanation,
	In the current incarnation of the code it means:

	portal->queueId != InvalidOid
	portal->releaseResLock == true


	Typically every portal gets queueId set at creation, but this may get
	reset when the to-lock-or-not-to-lock decision comes around (see
	ResLockPortal).