src/backend/utils/mmgr/README - hawq - Git at Google

 $PostgreSQL: pgsql/src/backend/utils/mmgr/README,v 1.15 2008/04/09 01:00:46 momjian Exp $

 Notes About Memory Allocation Redesign
 ======================================

 Up through version 7.0, Postgres had serious problems with memory leakage
 during large queries that process a lot of pass-by-reference data.  There
 was no provision for recycling memory until end of query.  This needed to be
 fixed, even more so with the advent of TOAST which will allowed very large
 chunks of data to be passed around in the system.  This document describes
 the new memory management system implemented in 7.1.


 Background
 ----------

 We already do most of our memory allocation in "memory contexts", which
 are usually AllocSets as implemented by backend/utils/mmgr/aset.c.  What
 we need to do is create more contexts and define proper rules about when
 they can be freed.

 The basic operations on a memory context are:

 * create a context

 * allocate a chunk of memory within a context (equivalent of standard
   C library's malloc())

 * delete a context (including freeing all the memory allocated therein)

 * reset a context (free all memory allocated in the context, but not the
   context object itself)

 Given a chunk of memory previously allocated from a context, one can
 free it or reallocate it larger or smaller (corresponding to standard
 library's free() and realloc() routines).  These operations return memory
 to or get more memory from the same context the chunk was originally
 allocated in.

 At all times there is a "current" context denoted by the
 CurrentMemoryContext global variable.  The backend macro palloc()
 implicitly allocates space in that context.  The MemoryContextSwitchTo()
 operation selects a new current context (and returns the previous context,
 so that the caller can restore the previous context before exiting).

 The main advantage of memory contexts over plain use of malloc/free is
 that the entire contents of a memory context can be freed easily, without
 having to request freeing of each individual chunk within it.  This is
 both faster and more reliable than per-chunk bookkeeping.  We already use
 this fact to clean up at transaction end: by resetting all the active
 contexts, we reclaim all memory.  What we need are additional contexts
 that can be reset or deleted at strategic times within a query, such as
 after each tuple.


 Some Notes About the palloc API Versus Standard C Library
 ---------------------------------------------------------

 The behavior of palloc and friends is similar to the standard C library's
 malloc and friends, but there are some deliberate differences too.  Here
 are some notes to clarify the behavior.

 * If out of memory, palloc and repalloc exit via elog(ERROR).  They never
 return NULL, and it is not necessary or useful to test for such a result.

 * palloc(0) is explicitly a valid operation.  It does not return a NULL
 pointer, but a valid chunk of which no bytes may be used.  (However, the
 chunk might later be repalloc'd larger; it can also be pfree'd without
 error.)  (Note: this behavior is new in Postgres 8.0; earlier versions
 disallowed palloc(0).  It seems more consistent to allow it, however.)
 Similarly, repalloc allows realloc'ing to zero size.

 * pfree and repalloc do not accept a NULL pointer.  This is intentional.


 pfree/repalloc No Longer Depend On CurrentMemoryContext
 -------------------------------------------------------

 In this proposal, pfree() and repalloc() can be applied to any chunk
 whether it belongs to CurrentMemoryContext or not --- the chunk's owning
 context will be invoked to handle the operation, regardless.  This is a
 change from the old requirement that CurrentMemoryContext must be set
 to the same context the memory was allocated from before one can use
 pfree() or repalloc().  The old coding requirement is obviously fairly
 error-prone, and will become more so the more context-switching we do;
 so I think it's essential to use CurrentMemoryContext only for palloc.
 We can avoid needing it for pfree/repalloc by putting restrictions on
 context managers as discussed below.

 We could even consider getting rid of CurrentMemoryContext entirely,
 instead requiring the target memory context for allocation to be specified
 explicitly.  But I think that would be too much notational overhead ---
 we'd have to pass an appropriate memory context to called routines in
 many places.  For example, the copyObject routines would need to be passed
 a context, as would function execution routines that return a
 pass-by-reference datatype.  And what of routines that temporarily
 allocate space internally, but don't return it to their caller?  We
 certainly don't want to clutter every call in the system with "here is
 a context to use for any temporary memory allocation you might want to
 do".  So there'd still need to be a global variable specifying a suitable
 temporary-allocation context.  That might as well be CurrentMemoryContext.


 Additions to the Memory-Context Mechanism
 -----------------------------------------

 If we are going to have more contexts, we need more mechanism for keeping
 track of them; else we risk leaking whole contexts under error conditions.

 We can do this by creating trees of "parent" and "child" contexts.  When
 creating a memory context, the new context can be specified to be a child
 of some existing context.  A context can have many children, but only one
 parent.  In this way the contexts form a forest (not necessarily a single
 tree, since there could be more than one top-level context).

 We then say that resetting or deleting any particular context resets or
 deletes all its direct and indirect children as well.  This feature allows
 us to manage a lot of contexts without fear that some will be leaked; we
 only need to keep track of one top-level context that we are going to
 delete at transaction end, and make sure that any shorter-lived contexts
 we create are descendants of that context.  Since the tree can have
 multiple levels, we can deal easily with nested lifetimes of storage,
 such as per-transaction, per-statement, per-scan, per-tuple.  Storage
 lifetimes that only partially overlap can be handled by allocating
 from different trees of the context forest (there are some examples
 in the next section).

 For convenience we will also want operations like "reset/delete all
 children of a given context, but don't reset or delete that context
 itself".


 Globally Known Contexts
 -----------------------

 There will be several widely-known contexts that will typically be
 referenced through global variables.  At any instant the system may
 contain many additional contexts, but all other contexts should be direct
 or indirect children of one of these contexts to ensure they are not
 leaked in event of an error.

 TopMemoryContext --- this is the actual top level of the context tree;
 every other context is a direct or indirect child of this one.  Allocating
 here is essentially the same as "malloc", because this context will never
 be reset or deleted.  This is for stuff that should live forever, or for
 stuff that the controlling module will take care of deleting at the
 appropriate time.  An example is fd.c's tables of open files, as well as
 the context management nodes for memory contexts themselves.  Avoid
 allocating stuff here unless really necessary, and especially avoid
 running with CurrentMemoryContext pointing here.

 PostmasterContext --- this is the postmaster's normal working context.
 After a backend is spawned, it can delete PostmasterContext to free its
 copy of memory the postmaster was using that it doesn't need.  (Anything
 that has to be passed from postmaster to backends will be passed in
 TopMemoryContext.  The postmaster will have only TopMemoryContext,
 PostmasterContext, and ErrorContext --- the remaining top-level contexts
 will be set up in each backend during startup.)

 CacheMemoryContext --- permanent storage for relcache, catcache, and
 related modules.  This will never be reset or deleted, either, so it's
 not truly necessary to distinguish it from TopMemoryContext.  But it
 seems worthwhile to maintain the distinction for debugging purposes.
 (Note: CacheMemoryContext will have child-contexts with shorter lifespans.
 For example, a child context is the best place to keep the subsidiary
 storage associated with a relcache entry; that way we can free rule
 parsetrees and so forth easily, without having to depend on constructing
 a reliable version of freeObject().)

 MessageContext --- this context holds the current command message from the
 frontend, as well as any derived storage that need only live as long as
 the current message (for example, in simple-Query mode the parse and plan
 trees can live here).  This context will be reset, and any children
 deleted, at the top of each cycle of the outer loop of PostgresMain.  This
 is kept separate from per-transaction and per-portal contexts because a
 query string might need to live either a longer or shorter time than any
 single transaction or portal.

 TopTransactionContext --- this holds everything that lives until end of the
 top-level transaction.  This context will be reset, and all its children
 deleted, at conclusion of each top-level transaction cycle.  In most cases
 you don't want to allocate stuff directly here, but in CurTransactionContext;
 what does belong here is control information that exists explicitly to manage
 status across multiple subtransactions.  Note: this context is NOT cleared
 immediately upon error; its contents will survive until the transaction block
 is exited by COMMIT/ROLLBACK.

 CurTransactionContext --- this holds data that has to survive until the end
 of the current transaction, and in particular will be needed at top-level
 transaction commit.  When we are in a top-level transaction this is the same
 as TopTransactionContext, but in subtransactions it points to a child context.
 It is important to understand that if a subtransaction aborts, its
 CurTransactionContext is thrown away after finishing the abort processing;
 but a committed subtransaction's CurTransactionContext is kept until top-level
 commit (unless of course one of the intermediate levels of subtransaction
 aborts).  This ensures that we do not keep data from a failed subtransaction
 longer than necessary.  Because of this behavior, you must be careful to clean
 up properly during subtransaction abort --- the subtransaction's state must be
 delinked from any pointers or lists kept in upper transactions, or you will
 have dangling pointers leading to a crash at top-level commit.  An example of
 data kept here is pending NOTIFY messages, which are sent at top-level commit,
 but only if the generating subtransaction did not abort.

 QueryContext --- this is not actually a separate context, but a global
 variable pointing to the context that holds the current command's parse tree.
 (In simple-Query mode this points to MessageContext; when executing a
 prepared statement it will point to the prepared statement's private context.
 Note that the plan tree may or may not be in this same context.)
 Generally it is not appropriate for any code to use QueryContext as an
 allocation target --- from the point of view of any code that would be
 referencing the QueryContext variable, it's a read-only context.

 PortalContext --- this is not actually a separate context, but a
 global variable pointing to the per-portal context of the currently active
 execution portal.  This can be used if it's necessary to allocate storage
 that will live just as long as the execution of the current portal requires.

 ErrorContext --- this permanent context will be switched into for error
 recovery processing, and then reset on completion of recovery.  We'll
 arrange to have, say, 8K of memory available in it at all times.  In this
 way, we can ensure that some memory is available for error recovery even
 if the backend has run out of memory otherwise.  This allows out-of-memory
 to be treated as a normal ERROR condition, not a FATAL error.


 Contexts For Prepared Statements And Portals
 --------------------------------------------

 A prepared-statement object has an associated private context, in which
 the parse and plan trees for its query are stored.  Because these trees
 are read-only to the executor, the prepared statement can be re-used many
 times without further copying of these trees.

 An execution-portal object has a private context that is referenced by
 PortalContext when the portal is active.  In the case of a portal created
 by DECLARE CURSOR, this private context contains the query parse and plan
 trees (there being no other object that can hold them).  Portals created
 from prepared statements simply reference the prepared statements' trees,
 and won't actually need any storage allocated in their private contexts.


 Transient Contexts During Execution
 -----------------------------------

 When creating a prepared statement, the parse and plan trees will be built
 in a temporary context that's a child of MessageContext (so that it will
 go away automatically upon error).  On success, the finished plan is
 copied to the prepared statement's private context, and the temp context
 is released; this allows planner temporary space to be recovered before
 execution begins.  (In simple-Query mode we'll not bother with the extra
 copy step, so the planner temp space stays around till end of query.)

 The top-level executor routines, as well as most of the "plan node"
 execution code, will normally run in a context that is created by
 ExecutorStart and destroyed by ExecutorEnd; this context also holds the
 "plan state" tree built during ExecutorStart.  Most of the memory
 allocated in these routines is intended to live until end of query,
 so this is appropriate for those purposes.  The executor's top context
 is a child of PortalContext, that is, the per-portal context of the
 portal that represents the query's execution.

 The main improvement needed in the executor is that expression evaluation
 --- both for qual testing and for computation of targetlist entries ---
 needs to not leak memory.  To do this, each ExprContext (expression-eval
 context) created in the executor will now have a private memory context
 associated with it, and we'll arrange to switch into that context when
 evaluating expressions in that ExprContext.  The plan node that owns the
 ExprContext is responsible for resetting the private context to empty
 when it no longer needs the results of expression evaluations.  Typically
 the reset is done at the start of each tuple-fetch cycle in the plan node.

 Note that this design gives each plan node its own expression-eval memory
 context.  This appears necessary to handle nested joins properly, since
 an outer plan node might need to retain expression results it has computed
 while obtaining the next tuple from an inner node --- but the inner node
 might execute many tuple cycles and many expressions before returning a
 tuple.  The inner node must be able to reset its own expression context
 more often than once per outer tuple cycle.  Fortunately, memory contexts
 are cheap enough that giving one to each plan node doesn't seem like a
 problem.

 A problem with running index accesses and sorts in a query-lifespan context
 is that these operations invoke datatype-specific comparison functions,
 and if the comparators leak any memory then that memory won't be recovered
 till end of query.  The comparator functions all return bool or int32,
 so there's no problem with their result data, but there can be a problem
 with leakage of internal temporary data.  In particular, comparator
 functions that operate on TOAST-able data types will need to be careful
 not to leak detoasted versions of their inputs.  This is annoying, but
 it appears a lot easier to make the comparators conform than to fix the
 index and sort routines, so that's what I propose to do for 7.1.  Further
 cleanup can be left for another day.

 There will be some special cases, such as aggregate functions.  nodeAgg.c
 needs to remember the results of evaluation of aggregate transition
 functions from one tuple cycle to the next, so it can't just discard
 all per-tuple state in each cycle.  The easiest way to handle this seems
 to be to have two per-tuple contexts in an aggregate node, and to
 ping-pong between them, so that at each tuple one is the active allocation
 context and the other holds any results allocated by the prior cycle's
 transition function.

 Executor routines that switch the active CurrentMemoryContext may need
 to copy data into their caller's current memory context before returning.
 I think there will be relatively little need for that, because of the
 convention of resetting the per-tuple context at the *start* of an
 execution cycle rather than at its end.  With that rule, an execution
 node can return a tuple that is palloc'd in its per-tuple context, and
 the tuple will remain good until the node is called for another tuple
 or told to end execution.  This is pretty much the same state of affairs
 that exists now, since a scan node can return a direct pointer to a tuple
 in a disk buffer that is only guaranteed to remain good that long.

 A more common reason for copying data will be to transfer a result from
 per-tuple context to per-run context; for example, a Unique node will
 save the last distinct tuple value in its per-run context, requiring a
 copy step.

 Another interesting special case is VACUUM, which needs to allocate
 working space that will survive its forced transaction commits, yet
 be released on error.  Currently it does that through a "portal",
 which is essentially a child context of TopMemoryContext.  While that
 way still works, it's ugly since xact abort needs special processing
 to delete the portal.  Better would be to use a context that's a child
 of PortalContext and hence is certain to go away as part of normal
 processing.  (Eventually we might have an even better solution from
 nested transactions, but this'll do fine for now.)


 Mechanisms to Allow Multiple Types of Contexts
 ----------------------------------------------

 We may want several different types of memory contexts with different
 allocation policies but similar external behavior.  To handle this,
 memory allocation functions will be accessed via function pointers,
 and we will require all context types to obey the conventions given here.
 (This is not very far different from the existing code.)

 A memory context will be represented by an object like

 typedef struct MemoryContextData
 {
     NodeTag        type;           /* identifies exact kind of context */
     MemoryContextMethods methods;
     MemoryContextData *parent;     /* NULL if no parent (toplevel context) */
     MemoryContextData *firstchild; /* head of linked list of children */
     MemoryContextData *nextchild;  /* next child of same parent */
     char          *name;           /* context name (just for debugging) */
 } MemoryContextData, *MemoryContext;

 This is essentially an abstract superclass, and the "methods" pointer is
 its virtual function table.  Specific memory context types will use
 derived structs having these fields as their first fields.  All the
 contexts of a specific type will have methods pointers that point to the
 same static table of function pointers, which will look like

 typedef struct MemoryContextMethodsData
 {
     Pointer     (*alloc) (MemoryContext c, Size size);
     void        (*free_p) (Pointer chunk);
     Pointer     (*realloc) (Pointer chunk, Size newsize);
     void        (*reset) (MemoryContext c);
     void        (*delete) (MemoryContext c);
 } MemoryContextMethodsData, *MemoryContextMethods;

 Alloc, reset, and delete requests will take a MemoryContext pointer
 as parameter, so they'll have no trouble finding the method pointer
 to call.  Free and realloc are trickier.  To make those work, we will
 require all memory context types to produce allocated chunks that
 are immediately preceded by a standard chunk header, which has the
 layout

 typedef struct StandardChunkHeader
 {
     MemoryContext mycontext;         /* Link to owning context object */
     Size          size;              /* Allocated size of chunk */
 };

 It turns out that the existing aset.c memory context type does this
 already, and probably any other kind of context would need to have the
 same data available to support realloc, so this is not really creating
 any additional overhead.  (Note that if a context type needs more per-
 allocated-chunk information than this, it can make an additional
 nonstandard header that precedes the standard header.  So we're not
 constraining context-type designers very much.)

 Given this, the pfree routine will look something like

     StandardChunkHeader * header =
         (StandardChunkHeader *) ((char *) p - sizeof(StandardChunkHeader));

     (*header->mycontext->methods->free_p) (p);

 We could do it as a macro, but the macro would have to evaluate its
 argument twice, which seems like a bad idea (the current pfree macro
 does not do that).  This is already saving two levels of function call
 compared to the existing code, so I think we're doing fine without
 squeezing out that last little bit ...


 More Control Over aset.c Behavior
 ---------------------------------

 Currently, aset.c allocates an 8K block upon the first allocation in
 a context, and doubles that size for each successive block request.
 That's good behavior for a context that might hold *lots* of data, and
 the overhead wasn't bad when we had only a few contexts in existence.
 With dozens if not hundreds of smaller contexts in the system, we will
 want to be able to fine-tune things a little better.

 The creator of a context will be able to specify an initial block size
 and a maximum block size.  Selecting smaller values will prevent wastage
 of space in contexts that aren't expected to hold very much (an example is
 the relcache's per-relation contexts).

 Also, it will be possible to specify a minimum context size.  If this
 value is greater than zero then a block of that size will be grabbed
 immediately upon context creation, and cleared but not released during
 context resets.  This feature is needed for ErrorContext (see above),
 but will most likely not be used for other contexts.

 We expect that per-tuple contexts will be reset frequently and typically
 will not allocate very much space per tuple cycle.  To make this usage
 pattern cheap, the first block allocated in a context is not given
 back to malloc() during reset, but just cleared.  This avoids malloc
 thrashing.


 Other Notes
 -----------

 The original version of this proposal suggested that functions returning
 pass-by-reference datatypes should be required to return a value freshly
 palloc'd in their caller's memory context, never a pointer to an input
 value.  I've abandoned that notion since it clearly is prone to error.
 In the current proposal, it is possible to discover which context a
 chunk of memory is allocated in (by checking the required standard chunk
 header), so nodeAgg can determine whether or not it's safe to reset
 its working context; it doesn't have to rely on the transition function
 to do what it's expecting.