| $PostgreSQL: pgsql/src/backend/utils/mmgr/README,v 1.15 2008/04/09 01:00:46 momjian Exp $ |
| |
| Notes About Memory Allocation Redesign |
| ====================================== |
| |
| Up through version 7.0, Postgres had serious problems with memory leakage |
| during large queries that process a lot of pass-by-reference data. There |
| was no provision for recycling memory until end of query. This needed to be |
| fixed, even more so with the advent of TOAST which will allowed very large |
| chunks of data to be passed around in the system. This document describes |
| the new memory management system implemented in 7.1. |
| |
| |
| Background |
| ---------- |
| |
| We already do most of our memory allocation in "memory contexts", which |
| are usually AllocSets as implemented by backend/utils/mmgr/aset.c. What |
| we need to do is create more contexts and define proper rules about when |
| they can be freed. |
| |
| The basic operations on a memory context are: |
| |
| * create a context |
| |
| * allocate a chunk of memory within a context (equivalent of standard |
| C library's malloc()) |
| |
| * delete a context (including freeing all the memory allocated therein) |
| |
| * reset a context (free all memory allocated in the context, but not the |
| context object itself) |
| |
| Given a chunk of memory previously allocated from a context, one can |
| free it or reallocate it larger or smaller (corresponding to standard |
| library's free() and realloc() routines). These operations return memory |
| to or get more memory from the same context the chunk was originally |
| allocated in. |
| |
| At all times there is a "current" context denoted by the |
| CurrentMemoryContext global variable. The backend macro palloc() |
| implicitly allocates space in that context. The MemoryContextSwitchTo() |
| operation selects a new current context (and returns the previous context, |
| so that the caller can restore the previous context before exiting). |
| |
| The main advantage of memory contexts over plain use of malloc/free is |
| that the entire contents of a memory context can be freed easily, without |
| having to request freeing of each individual chunk within it. This is |
| both faster and more reliable than per-chunk bookkeeping. We already use |
| this fact to clean up at transaction end: by resetting all the active |
| contexts, we reclaim all memory. What we need are additional contexts |
| that can be reset or deleted at strategic times within a query, such as |
| after each tuple. |
| |
| |
| Some Notes About the palloc API Versus Standard C Library |
| --------------------------------------------------------- |
| |
| The behavior of palloc and friends is similar to the standard C library's |
| malloc and friends, but there are some deliberate differences too. Here |
| are some notes to clarify the behavior. |
| |
| * If out of memory, palloc and repalloc exit via elog(ERROR). They never |
| return NULL, and it is not necessary or useful to test for such a result. |
| |
| * palloc(0) is explicitly a valid operation. It does not return a NULL |
| pointer, but a valid chunk of which no bytes may be used. (However, the |
| chunk might later be repalloc'd larger; it can also be pfree'd without |
| error.) (Note: this behavior is new in Postgres 8.0; earlier versions |
| disallowed palloc(0). It seems more consistent to allow it, however.) |
| Similarly, repalloc allows realloc'ing to zero size. |
| |
| * pfree and repalloc do not accept a NULL pointer. This is intentional. |
| |
| |
| pfree/repalloc No Longer Depend On CurrentMemoryContext |
| ------------------------------------------------------- |
| |
| In this proposal, pfree() and repalloc() can be applied to any chunk |
| whether it belongs to CurrentMemoryContext or not --- the chunk's owning |
| context will be invoked to handle the operation, regardless. This is a |
| change from the old requirement that CurrentMemoryContext must be set |
| to the same context the memory was allocated from before one can use |
| pfree() or repalloc(). The old coding requirement is obviously fairly |
| error-prone, and will become more so the more context-switching we do; |
| so I think it's essential to use CurrentMemoryContext only for palloc. |
| We can avoid needing it for pfree/repalloc by putting restrictions on |
| context managers as discussed below. |
| |
| We could even consider getting rid of CurrentMemoryContext entirely, |
| instead requiring the target memory context for allocation to be specified |
| explicitly. But I think that would be too much notational overhead --- |
| we'd have to pass an appropriate memory context to called routines in |
| many places. For example, the copyObject routines would need to be passed |
| a context, as would function execution routines that return a |
| pass-by-reference datatype. And what of routines that temporarily |
| allocate space internally, but don't return it to their caller? We |
| certainly don't want to clutter every call in the system with "here is |
| a context to use for any temporary memory allocation you might want to |
| do". So there'd still need to be a global variable specifying a suitable |
| temporary-allocation context. That might as well be CurrentMemoryContext. |
| |
| |
| Additions to the Memory-Context Mechanism |
| ----------------------------------------- |
| |
| If we are going to have more contexts, we need more mechanism for keeping |
| track of them; else we risk leaking whole contexts under error conditions. |
| |
| We can do this by creating trees of "parent" and "child" contexts. When |
| creating a memory context, the new context can be specified to be a child |
| of some existing context. A context can have many children, but only one |
| parent. In this way the contexts form a forest (not necessarily a single |
| tree, since there could be more than one top-level context). |
| |
| We then say that resetting or deleting any particular context resets or |
| deletes all its direct and indirect children as well. This feature allows |
| us to manage a lot of contexts without fear that some will be leaked; we |
| only need to keep track of one top-level context that we are going to |
| delete at transaction end, and make sure that any shorter-lived contexts |
| we create are descendants of that context. Since the tree can have |
| multiple levels, we can deal easily with nested lifetimes of storage, |
| such as per-transaction, per-statement, per-scan, per-tuple. Storage |
| lifetimes that only partially overlap can be handled by allocating |
| from different trees of the context forest (there are some examples |
| in the next section). |
| |
| For convenience we will also want operations like "reset/delete all |
| children of a given context, but don't reset or delete that context |
| itself". |
| |
| |
| Globally Known Contexts |
| ----------------------- |
| |
| There will be several widely-known contexts that will typically be |
| referenced through global variables. At any instant the system may |
| contain many additional contexts, but all other contexts should be direct |
| or indirect children of one of these contexts to ensure they are not |
| leaked in event of an error. |
| |
| TopMemoryContext --- this is the actual top level of the context tree; |
| every other context is a direct or indirect child of this one. Allocating |
| here is essentially the same as "malloc", because this context will never |
| be reset or deleted. This is for stuff that should live forever, or for |
| stuff that the controlling module will take care of deleting at the |
| appropriate time. An example is fd.c's tables of open files, as well as |
| the context management nodes for memory contexts themselves. Avoid |
| allocating stuff here unless really necessary, and especially avoid |
| running with CurrentMemoryContext pointing here. |
| |
| PostmasterContext --- this is the postmaster's normal working context. |
| After a backend is spawned, it can delete PostmasterContext to free its |
| copy of memory the postmaster was using that it doesn't need. (Anything |
| that has to be passed from postmaster to backends will be passed in |
| TopMemoryContext. The postmaster will have only TopMemoryContext, |
| PostmasterContext, and ErrorContext --- the remaining top-level contexts |
| will be set up in each backend during startup.) |
| |
| CacheMemoryContext --- permanent storage for relcache, catcache, and |
| related modules. This will never be reset or deleted, either, so it's |
| not truly necessary to distinguish it from TopMemoryContext. But it |
| seems worthwhile to maintain the distinction for debugging purposes. |
| (Note: CacheMemoryContext will have child-contexts with shorter lifespans. |
| For example, a child context is the best place to keep the subsidiary |
| storage associated with a relcache entry; that way we can free rule |
| parsetrees and so forth easily, without having to depend on constructing |
| a reliable version of freeObject().) |
| |
| MessageContext --- this context holds the current command message from the |
| frontend, as well as any derived storage that need only live as long as |
| the current message (for example, in simple-Query mode the parse and plan |
| trees can live here). This context will be reset, and any children |
| deleted, at the top of each cycle of the outer loop of PostgresMain. This |
| is kept separate from per-transaction and per-portal contexts because a |
| query string might need to live either a longer or shorter time than any |
| single transaction or portal. |
| |
| TopTransactionContext --- this holds everything that lives until end of the |
| top-level transaction. This context will be reset, and all its children |
| deleted, at conclusion of each top-level transaction cycle. In most cases |
| you don't want to allocate stuff directly here, but in CurTransactionContext; |
| what does belong here is control information that exists explicitly to manage |
| status across multiple subtransactions. Note: this context is NOT cleared |
| immediately upon error; its contents will survive until the transaction block |
| is exited by COMMIT/ROLLBACK. |
| |
| CurTransactionContext --- this holds data that has to survive until the end |
| of the current transaction, and in particular will be needed at top-level |
| transaction commit. When we are in a top-level transaction this is the same |
| as TopTransactionContext, but in subtransactions it points to a child context. |
| It is important to understand that if a subtransaction aborts, its |
| CurTransactionContext is thrown away after finishing the abort processing; |
| but a committed subtransaction's CurTransactionContext is kept until top-level |
| commit (unless of course one of the intermediate levels of subtransaction |
| aborts). This ensures that we do not keep data from a failed subtransaction |
| longer than necessary. Because of this behavior, you must be careful to clean |
| up properly during subtransaction abort --- the subtransaction's state must be |
| delinked from any pointers or lists kept in upper transactions, or you will |
| have dangling pointers leading to a crash at top-level commit. An example of |
| data kept here is pending NOTIFY messages, which are sent at top-level commit, |
| but only if the generating subtransaction did not abort. |
| |
| QueryContext --- this is not actually a separate context, but a global |
| variable pointing to the context that holds the current command's parse tree. |
| (In simple-Query mode this points to MessageContext; when executing a |
| prepared statement it will point to the prepared statement's private context. |
| Note that the plan tree may or may not be in this same context.) |
| Generally it is not appropriate for any code to use QueryContext as an |
| allocation target --- from the point of view of any code that would be |
| referencing the QueryContext variable, it's a read-only context. |
| |
| PortalContext --- this is not actually a separate context, but a |
| global variable pointing to the per-portal context of the currently active |
| execution portal. This can be used if it's necessary to allocate storage |
| that will live just as long as the execution of the current portal requires. |
| |
| ErrorContext --- this permanent context will be switched into for error |
| recovery processing, and then reset on completion of recovery. We'll |
| arrange to have, say, 8K of memory available in it at all times. In this |
| way, we can ensure that some memory is available for error recovery even |
| if the backend has run out of memory otherwise. This allows out-of-memory |
| to be treated as a normal ERROR condition, not a FATAL error. |
| |
| |
| Contexts For Prepared Statements And Portals |
| -------------------------------------------- |
| |
| A prepared-statement object has an associated private context, in which |
| the parse and plan trees for its query are stored. Because these trees |
| are read-only to the executor, the prepared statement can be re-used many |
| times without further copying of these trees. |
| |
| An execution-portal object has a private context that is referenced by |
| PortalContext when the portal is active. In the case of a portal created |
| by DECLARE CURSOR, this private context contains the query parse and plan |
| trees (there being no other object that can hold them). Portals created |
| from prepared statements simply reference the prepared statements' trees, |
| and won't actually need any storage allocated in their private contexts. |
| |
| |
| Transient Contexts During Execution |
| ----------------------------------- |
| |
| When creating a prepared statement, the parse and plan trees will be built |
| in a temporary context that's a child of MessageContext (so that it will |
| go away automatically upon error). On success, the finished plan is |
| copied to the prepared statement's private context, and the temp context |
| is released; this allows planner temporary space to be recovered before |
| execution begins. (In simple-Query mode we'll not bother with the extra |
| copy step, so the planner temp space stays around till end of query.) |
| |
| The top-level executor routines, as well as most of the "plan node" |
| execution code, will normally run in a context that is created by |
| ExecutorStart and destroyed by ExecutorEnd; this context also holds the |
| "plan state" tree built during ExecutorStart. Most of the memory |
| allocated in these routines is intended to live until end of query, |
| so this is appropriate for those purposes. The executor's top context |
| is a child of PortalContext, that is, the per-portal context of the |
| portal that represents the query's execution. |
| |
| The main improvement needed in the executor is that expression evaluation |
| --- both for qual testing and for computation of targetlist entries --- |
| needs to not leak memory. To do this, each ExprContext (expression-eval |
| context) created in the executor will now have a private memory context |
| associated with it, and we'll arrange to switch into that context when |
| evaluating expressions in that ExprContext. The plan node that owns the |
| ExprContext is responsible for resetting the private context to empty |
| when it no longer needs the results of expression evaluations. Typically |
| the reset is done at the start of each tuple-fetch cycle in the plan node. |
| |
| Note that this design gives each plan node its own expression-eval memory |
| context. This appears necessary to handle nested joins properly, since |
| an outer plan node might need to retain expression results it has computed |
| while obtaining the next tuple from an inner node --- but the inner node |
| might execute many tuple cycles and many expressions before returning a |
| tuple. The inner node must be able to reset its own expression context |
| more often than once per outer tuple cycle. Fortunately, memory contexts |
| are cheap enough that giving one to each plan node doesn't seem like a |
| problem. |
| |
| A problem with running index accesses and sorts in a query-lifespan context |
| is that these operations invoke datatype-specific comparison functions, |
| and if the comparators leak any memory then that memory won't be recovered |
| till end of query. The comparator functions all return bool or int32, |
| so there's no problem with their result data, but there can be a problem |
| with leakage of internal temporary data. In particular, comparator |
| functions that operate on TOAST-able data types will need to be careful |
| not to leak detoasted versions of their inputs. This is annoying, but |
| it appears a lot easier to make the comparators conform than to fix the |
| index and sort routines, so that's what I propose to do for 7.1. Further |
| cleanup can be left for another day. |
| |
| There will be some special cases, such as aggregate functions. nodeAgg.c |
| needs to remember the results of evaluation of aggregate transition |
| functions from one tuple cycle to the next, so it can't just discard |
| all per-tuple state in each cycle. The easiest way to handle this seems |
| to be to have two per-tuple contexts in an aggregate node, and to |
| ping-pong between them, so that at each tuple one is the active allocation |
| context and the other holds any results allocated by the prior cycle's |
| transition function. |
| |
| Executor routines that switch the active CurrentMemoryContext may need |
| to copy data into their caller's current memory context before returning. |
| I think there will be relatively little need for that, because of the |
| convention of resetting the per-tuple context at the *start* of an |
| execution cycle rather than at its end. With that rule, an execution |
| node can return a tuple that is palloc'd in its per-tuple context, and |
| the tuple will remain good until the node is called for another tuple |
| or told to end execution. This is pretty much the same state of affairs |
| that exists now, since a scan node can return a direct pointer to a tuple |
| in a disk buffer that is only guaranteed to remain good that long. |
| |
| A more common reason for copying data will be to transfer a result from |
| per-tuple context to per-run context; for example, a Unique node will |
| save the last distinct tuple value in its per-run context, requiring a |
| copy step. |
| |
| Another interesting special case is VACUUM, which needs to allocate |
| working space that will survive its forced transaction commits, yet |
| be released on error. Currently it does that through a "portal", |
| which is essentially a child context of TopMemoryContext. While that |
| way still works, it's ugly since xact abort needs special processing |
| to delete the portal. Better would be to use a context that's a child |
| of PortalContext and hence is certain to go away as part of normal |
| processing. (Eventually we might have an even better solution from |
| nested transactions, but this'll do fine for now.) |
| |
| |
| Mechanisms to Allow Multiple Types of Contexts |
| ---------------------------------------------- |
| |
| We may want several different types of memory contexts with different |
| allocation policies but similar external behavior. To handle this, |
| memory allocation functions will be accessed via function pointers, |
| and we will require all context types to obey the conventions given here. |
| (This is not very far different from the existing code.) |
| |
| A memory context will be represented by an object like |
| |
| typedef struct MemoryContextData |
| { |
| NodeTag type; /* identifies exact kind of context */ |
| MemoryContextMethods methods; |
| MemoryContextData *parent; /* NULL if no parent (toplevel context) */ |
| MemoryContextData *firstchild; /* head of linked list of children */ |
| MemoryContextData *nextchild; /* next child of same parent */ |
| char *name; /* context name (just for debugging) */ |
| } MemoryContextData, *MemoryContext; |
| |
| This is essentially an abstract superclass, and the "methods" pointer is |
| its virtual function table. Specific memory context types will use |
| derived structs having these fields as their first fields. All the |
| contexts of a specific type will have methods pointers that point to the |
| same static table of function pointers, which will look like |
| |
| typedef struct MemoryContextMethodsData |
| { |
| Pointer (*alloc) (MemoryContext c, Size size); |
| void (*free_p) (Pointer chunk); |
| Pointer (*realloc) (Pointer chunk, Size newsize); |
| void (*reset) (MemoryContext c); |
| void (*delete) (MemoryContext c); |
| } MemoryContextMethodsData, *MemoryContextMethods; |
| |
| Alloc, reset, and delete requests will take a MemoryContext pointer |
| as parameter, so they'll have no trouble finding the method pointer |
| to call. Free and realloc are trickier. To make those work, we will |
| require all memory context types to produce allocated chunks that |
| are immediately preceded by a standard chunk header, which has the |
| layout |
| |
| typedef struct StandardChunkHeader |
| { |
| MemoryContext mycontext; /* Link to owning context object */ |
| Size size; /* Allocated size of chunk */ |
| }; |
| |
| It turns out that the existing aset.c memory context type does this |
| already, and probably any other kind of context would need to have the |
| same data available to support realloc, so this is not really creating |
| any additional overhead. (Note that if a context type needs more per- |
| allocated-chunk information than this, it can make an additional |
| nonstandard header that precedes the standard header. So we're not |
| constraining context-type designers very much.) |
| |
| Given this, the pfree routine will look something like |
| |
| StandardChunkHeader * header = |
| (StandardChunkHeader *) ((char *) p - sizeof(StandardChunkHeader)); |
| |
| (*header->mycontext->methods->free_p) (p); |
| |
| We could do it as a macro, but the macro would have to evaluate its |
| argument twice, which seems like a bad idea (the current pfree macro |
| does not do that). This is already saving two levels of function call |
| compared to the existing code, so I think we're doing fine without |
| squeezing out that last little bit ... |
| |
| |
| More Control Over aset.c Behavior |
| --------------------------------- |
| |
| Currently, aset.c allocates an 8K block upon the first allocation in |
| a context, and doubles that size for each successive block request. |
| That's good behavior for a context that might hold *lots* of data, and |
| the overhead wasn't bad when we had only a few contexts in existence. |
| With dozens if not hundreds of smaller contexts in the system, we will |
| want to be able to fine-tune things a little better. |
| |
| The creator of a context will be able to specify an initial block size |
| and a maximum block size. Selecting smaller values will prevent wastage |
| of space in contexts that aren't expected to hold very much (an example is |
| the relcache's per-relation contexts). |
| |
| Also, it will be possible to specify a minimum context size. If this |
| value is greater than zero then a block of that size will be grabbed |
| immediately upon context creation, and cleared but not released during |
| context resets. This feature is needed for ErrorContext (see above), |
| but will most likely not be used for other contexts. |
| |
| We expect that per-tuple contexts will be reset frequently and typically |
| will not allocate very much space per tuple cycle. To make this usage |
| pattern cheap, the first block allocated in a context is not given |
| back to malloc() during reset, but just cleared. This avoids malloc |
| thrashing. |
| |
| |
| Other Notes |
| ----------- |
| |
| The original version of this proposal suggested that functions returning |
| pass-by-reference datatypes should be required to return a value freshly |
| palloc'd in their caller's memory context, never a pointer to an input |
| value. I've abandoned that notion since it clearly is prone to error. |
| In the current proposal, it is possible to discover which context a |
| chunk of memory is allocated in (by checking the required standard chunk |
| header), so nodeAgg can determine whether or not it's safe to reset |
| its working context; it doesn't have to rely on the transition function |
| to do what it's expecting. |