| src/backend/utils/mmgr/README |
| |
| Memory Context System Design Overview |
| ===================================== |
| |
| Background |
| ---------- |
| |
| We do most of our memory allocation in "memory contexts", which are usually |
| AllocSets as implemented by src/backend/utils/mmgr/aset.c. The key to |
| successful memory management without lots of overhead is to define a useful |
| set of contexts with appropriate lifespans. |
| |
| The basic operations on a memory context are: |
| |
| * create a context |
| |
| * allocate a chunk of memory within a context (equivalent of standard |
| C library's malloc()) |
| |
| * delete a context (including freeing all the memory allocated therein) |
| |
| * reset a context (free all memory allocated in the context, but not the |
| context object itself) |
| |
| * inquire about the total amount of memory allocated to the context |
| (the raw memory from which the context allocates chunks; not the |
| chunks themselves) |
| |
| Given a chunk of memory previously allocated from a context, one can |
| free it or reallocate it larger or smaller (corresponding to standard C |
| library's free() and realloc() routines). These operations return memory |
| to or get more memory from the same context the chunk was originally |
| allocated in. |
| |
| At all times there is a "current" context denoted by the |
| CurrentMemoryContext global variable. palloc() implicitly allocates space |
| in that context. The MemoryContextSwitchTo() operation selects a new current |
| context (and returns the previous context, so that the caller can restore the |
| previous context before exiting). |
| |
| The main advantage of memory contexts over plain use of malloc/free is |
| that the entire contents of a memory context can be freed easily, without |
| having to request freeing of each individual chunk within it. This is |
| both faster and more reliable than per-chunk bookkeeping. We use this |
| fact to clean up at transaction end: by resetting all the active contexts |
| of transaction or shorter lifespan, we can reclaim all transient memory. |
| Similarly, we can clean up at the end of each query, or after each tuple |
| is processed during a query. |
| |
| |
| Some Notes About the palloc API Versus Standard C Library |
| --------------------------------------------------------- |
| |
| The behavior of palloc and friends is similar to the standard C library's |
| malloc and friends, but there are some deliberate differences too. Here |
| are some notes to clarify the behavior. |
| |
| * If out of memory, palloc and repalloc exit via elog(ERROR). They |
| never return NULL, and it is not necessary or useful to test for such |
| a result. With palloc_extended() that behavior can be overridden |
| using the MCXT_ALLOC_NO_OOM flag. |
| |
| * palloc(0) is explicitly a valid operation. It does not return a NULL |
| pointer, but a valid chunk of which no bytes may be used. However, the |
| chunk might later be repalloc'd larger; it can also be pfree'd without |
| error. Similarly, repalloc allows realloc'ing to zero size. |
| |
| * pfree and repalloc do not accept a NULL pointer. This is intentional. |
| |
| |
| The Current Memory Context |
| -------------------------- |
| |
| Because it would be too much notational overhead to always pass an |
| appropriate memory context to called routines, there always exists the |
| notion of the current memory context CurrentMemoryContext. Without it, |
| for example, the copyObject routines would need to be passed a context, as |
| would function execution routines that return a pass-by-reference |
| datatype. Similarly for routines that temporarily allocate space |
| internally, but don't return it to their caller? We certainly don't |
| want to clutter every call in the system with "here is a context to |
| use for any temporary memory allocation you might want to do". |
| |
| The upshot of that reasoning, though, is that CurrentMemoryContext should |
| generally point at a short-lifespan context if at all possible. During |
| query execution it usually points to a context that gets reset after each |
| tuple. Only in *very* circumscribed code should it ever point at a |
| context having greater than transaction lifespan, since doing so risks |
| permanent memory leaks. |
| |
| |
| pfree/repalloc Do Not Depend On CurrentMemoryContext |
| ---------------------------------------------------- |
| |
| pfree() and repalloc() can be applied to any chunk whether it belongs |
| to CurrentMemoryContext or not --- the chunk's owning context will be |
| invoked to handle the operation, regardless. |
| |
| |
| "Parent" and "Child" Contexts |
| ----------------------------- |
| |
| If all contexts were independent, it'd be hard to keep track of them, |
| especially in error cases. That is solved by creating a tree of |
| "parent" and "child" contexts. When creating a memory context, the |
| new context can be specified to be a child of some existing context. |
| A context can have many children, but only one parent. In this way |
| the contexts form a forest (not necessarily a single tree, since there |
| could be more than one top-level context; although in current practice |
| there is only one top context, TopMemoryContext). |
| |
| Deleting a context deletes all its direct and indirect children as |
| well. When resetting a context it's almost always more useful to |
| delete child contexts, thus MemoryContextReset() means that, and if |
| you really do want a tree of empty contexts you need to call |
| MemoryContextResetOnly() plus MemoryContextResetChildren(). |
| |
| These features allow us to manage a lot of contexts without fear that |
| some will be leaked; we only need to keep track of one top-level |
| context that we are going to delete at transaction end, and make sure |
| that any shorter-lived contexts we create are descendants of that |
| context. Since the tree can have multiple levels, we can deal easily |
| with nested lifetimes of storage, such as per-transaction, |
| per-statement, per-scan, per-tuple. Storage lifetimes that only |
| partially overlap can be handled by allocating from different trees of |
| the context forest (there are some examples in the next section). |
| |
| For convenience we also provide operations like "reset/delete all children |
| of a given context, but don't reset or delete that context itself". |
| |
| |
| Memory Context Reset/Delete Callbacks |
| ------------------------------------- |
| |
| A feature introduced in Postgres 9.5 allows memory contexts to be used |
| for managing more resources than just plain palloc'd memory. This is |
| done by registering a "reset callback function" for a memory context. |
| Such a function will be called, once, just before the context is next |
| reset or deleted. It can be used to give up resources that are in some |
| sense associated with an object allocated within the context. Possible |
| use-cases include |
| * closing open files associated with a tuplesort object; |
| * releasing reference counts on long-lived cache objects that are held |
| by some object within the context being reset; |
| * freeing malloc-managed memory associated with some palloc'd object. |
| That last case would just represent bad programming practice for pure |
| Postgres code; better to have made all the allocations using palloc, |
| in the target context or some child context. However, it could well |
| come in handy for code that interfaces to non-Postgres libraries. |
| |
| Any number of reset callbacks can be established for a memory context; |
| they are called in reverse order of registration. Also, callbacks |
| attached to child contexts are called before callbacks attached to |
| parent contexts, if a tree of contexts is being reset or deleted. |
| |
| The API for this requires the caller to provide a MemoryContextCallback |
| memory chunk to hold the state for a callback. Typically this should be |
| allocated in the same context it is logically attached to, so that it |
| will be released automatically after use. The reason for asking the |
| caller to provide this memory is that in most usage scenarios, the caller |
| will be creating some larger struct within the target context, and the |
| MemoryContextCallback struct can be made "for free" without a separate |
| palloc() call by including it in this larger struct. |
| |
| |
| Memory Contexts in Practice |
| =========================== |
| |
| Globally Known Contexts |
| ----------------------- |
| |
| There are a few widely-known contexts that are typically referenced |
| through global variables. At any instant the system may contain many |
| additional contexts, but all other contexts should be direct or indirect |
| children of one of these contexts to ensure they are not leaked in event |
| of an error. |
| |
| TopMemoryContext --- this is the actual top level of the context tree; |
| every other context is a direct or indirect child of this one. Allocating |
| here is essentially the same as "malloc", because this context will never |
| be reset or deleted. This is for stuff that should live forever, or for |
| stuff that the controlling module will take care of deleting at the |
| appropriate time. An example is fd.c's tables of open files. Avoid |
| allocating stuff here unless really necessary, and especially avoid |
| running with CurrentMemoryContext pointing here. |
| |
| PostmasterContext --- this is the postmaster's normal working context. |
| After a backend is spawned, it can delete PostmasterContext to free its |
| copy of memory the postmaster was using that it doesn't need. |
| Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf |
| and pg_ident.conf data is used directly during authentication in backend |
| processes; so backends can't delete PostmasterContext until that's done. |
| (The postmaster has only TopMemoryContext, PostmasterContext, and |
| ErrorContext --- the remaining top-level contexts are set up in each |
| backend during startup.) |
| |
| CacheMemoryContext --- permanent storage for relcache, catcache, and |
| related modules. This will never be reset or deleted, either, so it's |
| not truly necessary to distinguish it from TopMemoryContext. But it |
| seems worthwhile to maintain the distinction for debugging purposes. |
| (Note: CacheMemoryContext has child contexts with shorter lifespans. |
| For example, a child context is the best place to keep the subsidiary |
| storage associated with a relcache entry; that way we can free rule |
| parsetrees and so forth easily, without having to depend on constructing |
| a reliable version of freeObject().) |
| |
| MessageContext --- this context holds the current command message from the |
| frontend, as well as any derived storage that need only live as long as |
| the current message (for example, in simple-Query mode the parse and plan |
| trees can live here). This context will be reset, and any children |
| deleted, at the top of each cycle of the outer loop of PostgresMain. This |
| is kept separate from per-transaction and per-portal contexts because a |
| query string might need to live either a longer or shorter time than any |
| single transaction or portal. |
| |
| TopTransactionContext --- this holds everything that lives until end of the |
| top-level transaction. This context will be reset, and all its children |
| deleted, at conclusion of each top-level transaction cycle. In most cases |
| you don't want to allocate stuff directly here, but in CurTransactionContext; |
| what does belong here is control information that exists explicitly to manage |
| status across multiple subtransactions. Note: this context is NOT cleared |
| immediately upon error; its contents will survive until the transaction block |
| is exited by COMMIT/ROLLBACK. |
| |
| CurTransactionContext --- this holds data that has to survive until the end |
| of the current transaction, and in particular will be needed at top-level |
| transaction commit. When we are in a top-level transaction this is the same |
| as TopTransactionContext, but in subtransactions it points to a child context. |
| It is important to understand that if a subtransaction aborts, its |
| CurTransactionContext is thrown away after finishing the abort processing; |
| but a committed subtransaction's CurTransactionContext is kept until top-level |
| commit (unless of course one of the intermediate levels of subtransaction |
| aborts). This ensures that we do not keep data from a failed subtransaction |
| longer than necessary. Because of this behavior, you must be careful to clean |
| up properly during subtransaction abort --- the subtransaction's state must be |
| delinked from any pointers or lists kept in upper transactions, or you will |
| have dangling pointers leading to a crash at top-level commit. An example of |
| data kept here is pending NOTIFY messages, which are sent at top-level commit, |
| but only if the generating subtransaction did not abort. |
| |
| QueryContext --- this is not actually a separate context, but a global |
| variable pointing to the context that holds the current command's parse tree. |
| (In simple-Query mode this points to MessageContext; when executing a |
| prepared statement it will point to the prepared statement's private context. |
| Note that the plan tree may or may not be in this same context.) |
| Generally it is not appropriate for any code to use QueryContext as an |
| allocation target --- from the point of view of any code that would be |
| referencing the QueryContext variable, it's a read-only context. |
| |
| PortalContext --- this is not actually a separate context, but a |
| global variable pointing to the per-portal context of the currently active |
| execution portal. This can be used if it's necessary to allocate storage |
| that will live just as long as the execution of the current portal requires. |
| |
| ErrorContext --- this permanent context is switched into for error |
| recovery processing, and then reset on completion of recovery. We arrange |
| to have a few KB of memory available in it at all times. In this way, we |
| can ensure that some memory is available for error recovery even if the |
| backend has run out of memory otherwise. This allows out-of-memory to be |
| treated as a normal ERROR condition, not a FATAL error. |
| |
| |
| Contexts For Prepared Statements And Portals |
| -------------------------------------------- |
| |
| A prepared-statement object has an associated private context, in which |
| the parse and plan trees for its query are stored. Because these trees |
| are read-only to the executor, the prepared statement can be re-used many |
| times without further copying of these trees. |
| |
| An execution-portal object has a private context that is referenced by |
| PortalContext when the portal is active. In the case of a portal created |
| by DECLARE CURSOR, this private context contains the query parse and plan |
| trees (there being no other object that can hold them). Portals created |
| from prepared statements simply reference the prepared statements' trees, |
| and don't actually need any storage allocated in their private contexts. |
| |
| |
| Logical Replication Worker Contexts |
| ----------------------------------- |
| |
| ApplyContext --- permanent during whole lifetime of apply worker. It |
| is possible to use TopMemoryContext here as well, but for simplicity |
| of memory usage analysis we spin up different context. |
| |
| ApplyMessageContext --- short-lived context that is reset after each |
| logical replication protocol message is processed. |
| |
| |
| Transient Contexts During Execution |
| ----------------------------------- |
| |
| When creating a prepared statement, the parse and plan trees will be built |
| in a temporary context that's a child of MessageContext (so that it will |
| go away automatically upon error). On success, the finished plan is |
| copied to the prepared statement's private context, and the temp context |
| is released; this allows planner temporary space to be recovered before |
| execution begins. (In simple-Query mode we don't bother with the extra |
| copy step, so the planner temp space stays around till end of query.) |
| |
| The top-level executor routines, as well as most of the "plan node" |
| execution code, will normally run in a context that is created by |
| ExecutorStart and destroyed by ExecutorEnd; this context also holds the |
| "plan state" tree built during ExecutorStart. Most of the memory |
| allocated in these routines is intended to live until end of query, |
| so this is appropriate for those purposes. The executor's top context |
| is a child of PortalContext, that is, the per-portal context of the |
| portal that represents the query's execution. |
| |
| The main memory-management consideration in the executor is that |
| expression evaluation --- both for qual testing and for computation of |
| targetlist entries --- needs to not leak memory. To do this, each |
| ExprContext (expression-eval context) created in the executor has a |
| private memory context associated with it, and we switch into that context |
| when evaluating expressions in that ExprContext. The plan node that owns |
| the ExprContext is responsible for resetting the private context to empty |
| when it no longer needs the results of expression evaluations. Typically |
| the reset is done at the start of each tuple-fetch cycle in the plan node. |
| |
| Note that this design gives each plan node its own expression-eval memory |
| context. This appears necessary to handle nested joins properly, since |
| an outer plan node might need to retain expression results it has computed |
| while obtaining the next tuple from an inner node --- but the inner node |
| might execute many tuple cycles and many expressions before returning a |
| tuple. The inner node must be able to reset its own expression context |
| more often than once per outer tuple cycle. Fortunately, memory contexts |
| are cheap enough that giving one to each plan node doesn't seem like a |
| problem. |
| |
| A problem with running index accesses and sorts in a query-lifespan context |
| is that these operations invoke datatype-specific comparison functions, |
| and if the comparators leak any memory then that memory won't be recovered |
| till end of query. The comparator functions all return bool or int32, |
| so there's no problem with their result data, but there can be a problem |
| with leakage of internal temporary data. In particular, comparator |
| functions that operate on TOAST-able data types need to be careful |
| not to leak detoasted versions of their inputs. This is annoying, but |
| it appeared a lot easier to make the comparators conform than to fix the |
| index and sort routines, so that's what was done for 7.1. This remains |
| the state of affairs in btree and hash indexes, so btree and hash support |
| functions still need to not leak memory. Most of the other index AMs |
| have been modified to run opclass support functions in short-lived |
| contexts, so that leakage is not a problem; this is necessary in view |
| of the fact that their support functions tend to be far more complex. |
| |
| There are some special cases, such as aggregate functions. nodeAgg.c |
| needs to remember the results of evaluation of aggregate transition |
| functions from one tuple cycle to the next, so it can't just discard |
| all per-tuple state in each cycle. The easiest way to handle this seems |
| to be to have two per-tuple contexts in an aggregate node, and to |
| ping-pong between them, so that at each tuple one is the active allocation |
| context and the other holds any results allocated by the prior cycle's |
| transition function. |
| |
| Executor routines that switch the active CurrentMemoryContext may need |
| to copy data into their caller's current memory context before returning. |
| However, we have minimized the need for that, because of the convention |
| of resetting the per-tuple context at the *start* of an execution cycle |
| rather than at its end. With that rule, an execution node can return a |
| tuple that is palloc'd in its per-tuple context, and the tuple will remain |
| good until the node is called for another tuple or told to end execution. |
| This parallels the situation with pass-by-reference values at the table |
| scan level, since a scan node can return a direct pointer to a tuple in a |
| disk buffer that is only guaranteed to remain good that long. |
| |
| A more common reason for copying data is to transfer a result from |
| per-tuple context to per-query context; for example, a Unique node will |
| save the last distinct tuple value in its per-query context, requiring a |
| copy step. |
| |
| |
| Mechanisms to Allow Multiple Types of Contexts |
| ---------------------------------------------- |
| |
| To efficiently allow for different allocation patterns, and for |
| experimentation, we allow for different types of memory contexts with |
| different allocation policies but similar external behavior. To |
| handle this, memory allocation functions are accessed via function |
| pointers, and we require all context types to obey the conventions |
| given here. |
| |
| A memory context is represented by struct MemoryContextData (see |
| memnodes.h). This struct identifies the exact type of the context, and |
| contains information common between the different types of |
| MemoryContext like the parent and child contexts, and the name of the |
| context. |
| |
| This is essentially an abstract superclass, and the behavior is |
| determined by the "methods" pointer is its virtual function table |
| (struct MemoryContextMethods). Specific memory context types will use |
| derived structs having these fields as their first fields. All the |
| contexts of a specific type will have methods pointers that point to |
| the same static table of function pointers. |
| |
| While operations like allocating from and resetting a context take the |
| relevant MemoryContext as a parameter, operations like free and |
| realloc are trickier. To make those work, we require all memory |
| context types to produce allocated chunks that are immediately, |
| without any padding, preceded by a pointer to the corresponding |
| MemoryContext. |
| |
| If a type of allocator needs additional information about its chunks, |
| like e.g. the size of the allocation, that information can in turn |
| precede the MemoryContext. This means the only overhead implied by |
| the memory context mechanism is a pointer to its context, so we're not |
| constraining context-type designers very much. |
| |
| Given this, routines like pfree determine their corresponding context |
| with an operation like (although that is usually encapsulated in |
| GetMemoryChunkContext()) |
| |
| MemoryContext context = *(MemoryContext*) (((char *) pointer) - sizeof(void *)); |
| |
| and then invoke the corresponding method for the context |
| |
| context->methods->free_p(pointer); |
| |
| |
| More Control Over aset.c Behavior |
| --------------------------------- |
| |
| By default aset.c always allocates an 8K block upon the first |
| allocation in a context, and doubles that size for each successive |
| block request. That's good behavior for a context that might hold |
| *lots* of data. But if there are dozens if not hundreds of smaller |
| contexts in the system, we need to be able to fine-tune things a |
| little better. |
| |
| The creator of a context is able to specify an initial block size and |
| a maximum block size. Selecting smaller values can prevent wastage of |
| space in contexts that aren't expected to hold very much (an example |
| is the relcache's per-relation contexts). |
| |
| Also, it is possible to specify a minimum context size, in case for some |
| reason that should be different from the initial size for additional |
| blocks. An aset.c context will always contain at least one block, |
| of size minContextSize if that is specified, otherwise initBlockSize. |
| |
| We expect that per-tuple contexts will be reset frequently and typically |
| will not allocate very much space per tuple cycle. To make this usage |
| pattern cheap, the first block allocated in a context is not given |
| back to malloc() during reset, but just cleared. This avoids malloc |
| thrashing. |
| |
| |
| Alternative Memory Context Implementations |
| ------------------------------------------ |
| |
| aset.c is our default general-purpose implementation, working fine |
| in most situations. We also have two implementations optimized for |
| special use cases, providing either better performance or lower memory |
| usage compared to aset.c (or both). |
| |
| * slab.c (SlabContext) is designed for allocations of fixed-length |
| chunks, and does not allow allocations of chunks with different size. |
| |
| * generation.c (GenerationContext) is designed for cases when chunks |
| are allocated in groups with similar lifespan (generations), or |
| roughly in FIFO order. |
| |
| Both memory contexts aim to free memory back to the operating system |
| (unlike aset.c, which keeps the freed chunks in a freelist, and only |
| returns the memory when reset/deleted). |
| |
| These memory contexts were initially developed for ReorderBuffer, but |
| may be useful elsewhere as long as the allocation patterns match. |
| |
| |
| Memory Accounting |
| ----------------- |
| |
| One of the basic memory context operations is determining the amount of |
| memory used in the context (and its children). We have multiple places |
| that implement their own ad hoc memory accounting, and this is meant to |
| provide a unified approach. Ad hoc accounting solutions work for places |
| with tight control over the allocations or when it's easy to determine |
| sizes of allocated chunks (e.g. places that only work with tuples). |
| |
| The accounting built into the memory contexts is transparent and works |
| transparently for all allocations as long as they end up in the right |
| memory context subtree. |
| |
| Consider for example aggregate functions - the aggregate state is often |
| represented by an arbitrary structure, allocated from the transition |
| function, so the ad hoc accounting is unlikely to work. The built-in |
| accounting will however handle such cases just fine. |
| |
| To minimize overhead, the accounting is done at the block level, not for |
| individual allocation chunks. |
| |
| The accounting is lazy - after a block is allocated (or freed), only the |
| context owning that block is updated. This means that when inquiring |
| about the memory usage in a given context, we have to walk all children |
| contexts recursively. This means the memory accounting is not intended |
| for cases with too many memory contexts (in the relevant subtree). |