| Overview |
| ======== |
| |
| PostgreSQL provides some simple facilities to make writing parallel algorithms |
| easier. Using a data structure called a ParallelContext, you can arrange to |
| launch background worker processes, initialize their state to match that of |
| the backend which initiated parallelism, communicate with them via dynamic |
| shared memory, and write reasonably complex code that can run either in the |
| user backend or in one of the parallel workers without needing to be aware of |
| where it's running. |
| |
| The backend which starts a parallel operation (hereafter, the initiating |
| backend) starts by creating a dynamic shared memory segment which will last |
| for the lifetime of the parallel operation. This dynamic shared memory segment |
| will contain (1) a shm_mq that can be used to transport errors (and other |
| messages reported via elog/ereport) from the worker back to the initiating |
| backend; (2) serialized representations of the initiating backend's private |
| state, so that the worker can synchronize its state with of the initiating |
| backend; and (3) any other data structures which a particular user of the |
| ParallelContext data structure may wish to add for its own purposes. Once |
| the initiating backend has initialized the dynamic shared memory segment, it |
| asks the postmaster to launch the appropriate number of parallel workers. |
| These workers then connect to the dynamic shared memory segment, initiate |
| their state, and then invoke the appropriate entrypoint, as further detailed |
| below. |
| |
| Error Reporting |
| =============== |
| |
| When started, each parallel worker begins by attaching the dynamic shared |
| memory segment and locating the shm_mq to be used for error reporting; it |
| redirects all of its protocol messages to this shm_mq. Prior to this point, |
| any failure of the background worker will not be reported to the initiating |
| backend; from the point of view of the initiating backend, the worker simply |
| failed to start. The initiating backend must anyway be prepared to cope |
| with fewer parallel workers than it originally requested, so catering to |
| this case imposes no additional burden. |
| |
| Whenever a new message (or partial message; very large messages may wrap) is |
| sent to the error-reporting queue, PROCSIG_PARALLEL_MESSAGE is sent to the |
| initiating backend. This causes the next CHECK_FOR_INTERRUPTS() in the |
| initiating backend to read and rethrow the message. For the most part, this |
| makes error reporting in parallel mode "just work". Of course, to work |
| properly, it is important that the code the initiating backend is executing |
| CHECK_FOR_INTERRUPTS() regularly and avoid blocking interrupt processing for |
| long periods of time, but those are good things to do anyway. |
| |
| (A currently-unsolved problem is that some messages may get written to the |
| system log twice, once in the backend where the report was originally |
| generated, and again when the initiating backend rethrows the message. If |
| we decide to suppress one of these reports, it should probably be second one; |
| otherwise, if the worker is for some reason unable to propagate the message |
| back to the initiating backend, the message will be lost altogether.) |
| |
| State Sharing |
| ============= |
| |
| It's possible to write C code which works correctly without parallelism, but |
| which fails when parallelism is used. No parallel infrastructure can |
| completely eliminate this problem, because any global variable is a risk. |
| There's no general mechanism for ensuring that every global variable in the |
| worker will have the same value that it does in the initiating backend; even |
| if we could ensure that, some function we're calling could update the variable |
| after each call, and only the backend where that update is performed will see |
| the new value. Similar problems can arise with any more-complex data |
| structure we might choose to use. For example, a pseudo-random number |
| generator should, given a particular seed value, produce the same predictable |
| series of values every time. But it does this by relying on some private |
| state which won't automatically be shared between cooperating backends. A |
| parallel-safe PRNG would need to store its state in dynamic shared memory, and |
| would require locking. The parallelism infrastructure has no way of knowing |
| whether the user intends to call code that has this sort of problem, and can't |
| do anything about it anyway. |
| |
| Instead, we take a more pragmatic approach. First, we try to make as many of |
| the operations that are safe outside of parallel mode work correctly in |
| parallel mode as well. Second, we try to prohibit common unsafe operations |
| via suitable error checks. These checks are intended to catch 100% of |
| unsafe things that a user might do from the SQL interface, but code written |
| in C can do unsafe things that won't trigger these checks. The error checks |
| are engaged via EnterParallelMode(), which should be called before creating |
| a parallel context, and disarmed via ExitParallelMode(), which should be |
| called after all parallel contexts have been destroyed. The most |
| significant restriction imposed by parallel mode is that all operations must |
| be strictly read-only; we allow no writes to the database and no DDL. We |
| might try to relax these restrictions in the future. |
| |
| To make as many operations as possible safe in parallel mode, we try to copy |
| the most important pieces of state from the initiating backend to each parallel |
| worker. This includes: |
| |
| - The set of libraries dynamically loaded by dfmgr.c. |
| |
| - The authenticated user ID and current database. Each parallel worker |
| will connect to the same database as the initiating backend, using the |
| same user ID. |
| |
| - The values of all GUCs. Accordingly, permanent changes to the value of |
| any GUC are forbidden while in parallel mode; but temporary changes, |
| such as entering a function with non-NULL proconfig, are OK. |
| |
| - The current subtransaction's XID, the top-level transaction's XID, and |
| the list of XIDs considered current (that is, they are in-progress or |
| subcommitted). This information is needed to ensure that tuple visibility |
| checks return the same results in the worker as they do in the |
| initiating backend. See also the section Transaction Integration, below. |
| |
| - The combo CID mappings. This is needed to ensure consistent answers to |
| tuple visibility checks. The need to synchronize this data structure is |
| a major reason why we can't support writes in parallel mode: such writes |
| might create new combo CIDs, and we have no way to let other workers |
| (or the initiating backend) know about them. |
| |
| - The transaction snapshot. |
| |
| - The active snapshot, which might be different from the transaction |
| snapshot. |
| |
| - The currently active user ID and security context. Note that this is |
| the fourth user ID we restore: the initial step of binding to the correct |
| database also involves restoring the authenticated user ID. When GUC |
| values are restored, this incidentally sets SessionUserId and OuterUserId |
| to the correct values. This final step restores CurrentUserId. |
| |
| - State related to pending REINDEX operations, which prevents access to |
| an index that is currently being rebuilt. |
| |
| - Active relmapper.c mapping state. This is needed to allow consistent |
| answers when fetching the current relfilenumber for relation oids of |
| mapped relations. |
| |
| To prevent unprincipled deadlocks when running in parallel mode, this code |
| also arranges for the leader and all workers to participate in group |
| locking. See src/backend/storage/lmgr/README for more details. |
| |
| Transaction Integration |
| ======================= |
| |
| Regardless of what the TransactionState stack looks like in the parallel |
| leader, each parallel worker ends up with a stack of depth 1. This stack |
| entry is marked with the special transaction block state |
| TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary |
| toplevel transaction. The XID of this TransactionState is set to the XID of |
| the innermost currently-active subtransaction in the initiating backend. The |
| initiating backend's toplevel XID, and the XIDs of all current (in-progress |
| or subcommitted) XIDs are stored separately from the TransactionState stack, |
| but in such a way that GetTopTransactionId(), GetTopTransactionIdIfAny(), and |
| TransactionIdIsCurrentTransactionId() return the same values that they would |
| in the initiating backend. We could copy the entire transaction state stack, |
| but most of it would be useless: for example, you can't roll back to a |
| savepoint from within a parallel worker, and there are no resources to |
| associated with the memory contexts or resource owners of intermediate |
| subtransactions. |
| |
| No meaningful change to the transaction state can be made while in parallel |
| mode. No XIDs can be assigned, and no subtransactions can start or end, |
| because we have no way of communicating these state changes to cooperating |
| backends, or of synchronizing them. It's clearly unworkable for the initiating |
| backend to exit any transaction or subtransaction that was in progress when |
| parallelism was started before all parallel workers have exited; and it's even |
| more clearly crazy for a parallel worker to try to subcommit or subabort the |
| current subtransaction and execute in some other transaction context than was |
| present in the initiating backend. It might be practical to allow internal |
| sub-transactions (e.g. to implement a PL/pgSQL EXCEPTION block) to be used in |
| parallel mode, provided that they are XID-less, because other backends |
| wouldn't really need to know about those transactions or do anything |
| differently because of them. Right now, we don't even allow that. |
| |
| At the end of a parallel operation, which can happen either because it |
| completed successfully or because it was interrupted by an error, parallel |
| workers associated with that operation exit. In the error case, transaction |
| abort processing in the parallel leader kills off any remaining workers, and |
| the parallel leader then waits for them to die. In the case of a successful |
| parallel operation, the parallel leader does not send any signals, but must |
| wait for workers to complete and exit of their own volition. In either |
| case, it is very important that all workers actually exit before the |
| parallel leader cleans up the (sub)transaction in which they were created; |
| otherwise, chaos can ensue. For example, if the leader is rolling back the |
| transaction that created the relation being scanned by a worker, the |
| relation could disappear while the worker is still busy scanning it. That's |
| not safe. |
| |
| Generally, the cleanup performed by each worker at this point is similar to |
| top-level commit or abort. Each backend has its own resource owners: buffer |
| pins, catcache or relcache reference counts, tuple descriptors, and so on |
| are managed separately by each backend, and must free them before exiting. |
| There are, however, some important differences between parallel worker |
| commit or abort and a real top-level transaction commit or abort. Most |
| importantly: |
| |
| - No commit or abort record is written; the initiating backend is |
| responsible for this. |
| |
| - Cleanup of pg_temp namespaces is not done. Parallel workers cannot |
| safely access the initiating backend's pg_temp namespace, and should |
| not create one of their own. |
| |
| Coding Conventions |
| =================== |
| |
| Before beginning any parallel operation, call EnterParallelMode(); after all |
| parallel operations are completed, call ExitParallelMode(). To actually |
| parallelize a particular operation, use a ParallelContext. The basic coding |
| pattern looks like this: |
| |
| EnterParallelMode(); /* prohibit unsafe state changes */ |
| |
| pcxt = CreateParallelContext("library_name", "function_name", nworkers); |
| |
| /* Allow space for application-specific data here. */ |
| shm_toc_estimate_chunk(&pcxt->estimator, size); |
| shm_toc_estimate_keys(&pcxt->estimator, keys); |
| |
| InitializeParallelDSM(pcxt); /* create DSM and copy state to it */ |
| |
| /* Store the data for which we reserved space. */ |
| space = shm_toc_allocate(pcxt->toc, size); |
| shm_toc_insert(pcxt->toc, key, space); |
| |
| LaunchParallelWorkers(pcxt); |
| |
| /* do parallel stuff */ |
| |
| WaitForParallelWorkersToFinish(pcxt); |
| |
| /* read any final results from dynamic shared memory */ |
| |
| DestroyParallelContext(pcxt); |
| |
| ExitParallelMode(); |
| |
| If desired, after WaitForParallelWorkersToFinish() has been called, the |
| context can be reset so that workers can be launched anew using the same |
| parallel context. To do this, first call ReinitializeParallelDSM() to |
| reinitialize state managed by the parallel context machinery itself; then, |
| perform any other necessary resetting of state; after that, you can again |
| call LaunchParallelWorkers. |