| <!-- doc/src/sgml/parallel.sgml --> |
| |
| <chapter id="parallel-query"> |
| <title>Parallel Query</title> |
| |
| <indexterm zone="parallel-query"> |
| <primary>parallel query</primary> |
| </indexterm> |
| |
| <para> |
| <productname>PostgreSQL</productname> can devise query plans that can leverage |
| multiple CPUs in order to answer queries faster. This feature is known |
| as parallel query. Many queries cannot benefit from parallel query, either |
| due to limitations of the current implementation or because there is no |
| imaginable query plan that is any faster than the serial query plan. |
| However, for queries that can benefit, the speedup from parallel query |
| is often very significant. Many queries can run more than twice as fast |
| when using parallel query, and some queries can run four times faster or |
| even more. Queries that touch a large amount of data but return only a |
| few rows to the user will typically benefit most. This chapter explains |
| some details of how parallel query works and in which situations it can be |
| used so that users who wish to make use of it can understand what to expect. |
| </para> |
| |
| <sect1 id="how-parallel-query-works"> |
| <title>How Parallel Query Works</title> |
| |
| <para> |
| When the optimizer determines that parallel query is the fastest execution |
| strategy for a particular query, it will create a query plan that includes |
| a <firstterm>Gather</firstterm> or <firstterm>Gather Merge</firstterm> |
| node. Here is a simple example: |
| |
| <screen> |
| EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%'; |
| QUERY PLAN |
| -------------------------------------------------------------------&zwsp;------------------ |
| Gather (cost=1000.00..217018.43 rows=1 width=97) |
| Workers Planned: 2 |
| -> Parallel Seq Scan on pgbench_accounts (cost=0.00..216018.33 rows=1 width=97) |
| Filter: (filler ~~ '%x%'::text) |
| (4 rows) |
| </screen> |
| </para> |
| |
| <para> |
| In all cases, the <literal>Gather</literal> or |
| <literal>Gather Merge</literal> node will have exactly one |
| child plan, which is the portion of the plan that will be executed in |
| parallel. If the <literal>Gather</literal> or <literal>Gather Merge</literal> node is |
| at the very top of the plan tree, then the entire query will execute in |
| parallel. If it is somewhere else in the plan tree, then only the portion |
| of the plan below it will run in parallel. In the example above, the |
| query accesses only one table, so there is only one plan node other than |
| the <literal>Gather</literal> node itself; since that plan node is a child of the |
| <literal>Gather</literal> node, it will run in parallel. |
| </para> |
| |
| <para> |
| <link linkend="using-explain">Using EXPLAIN</link>, you can see the number of |
| workers chosen by the planner. When the <literal>Gather</literal> node is reached |
| during query execution, the process that is implementing the user's |
| session will request a number of <link linkend="bgworker">background |
| worker processes</link> equal to the number |
| of workers chosen by the planner. The number of background workers that |
| the planner will consider using is limited to at most |
| <xref linkend="guc-max-parallel-workers-per-gather"/>. The total number |
| of background workers that can exist at any one time is limited by both |
| <xref linkend="guc-max-worker-processes"/> and |
| <xref linkend="guc-max-parallel-workers"/>. Therefore, it is possible for a |
| parallel query to run with fewer workers than planned, or even with |
| no workers at all. The optimal plan may depend on the number of workers |
| that are available, so this can result in poor query performance. If this |
| occurrence is frequent, consider increasing |
| <varname>max_worker_processes</varname> and <varname>max_parallel_workers</varname> |
| so that more workers can be run simultaneously or alternatively reducing |
| <varname>max_parallel_workers_per_gather</varname> so that the planner |
| requests fewer workers. |
| </para> |
| |
| <para> |
| Every background worker process that is successfully started for a given |
| parallel query will execute the parallel portion of the plan. The leader |
| will also execute that portion of the plan, but it has an additional |
| responsibility: it must also read all of the tuples generated by the |
| workers. When the parallel portion of the plan generates only a small |
| number of tuples, the leader will often behave very much like an additional |
| worker, speeding up query execution. Conversely, when the parallel portion |
| of the plan generates a large number of tuples, the leader may be almost |
| entirely occupied with reading the tuples generated by the workers and |
| performing any further processing steps that are required by plan nodes |
| above the level of the <literal>Gather</literal> node or |
| <literal>Gather Merge</literal> node. In such cases, the leader will |
| do very little of the work of executing the parallel portion of the plan. |
| </para> |
| |
| <para> |
| When the node at the top of the parallel portion of the plan is |
| <literal>Gather Merge</literal> rather than <literal>Gather</literal>, it indicates that |
| each process executing the parallel portion of the plan is producing |
| tuples in sorted order, and that the leader is performing an |
| order-preserving merge. In contrast, <literal>Gather</literal> reads tuples |
| from the workers in whatever order is convenient, destroying any sort |
| order that may have existed. |
| </para> |
| </sect1> |
| |
| <sect1 id="when-can-parallel-query-be-used"> |
| <title>When Can Parallel Query Be Used?</title> |
| |
| <para> |
| There are several settings that can cause the query planner not to |
| generate a parallel query plan under any circumstances. In order for |
| any parallel query plans whatsoever to be generated, the following |
| settings must be configured as indicated. |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <xref linkend="guc-max-parallel-workers-per-gather"/> must be set to a |
| value that is greater than zero. This is a special case of the more |
| general principle that no more workers should be used than the number |
| configured via <varname>max_parallel_workers_per_gather</varname>. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| In addition, the system must not be running in single-user mode. Since |
| the entire database system is running in single process in this situation, |
| no background workers will be available. |
| </para> |
| |
| <para> |
| Even when it is in general possible for parallel query plans to be |
| generated, the planner will not generate them for a given query |
| if any of the following are true: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| The query writes any data or locks any database rows. If a query |
| contains a data-modifying operation either at the top level or within |
| a CTE, no parallel plans for that query will be generated. As an |
| exception, the following commands, which create a new table and populate |
| it, can use a parallel plan for the underlying <literal>SELECT</literal> |
| part of the query: |
| |
| <itemizedlist> |
| <listitem> |
| <para><command>CREATE TABLE ... AS</command></para> |
| </listitem> |
| <listitem> |
| <para><command>SELECT INTO</command></para> |
| </listitem> |
| <listitem> |
| <para><command>CREATE MATERIALIZED VIEW</command></para> |
| </listitem> |
| <listitem> |
| <para><command>REFRESH MATERIALIZED VIEW</command></para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The query might be suspended during execution. In any situation in |
| which the system thinks that partial or incremental execution might |
| occur, no parallel plan is generated. For example, a cursor created |
| using <link linkend="sql-declare">DECLARE CURSOR</link> will never use |
| a parallel plan. Similarly, a PL/pgSQL loop of the form |
| <literal>FOR x IN query LOOP .. END LOOP</literal> will never use a |
| parallel plan, because the parallel query system is unable to verify |
| that the code in the loop is safe to execute while parallel query is |
| active. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The query uses any function marked <literal>PARALLEL UNSAFE</literal>. |
| Most system-defined functions are <literal>PARALLEL SAFE</literal>, |
| but user-defined functions are marked <literal>PARALLEL |
| UNSAFE</literal> by default. See the discussion of |
| <xref linkend="parallel-safety"/>. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The query is running inside of another query that is already parallel. |
| For example, if a function called by a parallel query issues an SQL |
| query itself, that query will never use a parallel plan. This is a |
| limitation of the current implementation, but it may not be desirable |
| to remove this limitation, since it could result in a single query |
| using a very large number of processes. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| Even when parallel query plan is generated for a particular query, there |
| are several circumstances under which it will be impossible to execute |
| that plan in parallel at execution time. If this occurs, the leader |
| will execute the portion of the plan below the <literal>Gather</literal> |
| node entirely by itself, almost as if the <literal>Gather</literal> node were |
| not present. This will happen if any of the following conditions are met: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| No background workers can be obtained because of the limitation that |
| the total number of background workers cannot exceed |
| <xref linkend="guc-max-worker-processes"/>. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| No background workers can be obtained because of the limitation that |
| the total number of background workers launched for purposes of |
| parallel query cannot exceed <xref linkend="guc-max-parallel-workers"/>. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The client sends an Execute message with a non-zero fetch count. |
| See the discussion of the |
| <link linkend="protocol-flow-ext-query">extended query protocol</link>. |
| Since <link linkend="libpq">libpq</link> currently provides no way to |
| send such a message, this can only occur when using a client that |
| does not rely on libpq. If this is a frequent |
| occurrence, it may be a good idea to set |
| <xref linkend="guc-max-parallel-workers-per-gather"/> to zero in |
| sessions where it is likely, so as to avoid generating query plans |
| that may be suboptimal when run serially. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </sect1> |
| |
| <sect1 id="parallel-plans"> |
| <title>Parallel Plans</title> |
| |
| <para> |
| Because each worker executes the parallel portion of the plan to |
| completion, it is not possible to simply take an ordinary query plan |
| and run it using multiple workers. Each worker would produce a full |
| copy of the output result set, so the query would not run any faster |
| than normal but would produce incorrect results. Instead, the parallel |
| portion of the plan must be what is known internally to the query |
| optimizer as a <firstterm>partial plan</firstterm>; that is, it must be constructed |
| so that each process that executes the plan will generate only a |
| subset of the output rows in such a way that each required output row |
| is guaranteed to be generated by exactly one of the cooperating processes. |
| Generally, this means that the scan on the driving table of the query |
| must be a parallel-aware scan. |
| </para> |
| |
| <sect2 id="parallel-scans"> |
| <title>Parallel Scans</title> |
| |
| <para> |
| The following types of parallel-aware table scans are currently supported. |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| In a <emphasis>parallel sequential scan</emphasis>, the table's blocks will |
| be divided among the cooperating processes. Blocks are handed out one |
| at a time, so that access to the table remains sequential. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| In a <emphasis>parallel bitmap heap scan</emphasis>, one process is chosen |
| as the leader. That process performs a scan of one or more indexes |
| and builds a bitmap indicating which table blocks need to be visited. |
| These blocks are then divided among the cooperating processes as in |
| a parallel sequential scan. In other words, the heap scan is performed |
| in parallel, but the underlying index scan is not. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| In a <emphasis>parallel index scan</emphasis> or <emphasis>parallel index-only |
| scan</emphasis>, the cooperating processes take turns reading data from the |
| index. Currently, parallel index scans are supported only for |
| btree indexes. Each process will claim a single index block and will |
| scan and return all tuples referenced by that block; other processes can |
| at the same time be returning tuples from a different index block. |
| The results of a parallel btree scan are returned in sorted order |
| within each worker process. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| Other scan types, such as scans of non-btree indexes, may support |
| parallel scans in the future. |
| </para> |
| </sect2> |
| |
| <sect2 id="parallel-joins"> |
| <title>Parallel Joins</title> |
| |
| <para> |
| Just as in a non-parallel plan, the driving table may be joined to one or |
| more other tables using a nested loop, hash join, or merge join. The |
| inner side of the join may be any kind of non-parallel plan that is |
| otherwise supported by the planner provided that it is safe to run within |
| a parallel worker. Depending on the join type, the inner side may also be |
| a parallel plan. |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| In a <emphasis>nested loop join</emphasis>, the inner side is always |
| non-parallel. Although it is executed in full, this is efficient if |
| the inner side is an index scan, because the outer tuples and thus |
| the loops that look up values in the index are divided over the |
| cooperating processes. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| In a <emphasis>merge join</emphasis>, the inner side is always |
| a non-parallel plan and therefore executed in full. This may be |
| inefficient, especially if a sort must be performed, because the work |
| and resulting data are duplicated in every cooperating process. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| In a <emphasis>hash join</emphasis> (without the "parallel" prefix), |
| the inner side is executed in full by every cooperating process |
| to build identical copies of the hash table. This may be inefficient |
| if the hash table is large or the plan is expensive. In a |
| <emphasis>parallel hash join</emphasis>, the inner side is a |
| <emphasis>parallel hash</emphasis> that divides the work of building |
| a shared hash table over the cooperating processes. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </sect2> |
| |
| <sect2 id="parallel-aggregation"> |
| <title>Parallel Aggregation</title> |
| <para> |
| <productname>PostgreSQL</productname> supports parallel aggregation by aggregating in |
| two stages. First, each process participating in the parallel portion of |
| the query performs an aggregation step, producing a partial result for |
| each group of which that process is aware. This is reflected in the plan |
| as a <literal>Partial Aggregate</literal> node. Second, the partial results are |
| transferred to the leader via <literal>Gather</literal> or <literal>Gather |
| Merge</literal>. Finally, the leader re-aggregates the results across all |
| workers in order to produce the final result. This is reflected in the |
| plan as a <literal>Finalize Aggregate</literal> node. |
| </para> |
| |
| <para> |
| Because the <literal>Finalize Aggregate</literal> node runs on the leader |
| process, queries that produce a relatively large number of groups in |
| comparison to the number of input rows will appear less favorable to the |
| query planner. For example, in the worst-case scenario the number of |
| groups seen by the <literal>Finalize Aggregate</literal> node could be as many as |
| the number of input rows that were seen by all worker processes in the |
| <literal>Partial Aggregate</literal> stage. For such cases, there is clearly |
| going to be no performance benefit to using parallel aggregation. The |
| query planner takes this into account during the planning process and is |
| unlikely to choose parallel aggregate in this scenario. |
| </para> |
| |
| <para> |
| Parallel aggregation is not supported in all situations. Each aggregate |
| must be <link linkend="parallel-safety">safe</link> for parallelism and must |
| have a combine function. If the aggregate has a transition state of type |
| <literal>internal</literal>, it must have serialization and deserialization |
| functions. See <xref linkend="sql-createaggregate"/> for more details. |
| Parallel aggregation is not supported if any aggregate function call |
| contains <literal>DISTINCT</literal> or <literal>ORDER BY</literal> clause and is also |
| not supported for ordered set aggregates or when the query involves |
| <literal>GROUPING SETS</literal>. It can only be used when all joins involved in |
| the query are also part of the parallel portion of the plan. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="parallel-append"> |
| <title>Parallel Append</title> |
| |
| <para> |
| Whenever <productname>PostgreSQL</productname> needs to combine rows |
| from multiple sources into a single result set, it uses an |
| <literal>Append</literal> or <literal>MergeAppend</literal> plan node. |
| This commonly happens when implementing <literal>UNION ALL</literal> or |
| when scanning a partitioned table. Such nodes can be used in parallel |
| plans just as they can in any other plan. However, in a parallel plan, |
| the planner may instead use a <literal>Parallel Append</literal> node. |
| </para> |
| |
| <para> |
| When an <literal>Append</literal> node is used in a parallel plan, each |
| process will execute the child plans in the order in which they appear, |
| so that all participating processes cooperate to execute the first child |
| plan until it is complete and then move to the second plan at around the |
| same time. When a <literal>Parallel Append</literal> is used instead, the |
| executor will instead spread out the participating processes as evenly as |
| possible across its child plans, so that multiple child plans are executed |
| simultaneously. This avoids contention, and also avoids paying the startup |
| cost of a child plan in those processes that never execute it. |
| </para> |
| |
| <para> |
| Also, unlike a regular <literal>Append</literal> node, which can only have |
| partial children when used within a parallel plan, a <literal>Parallel |
| Append</literal> node can have both partial and non-partial child plans. |
| Non-partial children will be scanned by only a single process, since |
| scanning them more than once would produce duplicate results. Plans that |
| involve appending multiple results sets can therefore achieve |
| coarse-grained parallelism even when efficient partial plans are not |
| available. For example, consider a query against a partitioned table |
| that can only be implemented efficiently by using an index that does |
| not support parallel scans. The planner might choose a <literal>Parallel |
| Append</literal> of regular <literal>Index Scan</literal> plans; each |
| individual index scan would have to be executed to completion by a single |
| process, but different scans could be performed at the same time by |
| different processes. |
| </para> |
| |
| <para> |
| <xref linkend="guc-enable-parallel-append" /> can be used to disable |
| this feature. |
| </para> |
| </sect2> |
| |
| <sect2 id="parallel-plan-tips"> |
| <title>Parallel Plan Tips</title> |
| |
| <para> |
| If a query that is expected to do so does not produce a parallel plan, |
| you can try reducing <xref linkend="guc-parallel-setup-cost"/> or |
| <xref linkend="guc-parallel-tuple-cost"/>. Of course, this plan may turn |
| out to be slower than the serial plan that the planner preferred, but |
| this will not always be the case. If you don't get a parallel |
| plan even with very small values of these settings (e.g., after setting |
| them both to zero), there may be some reason why the query planner is |
| unable to generate a parallel plan for your query. See |
| <xref linkend="when-can-parallel-query-be-used"/> and |
| <xref linkend="parallel-safety"/> for information on why this may be |
| the case. |
| </para> |
| |
| <para> |
| When executing a parallel plan, you can use <literal>EXPLAIN (ANALYZE, |
| VERBOSE)</literal> to display per-worker statistics for each plan node. |
| This may be useful in determining whether the work is being evenly |
| distributed between all plan nodes and more generally in understanding the |
| performance characteristics of the plan. |
| </para> |
| |
| </sect2> |
| </sect1> |
| |
| <sect1 id="parallel-safety"> |
| <title>Parallel Safety</title> |
| |
| <para> |
| The planner classifies operations involved in a query as either |
| <firstterm>parallel safe</firstterm>, <firstterm>parallel restricted</firstterm>, |
| or <firstterm>parallel unsafe</firstterm>. A parallel safe operation is one that |
| does not conflict with the use of parallel query. A parallel restricted |
| operation is one that cannot be performed in a parallel worker, but that |
| can be performed in the leader while parallel query is in use. Therefore, |
| parallel restricted operations can never occur below a <literal>Gather</literal> |
| or <literal>Gather Merge</literal> node, but can occur elsewhere in a plan that |
| contains such a node. A parallel unsafe operation is one that cannot |
| be performed while parallel query is in use, not even in the leader. |
| When a query contains anything that is parallel unsafe, parallel query |
| is completely disabled for that query. |
| </para> |
| |
| <para> |
| The following operations are always parallel restricted: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Scans of common table expressions (CTEs). |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Scans of temporary tables. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Scans of foreign tables, unless the foreign data wrapper has |
| an <literal>IsForeignScanParallelSafe</literal> API that indicates otherwise. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Plan nodes to which an <literal>InitPlan</literal> is attached. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Plan nodes that reference a correlated <literal>SubPlan</literal>. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <sect2 id="parallel-labeling"> |
| <title>Parallel Labeling for Functions and Aggregates</title> |
| |
| <para> |
| The planner cannot automatically determine whether a user-defined |
| function or aggregate is parallel safe, parallel restricted, or parallel |
| unsafe, because this would require predicting every operation that the |
| function could possibly perform. In general, this is equivalent to the |
| Halting Problem and therefore impossible. Even for simple functions |
| where it could conceivably be done, we do not try, since this would be expensive |
| and error-prone. Instead, all user-defined functions are assumed to |
| be parallel unsafe unless otherwise marked. When using |
| <xref linkend="sql-createfunction"/> or |
| <xref linkend="sql-alterfunction"/>, markings can be set by specifying |
| <literal>PARALLEL SAFE</literal>, <literal>PARALLEL RESTRICTED</literal>, or |
| <literal>PARALLEL UNSAFE</literal> as appropriate. When using |
| <xref linkend="sql-createaggregate"/>, the |
| <literal>PARALLEL</literal> option can be specified with <literal>SAFE</literal>, |
| <literal>RESTRICTED</literal>, or <literal>UNSAFE</literal> as the corresponding value. |
| </para> |
| |
| <para> |
| Functions and aggregates must be marked <literal>PARALLEL UNSAFE</literal> if |
| they write to the database, access sequences, change the transaction state |
| even temporarily (e.g., a PL/pgSQL function that establishes an |
| <literal>EXCEPTION</literal> block to catch errors), or make persistent changes to |
| settings. Similarly, functions must be marked <literal>PARALLEL |
| RESTRICTED</literal> if they access temporary tables, client connection state, |
| cursors, prepared statements, or miscellaneous backend-local state that |
| the system cannot synchronize across workers. For example, |
| <literal>setseed</literal> and <literal>random</literal> are parallel restricted for |
| this last reason. |
| </para> |
| |
| <para> |
| In general, if a function is labeled as being safe when it is restricted or |
| unsafe, or if it is labeled as being restricted when it is in fact unsafe, |
| it may throw errors or produce wrong answers when used in a parallel query. |
| C-language functions could in theory exhibit totally undefined behavior if |
| mislabeled, since there is no way for the system to protect itself against |
| arbitrary C code, but in most likely cases the result will be no worse than |
| for any other function. If in doubt, it is probably best to label functions |
| as <literal>UNSAFE</literal>. |
| </para> |
| |
| <para> |
| If a function executed within a parallel worker acquires locks that are |
| not held by the leader, for example by querying a table not referenced in |
| the query, those locks will be released at worker exit, not end of |
| transaction. If you write a function that does this, and this behavior |
| difference is important to you, mark such functions as |
| <literal>PARALLEL RESTRICTED</literal> |
| to ensure that they execute only in the leader. |
| </para> |
| |
| <para> |
| Note that the query planner does not consider deferring the evaluation of |
| parallel-restricted functions or aggregates involved in the query in |
| order to obtain a superior plan. So, for example, if a <literal>WHERE</literal> |
| clause applied to a particular table is parallel restricted, the query |
| planner will not consider performing a scan of that table in the parallel |
| portion of a plan. In some cases, it would be |
| possible (and perhaps even efficient) to include the scan of that table in |
| the parallel portion of the query and defer the evaluation of the |
| <literal>WHERE</literal> clause so that it happens above the <literal>Gather</literal> |
| node. However, the planner does not do this. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| </chapter> |