| <!-- doc/src/sgml/high-availability.sgml --> |
| |
| <chapter id="high-availability"> |
| <title>High Availability, Load Balancing, and Replication</title> |
| |
| <indexterm><primary>high availability</primary></indexterm> |
| <indexterm><primary>failover</primary></indexterm> |
| <indexterm><primary>replication</primary></indexterm> |
| <indexterm><primary>load balancing</primary></indexterm> |
| <indexterm><primary>clustering</primary></indexterm> |
| <indexterm><primary>data partitioning</primary></indexterm> |
| |
| <para> |
| Database servers can work together to allow a second server to |
| take over quickly if the primary server fails (high |
| availability), or to allow several computers to serve the same |
| data (load balancing). Ideally, database servers could work |
| together seamlessly. Web servers serving static web pages can |
| be combined quite easily by merely load-balancing web requests |
| to multiple machines. In fact, read-only database servers can |
| be combined relatively easily too. Unfortunately, most database |
| servers have a read/write mix of requests, and read/write servers |
| are much harder to combine. This is because though read-only |
| data needs to be placed on each server only once, a write to any |
| server has to be propagated to all servers so that future read |
| requests to those servers return consistent results. |
| </para> |
| |
| <para> |
| This synchronization problem is the fundamental difficulty for |
| servers working together. Because there is no single solution |
| that eliminates the impact of the sync problem for all use cases, |
| there are multiple solutions. Each solution addresses this |
| problem in a different way, and minimizes its impact for a specific |
| workload. |
| </para> |
| |
| <para> |
| Some solutions deal with synchronization by allowing only one |
| server to modify the data. Servers that can modify data are |
| called read/write, <firstterm>master</firstterm> or <firstterm>primary</firstterm> servers. |
| Servers that track changes in the primary are called <firstterm>standby</firstterm> |
| or <firstterm>secondary</firstterm> servers. A standby server that cannot be connected |
| to until it is promoted to a primary server is called a <firstterm>warm |
| standby</firstterm> server, and one that can accept connections and serves read-only |
| queries is called a <firstterm>hot standby</firstterm> server. |
| </para> |
| |
| <para> |
| Some solutions are synchronous, |
| meaning that a data-modifying transaction is not considered |
| committed until all servers have committed the transaction. This |
| guarantees that a failover will not lose any data and that all |
| load-balanced servers will return consistent results no matter |
| which server is queried. In contrast, asynchronous solutions allow some |
| delay between the time of a commit and its propagation to the other servers, |
| opening the possibility that some transactions might be lost in |
| the switch to a backup server, and that load balanced servers |
| might return slightly stale results. Asynchronous communication |
| is used when synchronous would be too slow. |
| </para> |
| |
| <para> |
| Solutions can also be categorized by their granularity. Some solutions |
| can deal only with an entire database server, while others allow control |
| at the per-table or per-database level. |
| </para> |
| |
| <para> |
| Performance must be considered in any choice. There is usually a |
| trade-off between functionality and |
| performance. For example, a fully synchronous solution over a slow |
| network might cut performance by more than half, while an asynchronous |
| one might have a minimal performance impact. |
| </para> |
| |
| <para> |
| The remainder of this section outlines various failover, replication, |
| and load balancing solutions. |
| </para> |
| |
| <sect1 id="different-replication-solutions"> |
| <title>Comparison of Different Solutions</title> |
| |
| <variablelist> |
| |
| <varlistentry> |
| <term>Shared Disk Failover</term> |
| <listitem> |
| |
| <para> |
| Shared disk failover avoids synchronization overhead by having only one |
| copy of the database. It uses a single disk array that is shared by |
| multiple servers. If the main database server fails, the standby server |
| is able to mount and start the database as though it were recovering from |
| a database crash. This allows rapid failover with no data loss. |
| </para> |
| |
| <para> |
| Shared hardware functionality is common in network storage devices. |
| Using a network file system is also possible, though care must be |
| taken that the file system has full <acronym>POSIX</acronym> behavior (see <xref |
| linkend="creating-cluster-nfs"/>). One significant limitation of this |
| method is that if the shared disk array fails or becomes corrupt, the |
| primary and standby servers are both nonfunctional. Another issue is |
| that the standby server should never access the shared storage while |
| the primary server is running. |
| </para> |
| |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>File System (Block Device) Replication</term> |
| <listitem> |
| |
| <para> |
| A modified version of shared hardware functionality is file system |
| replication, where all changes to a file system are mirrored to a file |
| system residing on another computer. The only restriction is that |
| the mirroring must be done in a way that ensures the standby server |
| has a consistent copy of the file system — specifically, writes |
| to the standby must be done in the same order as those on the primary. |
| <productname>DRBD</productname> is a popular file system replication solution |
| for Linux. |
| </para> |
| |
| <!-- |
| https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html |
| |
| Oracle RAC is a shared disk approach and just send cache invalidations |
| to other nodes but not actual data. As the disk is shared, data is |
| only committed once to disk and there is a distributed locking |
| protocol to make nodes agree on a serializable transactional order. |
| --> |
| |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>Write-Ahead Log Shipping</term> |
| <listitem> |
| |
| <para> |
| Warm and hot standby servers can be kept current by reading a |
| stream of write-ahead log (<acronym>WAL</acronym>) |
| records. If the main server fails, the standby contains |
| almost all of the data of the main server, and can be quickly |
| made the new primary database server. This can be synchronous or |
| asynchronous and can only be done for the entire database server. |
| </para> |
| <para> |
| A standby server can be implemented using file-based log shipping |
| (<xref linkend="warm-standby"/>) or streaming replication (see |
| <xref linkend="streaming-replication"/>), or a combination of both. For |
| information on hot standby, see <xref linkend="hot-standby"/>. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>Logical Replication</term> |
| <listitem> |
| <para> |
| Logical replication allows a database server to send a stream of data |
| modifications to another server. <productname>PostgreSQL</productname> |
| logical replication constructs a stream of logical data modifications |
| from the WAL. Logical replication allows replication of data changes on |
| a per-table basis. In addition, a server that is publishing its own |
| changes can also subscribe to changes from another server, allowing data |
| to flow in multiple directions. For more information on logical |
| replication, see <xref linkend="logical-replication"/>. Through the |
| logical decoding interface (<xref linkend="logicaldecoding"/>), |
| third-party extensions can also provide similar functionality. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>Trigger-Based Primary-Standby Replication</term> |
| <listitem> |
| |
| <para> |
| A trigger-based replication setup typically funnels data modification |
| queries to a designated primary server. Operating on a per-table basis, |
| the primary server sends data changes (typically) asynchronously to the |
| standby servers. Standby servers can answer queries while the primary is |
| running, and may allow some local data changes or write activity. This |
| form of replication is often used for offloading large analytical or data |
| warehouse queries. |
| </para> |
| |
| <para> |
| <productname>Slony-I</productname> is an example of this type of |
| replication, with per-table granularity, and support for multiple standby |
| servers. Because it updates the standby server asynchronously (in |
| batches), there is possible data loss during fail over. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>SQL-Based Replication Middleware</term> |
| <listitem> |
| |
| <para> |
| With SQL-based replication middleware, a program intercepts |
| every SQL query and sends it to one or all servers. Each server |
| operates independently. Read-write queries must be sent to all servers, |
| so that every server receives any changes. But read-only queries can be |
| sent to just one server, allowing the read workload to be distributed |
| among them. |
| </para> |
| |
| <para> |
| If queries are simply broadcast unmodified, functions like |
| <function>random()</function>, <function>CURRENT_TIMESTAMP</function>, and |
| sequences can have different values on different servers. |
| This is because each server operates independently, and because |
| SQL queries are broadcast rather than actual data changes. If |
| this is unacceptable, either the middleware or the application |
| must determine such values from a single source and then use those |
| values in write queries. Care must also be taken that all |
| transactions either commit or abort on all servers, perhaps |
| using two-phase commit (<xref linkend="sql-prepare-transaction"/> |
| and <xref linkend="sql-commit-prepared"/>). |
| <productname>Pgpool-II</productname> and <productname>Continuent Tungsten</productname> |
| are examples of this type of replication. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>Asynchronous Multimaster Replication</term> |
| <listitem> |
| |
| <para> |
| For servers that are not regularly connected or have slow |
| communication links, like laptops or |
| remote servers, keeping data consistent among servers is a |
| challenge. Using asynchronous multimaster replication, each |
| server works independently, and periodically communicates with |
| the other servers to identify conflicting transactions. The |
| conflicts can be resolved by users or conflict resolution rules. |
| Bucardo is an example of this type of replication. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>Synchronous Multimaster Replication</term> |
| <listitem> |
| |
| <para> |
| In synchronous multimaster replication, each server can accept |
| write requests, and modified data is transmitted from the |
| original server to every other server before each transaction |
| commits. Heavy write activity can cause excessive locking and |
| commit delays, leading to poor performance. Read requests can |
| be sent to any server. Some implementations use shared disk |
| to reduce the communication overhead. Synchronous multimaster |
| replication is best for mostly read workloads, though its big |
| advantage is that any server can accept write requests — |
| there is no need to partition workloads between primary and |
| standby servers, and because the data changes are sent from one |
| server to another, there is no problem with non-deterministic |
| functions like <function>random()</function>. |
| </para> |
| |
| <para> |
| <productname>PostgreSQL</productname> does not offer this type of replication, |
| though <productname>PostgreSQL</productname> two-phase commit (<xref |
| linkend="sql-prepare-transaction"/> and <xref |
| linkend="sql-commit-prepared"/>) |
| can be used to implement this in application code or middleware. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| |
| <para> |
| <xref linkend="high-availability-matrix"/> summarizes |
| the capabilities of the various solutions listed above. |
| </para> |
| |
| <table id="high-availability-matrix"> |
| <title>High Availability, Load Balancing, and Replication Feature Matrix</title> |
| <tgroup cols="9"> |
| <colspec colname="col1" colwidth="1.1*"/> |
| <colspec colname="col2" colwidth="1*"/> |
| <colspec colname="col3" colwidth="1*"/> |
| <colspec colname="col4" colwidth="1*"/> |
| <colspec colname="col5" colwidth="1*"/> |
| <colspec colname="col6" colwidth="1*"/> |
| <colspec colname="col7" colwidth="1*"/> |
| <colspec colname="col8" colwidth="1*"/> |
| <colspec colname="col9" colwidth="1*"/> |
| <thead> |
| <row> |
| <entry>Feature</entry> |
| <entry>Shared Disk</entry> |
| <entry>File System Repl.</entry> |
| <entry>Write-Ahead Log Shipping</entry> |
| <entry>Logical Repl.</entry> |
| <entry>Trigger-&zwsp;Based Repl.</entry> |
| <entry>SQL Repl. Middle-ware</entry> |
| <entry>Async. MM Repl.</entry> |
| <entry>Sync. MM Repl.</entry> |
| </row> |
| </thead> |
| |
| <tbody> |
| |
| <row> |
| <entry>Popular examples</entry> |
| <entry align="center">NAS</entry> |
| <entry align="center">DRBD</entry> |
| <entry align="center">built-in streaming repl.</entry> |
| <entry align="center">built-in logical repl., pglogical</entry> |
| <entry align="center">Londiste, Slony</entry> |
| <entry align="center">pgpool-II</entry> |
| <entry align="center">Bucardo</entry> |
| <entry align="center"></entry> |
| </row> |
| |
| <row> |
| <entry>Comm. method</entry> |
| <entry align="center">shared disk</entry> |
| <entry align="center">disk blocks</entry> |
| <entry align="center">WAL</entry> |
| <entry align="center">logical decoding</entry> |
| <entry align="center">table rows</entry> |
| <entry align="center">SQL</entry> |
| <entry align="center">table rows</entry> |
| <entry align="center">table rows and row locks</entry> |
| </row> |
| |
| <row> |
| <entry>No special hardware required</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| </row> |
| |
| <row> |
| <entry>Allows multiple primary servers</entry> |
| <entry align="center"></entry> |
| <entry align="center"></entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| </row> |
| |
| <row> |
| <entry>No overhead on primary</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center"></entry> |
| </row> |
| |
| <row> |
| <entry>No waiting for multiple servers</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">with sync off</entry> |
| <entry align="center">with sync off</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| </row> |
| |
| <row> |
| <entry>Primary failure will never lose data</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">with sync on</entry> |
| <entry align="center">with sync on</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| </row> |
| |
| <row> |
| <entry>Replicas accept read-only queries</entry> |
| <entry align="center"></entry> |
| <entry align="center"></entry> |
| <entry align="center">with hot standby</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| </row> |
| |
| <row> |
| <entry>Per-table granularity</entry> |
| <entry align="center"></entry> |
| <entry align="center"></entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| </row> |
| |
| <row> |
| <entry>No conflict resolution necessary</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| <entry align="center">•</entry> |
| <entry align="center"></entry> |
| <entry align="center">•</entry> |
| </row> |
| |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <para> |
| There are a few solutions that do not fit into the above categories: |
| </para> |
| |
| <variablelist> |
| |
| <varlistentry> |
| <term>Data Partitioning</term> |
| <listitem> |
| |
| <para> |
| Data partitioning splits tables into data sets. Each set can |
| be modified by only one server. For example, data can be |
| partitioned by offices, e.g., London and Paris, with a server |
| in each office. If queries combining London and Paris data |
| are necessary, an application can query both servers, or |
| primary/standby replication can be used to keep a read-only copy |
| of the other office's data on each server. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term>Multiple-Server Parallel Query Execution</term> |
| <listitem> |
| |
| <para> |
| Many of the above solutions allow multiple servers to handle multiple |
| queries, but none allow a single query to use multiple servers to |
| complete faster. This solution allows multiple servers to work |
| concurrently on a single query. It is usually accomplished by |
| splitting the data among servers and having each server execute its |
| part of the query and return results to a central server where they |
| are combined and returned to the user. This can be implemented using the |
| <productname>PL/Proxy</productname> tool set. |
| </para> |
| |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| |
| <para> |
| It should also be noted that because <productname>PostgreSQL</productname> |
| is open source and easily extended, a number of companies have |
| taken <productname>PostgreSQL</productname> and created commercial |
| closed-source solutions with unique failover, replication, and load |
| balancing capabilities. These are not discussed here. |
| </para> |
| |
| </sect1> |
| |
| |
| <sect1 id="warm-standby"> |
| <title>Log-Shipping Standby Servers</title> |
| |
| |
| <para> |
| Continuous archiving can be used to create a <firstterm>high |
| availability</firstterm> (HA) cluster configuration with one or more |
| <firstterm>standby servers</firstterm> ready to take over operations if the |
| primary server fails. This capability is widely referred to as |
| <firstterm>warm standby</firstterm> or <firstterm>log shipping</firstterm>. |
| </para> |
| |
| <para> |
| The primary and standby server work together to provide this capability, |
| though the servers are only loosely coupled. The primary server operates |
| in continuous archiving mode, while each standby server operates in |
| continuous recovery mode, reading the WAL files from the primary. No |
| changes to the database tables are required to enable this capability, |
| so it offers low administration overhead compared to some other |
| replication solutions. This configuration also has relatively low |
| performance impact on the primary server. |
| </para> |
| |
| <para> |
| Directly moving WAL records from one database server to another |
| is typically described as log shipping. <productname>PostgreSQL</productname> |
| implements file-based log shipping by transferring WAL records |
| one file (WAL segment) at a time. WAL files (16MB) can be |
| shipped easily and cheaply over any distance, whether it be to an |
| adjacent system, another system at the same site, or another system on |
| the far side of the globe. The bandwidth required for this technique |
| varies according to the transaction rate of the primary server. |
| Record-based log shipping is more granular and streams WAL changes |
| incrementally over a network connection (see <xref |
| linkend="streaming-replication"/>). |
| </para> |
| |
| <para> |
| It should be noted that log shipping is asynchronous, i.e., the WAL |
| records are shipped after transaction commit. As a result, there is a |
| window for data loss should the primary server suffer a catastrophic |
| failure; transactions not yet shipped will be lost. The size of the |
| data loss window in file-based log shipping can be limited by use of the |
| <varname>archive_timeout</varname> parameter, which can be set as low |
| as a few seconds. However such a low setting will |
| substantially increase the bandwidth required for file shipping. |
| Streaming replication (see <xref linkend="streaming-replication"/>) |
| allows a much smaller window of data loss. |
| </para> |
| |
| <para> |
| Recovery performance is sufficiently good that the standby will |
| typically be only moments away from full |
| availability once it has been activated. As a result, this is called |
| a warm standby configuration which offers high |
| availability. Restoring a server from an archived base backup and |
| rollforward will take considerably longer, so that technique only |
| offers a solution for disaster recovery, not high availability. |
| A standby server can also be used for read-only queries, in which case |
| it is called a Hot Standby server. See <xref linkend="hot-standby"/> for |
| more information. |
| </para> |
| |
| <indexterm zone="high-availability"> |
| <primary>warm standby</primary> |
| </indexterm> |
| |
| <indexterm zone="high-availability"> |
| <primary>PITR standby</primary> |
| </indexterm> |
| |
| <indexterm zone="high-availability"> |
| <primary>standby server</primary> |
| </indexterm> |
| |
| <indexterm zone="high-availability"> |
| <primary>log shipping</primary> |
| </indexterm> |
| |
| <indexterm zone="high-availability"> |
| <primary>witness server</primary> |
| </indexterm> |
| |
| <indexterm zone="high-availability"> |
| <primary>STONITH</primary> |
| </indexterm> |
| |
| <sect2 id="standby-planning"> |
| <title>Planning</title> |
| |
| <para> |
| It is usually wise to create the primary and standby servers |
| so that they are as similar as possible, at least from the |
| perspective of the database server. In particular, the path names |
| associated with tablespaces will be passed across unmodified, so both |
| primary and standby servers must have the same mount paths for |
| tablespaces if that feature is used. Keep in mind that if |
| <xref linkend="sql-createtablespace"/> |
| is executed on the primary, any new mount point needed for it must |
| be created on the primary and all standby servers before the command |
| is executed. Hardware need not be exactly the same, but experience shows |
| that maintaining two identical systems is easier than maintaining two |
| dissimilar ones over the lifetime of the application and system. |
| In any case the hardware architecture must be the same — shipping |
| from, say, a 32-bit to a 64-bit system will not work. |
| </para> |
| |
| <para> |
| In general, log shipping between servers running different major |
| <productname>PostgreSQL</productname> release |
| levels is not possible. It is the policy of the PostgreSQL Global |
| Development Group not to make changes to disk formats during minor release |
| upgrades, so it is likely that running different minor release levels |
| on primary and standby servers will work successfully. However, no |
| formal support for that is offered and you are advised to keep primary |
| and standby servers at the same release level as much as possible. |
| When updating to a new minor release, the safest policy is to update |
| the standby servers first — a new minor release is more likely |
| to be able to read WAL files from a previous minor release than vice |
| versa. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="standby-server-operation" xreflabel="Standby Server Operation"> |
| <title>Standby Server Operation</title> |
| |
| <para> |
| A server enters standby mode if a |
| <anchor id="file-standby-signal" xreflabel="standby.signal"/> |
| <filename>standby.signal</filename> |
| <indexterm><primary><filename>standby.signal</filename></primary></indexterm> |
| file exists in the data directory when the server is started. |
| </para> |
| |
| <para> |
| In standby mode, the server continuously applies WAL received from the |
| primary server. The standby server can read WAL from a WAL archive |
| (see <xref linkend="guc-restore-command"/>) or directly from the primary |
| over a TCP connection (streaming replication). The standby server will |
| also attempt to restore any WAL found in the standby cluster's |
| <filename>pg_wal</filename> directory. That typically happens after a server |
| restart, when the standby replays again WAL that was streamed from the |
| primary before the restart, but you can also manually copy files to |
| <filename>pg_wal</filename> at any time to have them replayed. |
| </para> |
| |
| <para> |
| At startup, the standby begins by restoring all WAL available in the |
| archive location, calling <varname>restore_command</varname>. Once it |
| reaches the end of WAL available there and <varname>restore_command</varname> |
| fails, it tries to restore any WAL available in the <filename>pg_wal</filename> directory. |
| If that fails, and streaming replication has been configured, the |
| standby tries to connect to the primary server and start streaming WAL |
| from the last valid record found in archive or <filename>pg_wal</filename>. If that fails |
| or streaming replication is not configured, or if the connection is |
| later disconnected, the standby goes back to step 1 and tries to |
| restore the file from the archive again. This loop of retries from the |
| archive, <filename>pg_wal</filename>, and via streaming replication goes on until the server |
| is stopped or failover is triggered by a trigger file. |
| </para> |
| |
| <para> |
| Standby mode is exited and the server switches to normal operation |
| when <command>pg_ctl promote</command> is run, |
| <function>pg_promote()</function> is called, or a trigger file is found |
| (<varname>promote_trigger_file</varname>). Before failover, |
| any WAL immediately available in the archive or in <filename>pg_wal</filename> will be |
| restored, but no attempt is made to connect to the primary. |
| </para> |
| </sect2> |
| |
| <sect2 id="preparing-primary-for-standby"> |
| <title>Preparing the Primary for Standby Servers</title> |
| |
| <para> |
| Set up continuous archiving on the primary to an archive directory |
| accessible from the standby, as described |
| in <xref linkend="continuous-archiving"/>. The archive location should be |
| accessible from the standby even when the primary is down, i.e., it should |
| reside on the standby server itself or another trusted server, not on |
| the primary server. |
| </para> |
| |
| <para> |
| If you want to use streaming replication, set up authentication on the |
| primary server to allow replication connections from the standby |
| server(s); that is, create a role and provide a suitable entry or |
| entries in <filename>pg_hba.conf</filename> with the database field set to |
| <literal>replication</literal>. Also ensure <varname>max_wal_senders</varname> is set |
| to a sufficiently large value in the configuration file of the primary |
| server. If replication slots will be used, |
| ensure that <varname>max_replication_slots</varname> is set sufficiently |
| high as well. |
| </para> |
| |
| <para> |
| Take a base backup as described in <xref linkend="backup-base-backup"/> |
| to bootstrap the standby server. |
| </para> |
| </sect2> |
| |
| <sect2 id="standby-server-setup"> |
| <title>Setting Up a Standby Server</title> |
| |
| <para> |
| To set up the standby server, restore the base backup taken from primary |
| server (see <xref linkend="backup-pitr-recovery"/>). Create a file |
| <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary></indexterm> |
| in the standby's cluster data |
| directory. Set <xref linkend="guc-restore-command"/> to a simple command to copy files from |
| the WAL archive. If you plan to have multiple standby servers for high |
| availability purposes, make sure that <varname>recovery_target_timeline</varname> is set to |
| <literal>latest</literal> (the default), to make the standby server follow the timeline change |
| that occurs at failover to another standby. |
| </para> |
| |
| <note> |
| <para> |
| <xref linkend="guc-restore-command"/> should return immediately |
| if the file does not exist; the server will retry the command again if |
| necessary. |
| </para> |
| </note> |
| |
| <para> |
| If you want to use streaming replication, fill in |
| <xref linkend="guc-primary-conninfo"/> with a libpq connection string, including |
| the host name (or IP address) and any additional details needed to |
| connect to the primary server. If the primary needs a password for |
| authentication, the password needs to be specified in |
| <xref linkend="guc-primary-conninfo"/> as well. |
| </para> |
| |
| <para> |
| If you're setting up the standby server for high availability purposes, |
| set up WAL archiving, connections and authentication like the primary |
| server, because the standby server will work as a primary server after |
| failover. |
| </para> |
| |
| <para> |
| If you're using a WAL archive, its size can be minimized using the <xref |
| linkend="guc-archive-cleanup-command"/> parameter to remove files that are no |
| longer required by the standby server. |
| The <application>pg_archivecleanup</application> utility is designed specifically to |
| be used with <varname>archive_cleanup_command</varname> in typical single-standby |
| configurations, see <xref linkend="pgarchivecleanup"/>. |
| Note however, that if you're using the archive for backup purposes, you |
| need to retain files needed to recover from at least the latest base |
| backup, even if they're no longer needed by the standby. |
| </para> |
| |
| <para> |
| A simple example of configuration is: |
| <programlisting> |
| primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000''' |
| restore_command = 'cp /path/to/archive/%f %p' |
| archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r' |
| </programlisting> |
| </para> |
| |
| <para> |
| You can have any number of standby servers, but if you use streaming |
| replication, make sure you set <varname>max_wal_senders</varname> high enough in |
| the primary to allow them to be connected simultaneously. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="streaming-replication"> |
| <title>Streaming Replication</title> |
| |
| <indexterm zone="high-availability"> |
| <primary>Streaming Replication</primary> |
| </indexterm> |
| |
| <para> |
| Streaming replication allows a standby server to stay more up-to-date |
| than is possible with file-based log shipping. The standby connects |
| to the primary, which streams WAL records to the standby as they're |
| generated, without waiting for the WAL file to be filled. |
| </para> |
| |
| <para> |
| Streaming replication is asynchronous by default |
| (see <xref linkend="synchronous-replication"/>), in which case there is |
| a small delay between committing a transaction in the primary and the |
| changes becoming visible in the standby. This delay is however much |
| smaller than with file-based log shipping, typically under one second |
| assuming the standby is powerful enough to keep up with the load. With |
| streaming replication, <varname>archive_timeout</varname> is not required to |
| reduce the data loss window. |
| </para> |
| |
| <para> |
| If you use streaming replication without file-based continuous |
| archiving, the server might recycle old WAL segments before the standby |
| has received them. If this occurs, the standby will need to be |
| reinitialized from a new base backup. You can avoid this by setting |
| <varname>wal_keep_size</varname> to a value large enough to ensure that |
| WAL segments are not recycled too early, or by configuring a replication |
| slot for the standby. If you set up a WAL archive that's accessible from |
| the standby, these solutions are not required, since the standby can |
| always use the archive to catch up provided it retains enough segments. |
| </para> |
| |
| <para> |
| To use streaming replication, set up a file-based log-shipping standby |
| server as described in <xref linkend="warm-standby"/>. The step that |
| turns a file-based log-shipping standby into streaming replication |
| standby is setting the <varname>primary_conninfo</varname> setting |
| to point to the primary server. Set |
| <xref linkend="guc-listen-addresses"/> and authentication options |
| (see <filename>pg_hba.conf</filename>) on the primary so that the standby server |
| can connect to the <literal>replication</literal> pseudo-database on the primary |
| server (see <xref linkend="streaming-replication-authentication"/>). |
| </para> |
| |
| <para> |
| On systems that support the keepalive socket option, setting |
| <xref linkend="guc-tcp-keepalives-idle"/>, |
| <xref linkend="guc-tcp-keepalives-interval"/> and |
| <xref linkend="guc-tcp-keepalives-count"/> helps the primary promptly |
| notice a broken connection. |
| </para> |
| |
| <para> |
| Set the maximum number of concurrent connections from the standby servers |
| (see <xref linkend="guc-max-wal-senders"/> for details). |
| </para> |
| |
| <para> |
| When the standby is started and <varname>primary_conninfo</varname> is set |
| correctly, the standby will connect to the primary after replaying all |
| WAL files available in the archive. If the connection is established |
| successfully, you will see a <literal>walreceiver</literal> in the standby, and |
| a corresponding <literal>walsender</literal> process in the primary. |
| </para> |
| |
| <sect3 id="streaming-replication-authentication"> |
| <title>Authentication</title> |
| <para> |
| It is very important that the access privileges for replication be set up |
| so that only trusted users can read the WAL stream, because it is |
| easy to extract privileged information from it. Standby servers must |
| authenticate to the primary as an account that has the |
| <literal>REPLICATION</literal> privilege or a superuser. It is |
| recommended to create a dedicated user account with |
| <literal>REPLICATION</literal> and <literal>LOGIN</literal> |
| privileges for replication. While <literal>REPLICATION</literal> |
| privilege gives very high permissions, it does not allow the user to |
| modify any data on the primary system, which the |
| <literal>SUPERUSER</literal> privilege does. |
| </para> |
| |
| <para> |
| Client authentication for replication is controlled by a |
| <filename>pg_hba.conf</filename> record specifying <literal>replication</literal> in the |
| <replaceable>database</replaceable> field. For example, if the standby is running on |
| host IP <literal>192.168.1.100</literal> and the account name for replication |
| is <literal>foo</literal>, the administrator can add the following line to the |
| <filename>pg_hba.conf</filename> file on the primary: |
| |
| <programlisting> |
| # Allow the user "foo" from host 192.168.1.100 to connect to the primary |
| # as a replication standby if the user's password is correctly supplied. |
| # |
| # TYPE DATABASE USER ADDRESS METHOD |
| host replication foo 192.168.1.100/32 md5 |
| </programlisting> |
| </para> |
| <para> |
| The host name and port number of the primary, connection user name, |
| and password are specified in the <xref linkend="guc-primary-conninfo"/>. |
| The password can also be set in the <filename>~/.pgpass</filename> file on the |
| standby (specify <literal>replication</literal> in the <replaceable>database</replaceable> |
| field). |
| For example, if the primary is running on host IP <literal>192.168.1.50</literal>, |
| port <literal>5432</literal>, the account name for replication is |
| <literal>foo</literal>, and the password is <literal>foopass</literal>, the administrator |
| can add the following line to the <filename>postgresql.conf</filename> file on the |
| standby: |
| |
| <programlisting> |
| # The standby connects to the primary that is running on host 192.168.1.50 |
| # and port 5432 as the user "foo" whose password is "foopass". |
| primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' |
| </programlisting> |
| </para> |
| </sect3> |
| |
| <sect3 id="streaming-replication-monitoring"> |
| <title>Monitoring</title> |
| <para> |
| An important health indicator of streaming replication is the amount |
| of WAL records generated in the primary, but not yet applied in the |
| standby. You can calculate this lag by comparing the current WAL write |
| location on the primary with the last WAL location received by the |
| standby. These locations can be retrieved using |
| <function>pg_current_wal_lsn</function> on the primary and |
| <function>pg_last_wal_receive_lsn</function> on the standby, |
| respectively (see <xref linkend="functions-admin-backup-table"/> and |
| <xref linkend="functions-recovery-info-table"/> for details). |
| The last WAL receive location in the standby is also displayed in the |
| process status of the WAL receiver process, displayed using the |
| <command>ps</command> command (see <xref linkend="monitoring-ps"/> for details). |
| </para> |
| <para> |
| You can retrieve a list of WAL sender processes via the |
| <link linkend="monitoring-pg-stat-replication-view"><structname> |
| pg_stat_replication</structname></link> view. Large differences between |
| <function>pg_current_wal_lsn</function> and the view's <literal>sent_lsn</literal> field |
| might indicate that the primary server is under heavy load, while |
| differences between <literal>sent_lsn</literal> and |
| <function>pg_last_wal_receive_lsn</function> on the standby might indicate |
| network delay, or that the standby is under heavy load. |
| </para> |
| <para> |
| On a hot standby, the status of the WAL receiver process can be retrieved |
| via the <link linkend="monitoring-pg-stat-wal-receiver-view"> |
| <structname>pg_stat_wal_receiver</structname></link> view. A large |
| difference between <function>pg_last_wal_replay_lsn</function> and the |
| view's <literal>flushed_lsn</literal> indicates that WAL is being |
| received faster than it can be replayed. |
| </para> |
| </sect3> |
| </sect2> |
| |
| <sect2 id="streaming-replication-slots"> |
| <title>Replication Slots</title> |
| <indexterm> |
| <primary>replication slot</primary> |
| <secondary>streaming replication</secondary> |
| </indexterm> |
| <para> |
| Replication slots provide an automated way to ensure that the primary does |
| not remove WAL segments until they have been received by all standbys, |
| and that the primary does not remove rows which could cause a |
| <link linkend="hot-standby-conflict">recovery conflict</link> even when the |
| standby is disconnected. |
| </para> |
| <para> |
| In lieu of using replication slots, it is possible to prevent the removal |
| of old WAL segments using <xref linkend="guc-wal-keep-size"/>, or by |
| storing the segments in an archive using |
| <xref linkend="guc-archive-command"/>. |
| However, these methods often result in retaining more WAL segments than |
| required, whereas replication slots retain only the number of segments |
| known to be needed. On the other hand, replication slots can retain so |
| many WAL segments that they fill up the space allocated |
| for <literal>pg_wal</literal>; |
| <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files |
| retained by replication slots. |
| </para> |
| <para> |
| Similarly, <xref linkend="guc-hot-standby-feedback"/> |
| and <xref linkend="guc-vacuum-defer-cleanup-age"/> provide protection against |
| relevant rows being removed by vacuum, but the former provides no |
| protection during any time period when the standby is not connected, |
| and the latter often needs to be set to a high value to provide adequate |
| protection. Replication slots overcome these disadvantages. |
| </para> |
| <sect3 id="streaming-replication-slots-manipulation"> |
| <title>Querying and Manipulating Replication Slots</title> |
| <para> |
| Each replication slot has a name, which can contain lower-case letters, |
| numbers, and the underscore character. |
| </para> |
| <para> |
| Existing replication slots and their state can be seen in the |
| <link linkend="view-pg-replication-slots"><structname>pg_replication_slots</structname></link> |
| view. |
| </para> |
| <para> |
| Slots can be created and dropped either via the streaming replication |
| protocol (see <xref linkend="protocol-replication"/>) or via SQL |
| functions (see <xref linkend="functions-replication"/>). |
| </para> |
| </sect3> |
| <sect3 id="streaming-replication-slots-config"> |
| <title>Configuration Example</title> |
| <para> |
| You can create a replication slot like this: |
| <programlisting> |
| postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot'); |
| slot_name | lsn |
| -------------+----- |
| node_a_slot | |
| |
| postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots; |
| slot_name | slot_type | active |
| -------------+-----------+-------- |
| node_a_slot | physical | f |
| (1 row) |
| </programlisting> |
| To configure the standby to use this slot, <varname>primary_slot_name</varname> |
| should be configured on the standby. Here is a simple example: |
| <programlisting> |
| primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' |
| primary_slot_name = 'node_a_slot' |
| </programlisting> |
| </para> |
| </sect3> |
| </sect2> |
| |
| <sect2 id="cascading-replication"> |
| <title>Cascading Replication</title> |
| |
| <indexterm zone="high-availability"> |
| <primary>Cascading Replication</primary> |
| </indexterm> |
| |
| <para> |
| The cascading replication feature allows a standby server to accept replication |
| connections and stream WAL records to other standbys, acting as a relay. |
| This can be used to reduce the number of direct connections to the primary |
| and also to minimize inter-site bandwidth overheads. |
| </para> |
| |
| <para> |
| A standby acting as both a receiver and a sender is known as a cascading |
| standby. Standbys that are more directly connected to the primary are known |
| as upstream servers, while those standby servers further away are downstream |
| servers. Cascading replication does not place limits on the number or |
| arrangement of downstream servers, though each standby connects to only |
| one upstream server which eventually links to a single primary server. |
| </para> |
| |
| <para> |
| A cascading standby sends not only WAL records received from the |
| primary but also those restored from the archive. So even if the replication |
| connection in some upstream connection is terminated, streaming replication |
| continues downstream for as long as new WAL records are available. |
| </para> |
| |
| <para> |
| Cascading replication is currently asynchronous. Synchronous replication |
| (see <xref linkend="synchronous-replication"/>) settings have no effect on |
| cascading replication at present. |
| </para> |
| |
| <para> |
| Hot Standby feedback propagates upstream, whatever the cascaded arrangement. |
| </para> |
| |
| <para> |
| If an upstream standby server is promoted to become the new primary, downstream |
| servers will continue to stream from the new primary if |
| <varname>recovery_target_timeline</varname> is set to <literal>'latest'</literal> (the default). |
| </para> |
| |
| <para> |
| To use cascading replication, set up the cascading standby so that it can |
| accept replication connections (that is, set |
| <xref linkend="guc-max-wal-senders"/> and <xref linkend="guc-hot-standby"/>, |
| and configure |
| <link linkend="auth-pg-hba-conf">host-based authentication</link>). |
| You will also need to set <varname>primary_conninfo</varname> in the downstream |
| standby to point to the cascading standby. |
| </para> |
| </sect2> |
| |
| <sect2 id="synchronous-replication"> |
| <title>Synchronous Replication</title> |
| |
| <indexterm zone="high-availability"> |
| <primary>Synchronous Replication</primary> |
| </indexterm> |
| |
| <para> |
| <productname>PostgreSQL</productname> streaming replication is asynchronous by |
| default. If the primary server |
| crashes then some transactions that were committed may not have been |
| replicated to the standby server, causing data loss. The amount |
| of data loss is proportional to the replication delay at the time of |
| failover. |
| </para> |
| |
| <para> |
| Synchronous replication offers the ability to confirm that all changes |
| made by a transaction have been transferred to one or more synchronous |
| standby servers. This extends that standard level of durability |
| offered by a transaction commit. This level of protection is referred |
| to as 2-safe replication in computer science theory, and group-1-safe |
| (group-safe and 1-safe) when <varname>synchronous_commit</varname> is set to |
| <literal>remote_write</literal>. |
| </para> |
| |
| <para> |
| When requesting synchronous replication, each commit of a |
| write transaction will wait until confirmation is |
| received that the commit has been written to the write-ahead log on disk |
| of both the primary and standby server. The only possibility that data |
| can be lost is if both the primary and the standby suffer crashes at the |
| same time. This can provide a much higher level of durability, though only |
| if the sysadmin is cautious about the placement and management of the two |
| servers. Waiting for confirmation increases the user's confidence that the |
| changes will not be lost in the event of server crashes but it also |
| necessarily increases the response time for the requesting transaction. |
| The minimum wait time is the round-trip time between primary and standby. |
| </para> |
| |
| <para> |
| Read-only transactions and transaction rollbacks need not wait for |
| replies from standby servers. Subtransaction commits do not wait for |
| responses from standby servers, only top-level commits. Long |
| running actions such as data loading or index building do not wait |
| until the very final commit message. All two-phase commit actions |
| require commit waits, including both prepare and commit. |
| </para> |
| |
| <para> |
| A synchronous standby can be a physical replication standby or a logical |
| replication subscriber. It can also be any other physical or logical WAL |
| replication stream consumer that knows how to send the appropriate |
| feedback messages. Besides the built-in physical and logical replication |
| systems, this includes special programs such |
| as <command>pg_receivewal</command> and <command>pg_recvlogical</command> |
| as well as some third-party replication systems and custom programs. |
| Check the respective documentation for details on synchronous replication |
| support. |
| </para> |
| |
| <sect3 id="synchronous-replication-config"> |
| <title>Basic Configuration</title> |
| |
| <para> |
| Once streaming replication has been configured, configuring synchronous |
| replication requires only one additional configuration step: |
| <xref linkend="guc-synchronous-standby-names"/> must be set to |
| a non-empty value. <varname>synchronous_commit</varname> must also be set to |
| <literal>on</literal>, but since this is the default value, typically no change is |
| required. (See <xref linkend="runtime-config-wal-settings"/> and |
| <xref linkend="runtime-config-replication-primary"/>.) |
| This configuration will cause each commit to wait for |
| confirmation that the standby has written the commit record to durable |
| storage. |
| <varname>synchronous_commit</varname> can be set by individual |
| users, so it can be configured in the configuration file, for particular |
| users or databases, or dynamically by applications, in order to control |
| the durability guarantee on a per-transaction basis. |
| </para> |
| |
| <para> |
| After a commit record has been written to disk on the primary, the |
| WAL record is then sent to the standby. The standby sends reply |
| messages each time a new batch of WAL data is written to disk, unless |
| <varname>wal_receiver_status_interval</varname> is set to zero on the standby. |
| In the case that <varname>synchronous_commit</varname> is set to |
| <literal>remote_apply</literal>, the standby sends reply messages when the commit |
| record is replayed, making the transaction visible. |
| If the standby is chosen as a synchronous standby, according to the setting |
| of <varname>synchronous_standby_names</varname> on the primary, the reply |
| messages from that standby will be considered along with those from other |
| synchronous standbys to decide when to release transactions waiting for |
| confirmation that the commit record has been received. These parameters |
| allow the administrator to specify which standby servers should be |
| synchronous standbys. Note that the configuration of synchronous |
| replication is mainly on the primary. Named standbys must be directly |
| connected to the primary; the primary knows nothing about downstream |
| standby servers using cascaded replication. |
| </para> |
| |
| <para> |
| Setting <varname>synchronous_commit</varname> to <literal>remote_write</literal> will |
| cause each commit to wait for confirmation that the standby has received |
| the commit record and written it out to its own operating system, but not |
| for the data to be flushed to disk on the standby. This |
| setting provides a weaker guarantee of durability than <literal>on</literal> |
| does: the standby could lose the data in the event of an operating system |
| crash, though not a <productname>PostgreSQL</productname> crash. |
| However, it's a useful setting in practice |
| because it can decrease the response time for the transaction. |
| Data loss could only occur if both the primary and the standby crash and |
| the database of the primary gets corrupted at the same time. |
| </para> |
| |
| <para> |
| Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will |
| cause each commit to wait until the current synchronous standbys report |
| that they have replayed the transaction, making it visible to user |
| queries. In simple cases, this allows for load balancing with causal |
| consistency. |
| </para> |
| |
| <para> |
| Users will stop waiting if a fast shutdown is requested. However, as |
| when using asynchronous replication, the server will not fully |
| shutdown until all outstanding WAL records are transferred to the currently |
| connected standby servers. |
| </para> |
| |
| </sect3> |
| |
| <sect3 id="synchronous-replication-multiple-standbys"> |
| <title>Multiple Synchronous Standbys</title> |
| |
| <para> |
| Synchronous replication supports one or more synchronous standby servers; |
| transactions will wait until all the standby servers which are considered |
| as synchronous confirm receipt of their data. The number of synchronous |
| standbys that transactions must wait for replies from is specified in |
| <varname>synchronous_standby_names</varname>. This parameter also specifies |
| a list of standby names and the method (<literal>FIRST</literal> and |
| <literal>ANY</literal>) to choose synchronous standbys from the listed ones. |
| </para> |
| <para> |
| The method <literal>FIRST</literal> specifies a priority-based synchronous |
| replication and makes transaction commits wait until their WAL records are |
| replicated to the requested number of synchronous standbys chosen based on |
| their priorities. The standbys whose names appear earlier in the list are |
| given higher priority and will be considered as synchronous. Other standby |
| servers appearing later in this list represent potential synchronous |
| standbys. If any of the current synchronous standbys disconnects for |
| whatever reason, it will be replaced immediately with the |
| next-highest-priority standby. |
| </para> |
| <para> |
| An example of <varname>synchronous_standby_names</varname> for |
| a priority-based multiple synchronous standbys is: |
| <programlisting> |
| synchronous_standby_names = 'FIRST 2 (s1, s2, s3)' |
| </programlisting> |
| In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>, |
| <literal>s3</literal> and <literal>s4</literal> are running, the two standbys |
| <literal>s1</literal> and <literal>s2</literal> will be chosen as synchronous standbys |
| because their names appear early in the list of standby names. |
| <literal>s3</literal> is a potential synchronous standby and will take over |
| the role of synchronous standby when either of <literal>s1</literal> or |
| <literal>s2</literal> fails. <literal>s4</literal> is an asynchronous standby since |
| its name is not in the list. |
| </para> |
| <para> |
| The method <literal>ANY</literal> specifies a quorum-based synchronous |
| replication and makes transaction commits wait until their WAL records |
| are replicated to <emphasis>at least</emphasis> the requested number of |
| synchronous standbys in the list. |
| </para> |
| <para> |
| An example of <varname>synchronous_standby_names</varname> for |
| a quorum-based multiple synchronous standbys is: |
| <programlisting> |
| synchronous_standby_names = 'ANY 2 (s1, s2, s3)' |
| </programlisting> |
| In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>, |
| <literal>s3</literal> and <literal>s4</literal> are running, transaction commits will |
| wait for replies from at least any two standbys of <literal>s1</literal>, |
| <literal>s2</literal> and <literal>s3</literal>. <literal>s4</literal> is an asynchronous |
| standby since its name is not in the list. |
| </para> |
| <para> |
| The synchronous states of standby servers can be viewed using |
| the <structname>pg_stat_replication</structname> view. |
| </para> |
| </sect3> |
| |
| <sect3 id="synchronous-replication-performance"> |
| <title>Planning for Performance</title> |
| |
| <para> |
| Synchronous replication usually requires carefully planned and placed |
| standby servers to ensure applications perform acceptably. Waiting |
| doesn't utilize system resources, but transaction locks continue to be |
| held until the transfer is confirmed. As a result, incautious use of |
| synchronous replication will reduce performance for database |
| applications because of increased response times and higher contention. |
| </para> |
| |
| <para> |
| <productname>PostgreSQL</productname> allows the application developer |
| to specify the durability level required via replication. This can be |
| specified for the system overall, though it can also be specified for |
| specific users or connections, or even individual transactions. |
| </para> |
| |
| <para> |
| For example, an application workload might consist of: |
| 10% of changes are important customer details, while |
| 90% of changes are less important data that the business can more |
| easily survive if it is lost, such as chat messages between users. |
| </para> |
| |
| <para> |
| With synchronous replication options specified at the application level |
| (on the primary) we can offer synchronous replication for the most |
| important changes, without slowing down the bulk of the total workload. |
| Application level options are an important and practical tool for allowing |
| the benefits of synchronous replication for high performance applications. |
| </para> |
| |
| <para> |
| You should consider that the network bandwidth must be higher than |
| the rate of generation of WAL data. |
| </para> |
| |
| </sect3> |
| |
| <sect3 id="synchronous-replication-ha"> |
| <title>Planning for High Availability</title> |
| |
| <para> |
| <varname>synchronous_standby_names</varname> specifies the number and |
| names of synchronous standbys that transaction commits made when |
| <varname>synchronous_commit</varname> is set to <literal>on</literal>, |
| <literal>remote_apply</literal> or <literal>remote_write</literal> will wait for |
| responses from. Such transaction commits may never be completed |
| if any one of synchronous standbys should crash. |
| </para> |
| |
| <para> |
| The best solution for high availability is to ensure you keep as many |
| synchronous standbys as requested. This can be achieved by naming multiple |
| potential synchronous standbys using <varname>synchronous_standby_names</varname>. |
| </para> |
| |
| <para> |
| In a priority-based synchronous replication, the standbys whose names |
| appear earlier in the list will be used as synchronous standbys. |
| Standbys listed after these will take over the role of synchronous standby |
| if one of current ones should fail. |
| </para> |
| |
| <para> |
| In a quorum-based synchronous replication, all the standbys appearing |
| in the list will be used as candidates for synchronous standbys. |
| Even if one of them should fail, the other standbys will keep performing |
| the role of candidates of synchronous standby. |
| </para> |
| |
| <para> |
| When a standby first attaches to the primary, it will not yet be properly |
| synchronized. This is described as <literal>catchup</literal> mode. Once |
| the lag between standby and primary reaches zero for the first time |
| we move to real-time <literal>streaming</literal> state. |
| The catch-up duration may be long immediately after the standby has |
| been created. If the standby is shut down, then the catch-up period |
| will increase according to the length of time the standby has been down. |
| The standby is only able to become a synchronous standby |
| once it has reached <literal>streaming</literal> state. |
| This state can be viewed using |
| the <structname>pg_stat_replication</structname> view. |
| </para> |
| |
| <para> |
| If primary restarts while commits are waiting for acknowledgment, those |
| waiting transactions will be marked fully committed once the primary |
| database recovers. |
| There is no way to be certain that all standbys have received all |
| outstanding WAL data at time of the crash of the primary. Some |
| transactions may not show as committed on the standby, even though |
| they show as committed on the primary. The guarantee we offer is that |
| the application will not receive explicit acknowledgment of the |
| successful commit of a transaction until the WAL data is known to be |
| safely received by all the synchronous standbys. |
| </para> |
| |
| <para> |
| If you really cannot keep as many synchronous standbys as requested |
| then you should decrease the number of synchronous standbys that |
| transaction commits must wait for responses from |
| in <varname>synchronous_standby_names</varname> (or disable it) and |
| reload the configuration file on the primary server. |
| </para> |
| |
| <para> |
| If the primary is isolated from remaining standby servers you should |
| fail over to the best candidate of those other remaining standby servers. |
| </para> |
| |
| <para> |
| If you need to re-create a standby server while transactions are |
| waiting, make sure that the commands pg_start_backup() and |
| pg_stop_backup() are run in a session with |
| <varname>synchronous_commit</varname> = <literal>off</literal>, otherwise those |
| requests will wait forever for the standby to appear. |
| </para> |
| |
| </sect3> |
| </sect2> |
| |
| <sect2 id="continuous-archiving-in-standby"> |
| <title>Continuous Archiving in Standby</title> |
| |
| <indexterm> |
| <primary>continuous archiving</primary> |
| <secondary>in standby</secondary> |
| </indexterm> |
| |
| <para> |
| When continuous WAL archiving is used in a standby, there are two |
| different scenarios: the WAL archive can be shared between the primary |
| and the standby, or the standby can have its own WAL archive. When |
| the standby has its own WAL archive, set <varname>archive_mode</varname> |
| to <literal>always</literal>, and the standby will call the archive |
| command for every WAL segment it receives, whether it's by restoring |
| from the archive or by streaming replication. The shared archive can |
| be handled similarly, but the <varname>archive_command</varname> must |
| test if the file being archived exists already, and if the existing file |
| has identical contents. This requires more care in the |
| <varname>archive_command</varname>, as it must |
| be careful to not overwrite an existing file with different contents, |
| but return success if the exactly same file is archived twice. And |
| all that must be done free of race conditions, if two servers attempt |
| to archive the same file at the same time. |
| </para> |
| |
| <para> |
| If <varname>archive_mode</varname> is set to <literal>on</literal>, the |
| archiver is not enabled during recovery or standby mode. If the standby |
| server is promoted, it will start archiving after the promotion, but |
| will not archive any WAL or timeline history files that |
| it did not generate itself. To get a complete |
| series of WAL files in the archive, you must ensure that all WAL is |
| archived, before it reaches the standby. This is inherently true with |
| file-based log shipping, as the standby can only restore files that |
| are found in the archive, but not if streaming replication is enabled. |
| When a server is not in recovery mode, there is no difference between |
| <literal>on</literal> and <literal>always</literal> modes. |
| </para> |
| </sect2> |
| </sect1> |
| |
| <sect1 id="warm-standby-failover"> |
| <title>Failover</title> |
| |
| <para> |
| If the primary server fails then the standby server should begin |
| failover procedures. |
| </para> |
| |
| <para> |
| If the standby server fails then no failover need take place. If the |
| standby server can be restarted, even some time later, then the recovery |
| process can also be restarted immediately, taking advantage of |
| restartable recovery. If the standby server cannot be restarted, then a |
| full new standby server instance should be created. |
| </para> |
| |
| <para> |
| If the primary server fails and the standby server becomes the |
| new primary, and then the old primary restarts, you must have |
| a mechanism for informing the old primary that it is no longer the primary. This is |
| sometimes known as <acronym>STONITH</acronym> (Shoot The Other Node In The Head), which is |
| necessary to avoid situations where both systems think they are the |
| primary, which will lead to confusion and ultimately data loss. |
| </para> |
| |
| <para> |
| Many failover systems use just two systems, the primary and the standby, |
| connected by some kind of heartbeat mechanism to continually verify the |
| connectivity between the two and the viability of the primary. It is |
| also possible to use a third system (called a witness server) to prevent |
| some cases of inappropriate failover, but the additional complexity |
| might not be worthwhile unless it is set up with sufficient care and |
| rigorous testing. |
| </para> |
| |
| <para> |
| <productname>PostgreSQL</productname> does not provide the system |
| software required to identify a failure on the primary and notify |
| the standby database server. Many such tools exist and are well |
| integrated with the operating system facilities required for |
| successful failover, such as IP address migration. |
| </para> |
| |
| <para> |
| Once failover to the standby occurs, there is only a |
| single server in operation. This is known as a degenerate state. |
| The former standby is now the primary, but the former primary is down |
| and might stay down. To return to normal operation, a standby server |
| must be recreated, |
| either on the former primary system when it comes up, or on a third, |
| possibly new, system. The <xref linkend="app-pgrewind"/> utility can be |
| used to speed up this process on large clusters. |
| Once complete, the primary and standby can be |
| considered to have switched roles. Some people choose to use a third |
| server to provide backup for the new primary until the new standby |
| server is recreated, |
| though clearly this complicates the system configuration and |
| operational processes. |
| </para> |
| |
| <para> |
| So, switching from primary to standby server can be fast but requires |
| some time to re-prepare the failover cluster. Regular switching from |
| primary to standby is useful, since it allows regular downtime on |
| each system for maintenance. This also serves as a test of the |
| failover mechanism to ensure that it will really work when you need it. |
| Written administration procedures are advised. |
| </para> |
| |
| <para> |
| To trigger failover of a log-shipping standby server, run |
| <command>pg_ctl promote</command>, call <function>pg_promote()</function>, |
| or create a trigger file with the file name and path specified by the |
| <varname>promote_trigger_file</varname>. If you're planning to use |
| <command>pg_ctl promote</command> or to call |
| <function>pg_promote()</function> to fail over, |
| <varname>promote_trigger_file</varname> is not required. If you're |
| setting up the reporting servers that are only used to offload read-only |
| queries from the primary, not for high availability purposes, you don't |
| need to promote it. |
| </para> |
| </sect1> |
| |
| <sect1 id="hot-standby"> |
| <title>Hot Standby</title> |
| |
| <indexterm zone="high-availability"> |
| <primary>Hot Standby</primary> |
| </indexterm> |
| |
| <para> |
| Hot Standby is the term used to describe the ability to connect to |
| the server and run read-only queries while the server is in archive |
| recovery or standby mode. This |
| is useful both for replication purposes and for restoring a backup |
| to a desired state with great precision. |
| The term Hot Standby also refers to the ability of the server to move |
| from recovery through to normal operation while users continue running |
| queries and/or keep their connections open. |
| </para> |
| |
| <para> |
| Running queries in hot standby mode is similar to normal query operation, |
| though there are several usage and administrative differences |
| explained below. |
| </para> |
| |
| <sect2 id="hot-standby-users"> |
| <title>User's Overview</title> |
| |
| <para> |
| When the <xref linkend="guc-hot-standby"/> parameter is set to true on a |
| standby server, it will begin accepting connections once the recovery has |
| brought the system to a consistent state. All such connections are |
| strictly read-only; not even temporary tables may be written. |
| </para> |
| |
| <para> |
| The data on the standby takes some time to arrive from the primary server |
| so there will be a measurable delay between primary and standby. Running the |
| same query nearly simultaneously on both primary and standby might therefore |
| return differing results. We say that data on the standby is |
| <firstterm>eventually consistent</firstterm> with the primary. Once the |
| commit record for a transaction is replayed on the standby, the changes |
| made by that transaction will be visible to any new snapshots taken on |
| the standby. Snapshots may be taken at the start of each query or at the |
| start of each transaction, depending on the current transaction isolation |
| level. For more details, see <xref linkend="transaction-iso"/>. |
| </para> |
| |
| <para> |
| Transactions started during hot standby may issue the following commands: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Query access: <command>SELECT</command>, <command>COPY TO</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Cursor commands: <command>DECLARE</command>, <command>FETCH</command>, <command>CLOSE</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Settings: <command>SHOW</command>, <command>SET</command>, <command>RESET</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Transaction management commands: |
| <itemizedlist> |
| <listitem> |
| <para> |
| <command>BEGIN</command>, <command>END</command>, <command>ABORT</command>, <command>START TRANSACTION</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>SAVEPOINT</command>, <command>RELEASE</command>, <command>ROLLBACK TO SAVEPOINT</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>EXCEPTION</command> blocks and other internal subtransactions |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>LOCK TABLE</command>, though only when explicitly in one of these modes: |
| <literal>ACCESS SHARE</literal>, <literal>ROW SHARE</literal> or <literal>ROW EXCLUSIVE</literal>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Plans and resources: <command>PREPARE</command>, <command>EXECUTE</command>, |
| <command>DEALLOCATE</command>, <command>DISCARD</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Plugins and extensions: <command>LOAD</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>UNLISTEN</command> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| Transactions started during hot standby will never be assigned a |
| transaction ID and cannot write to the system write-ahead log. |
| Therefore, the following actions will produce error messages: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Data Manipulation Language (DML): <command>INSERT</command>, |
| <command>UPDATE</command>, <command>DELETE</command>, <command>COPY FROM</command>, |
| <command>TRUNCATE</command>. |
| Note that there are no allowed actions that result in a trigger |
| being executed during recovery. This restriction applies even to |
| temporary tables, because table rows cannot be read or written without |
| assigning a transaction ID, which is currently not possible in a |
| Hot Standby environment. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Data Definition Language (DDL): <command>CREATE</command>, |
| <command>DROP</command>, <command>ALTER</command>, <command>COMMENT</command>. |
| This restriction applies even to temporary tables, because carrying |
| out these operations would require updating the system catalog tables. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>SELECT ... FOR SHARE | UPDATE</command>, because row locks cannot be |
| taken without updating the underlying data files. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Rules on <command>SELECT</command> statements that generate DML commands. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>LOCK</command> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</literal>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>LOCK</command> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</literal>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Transaction management commands that explicitly set non-read-only state: |
| <itemizedlist> |
| <listitem> |
| <para> |
| <command>BEGIN READ WRITE</command>, |
| <command>START TRANSACTION READ WRITE</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>SET TRANSACTION READ WRITE</command>, |
| <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>SET transaction_read_only = off</command> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Two-phase commit commands: <command>PREPARE TRANSACTION</command>, |
| <command>COMMIT PREPARED</command>, <command>ROLLBACK PREPARED</command> |
| because even read-only transactions need to write WAL in the |
| prepare phase (the first phase of two phase commit). |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Sequence updates: <function>nextval()</function>, <function>setval()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <command>LISTEN</command>, <command>NOTIFY</command> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| In normal operation, <quote>read-only</quote> transactions are allowed to |
| use <command>LISTEN</command> and <command>NOTIFY</command>, |
| so Hot Standby sessions operate under slightly tighter |
| restrictions than ordinary read-only sessions. It is possible that some |
| of these restrictions might be loosened in a future release. |
| </para> |
| |
| <para> |
| During hot standby, the parameter <varname>transaction_read_only</varname> is always |
| true and may not be changed. But as long as no attempt is made to modify |
| the database, connections during hot standby will act much like any other |
| database connection. If failover or switchover occurs, the database will |
| switch to normal processing mode. Sessions will remain connected while the |
| server changes mode. Once hot standby finishes, it will be possible to |
| initiate read-write transactions (even from a session begun during |
| hot standby). |
| </para> |
| |
| <para> |
| Users can determine whether hot standby is currently active for their |
| session by issuing <command>SHOW in_hot_standby</command>. |
| (In server versions before 14, the <varname>in_hot_standby</varname> |
| parameter did not exist; a workable substitute method for older servers |
| is <command>SHOW transaction_read_only</command>.) In addition, a set of |
| functions (<xref linkend="functions-recovery-info-table"/>) allow users to |
| access information about the standby server. These allow you to write |
| programs that are aware of the current state of the database. These |
| can be used to monitor the progress of recovery, or to allow you to |
| write complex programs that restore the database to particular states. |
| </para> |
| </sect2> |
| |
| <sect2 id="hot-standby-conflict"> |
| <title>Handling Query Conflicts</title> |
| |
| <para> |
| The primary and standby servers are in many ways loosely connected. Actions |
| on the primary will have an effect on the standby. As a result, there is |
| potential for negative interactions or conflicts between them. The easiest |
| conflict to understand is performance: if a huge data load is taking place |
| on the primary then this will generate a similar stream of WAL records on the |
| standby, so standby queries may contend for system resources, such as I/O. |
| </para> |
| |
| <para> |
| There are also additional types of conflict that can occur with Hot Standby. |
| These conflicts are <emphasis>hard conflicts</emphasis> in the sense that queries |
| might need to be canceled and, in some cases, sessions disconnected to resolve them. |
| The user is provided with several ways to handle these |
| conflicts. Conflict cases include: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Access Exclusive locks taken on the primary server, including both |
| explicit <command>LOCK</command> commands and various <acronym>DDL</acronym> |
| actions, conflict with table accesses in standby queries. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Dropping a tablespace on the primary conflicts with standby queries |
| using that tablespace for temporary work files. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Dropping a database on the primary conflicts with sessions connected |
| to that database on the standby. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Application of a vacuum cleanup record from WAL conflicts with |
| standby transactions whose snapshots can still <quote>see</quote> any of |
| the rows to be removed. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Application of a vacuum cleanup record from WAL conflicts with |
| queries accessing the target page on the standby, whether or not |
| the data to be removed is visible. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| On the primary server, these cases simply result in waiting; and the |
| user might choose to cancel either of the conflicting actions. However, |
| on the standby there is no choice: the WAL-logged action already occurred |
| on the primary so the standby must not fail to apply it. Furthermore, |
| allowing WAL application to wait indefinitely may be very undesirable, |
| because the standby's state will become increasingly far behind the |
| primary's. Therefore, a mechanism is provided to forcibly cancel standby |
| queries that conflict with to-be-applied WAL records. |
| </para> |
| |
| <para> |
| An example of the problem situation is an administrator on the primary |
| server running <command>DROP TABLE</command> on a table that is currently being |
| queried on the standby server. Clearly the standby query cannot continue |
| if the <command>DROP TABLE</command> is applied on the standby. If this situation |
| occurred on the primary, the <command>DROP TABLE</command> would wait until the |
| other query had finished. But when <command>DROP TABLE</command> is run on the |
| primary, the primary doesn't have information about what queries are |
| running on the standby, so it will not wait for any such standby |
| queries. The WAL change records come through to the standby while the |
| standby query is still running, causing a conflict. The standby server |
| must either delay application of the WAL records (and everything after |
| them, too) or else cancel the conflicting query so that the <command>DROP |
| TABLE</command> can be applied. |
| </para> |
| |
| <para> |
| When a conflicting query is short, it's typically desirable to allow it to |
| complete by delaying WAL application for a little bit; but a long delay in |
| WAL application is usually not desirable. So the cancel mechanism has |
| parameters, <xref linkend="guc-max-standby-archive-delay"/> and <xref |
| linkend="guc-max-standby-streaming-delay"/>, that define the maximum |
| allowed delay in WAL application. Conflicting queries will be canceled |
| once it has taken longer than the relevant delay setting to apply any |
| newly-received WAL data. There are two parameters so that different delay |
| values can be specified for the case of reading WAL data from an archive |
| (i.e., initial recovery from a base backup or <quote>catching up</quote> a |
| standby server that has fallen far behind) versus reading WAL data via |
| streaming replication. |
| </para> |
| |
| <para> |
| In a standby server that exists primarily for high availability, it's |
| best to set the delay parameters relatively short, so that the server |
| cannot fall far behind the primary due to delays caused by standby |
| queries. However, if the standby server is meant for executing |
| long-running queries, then a high or even infinite delay value may be |
| preferable. Keep in mind however that a long-running query could |
| cause other sessions on the standby server to not see recent changes |
| on the primary, if it delays application of WAL records. |
| </para> |
| |
| <para> |
| Once the delay specified by <varname>max_standby_archive_delay</varname> or |
| <varname>max_standby_streaming_delay</varname> has been exceeded, conflicting |
| queries will be canceled. This usually results just in a cancellation |
| error, although in the case of replaying a <command>DROP DATABASE</command> |
| the entire conflicting session will be terminated. Also, if the conflict |
| is over a lock held by an idle transaction, the conflicting session is |
| terminated (this behavior might change in the future). |
| </para> |
| |
| <para> |
| Canceled queries may be retried immediately (after beginning a new |
| transaction, of course). Since query cancellation depends on |
| the nature of the WAL records being replayed, a query that was |
| canceled may well succeed if it is executed again. |
| </para> |
| |
| <para> |
| Keep in mind that the delay parameters are compared to the elapsed time |
| since the WAL data was received by the standby server. Thus, the grace |
| period allowed to any one query on the standby is never more than the |
| delay parameter, and could be considerably less if the standby has already |
| fallen behind as a result of waiting for previous queries to complete, or |
| as a result of being unable to keep up with a heavy update load. |
| </para> |
| |
| <para> |
| The most common reason for conflict between standby queries and WAL replay |
| is <quote>early cleanup</quote>. Normally, <productname>PostgreSQL</productname> allows |
| cleanup of old row versions when there are no transactions that need to |
| see them to ensure correct visibility of data according to MVCC rules. |
| However, this rule can only be applied for transactions executing on the |
| primary. So it is possible that cleanup on the primary will remove row |
| versions that are still visible to a transaction on the standby. |
| </para> |
| |
| <para> |
| Experienced users should note that both row version cleanup and row version |
| freezing will potentially conflict with standby queries. Running a manual |
| <command>VACUUM FREEZE</command> is likely to cause conflicts even on tables with |
| no updated or deleted rows. |
| </para> |
| |
| <para> |
| Users should be clear that tables that are regularly and heavily updated |
| on the primary server will quickly cause cancellation of longer running |
| queries on the standby. In such cases the setting of a finite value for |
| <varname>max_standby_archive_delay</varname> or |
| <varname>max_standby_streaming_delay</varname> can be considered similar to |
| setting <varname>statement_timeout</varname>. |
| </para> |
| |
| <para> |
| Remedial possibilities exist if the number of standby-query cancellations |
| is found to be unacceptable. The first option is to set the parameter |
| <varname>hot_standby_feedback</varname>, which prevents <command>VACUUM</command> from |
| removing recently-dead rows and so cleanup conflicts do not occur. |
| If you do this, you |
| should note that this will delay cleanup of dead rows on the primary, |
| which may result in undesirable table bloat. However, the cleanup |
| situation will be no worse than if the standby queries were running |
| directly on the primary server, and you are still getting the benefit of |
| off-loading execution onto the standby. |
| If standby servers connect and disconnect frequently, you |
| might want to make adjustments to handle the period when |
| <varname>hot_standby_feedback</varname> feedback is not being provided. |
| For example, consider increasing <varname>max_standby_archive_delay</varname> |
| so that queries are not rapidly canceled by conflicts in WAL archive |
| files during disconnected periods. You should also consider increasing |
| <varname>max_standby_streaming_delay</varname> to avoid rapid cancellations |
| by newly-arrived streaming WAL entries after reconnection. |
| </para> |
| |
| <para> |
| Another option is to increase <xref linkend="guc-vacuum-defer-cleanup-age"/> |
| on the primary server, so that dead rows will not be cleaned up as quickly |
| as they normally would be. This will allow more time for queries to |
| execute before they are canceled on the standby, without having to set |
| a high <varname>max_standby_streaming_delay</varname>. However it is |
| difficult to guarantee any specific execution-time window with this |
| approach, since <varname>vacuum_defer_cleanup_age</varname> is measured in |
| transactions executed on the primary server. |
| </para> |
| |
| <para> |
| The number of query cancels and the reason for them can be viewed using |
| the <structname>pg_stat_database_conflicts</structname> system view on the standby |
| server. The <structname>pg_stat_database</structname> system view also contains |
| summary information. |
| </para> |
| |
| <para> |
| Users can control whether a log message is produced when WAL replay is waiting |
| longer than <varname>deadlock_timeout</varname> for conflicts. This |
| is controlled by the <xref linkend="guc-log-recovery-conflict-waits"/> parameter. |
| </para> |
| </sect2> |
| |
| <sect2 id="hot-standby-admin"> |
| <title>Administrator's Overview</title> |
| |
| <para> |
| If <varname>hot_standby</varname> is <literal>on</literal> in <filename>postgresql.conf</filename> |
| (the default value) and there is a |
| <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary><secondary>for hot standby</secondary></indexterm> |
| file present, the server will run in Hot Standby mode. |
| However, it may take some time for Hot Standby connections to be allowed, |
| because the server will not accept connections until it has completed |
| sufficient recovery to provide a consistent state against which queries |
| can run. During this period, |
| clients that attempt to connect will be refused with an error message. |
| To confirm the server has come up, either loop trying to connect from |
| the application, or look for these messages in the server logs: |
| |
| <programlisting> |
| LOG: entering standby mode |
| |
| ... then some time later ... |
| |
| LOG: consistent recovery state reached |
| LOG: database system is ready to accept read-only connections |
| </programlisting> |
| |
| Consistency information is recorded once per checkpoint on the primary. |
| It is not possible to enable hot standby when reading WAL |
| written during a period when <varname>wal_level</varname> was not set to |
| <literal>replica</literal> or <literal>logical</literal> on the primary. Reaching |
| a consistent state can also be delayed in the presence of both of these |
| conditions: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| A write transaction has more than 64 subtransactions |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Very long-lived write transactions |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| If you are running file-based log shipping ("warm standby"), you might need |
| to wait until the next WAL file arrives, which could be as long as the |
| <varname>archive_timeout</varname> setting on the primary. |
| </para> |
| |
| <para> |
| The settings of some parameters determine the size of shared memory for |
| tracking transaction IDs, locks, and prepared transactions. These shared |
| memory structures must be no smaller on a standby than on the primary in |
| order to ensure that the standby does not run out of shared memory during |
| recovery. For example, if the primary had used a prepared transaction but |
| the standby had not allocated any shared memory for tracking prepared |
| transactions, then recovery could not continue until the standby's |
| configuration is changed. The parameters affected are: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <varname>max_connections</varname> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <varname>max_prepared_transactions</varname> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <varname>max_locks_per_transaction</varname> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <varname>max_wal_senders</varname> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <varname>max_worker_processes</varname> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| The easiest way to ensure this does not become a problem is to have these |
| parameters set on the standbys to values equal to or greater than on the |
| primary. Therefore, if you want to increase these values, you should do |
| so on all standby servers first, before applying the changes to the |
| primary server. Conversely, if you want to decrease these values, you |
| should do so on the primary server first, before applying the changes to |
| all standby servers. Keep in mind that when a standby is promoted, it |
| becomes the new reference for the required parameter settings for the |
| standbys that follow it. Therefore, to avoid this becoming a problem |
| during a switchover or failover, it is recommended to keep these settings |
| the same on all standby servers. |
| </para> |
| |
| <para> |
| The WAL tracks changes to these parameters on the |
| primary. If a hot standby processes WAL that indicates that the current |
| value on the primary is higher than its own value, it will log a warning |
| and pause recovery, for example: |
| <screen> |
| WARNING: hot standby is not possible because of insufficient parameter settings |
| DETAIL: max_connections = 80 is a lower setting than on the primary server, where its value was 100. |
| LOG: recovery has paused |
| DETAIL: If recovery is unpaused, the server will shut down. |
| HINT: You can then restart the server after making the necessary configuration changes. |
| </screen> |
| At that point, the settings on the standby need to be updated and the |
| instance restarted before recovery can continue. If the standby is not a |
| hot standby, then when it encounters the incompatible parameter change, it |
| will shut down immediately without pausing, since there is then no value |
| in keeping it up. |
| </para> |
| |
| <para> |
| It is important that the administrator select appropriate settings for |
| <xref linkend="guc-max-standby-archive-delay"/> and <xref |
| linkend="guc-max-standby-streaming-delay"/>. The best choices vary |
| depending on business priorities. For example if the server is primarily |
| tasked as a High Availability server, then you will want low delay |
| settings, perhaps even zero, though that is a very aggressive setting. If |
| the standby server is tasked as an additional server for decision support |
| queries then it might be acceptable to set the maximum delay values to |
| many hours, or even -1 which means wait forever for queries to complete. |
| </para> |
| |
| <para> |
| Transaction status "hint bits" written on the primary are not WAL-logged, |
| so data on the standby will likely re-write the hints again on the standby. |
| Thus, the standby server will still perform disk writes even though |
| all users are read-only; no changes occur to the data values |
| themselves. Users will still write large sort temporary files and |
| re-generate relcache info files, so no part of the database |
| is truly read-only during hot standby mode. |
| Note also that writes to remote databases using |
| <application>dblink</application> module, and other operations outside the |
| database using PL functions will still be possible, even though the |
| transaction is read-only locally. |
| </para> |
| |
| <para> |
| The following types of administration commands are not accepted |
| during recovery mode: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Data Definition Language (DDL): e.g., <command>CREATE INDEX</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Privilege and Ownership: <command>GRANT</command>, <command>REVOKE</command>, |
| <command>REASSIGN</command> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Maintenance commands: <command>ANALYZE</command>, <command>VACUUM</command>, |
| <command>CLUSTER</command>, <command>REINDEX</command> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| Again, note that some of these commands are actually allowed during |
| "read only" mode transactions on the primary. |
| </para> |
| |
| <para> |
| As a result, you cannot create additional indexes that exist solely |
| on the standby, nor statistics that exist solely on the standby. |
| If these administration commands are needed, they should be executed |
| on the primary, and eventually those changes will propagate to the |
| standby. |
| </para> |
| |
| <para> |
| <function>pg_cancel_backend()</function> |
| and <function>pg_terminate_backend()</function> will work on user backends, |
| but not the Startup process, which performs |
| recovery. <structname>pg_stat_activity</structname> does not show |
| recovering transactions as active. As a result, |
| <structname>pg_prepared_xacts</structname> is always empty during |
| recovery. If you wish to resolve in-doubt prepared transactions, view |
| <literal>pg_prepared_xacts</literal> on the primary and issue commands to |
| resolve transactions there or resolve them after the end of recovery. |
| </para> |
| |
| <para> |
| <structname>pg_locks</structname> will show locks held by backends, |
| as normal. <structname>pg_locks</structname> also shows |
| a virtual transaction managed by the Startup process that owns all |
| <literal>AccessExclusiveLocks</literal> held by transactions being replayed by recovery. |
| Note that the Startup process does not acquire locks to |
| make database changes, and thus locks other than <literal>AccessExclusiveLocks</literal> |
| do not show in <structname>pg_locks</structname> for the Startup |
| process; they are just presumed to exist. |
| </para> |
| |
| <para> |
| The <productname>Nagios</productname> plugin <productname>check_pgsql</productname> will |
| work, because the simple information it checks for exists. |
| The <productname>check_postgres</productname> monitoring script will also work, |
| though some reported values could give different or confusing results. |
| For example, last vacuum time will not be maintained, since no |
| vacuum occurs on the standby. Vacuums running on the primary |
| do still send their changes to the standby. |
| </para> |
| |
| <para> |
| WAL file control commands will not work during recovery, |
| e.g., <function>pg_start_backup</function>, <function>pg_switch_wal</function> etc. |
| </para> |
| |
| <para> |
| Dynamically loadable modules work, including <structname>pg_stat_statements</structname>. |
| </para> |
| |
| <para> |
| Advisory locks work normally in recovery, including deadlock detection. |
| Note that advisory locks are never WAL logged, so it is impossible for |
| an advisory lock on either the primary or the standby to conflict with WAL |
| replay. Nor is it possible to acquire an advisory lock on the primary |
| and have it initiate a similar advisory lock on the standby. Advisory |
| locks relate only to the server on which they are acquired. |
| </para> |
| |
| <para> |
| Trigger-based replication systems such as <productname>Slony</productname>, |
| <productname>Londiste</productname> and <productname>Bucardo</productname> won't run on the |
| standby at all, though they will run happily on the primary server as |
| long as the changes are not sent to standby servers to be applied. |
| WAL replay is not trigger-based so you cannot relay from the |
| standby to any system that requires additional database writes or |
| relies on the use of triggers. |
| </para> |
| |
| <para> |
| New OIDs cannot be assigned, though some <acronym>UUID</acronym> generators may still |
| work as long as they do not rely on writing new status to the database. |
| </para> |
| |
| <para> |
| Currently, temporary table creation is not allowed during read-only |
| transactions, so in some cases existing scripts will not run correctly. |
| This restriction might be relaxed in a later release. This is |
| both an SQL Standard compliance issue and a technical issue. |
| </para> |
| |
| <para> |
| <command>DROP TABLESPACE</command> can only succeed if the tablespace is empty. |
| Some standby users may be actively using the tablespace via their |
| <varname>temp_tablespaces</varname> parameter. If there are temporary files in the |
| tablespace, all active queries are canceled to ensure that temporary |
| files are removed, so the tablespace can be removed and WAL replay |
| can continue. |
| </para> |
| |
| <para> |
| Running <command>DROP DATABASE</command> or <command>ALTER DATABASE ... SET |
| TABLESPACE</command> on the primary |
| will generate a WAL entry that will cause all users connected to that |
| database on the standby to be forcibly disconnected. This action occurs |
| immediately, whatever the setting of |
| <varname>max_standby_streaming_delay</varname>. Note that |
| <command>ALTER DATABASE ... RENAME</command> does not disconnect users, which |
| in most cases will go unnoticed, though might in some cases cause a |
| program confusion if it depends in some way upon database name. |
| </para> |
| |
| <para> |
| In normal (non-recovery) mode, if you issue <command>DROP USER</command> or <command>DROP ROLE</command> |
| for a role with login capability while that user is still connected then |
| nothing happens to the connected user — they remain connected. The user cannot |
| reconnect however. This behavior applies in recovery also, so a |
| <command>DROP USER</command> on the primary does not disconnect that user on the standby. |
| </para> |
| |
| <para> |
| The statistics collector is active during recovery. All scans, reads, blocks, |
| index usage, etc., will be recorded normally on the standby. Replayed |
| actions will not duplicate their effects on primary, so replaying an |
| insert will not increment the Inserts column of pg_stat_user_tables. |
| The stats file is deleted at the start of recovery, so stats from primary |
| and standby will differ; this is considered a feature, not a bug. |
| </para> |
| |
| <para> |
| Autovacuum is not active during recovery. It will start normally at the |
| end of recovery. |
| </para> |
| |
| <para> |
| The checkpointer process and the background writer process are active during |
| recovery. The checkpointer process will perform restartpoints (similar to |
| checkpoints on the primary) and the background writer process will perform |
| normal block cleaning activities. This can include updates of the hint bit |
| information stored on the standby server. |
| The <command>CHECKPOINT</command> command is accepted during recovery, |
| though it performs a restartpoint rather than a new checkpoint. |
| </para> |
| </sect2> |
| |
| <sect2 id="hot-standby-parameters"> |
| <title>Hot Standby Parameter Reference</title> |
| |
| <para> |
| Various parameters have been mentioned above in |
| <xref linkend="hot-standby-conflict"/> and |
| <xref linkend="hot-standby-admin"/>. |
| </para> |
| |
| <para> |
| On the primary, parameters <xref linkend="guc-wal-level"/> and |
| <xref linkend="guc-vacuum-defer-cleanup-age"/> can be used. |
| <xref linkend="guc-max-standby-archive-delay"/> and |
| <xref linkend="guc-max-standby-streaming-delay"/> have no effect if set on |
| the primary. |
| </para> |
| |
| <para> |
| On the standby, parameters <xref linkend="guc-hot-standby"/>, |
| <xref linkend="guc-max-standby-archive-delay"/> and |
| <xref linkend="guc-max-standby-streaming-delay"/> can be used. |
| <xref linkend="guc-vacuum-defer-cleanup-age"/> has no effect |
| as long as the server remains in standby mode, though it will |
| become relevant if the standby becomes primary. |
| </para> |
| </sect2> |
| |
| <sect2 id="hot-standby-caveats"> |
| <title>Caveats</title> |
| |
| <para> |
| There are several limitations of Hot Standby. |
| These can and probably will be fixed in future releases: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Full knowledge of running transactions is required before snapshots |
| can be taken. Transactions that use large numbers of subtransactions |
| (currently greater than 64) will delay the start of read-only |
| connections until the completion of the longest running write transaction. |
| If this situation occurs, explanatory messages will be sent to the server log. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Valid starting points for standby queries are generated at each |
| checkpoint on the primary. If the standby is shut down while the primary |
| is in a shutdown state, it might not be possible to re-enter Hot Standby |
| until the primary is started up, so that it generates further starting |
| points in the WAL logs. This situation isn't a problem in the most |
| common situations where it might happen. Generally, if the primary is |
| shut down and not available anymore, that's likely due to a serious |
| failure that requires the standby being converted to operate as |
| the new primary anyway. And in situations where the primary is |
| being intentionally taken down, coordinating to make sure the standby |
| becomes the new primary smoothly is also standard procedure. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| At the end of recovery, <literal>AccessExclusiveLocks</literal> held by prepared transactions |
| will require twice the normal number of lock table entries. If you plan |
| on running either a large number of concurrent prepared transactions |
| that normally take <literal>AccessExclusiveLocks</literal>, or you plan on having one |
| large transaction that takes many <literal>AccessExclusiveLocks</literal>, you are |
| advised to select a larger value of <varname>max_locks_per_transaction</varname>, |
| perhaps as much as twice the value of the parameter on |
| the primary server. You need not consider this at all if |
| your setting of <varname>max_prepared_transactions</varname> is 0. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| The Serializable transaction isolation level is not yet available in hot |
| standby. (See <xref linkend="xact-serializable"/> and |
| <xref linkend="serializable-consistency"/> for details.) |
| An attempt to set a transaction to the serializable isolation level in |
| hot standby mode will generate an error. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </para> |
| </sect2> |
| |
| </sect1> |
| |
| </chapter> |