doc/src/sgml/high-availability.sgml - cloudberry - Git at Google

 <!-- doc/src/sgml/high-availability.sgml -->

 <chapter id="high-availability">
  <title>High Availability, Load Balancing, and Replication</title>

  <indexterm><primary>high availability</primary></indexterm>
  <indexterm><primary>failover</primary></indexterm>
  <indexterm><primary>replication</primary></indexterm>
  <indexterm><primary>load balancing</primary></indexterm>
  <indexterm><primary>clustering</primary></indexterm>
  <indexterm><primary>data partitioning</primary></indexterm>

  <para>
   Database servers can work together to allow a second server to
   take over quickly if the primary server fails (high
   availability), or to allow several computers to serve the same
   data (load balancing).  Ideally, database servers could work
   together seamlessly.  Web servers serving static web pages can
   be combined quite easily by merely load-balancing web requests
   to multiple machines.  In fact, read-only database servers can
   be combined relatively easily too.  Unfortunately, most database
   servers have a read/write mix of requests, and read/write servers
   are much harder to combine.  This is because though read-only
   data needs to be placed on each server only once, a write to any
   server has to be propagated to all servers so that future read
   requests to those servers return consistent results.
  </para>

  <para>
   This synchronization problem is the fundamental difficulty for
   servers working together.  Because there is no single solution
   that eliminates the impact of the sync problem for all use cases,
   there are multiple solutions.  Each solution addresses this
   problem in a different way, and minimizes its impact for a specific
   workload.
  </para>

  <para>
   Some solutions deal with synchronization by allowing only one
   server to modify the data.  Servers that can modify data are
   called read/write, <firstterm>master</firstterm> or <firstterm>primary</firstterm> servers.
   Servers that track changes in the primary are called <firstterm>standby</firstterm>
   or <firstterm>secondary</firstterm> servers. A standby server that cannot be connected
   to until it is promoted to a primary server is called a <firstterm>warm
   standby</firstterm> server, and one that can accept connections and serves read-only
   queries is called a <firstterm>hot standby</firstterm> server.
  </para>

  <para>
   Some solutions are synchronous,
   meaning that a data-modifying transaction is not considered
   committed until all servers have committed the transaction.  This
   guarantees that a failover will not lose any data and that all
   load-balanced servers will return consistent results no matter
   which server is queried. In contrast, asynchronous solutions allow some
   delay between the time of a commit and its propagation to the other servers,
   opening the possibility that some transactions might be lost in
   the switch to a backup server, and that load balanced servers
   might return slightly stale results.  Asynchronous communication
   is used when synchronous would be too slow.
  </para>

  <para>
   Solutions can also be categorized by their granularity.  Some solutions
   can deal only with an entire database server, while others allow control
   at the per-table or per-database level.
  </para>

  <para>
   Performance must be considered in any choice.  There is usually a
   trade-off between functionality and
   performance.  For example, a fully synchronous solution over a slow
   network might cut performance by more than half, while an asynchronous
   one might have a minimal performance impact.
  </para>

  <para>
   The remainder of this section outlines various failover, replication,
   and load balancing solutions.
  </para>

  <sect1 id="different-replication-solutions">
  <title>Comparison of Different Solutions</title>

  <variablelist>

   <varlistentry>
    <term>Shared Disk Failover</term>
    <listitem>

     <para>
      Shared disk failover avoids synchronization overhead by having only one
      copy of the database.  It uses a single disk array that is shared by
      multiple servers.  If the main database server fails, the standby server
      is able to mount and start the database as though it were recovering from
      a database crash.  This allows rapid failover with no data loss.
     </para>

     <para>
      Shared hardware functionality is common in network storage devices.
      Using a network file system is also possible, though care must be
      taken that the file system has full <acronym>POSIX</acronym> behavior (see <xref
      linkend="creating-cluster-nfs"/>).  One significant limitation of this
      method is that if the shared disk array fails or becomes corrupt, the
      primary and standby servers are both nonfunctional.  Another issue is
      that the standby server should never access the shared storage while
      the primary server is running.
     </para>

    </listitem>
   </varlistentry>

   <varlistentry>
    <term>File System (Block Device) Replication</term>
    <listitem>

     <para>
      A modified version of shared hardware functionality is file system
      replication, where all changes to a file system are mirrored to a file
      system residing on another computer.  The only restriction is that
      the mirroring must be done in a way that ensures the standby server
      has a consistent copy of the file system &mdash; specifically, writes
      to the standby must be done in the same order as those on the primary.
      <productname>DRBD</productname> is a popular file system replication solution
      for Linux.
     </para>

 <!--
 https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

 Oracle RAC is a shared disk approach and just send cache invalidations
 to other nodes but not actual data. As the disk is shared, data is
 only committed once to disk and there is a distributed locking
 protocol to make nodes agree on a serializable transactional order.
 -->

    </listitem>
   </varlistentry>

   <varlistentry>
    <term>Write-Ahead Log Shipping</term>
    <listitem>

     <para>
      Warm and hot standby servers can be kept current by reading a
      stream of write-ahead log (<acronym>WAL</acronym>)
      records.  If the main server fails, the standby contains
      almost all of the data of the main server, and can be quickly
      made the new primary database server.  This can be synchronous or
      asynchronous and can only be done for the entire database server.
     </para>
     <para>
      A standby server can be implemented using file-based log shipping
      (<xref linkend="warm-standby"/>) or streaming replication (see
      <xref linkend="streaming-replication"/>), or a combination of both. For
      information on hot standby, see <xref linkend="hot-standby"/>.
     </para>
    </listitem>
   </varlistentry>

   <varlistentry>
    <term>Logical Replication</term>
    <listitem>
     <para>
      Logical replication allows a database server to send a stream of data
      modifications to another server.  <productname>PostgreSQL</productname>
      logical replication constructs a stream of logical data modifications
      from the WAL.  Logical replication allows replication of data changes on
      a per-table basis.  In addition, a server that is publishing its own
      changes can also subscribe to changes from another server, allowing data
      to flow in multiple directions.  For more information on logical
      replication, see <xref linkend="logical-replication"/>.  Through the
      logical decoding interface (<xref linkend="logicaldecoding"/>),
      third-party extensions can also provide similar functionality.
     </para>
    </listitem>
   </varlistentry>

   <varlistentry>
    <term>Trigger-Based Primary-Standby Replication</term>
    <listitem>

     <para>
      A trigger-based replication setup typically funnels data modification
      queries to a designated primary server. Operating on a per-table basis,
      the primary server sends data changes (typically) asynchronously to the
      standby servers.  Standby servers can answer queries while the primary is
      running, and may allow some local data changes or write activity.  This
      form of replication is often used for offloading large analytical or data
      warehouse queries.
     </para>

     <para>
      <productname>Slony-I</productname> is an example of this type of
      replication, with per-table granularity, and support for multiple standby
      servers.  Because it updates the standby server asynchronously (in
      batches), there is possible data loss during fail over.
     </para>
    </listitem>
   </varlistentry>

   <varlistentry>
    <term>SQL-Based Replication Middleware</term>
    <listitem>

     <para>
      With SQL-based replication middleware, a program intercepts
      every SQL query and sends it to one or all servers.  Each server
      operates independently.  Read-write queries must be sent to all servers,
      so that every server receives any changes.  But read-only queries can be
      sent to just one server, allowing the read workload to be distributed
      among them.
     </para>

     <para>
      If queries are simply broadcast unmodified, functions like
      <function>random()</function>, <function>CURRENT_TIMESTAMP</function>, and
      sequences can have different values on different servers.
      This is because each server operates independently, and because
      SQL queries are broadcast rather than actual data changes.  If
      this is unacceptable, either the middleware or the application
      must determine such values from a single source and then use those
      values in write queries.  Care must also be taken that all
      transactions either commit or abort on all servers, perhaps
      using two-phase commit (<xref linkend="sql-prepare-transaction"/>
      and <xref linkend="sql-commit-prepared"/>).
      <productname>Pgpool-II</productname> and <productname>Continuent Tungsten</productname>
      are examples of this type of replication.
     </para>
    </listitem>
   </varlistentry>

   <varlistentry>
    <term>Asynchronous Multimaster Replication</term>
    <listitem>

     <para>
      For servers that are not regularly connected or have slow
      communication links, like laptops or
      remote servers, keeping data consistent among servers is a
      challenge.  Using asynchronous multimaster replication, each
      server works independently, and periodically communicates with
      the other servers to identify conflicting transactions.  The
      conflicts can be resolved by users or conflict resolution rules.
      Bucardo is an example of this type of replication.
     </para>
    </listitem>
   </varlistentry>

   <varlistentry>
    <term>Synchronous Multimaster Replication</term>
    <listitem>

     <para>
      In synchronous multimaster replication, each server can accept
      write requests, and modified data is transmitted from the
      original server to every other server before each transaction
      commits.  Heavy write activity can cause excessive locking and
      commit delays, leading to poor performance.  Read requests can
      be sent to any server.  Some implementations use shared disk
      to reduce the communication overhead.  Synchronous multimaster
      replication is best for mostly read workloads, though its big
      advantage is that any server can accept write requests &mdash;
      there is no need to partition workloads between primary and
      standby servers, and because the data changes are sent from one
      server to another, there is no problem with non-deterministic
      functions like <function>random()</function>.
     </para>

     <para>
      <productname>PostgreSQL</productname> does not offer this type of replication,
      though <productname>PostgreSQL</productname> two-phase commit (<xref
      linkend="sql-prepare-transaction"/> and <xref
      linkend="sql-commit-prepared"/>)
      can be used to implement this in application code or middleware.
     </para>
    </listitem>
   </varlistentry>

  </variablelist>

  <para>
   <xref linkend="high-availability-matrix"/> summarizes
   the capabilities of the various solutions listed above.
  </para>

  <table id="high-availability-matrix">
   <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
   <tgroup cols="9">
    <colspec colname="col1" colwidth="1.1*"/>
    <colspec colname="col2" colwidth="1*"/>
    <colspec colname="col3" colwidth="1*"/>
    <colspec colname="col4" colwidth="1*"/>
    <colspec colname="col5" colwidth="1*"/>
    <colspec colname="col6" colwidth="1*"/>
    <colspec colname="col7" colwidth="1*"/>
    <colspec colname="col8" colwidth="1*"/>
    <colspec colname="col9" colwidth="1*"/>
    <thead>
     <row>
      <entry>Feature</entry>
      <entry>Shared Disk</entry>
      <entry>File System Repl.</entry>
      <entry>Write-Ahead Log Shipping</entry>
      <entry>Logical Repl.</entry>
      <entry>Trigger-&zwsp;Based Repl.</entry>
      <entry>SQL Repl. Middle-ware</entry>
      <entry>Async. MM Repl.</entry>
      <entry>Sync. MM Repl.</entry>
     </row>
    </thead>

    <tbody>

     <row>
      <entry>Popular examples</entry>
      <entry align="center">NAS</entry>
      <entry align="center">DRBD</entry>
      <entry align="center">built-in streaming repl.</entry>
      <entry align="center">built-in logical repl., pglogical</entry>
      <entry align="center">Londiste, Slony</entry>
      <entry align="center">pgpool-II</entry>
      <entry align="center">Bucardo</entry>
      <entry align="center"></entry>
     </row>

     <row>
      <entry>Comm. method</entry>
      <entry align="center">shared disk</entry>
      <entry align="center">disk blocks</entry>
      <entry align="center">WAL</entry>
      <entry align="center">logical decoding</entry>
      <entry align="center">table rows</entry>
      <entry align="center">SQL</entry>
      <entry align="center">table rows</entry>
      <entry align="center">table rows and row locks</entry>
     </row>

     <row>
      <entry>No special hardware required</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
     </row>

     <row>
      <entry>Allows multiple primary servers</entry>
      <entry align="center"></entry>
      <entry align="center"></entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
     </row>

     <row>
      <entry>No overhead on primary</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center"></entry>
     </row>

     <row>
      <entry>No waiting for multiple servers</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">with sync off</entry>
      <entry align="center">with sync off</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
     </row>

     <row>
      <entry>Primary failure will never lose data</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">with sync on</entry>
      <entry align="center">with sync on</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
     </row>

     <row>
      <entry>Replicas accept read-only queries</entry>
      <entry align="center"></entry>
      <entry align="center"></entry>
      <entry align="center">with hot standby</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
     </row>

     <row>
      <entry>Per-table granularity</entry>
      <entry align="center"></entry>
      <entry align="center"></entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
     </row>

     <row>
      <entry>No conflict resolution necessary</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
      <entry align="center">&bull;</entry>
      <entry align="center"></entry>
      <entry align="center">&bull;</entry>
     </row>

    </tbody>
   </tgroup>
  </table>

  <para>
   There are a few solutions that do not fit into the above categories:
  </para>

  <variablelist>

   <varlistentry>
    <term>Data Partitioning</term>
    <listitem>

     <para>
      Data partitioning splits tables into data sets.  Each set can
      be modified by only one server.  For example, data can be
      partitioned by offices, e.g., London and Paris, with a server
      in each office.  If queries combining London and Paris data
      are necessary, an application can query both servers, or
      primary/standby replication can be used to keep a read-only copy
      of the other office's data on each server.
     </para>
    </listitem>
   </varlistentry>

   <varlistentry>
    <term>Multiple-Server Parallel Query Execution</term>
    <listitem>

     <para>
      Many of the above solutions allow multiple servers to handle multiple
      queries, but none allow a single query to use multiple servers to
      complete faster.  This solution allows multiple servers to work
      concurrently on a single query.  It is usually accomplished by
      splitting the data among servers and having each server execute its
      part of the query and return results to a central server where they
      are combined and returned to the user. This can be implemented using the
      <productname>PL/Proxy</productname> tool set.
     </para>

    </listitem>
   </varlistentry>

  </variablelist>

   <para>
    It should also be noted that because <productname>PostgreSQL</productname>
    is open source and easily extended, a number of companies have
    taken <productname>PostgreSQL</productname> and created commercial
    closed-source solutions with unique failover, replication, and load
    balancing capabilities.  These are not discussed here.
   </para>

  </sect1>


  <sect1 id="warm-standby">
  <title>Log-Shipping Standby Servers</title>


   <para>
    Continuous archiving can be used to create a <firstterm>high
    availability</firstterm> (HA) cluster configuration with one or more
    <firstterm>standby servers</firstterm> ready to take over operations if the
    primary server fails. This capability is widely referred to as
    <firstterm>warm standby</firstterm> or <firstterm>log shipping</firstterm>.
   </para>

   <para>
    The primary and standby server work together to provide this capability,
    though the servers are only loosely coupled. The primary server operates
    in continuous archiving mode, while each standby server operates in
    continuous recovery mode, reading the WAL files from the primary. No
    changes to the database tables are required to enable this capability,
    so it offers low administration overhead compared to some other
    replication solutions. This configuration also has relatively low
    performance impact on the primary server.
   </para>

   <para>
    Directly moving WAL records from one database server to another
    is typically described as log shipping. <productname>PostgreSQL</productname>
    implements file-based log shipping by transferring WAL records
    one file (WAL segment) at a time. WAL files (16MB) can be
    shipped easily and cheaply over any distance, whether it be to an
    adjacent system, another system at the same site, or another system on
    the far side of the globe. The bandwidth required for this technique
    varies according to the transaction rate of the primary server.
    Record-based log shipping is more granular and streams WAL changes
    incrementally over a network connection (see <xref
    linkend="streaming-replication"/>).
   </para>

   <para>
    It should be noted that log shipping is asynchronous, i.e., the WAL
    records are shipped after transaction commit. As a result, there is a
    window for data loss should the primary server suffer a catastrophic
    failure; transactions not yet shipped will be lost.  The size of the
    data loss window in file-based log shipping can be limited by use of the
    <varname>archive_timeout</varname> parameter, which can be set as low
    as a few seconds.  However such a low setting will
    substantially increase the bandwidth required for file shipping.
    Streaming replication (see <xref linkend="streaming-replication"/>)
    allows a much smaller window of data loss.
   </para>

   <para>
    Recovery performance is sufficiently good that the standby will
    typically be only moments away from full
    availability once it has been activated. As a result, this is called
    a warm standby configuration which offers high
    availability. Restoring a server from an archived base backup and
    rollforward will take considerably longer, so that technique only
    offers a solution for disaster recovery, not high availability.
    A standby server can also be used for read-only queries, in which case
    it is called a Hot Standby server. See <xref linkend="hot-standby"/> for
    more information.
   </para>

   <indexterm zone="high-availability">
    <primary>warm standby</primary>
   </indexterm>

   <indexterm zone="high-availability">
    <primary>PITR standby</primary>
   </indexterm>

   <indexterm zone="high-availability">
    <primary>standby server</primary>
   </indexterm>

   <indexterm zone="high-availability">
    <primary>log shipping</primary>
   </indexterm>

   <indexterm zone="high-availability">
    <primary>witness server</primary>
   </indexterm>

   <indexterm zone="high-availability">
    <primary>STONITH</primary>
   </indexterm>

   <sect2 id="standby-planning">
    <title>Planning</title>

    <para>
     It is usually wise to create the primary and standby servers
     so that they are as similar as possible, at least from the
     perspective of the database server.  In particular, the path names
     associated with tablespaces will be passed across unmodified, so both
     primary and standby servers must have the same mount paths for
     tablespaces if that feature is used.  Keep in mind that if
     <xref linkend="sql-createtablespace"/>
     is executed on the primary, any new mount point needed for it must
     be created on the primary and all standby servers before the command
     is executed. Hardware need not be exactly the same, but experience shows
     that maintaining two identical systems is easier than maintaining two
     dissimilar ones over the lifetime of the application and system.
     In any case the hardware architecture must be the same &mdash; shipping
     from, say, a 32-bit to a 64-bit system will not work.
    </para>

    <para>
     In general, log shipping between servers running different major
     <productname>PostgreSQL</productname> release
     levels is not possible. It is the policy of the PostgreSQL Global
     Development Group not to make changes to disk formats during minor release
     upgrades, so it is likely that running different minor release levels
     on primary and standby servers will work successfully. However, no
     formal support for that is offered and you are advised to keep primary
     and standby servers at the same release level as much as possible.
     When updating to a new minor release, the safest policy is to update
     the standby servers first &mdash; a new minor release is more likely
     to be able to read WAL files from a previous minor release than vice
     versa.
    </para>

   </sect2>

   <sect2 id="standby-server-operation" xreflabel="Standby Server Operation">
    <title>Standby Server Operation</title>

    <para>
     A server enters standby mode if a
     <anchor id="file-standby-signal" xreflabel="standby.signal"/>
     <filename>standby.signal</filename>
     <indexterm><primary><filename>standby.signal</filename></primary></indexterm>
     file exists in the data directory when the server is started.
    </para>

    <para>
     In standby mode, the server continuously applies WAL received from the
     primary server. The standby server can read WAL from a WAL archive
     (see <xref linkend="guc-restore-command"/>) or directly from the primary
     over a TCP connection (streaming replication). The standby server will
     also attempt to restore any WAL found in the standby cluster's
     <filename>pg_wal</filename> directory. That typically happens after a server
     restart, when the standby replays again WAL that was streamed from the
     primary before the restart, but you can also manually copy files to
     <filename>pg_wal</filename> at any time to have them replayed.
    </para>

    <para>
     At startup, the standby begins by restoring all WAL available in the
     archive location, calling <varname>restore_command</varname>. Once it
     reaches the end of WAL available there and <varname>restore_command</varname>
     fails, it tries to restore any WAL available in the <filename>pg_wal</filename> directory.
     If that fails, and streaming replication has been configured, the
     standby tries to connect to the primary server and start streaming WAL
     from the last valid record found in archive or <filename>pg_wal</filename>. If that fails
     or streaming replication is not configured, or if the connection is
     later disconnected, the standby goes back to step 1 and tries to
     restore the file from the archive again. This loop of retries from the
     archive, <filename>pg_wal</filename>, and via streaming replication goes on until the server
     is stopped or failover is triggered by a trigger file.
    </para>

    <para>
     Standby mode is exited and the server switches to normal operation
     when <command>pg_ctl promote</command> is run,
     <function>pg_promote()</function> is called, or a trigger file is found
     (<varname>promote_trigger_file</varname>). Before failover,
     any WAL immediately available in the archive or in <filename>pg_wal</filename> will be
     restored, but no attempt is made to connect to the primary.
    </para>
   </sect2>

   <sect2 id="preparing-primary-for-standby">
    <title>Preparing the Primary for Standby Servers</title>

    <para>
     Set up continuous archiving on the primary to an archive directory
     accessible from the standby, as described
     in <xref linkend="continuous-archiving"/>. The archive location should be
     accessible from the standby even when the primary is down, i.e., it should
     reside on the standby server itself or another trusted server, not on
     the primary server.
    </para>

    <para>
     If you want to use streaming replication, set up authentication on the
     primary server to allow replication connections from the standby
     server(s); that is, create a role and provide a suitable entry or
     entries in <filename>pg_hba.conf</filename> with the database field set to
     <literal>replication</literal>.  Also ensure <varname>max_wal_senders</varname> is set
     to a sufficiently large value in the configuration file of the primary
     server. If replication slots will be used,
     ensure that <varname>max_replication_slots</varname> is set sufficiently
     high as well.
    </para>

    <para>
     Take a base backup as described in <xref linkend="backup-base-backup"/>
     to bootstrap the standby server.
    </para>
   </sect2>

   <sect2 id="standby-server-setup">
    <title>Setting Up a Standby Server</title>

    <para>
     To set up the standby server, restore the base backup taken from primary
     server (see <xref linkend="backup-pitr-recovery"/>). Create a file
     <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary></indexterm>
     in the standby's cluster data
     directory. Set <xref linkend="guc-restore-command"/> to a simple command to copy files from
     the WAL archive. If you plan to have multiple standby servers for high
     availability purposes, make sure that <varname>recovery_target_timeline</varname> is set to
     <literal>latest</literal> (the default), to make the standby server follow the timeline change
     that occurs at failover to another standby.
    </para>

    <note>
      <para>
      <xref linkend="guc-restore-command"/> should return immediately
      if the file does not exist; the server will retry the command again if
      necessary.
     </para>
    </note>

    <para>
      If you want to use streaming replication, fill in
      <xref linkend="guc-primary-conninfo"/> with a libpq connection string, including
      the host name (or IP address) and any additional details needed to
      connect to the primary server. If the primary needs a password for
      authentication, the password needs to be specified in
      <xref linkend="guc-primary-conninfo"/> as well.
    </para>

    <para>
     If you're setting up the standby server for high availability purposes,
     set up WAL archiving, connections and authentication like the primary
     server, because the standby server will work as a primary server after
     failover.
    </para>

    <para>
     If you're using a WAL archive, its size can be minimized using the <xref
     linkend="guc-archive-cleanup-command"/> parameter to remove files that are no
     longer required by the standby server.
     The <application>pg_archivecleanup</application> utility is designed specifically to
     be used with <varname>archive_cleanup_command</varname> in typical single-standby
     configurations, see <xref linkend="pgarchivecleanup"/>.
     Note however, that if you're using the archive for backup purposes, you
     need to retain files needed to recover from at least the latest base
     backup, even if they're no longer needed by the standby.
    </para>

    <para>
     A simple example of configuration is:
 <programlisting>
 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000'''
 restore_command = 'cp /path/to/archive/%f %p'
 archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
 </programlisting>
    </para>

    <para>
     You can have any number of standby servers, but if you use streaming
     replication, make sure you set <varname>max_wal_senders</varname> high enough in
     the primary to allow them to be connected simultaneously.
    </para>

   </sect2>

   <sect2 id="streaming-replication">
    <title>Streaming Replication</title>

    <indexterm zone="high-availability">
     <primary>Streaming Replication</primary>
    </indexterm>

    <para>
     Streaming replication allows a standby server to stay more up-to-date
     than is possible with file-based log shipping. The standby connects
     to the primary, which streams WAL records to the standby as they're
     generated, without waiting for the WAL file to be filled.
    </para>

    <para>
     Streaming replication is asynchronous by default
     (see <xref linkend="synchronous-replication"/>), in which case there is
     a small delay between committing a transaction in the primary and the
     changes becoming visible in the standby. This delay is however much
     smaller than with file-based log shipping, typically under one second
     assuming the standby is powerful enough to keep up with the load. With
     streaming replication, <varname>archive_timeout</varname> is not required to
     reduce the data loss window.
    </para>

    <para>
     If you use streaming replication without file-based continuous
     archiving, the server might recycle old WAL segments before the standby
     has received them.  If this occurs, the standby will need to be
     reinitialized from a new base backup.  You can avoid this by setting
     <varname>wal_keep_size</varname> to a value large enough to ensure that
     WAL segments are not recycled too early, or by configuring a replication
     slot for the standby.  If you set up a WAL archive that's accessible from
     the standby, these solutions are not required, since the standby can
     always use the archive to catch up provided it retains enough segments.
    </para>

    <para>
     To use streaming replication, set up a file-based log-shipping standby
     server as described in <xref linkend="warm-standby"/>. The step that
     turns a file-based log-shipping standby into streaming replication
     standby is setting the <varname>primary_conninfo</varname> setting
     to point to the primary server. Set
     <xref linkend="guc-listen-addresses"/> and authentication options
     (see <filename>pg_hba.conf</filename>) on the primary so that the standby server
     can connect to the <literal>replication</literal> pseudo-database on the primary
     server (see <xref linkend="streaming-replication-authentication"/>).
    </para>

    <para>
     On systems that support the keepalive socket option, setting
     <xref linkend="guc-tcp-keepalives-idle"/>,
     <xref linkend="guc-tcp-keepalives-interval"/> and
     <xref linkend="guc-tcp-keepalives-count"/> helps the primary promptly
     notice a broken connection.
    </para>

    <para>
     Set the maximum number of concurrent connections from the standby servers
     (see <xref linkend="guc-max-wal-senders"/> for details).
    </para>

    <para>
     When the standby is started and <varname>primary_conninfo</varname> is set
     correctly, the standby will connect to the primary after replaying all
     WAL files available in the archive. If the connection is established
     successfully, you will see a <literal>walreceiver</literal> in the standby, and
     a corresponding <literal>walsender</literal> process in the primary.
    </para>

    <sect3 id="streaming-replication-authentication">
     <title>Authentication</title>
     <para>
      It is very important that the access privileges for replication be set up
      so that only trusted users can read the WAL stream, because it is
      easy to extract privileged information from it.  Standby servers must
      authenticate to the primary as an account that has the
      <literal>REPLICATION</literal> privilege or a superuser. It is
      recommended to create a dedicated user account with
      <literal>REPLICATION</literal> and <literal>LOGIN</literal>
      privileges for replication. While <literal>REPLICATION</literal>
      privilege gives very high permissions, it does not allow the user to
      modify any data on the primary system, which the
      <literal>SUPERUSER</literal> privilege does.
     </para>

     <para>
      Client authentication for replication is controlled by a
      <filename>pg_hba.conf</filename> record specifying <literal>replication</literal> in the
      <replaceable>database</replaceable> field. For example, if the standby is running on
      host IP <literal>192.168.1.100</literal> and the account name for replication
      is <literal>foo</literal>, the administrator can add the following line to the
      <filename>pg_hba.conf</filename> file on the primary:

 <programlisting>
 # Allow the user "foo" from host 192.168.1.100 to connect to the primary
 # as a replication standby if the user's password is correctly supplied.
 #
 # TYPE  DATABASE        USER            ADDRESS                 METHOD
 host    replication     foo             192.168.1.100/32        md5
 </programlisting>
     </para>
     <para>
      The host name and port number of the primary, connection user name,
      and password are specified in the <xref linkend="guc-primary-conninfo"/>.
      The password can also be set in the <filename>~/.pgpass</filename> file on the
      standby (specify <literal>replication</literal> in the <replaceable>database</replaceable>
      field).
      For example, if the primary is running on host IP <literal>192.168.1.50</literal>,
      port <literal>5432</literal>, the account name for replication is
      <literal>foo</literal>, and the password is <literal>foopass</literal>, the administrator
      can add the following line to the <filename>postgresql.conf</filename> file on the
      standby:

 <programlisting>
 # The standby connects to the primary that is running on host 192.168.1.50
 # and port 5432 as the user "foo" whose password is "foopass".
 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 </programlisting>
     </para>
    </sect3>

    <sect3 id="streaming-replication-monitoring">
     <title>Monitoring</title>
     <para>
      An important health indicator of streaming replication is the amount
      of WAL records generated in the primary, but not yet applied in the
      standby. You can calculate this lag by comparing the current WAL write
      location on the primary with the last WAL location received by the
      standby. These locations can be retrieved using
      <function>pg_current_wal_lsn</function> on the primary and
      <function>pg_last_wal_receive_lsn</function> on the standby,
      respectively (see <xref linkend="functions-admin-backup-table"/> and
      <xref linkend="functions-recovery-info-table"/> for details).
      The last WAL receive location in the standby is also displayed in the
      process status of the WAL receiver process, displayed using the
      <command>ps</command> command (see <xref linkend="monitoring-ps"/> for details).
     </para>
     <para>
      You can retrieve a list of WAL sender processes via the
      <link linkend="monitoring-pg-stat-replication-view"><structname>
      pg_stat_replication</structname></link> view. Large differences between
      <function>pg_current_wal_lsn</function> and the view's <literal>sent_lsn</literal> field
      might indicate that the primary server is under heavy load, while
      differences between <literal>sent_lsn</literal> and
      <function>pg_last_wal_receive_lsn</function> on the standby might indicate
      network delay, or that the standby is under heavy load.
     </para>
     <para>
      On a hot standby, the status of the WAL receiver process can be retrieved
      via the <link linkend="monitoring-pg-stat-wal-receiver-view">
      <structname>pg_stat_wal_receiver</structname></link> view.  A large
      difference between <function>pg_last_wal_replay_lsn</function> and the
      view's <literal>flushed_lsn</literal> indicates that WAL is being
      received faster than it can be replayed.
     </para>
    </sect3>
   </sect2>

   <sect2 id="streaming-replication-slots">
    <title>Replication Slots</title>
    <indexterm>
     <primary>replication slot</primary>
     <secondary>streaming replication</secondary>
    </indexterm>
    <para>
     Replication slots provide an automated way to ensure that the primary does
     not remove WAL segments until they have been received by all standbys,
     and that the primary does not remove rows which could cause a
     <link linkend="hot-standby-conflict">recovery conflict</link> even when the
     standby is disconnected.
    </para>
    <para>
     In lieu of using replication slots, it is possible to prevent the removal
     of old WAL segments using <xref linkend="guc-wal-keep-size"/>, or by
     storing the segments in an archive using
     <xref linkend="guc-archive-command"/>.
     However, these methods often result in retaining more WAL segments than
     required, whereas replication slots retain only the number of segments
     known to be needed.  On the other hand, replication slots can retain so
     many WAL segments that they fill up the space allocated
     for <literal>pg_wal</literal>;
     <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
     retained by replication slots.
    </para>
    <para>
     Similarly, <xref linkend="guc-hot-standby-feedback"/>
     and <xref linkend="guc-vacuum-defer-cleanup-age"/> provide protection against
     relevant rows being removed by vacuum, but the former provides no
     protection during any time period when the standby is not connected,
     and the latter often needs to be set to a high value to provide adequate
     protection.  Replication slots overcome these disadvantages.
    </para>
    <sect3 id="streaming-replication-slots-manipulation">
     <title>Querying and Manipulating Replication Slots</title>
     <para>
      Each replication slot has a name, which can contain lower-case letters,
      numbers, and the underscore character.
     </para>
     <para>
      Existing replication slots and their state can be seen in the
      <link linkend="view-pg-replication-slots"><structname>pg_replication_slots</structname></link>
      view.
     </para>
     <para>
      Slots can be created and dropped either via the streaming replication
      protocol (see <xref linkend="protocol-replication"/>) or via SQL
      functions (see <xref linkend="functions-replication"/>).
     </para>
    </sect3>
    <sect3 id="streaming-replication-slots-config">
     <title>Configuration Example</title>
     <para>
      You can create a replication slot like this:
 <programlisting>
 postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
   slot_name  | lsn
 -------------+-----
  node_a_slot |

 postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots;
   slot_name  | slot_type | active
 -------------+-----------+--------
  node_a_slot | physical  | f
 (1 row)
 </programlisting>
      To configure the standby to use this slot, <varname>primary_slot_name</varname>
      should be configured on the standby. Here is a simple example:
 <programlisting>
 primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 primary_slot_name = 'node_a_slot'
 </programlisting>
     </para>
    </sect3>
   </sect2>

   <sect2 id="cascading-replication">
    <title>Cascading Replication</title>

    <indexterm zone="high-availability">
     <primary>Cascading Replication</primary>
    </indexterm>

    <para>
     The cascading replication feature allows a standby server to accept replication
     connections and stream WAL records to other standbys, acting as a relay.
     This can be used to reduce the number of direct connections to the primary
     and also to minimize inter-site bandwidth overheads.
    </para>

    <para>
     A standby acting as both a receiver and a sender is known as a cascading
     standby.  Standbys that are more directly connected to the primary are known
     as upstream servers, while those standby servers further away are downstream
     servers.  Cascading replication does not place limits on the number or
     arrangement of downstream servers, though each standby connects to only
     one upstream server which eventually links to a single primary server.
    </para>

    <para>
     A cascading standby sends not only WAL records received from the
     primary but also those restored from the archive. So even if the replication
     connection in some upstream connection is terminated, streaming replication
     continues downstream for as long as new WAL records are available.
    </para>

    <para>
     Cascading replication is currently asynchronous. Synchronous replication
     (see <xref linkend="synchronous-replication"/>) settings have no effect on
     cascading replication at present.
    </para>

    <para>
     Hot Standby feedback propagates upstream, whatever the cascaded arrangement.
    </para>

    <para>
     If an upstream standby server is promoted to become the new primary, downstream
     servers will continue to stream from the new primary if
     <varname>recovery_target_timeline</varname> is set to <literal>'latest'</literal> (the default).
    </para>

    <para>
     To use cascading replication, set up the cascading standby so that it can
     accept replication connections (that is, set
     <xref linkend="guc-max-wal-senders"/> and <xref linkend="guc-hot-standby"/>,
     and configure
     <link linkend="auth-pg-hba-conf">host-based authentication</link>).
     You will also need to set <varname>primary_conninfo</varname> in the downstream
     standby to point to the cascading standby.
    </para>
   </sect2>

   <sect2 id="synchronous-replication">
    <title>Synchronous Replication</title>

    <indexterm zone="high-availability">
     <primary>Synchronous Replication</primary>
    </indexterm>

    <para>
     <productname>PostgreSQL</productname> streaming replication is asynchronous by
     default. If the primary server
     crashes then some transactions that were committed may not have been
     replicated to the standby server, causing data loss. The amount
     of data loss is proportional to the replication delay at the time of
     failover.
    </para>

    <para>
     Synchronous replication offers the ability to confirm that all changes
     made by a transaction have been transferred to one or more synchronous
     standby servers. This extends that standard level of durability
     offered by a transaction commit. This level of protection is referred
     to as 2-safe replication in computer science theory, and group-1-safe
     (group-safe and 1-safe) when <varname>synchronous_commit</varname> is set to
     <literal>remote_write</literal>.
    </para>

    <para>
     When requesting synchronous replication, each commit of a
     write transaction will wait until confirmation is
     received that the commit has been written to the write-ahead log on disk
     of both the primary and standby server. The only possibility that data
     can be lost is if both the primary and the standby suffer crashes at the
     same time. This can provide a much higher level of durability, though only
     if the sysadmin is cautious about the placement and management of the two
     servers.  Waiting for confirmation increases the user's confidence that the
     changes will not be lost in the event of server crashes but it also
     necessarily increases the response time for the requesting transaction.
     The minimum wait time is the round-trip time between primary and standby.
    </para>

    <para>
     Read-only transactions and transaction rollbacks need not wait for
     replies from standby servers. Subtransaction commits do not wait for
     responses from standby servers, only top-level commits. Long
     running actions such as data loading or index building do not wait
     until the very final commit message. All two-phase commit actions
     require commit waits, including both prepare and commit.
    </para>

    <para>
     A synchronous standby can be a physical replication standby or a logical
     replication subscriber.  It can also be any other physical or logical WAL
     replication stream consumer that knows how to send the appropriate
     feedback messages.  Besides the built-in physical and logical replication
     systems, this includes special programs such
     as <command>pg_receivewal</command> and <command>pg_recvlogical</command>
     as well as some third-party replication systems and custom programs.
     Check the respective documentation for details on synchronous replication
     support.
    </para>

    <sect3 id="synchronous-replication-config">
     <title>Basic Configuration</title>

    <para>
     Once streaming replication has been configured, configuring synchronous
     replication requires only one additional configuration step:
     <xref linkend="guc-synchronous-standby-names"/> must be set to
     a non-empty value.  <varname>synchronous_commit</varname> must also be set to
     <literal>on</literal>, but since this is the default value, typically no change is
     required.  (See <xref linkend="runtime-config-wal-settings"/> and
     <xref linkend="runtime-config-replication-primary"/>.)
     This configuration will cause each commit to wait for
     confirmation that the standby has written the commit record to durable
     storage.
     <varname>synchronous_commit</varname> can be set by individual
     users, so it can be configured in the configuration file, for particular
     users or databases, or dynamically by applications, in order to control
     the durability guarantee on a per-transaction basis.
    </para>

    <para>
     After a commit record has been written to disk on the primary, the
     WAL record is then sent to the standby. The standby sends reply
     messages each time a new batch of WAL data is written to disk, unless
     <varname>wal_receiver_status_interval</varname> is set to zero on the standby.
     In the case that <varname>synchronous_commit</varname> is set to
     <literal>remote_apply</literal>, the standby sends reply messages when the commit
     record is replayed, making the transaction visible.
     If the standby is chosen as a synchronous standby, according to the setting
     of <varname>synchronous_standby_names</varname> on the primary, the reply
     messages from that standby will be considered along with those from other
     synchronous standbys to decide when to release transactions waiting for
     confirmation that the commit record has been received. These parameters
     allow the administrator to specify which standby servers should be
     synchronous standbys. Note that the configuration of synchronous
     replication is mainly on the primary. Named standbys must be directly
     connected to the primary; the primary knows nothing about downstream
     standby servers using cascaded replication.
    </para>

    <para>
     Setting <varname>synchronous_commit</varname> to <literal>remote_write</literal> will
     cause each commit to wait for confirmation that the standby has received
     the commit record and written it out to its own operating system, but not
     for the data to be flushed to disk on the standby.  This
     setting provides a weaker guarantee of durability than <literal>on</literal>
     does: the standby could lose the data in the event of an operating system
     crash, though not a <productname>PostgreSQL</productname> crash.
     However, it's a useful setting in practice
     because it can decrease the response time for the transaction.
     Data loss could only occur if both the primary and the standby crash and
     the database of the primary gets corrupted at the same time.
    </para>

    <para>
     Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
     consistency.
    </para>

    <para>
     Users will stop waiting if a fast shutdown is requested.  However, as
     when using asynchronous replication, the server will not fully
     shutdown until all outstanding WAL records are transferred to the currently
     connected standby servers.
    </para>

    </sect3>

    <sect3 id="synchronous-replication-multiple-standbys">
     <title>Multiple Synchronous Standbys</title>

    <para>
     Synchronous replication supports one or more synchronous standby servers;
     transactions will wait until all the standby servers which are considered
     as synchronous confirm receipt of their data. The number of synchronous
     standbys that transactions must wait for replies from is specified in
     <varname>synchronous_standby_names</varname>. This parameter also specifies
     a list of standby names and the method (<literal>FIRST</literal> and
     <literal>ANY</literal>) to choose synchronous standbys from the listed ones.
    </para>
    <para>
     The method <literal>FIRST</literal> specifies a priority-based synchronous
     replication and makes transaction commits wait until their WAL records are
     replicated to the requested number of synchronous standbys chosen based on
     their priorities. The standbys whose names appear earlier in the list are
     given higher priority and will be considered as synchronous. Other standby
     servers appearing later in this list represent potential synchronous
     standbys. If any of the current synchronous standbys disconnects for
     whatever reason, it will be replaced immediately with the
     next-highest-priority standby.
    </para>
    <para>
     An example of <varname>synchronous_standby_names</varname> for
     a priority-based multiple synchronous standbys is:
 <programlisting>
 synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
     <literal>s3</literal> and <literal>s4</literal> are running, the two standbys
     <literal>s1</literal> and <literal>s2</literal> will be chosen as synchronous standbys
     because their names appear early in the list of standby names.
     <literal>s3</literal> is a potential synchronous standby and will take over
     the role of synchronous standby when either of <literal>s1</literal> or
     <literal>s2</literal> fails. <literal>s4</literal> is an asynchronous standby since
     its name is not in the list.
    </para>
    <para>
     The method <literal>ANY</literal> specifies a quorum-based synchronous
     replication and makes transaction commits wait until their WAL records
     are replicated to <emphasis>at least</emphasis> the requested number of
     synchronous standbys in the list.
    </para>
    <para>
     An example of <varname>synchronous_standby_names</varname> for
     a quorum-based multiple synchronous standbys is:
 <programlisting>
 synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
     <literal>s3</literal> and <literal>s4</literal> are running, transaction commits will
     wait for replies from at least any two standbys of <literal>s1</literal>,
     <literal>s2</literal> and <literal>s3</literal>. <literal>s4</literal> is an asynchronous
     standby since its name is not in the list.
    </para>
    <para>
     The synchronous states of standby servers can be viewed using
     the <structname>pg_stat_replication</structname> view.
    </para>
    </sect3>

    <sect3 id="synchronous-replication-performance">
     <title>Planning for Performance</title>

    <para>
     Synchronous replication usually requires carefully planned and placed
     standby servers to ensure applications perform acceptably. Waiting
     doesn't utilize system resources, but transaction locks continue to be
     held until the transfer is confirmed. As a result, incautious use of
     synchronous replication will reduce performance for database
     applications because of increased response times and higher contention.
    </para>

    <para>
     <productname>PostgreSQL</productname> allows the application developer
     to specify the durability level required via replication. This can be
     specified for the system overall, though it can also be specified for
     specific users or connections, or even individual transactions.
    </para>

    <para>
     For example, an application workload might consist of:
     10% of changes are important customer details, while
     90% of changes are less important data that the business can more
     easily survive if it is lost, such as chat messages between users.
    </para>

    <para>
     With synchronous replication options specified at the application level
     (on the primary) we can offer synchronous replication for the most
     important changes, without slowing down the bulk of the total workload.
     Application level options are an important and practical tool for allowing
     the benefits of synchronous replication for high performance applications.
    </para>

    <para>
     You should consider that the network bandwidth must be higher than
     the rate of generation of WAL data.
    </para>

    </sect3>

    <sect3 id="synchronous-replication-ha">
     <title>Planning for High Availability</title>

    <para>
     <varname>synchronous_standby_names</varname> specifies the number and
     names of synchronous standbys that transaction commits made when
     <varname>synchronous_commit</varname> is set to <literal>on</literal>,
     <literal>remote_apply</literal> or <literal>remote_write</literal> will wait for
     responses from. Such transaction commits may never be completed
     if any one of synchronous standbys should crash.
    </para>

    <para>
     The best solution for high availability is to ensure you keep as many
     synchronous standbys as requested. This can be achieved by naming multiple
     potential synchronous standbys using <varname>synchronous_standby_names</varname>.
    </para>

    <para>
     In a priority-based synchronous replication, the standbys whose names
     appear earlier in the list will be used as synchronous standbys.
     Standbys listed after these will take over the role of synchronous standby
     if one of current ones should fail.
    </para>

    <para>
     In a quorum-based synchronous replication, all the standbys appearing
     in the list will be used as candidates for synchronous standbys.
     Even if one of them should fail, the other standbys will keep performing
     the role of candidates of synchronous standby.
    </para>

    <para>
     When a standby first attaches to the primary, it will not yet be properly
     synchronized. This is described as <literal>catchup</literal> mode. Once
     the lag between standby and primary reaches zero for the first time
     we move to real-time <literal>streaming</literal> state.
     The catch-up duration may be long immediately after the standby has
     been created. If the standby is shut down, then the catch-up period
     will increase according to the length of time the standby has been down.
     The standby is only able to become a synchronous standby
     once it has reached <literal>streaming</literal> state.
     This state can be viewed using
     the <structname>pg_stat_replication</structname> view.
    </para>

    <para>
     If primary restarts while commits are waiting for acknowledgment, those
     waiting transactions will be marked fully committed once the primary
     database recovers.
     There is no way to be certain that all standbys have received all
     outstanding WAL data at time of the crash of the primary. Some
     transactions may not show as committed on the standby, even though
     they show as committed on the primary. The guarantee we offer is that
     the application will not receive explicit acknowledgment of the
     successful commit of a transaction until the WAL data is known to be
     safely received by all the synchronous standbys.
    </para>

    <para>
     If you really cannot keep as many synchronous standbys as requested
     then you should decrease the number of synchronous standbys that
     transaction commits must wait for responses from
     in <varname>synchronous_standby_names</varname> (or disable it) and
     reload the configuration file on the primary server.
    </para>

    <para>
     If the primary is isolated from remaining standby servers you should
     fail over to the best candidate of those other remaining standby servers.
    </para>

    <para>
     If you need to re-create a standby server while transactions are
     waiting, make sure that the commands pg_start_backup() and
     pg_stop_backup() are run in a session with
     <varname>synchronous_commit</varname> = <literal>off</literal>, otherwise those
     requests will wait forever for the standby to appear.
    </para>

    </sect3>
   </sect2>

   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>

    <indexterm>
      <primary>continuous archiving</primary>
      <secondary>in standby</secondary>
    </indexterm>

    <para>
      When continuous WAL archiving is used in a standby, there are two
      different scenarios: the WAL archive can be shared between the primary
      and the standby, or the standby can have its own WAL archive. When
      the standby has its own WAL archive, set <varname>archive_mode</varname>
      to <literal>always</literal>, and the standby will call the archive
      command for every WAL segment it receives, whether it's by restoring
      from the archive or by streaming replication. The shared archive can
      be handled similarly, but the <varname>archive_command</varname> must
      test if the file being archived exists already, and if the existing file
      has identical contents. This requires more care in the
      <varname>archive_command</varname>, as it must
      be careful to not overwrite an existing file with different contents,
      but return success if the exactly same file is archived twice. And
      all that must be done free of race conditions, if two servers attempt
      to archive the same file at the same time.
    </para>

    <para>
      If <varname>archive_mode</varname> is set to <literal>on</literal>, the
      archiver is not enabled during recovery or standby mode. If the standby
      server is promoted, it will start archiving after the promotion, but
      will not archive any WAL or timeline history files that
      it did not generate itself. To get a complete
      series of WAL files in the archive, you must ensure that all WAL is
      archived, before it reaches the standby. This is inherently true with
      file-based log shipping, as the standby can only restore files that
      are found in the archive, but not if streaming replication is enabled.
      When a server is not in recovery mode, there is no difference between
      <literal>on</literal> and <literal>always</literal> modes.
    </para>
   </sect2>
   </sect1>

   <sect1 id="warm-standby-failover">
    <title>Failover</title>

    <para>
     If the primary server fails then the standby server should begin
     failover procedures.
    </para>

    <para>
     If the standby server fails then no failover need take place. If the
     standby server can be restarted, even some time later, then the recovery
     process can also be restarted immediately, taking advantage of
     restartable recovery. If the standby server cannot be restarted, then a
     full new standby server instance should be created.
    </para>

    <para>
     If the primary server fails and the standby server becomes the
     new primary, and then the old primary restarts, you must have
     a mechanism for informing the old primary that it is no longer the primary. This is
     sometimes known as <acronym>STONITH</acronym> (Shoot The Other Node In The Head), which is
     necessary to avoid situations where both systems think they are the
     primary, which will lead to confusion and ultimately data loss.
    </para>

    <para>
     Many failover systems use just two systems, the primary and the standby,
     connected by some kind of heartbeat mechanism to continually verify the
     connectivity between the two and the viability of the primary. It is
     also possible to use a third system (called a witness server) to prevent
     some cases of inappropriate failover, but the additional complexity
     might not be worthwhile unless it is set up with sufficient care and
     rigorous testing.
    </para>

    <para>
     <productname>PostgreSQL</productname> does not provide the system
     software required to identify a failure on the primary and notify
     the standby database server.  Many such tools exist and are well
     integrated with the operating system facilities required for
     successful failover, such as IP address migration.
    </para>

    <para>
     Once failover to the standby occurs, there is only a
     single server in operation. This is known as a degenerate state.
     The former standby is now the primary, but the former primary is down
     and might stay down.  To return to normal operation, a standby server
     must be recreated,
     either on the former primary system when it comes up, or on a third,
     possibly new, system. The <xref linkend="app-pgrewind"/> utility can be
     used to speed up this process on large clusters.
     Once complete, the primary and standby can be
     considered to have switched roles. Some people choose to use a third
     server to provide backup for the new primary until the new standby
     server is recreated,
     though clearly this complicates the system configuration and
     operational processes.
    </para>

    <para>
     So, switching from primary to standby server can be fast but requires
     some time to re-prepare the failover cluster. Regular switching from
     primary to standby is useful, since it allows regular downtime on
     each system for maintenance. This also serves as a test of the
     failover mechanism to ensure that it will really work when you need it.
     Written administration procedures are advised.
    </para>

    <para>
     To trigger failover of a log-shipping standby server, run
     <command>pg_ctl promote</command>, call <function>pg_promote()</function>,
     or create a trigger file with the file name and path specified by the
     <varname>promote_trigger_file</varname>. If you're planning to use
     <command>pg_ctl promote</command> or to call
     <function>pg_promote()</function> to fail over,
     <varname>promote_trigger_file</varname> is not required. If you're
     setting up the reporting servers that are only used to offload read-only
     queries from the primary, not for high availability purposes, you don't
     need to promote it.
    </para>
   </sect1>

  <sect1 id="hot-standby">
   <title>Hot Standby</title>

   <indexterm zone="high-availability">
    <primary>Hot Standby</primary>
   </indexterm>

    <para>
     Hot Standby is the term used to describe the ability to connect to
     the server and run read-only queries while the server is in archive
     recovery or standby mode. This
     is useful both for replication purposes and for restoring a backup
     to a desired state with great precision.
     The term Hot Standby also refers to the ability of the server to move
     from recovery through to normal operation while users continue running
     queries and/or keep their connections open.
    </para>

    <para>
     Running queries in hot standby mode is similar to normal query operation,
     though there are several usage and administrative differences
     explained below.
    </para>

   <sect2 id="hot-standby-users">
    <title>User's Overview</title>

    <para>
     When the <xref linkend="guc-hot-standby"/> parameter is set to true on a
     standby server, it will begin accepting connections once the recovery has
     brought the system to a consistent state.  All such connections are
     strictly read-only; not even temporary tables may be written.
    </para>

    <para>
     The data on the standby takes some time to arrive from the primary server
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
     <firstterm>eventually consistent</firstterm> with the primary.  Once the
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
     start of each transaction, depending on the current transaction isolation
     level.  For more details, see <xref linkend="transaction-iso"/>.
    </para>

    <para>
     Transactions started during hot standby may issue the following commands:

     <itemizedlist>
      <listitem>
       <para>
        Query access: <command>SELECT</command>, <command>COPY TO</command>
       </para>
      </listitem>
      <listitem>
       <para>
        Cursor commands: <command>DECLARE</command>, <command>FETCH</command>, <command>CLOSE</command>
       </para>
      </listitem>
      <listitem>
       <para>
        Settings: <command>SHOW</command>, <command>SET</command>, <command>RESET</command>
       </para>
      </listitem>
      <listitem>
       <para>
        Transaction management commands:
         <itemizedlist>
          <listitem>
           <para>
            <command>BEGIN</command>, <command>END</command>, <command>ABORT</command>, <command>START TRANSACTION</command>
           </para>
          </listitem>
          <listitem>
           <para>
            <command>SAVEPOINT</command>, <command>RELEASE</command>, <command>ROLLBACK TO SAVEPOINT</command>
           </para>
          </listitem>
          <listitem>
           <para>
            <command>EXCEPTION</command> blocks and other internal subtransactions
           </para>
          </listitem>
         </itemizedlist>
       </para>
      </listitem>
      <listitem>
       <para>
        <command>LOCK TABLE</command>, though only when explicitly in one of these modes:
        <literal>ACCESS SHARE</literal>, <literal>ROW SHARE</literal> or <literal>ROW EXCLUSIVE</literal>.
       </para>
      </listitem>
      <listitem>
       <para>
        Plans and resources: <command>PREPARE</command>, <command>EXECUTE</command>,
        <command>DEALLOCATE</command>, <command>DISCARD</command>
       </para>
      </listitem>
      <listitem>
       <para>
        Plugins and extensions: <command>LOAD</command>
       </para>
      </listitem>
      <listitem>
       <para>
        <command>UNLISTEN</command>
       </para>
      </listitem>
     </itemizedlist>
    </para>

    <para>
     Transactions started during hot standby will never be assigned a
     transaction ID and cannot write to the system write-ahead log.
     Therefore, the following actions will produce error messages:

     <itemizedlist>
      <listitem>
       <para>
        Data Manipulation Language (DML): <command>INSERT</command>,
        <command>UPDATE</command>, <command>DELETE</command>, <command>COPY FROM</command>,
        <command>TRUNCATE</command>.
        Note that there are no allowed actions that result in a trigger
        being executed during recovery.  This restriction applies even to
        temporary tables, because table rows cannot be read or written without
        assigning a transaction ID, which is currently not possible in a
        Hot Standby environment.
       </para>
      </listitem>
      <listitem>
       <para>
        Data Definition Language (DDL): <command>CREATE</command>,
        <command>DROP</command>, <command>ALTER</command>, <command>COMMENT</command>.
        This restriction applies even to temporary tables, because carrying
        out these operations would require updating the system catalog tables.
       </para>
      </listitem>
      <listitem>
       <para>
        <command>SELECT ... FOR SHARE | UPDATE</command>, because row locks cannot be
        taken without updating the underlying data files.
       </para>
      </listitem>
      <listitem>
       <para>
        Rules on <command>SELECT</command> statements that generate DML commands.
       </para>
      </listitem>
      <listitem>
       <para>
        <command>LOCK</command> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</literal>.
       </para>
      </listitem>
      <listitem>
       <para>
        <command>LOCK</command> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</literal>.
       </para>
      </listitem>
      <listitem>
       <para>
        Transaction management commands that explicitly set non-read-only state:
         <itemizedlist>
          <listitem>
           <para>
             <command>BEGIN READ WRITE</command>,
             <command>START TRANSACTION READ WRITE</command>
           </para>
          </listitem>
          <listitem>
           <para>
             <command>SET TRANSACTION READ WRITE</command>,
             <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</command>
           </para>
          </listitem>
          <listitem>
           <para>
            <command>SET transaction_read_only = off</command>
           </para>
          </listitem>
         </itemizedlist>
       </para>
      </listitem>
      <listitem>
       <para>
        Two-phase commit commands: <command>PREPARE TRANSACTION</command>,
        <command>COMMIT PREPARED</command>, <command>ROLLBACK PREPARED</command>
        because even read-only transactions need to write WAL in the
        prepare phase (the first phase of two phase commit).
       </para>
      </listitem>
      <listitem>
       <para>
        Sequence updates: <function>nextval()</function>, <function>setval()</function>
       </para>
      </listitem>
      <listitem>
       <para>
        <command>LISTEN</command>, <command>NOTIFY</command>
       </para>
      </listitem>
     </itemizedlist>
    </para>

    <para>
     In normal operation, <quote>read-only</quote> transactions are allowed to
     use <command>LISTEN</command> and <command>NOTIFY</command>,
     so Hot Standby sessions operate under slightly tighter
     restrictions than ordinary read-only sessions.  It is possible that some
     of these restrictions might be loosened in a future release.
    </para>

    <para>
     During hot standby, the parameter <varname>transaction_read_only</varname> is always
     true and may not be changed.  But as long as no attempt is made to modify
     the database, connections during hot standby will act much like any other
     database connection.  If failover or switchover occurs, the database will
     switch to normal processing mode.  Sessions will remain connected while the
     server changes mode.  Once hot standby finishes, it will be possible to
     initiate read-write transactions (even from a session begun during
     hot standby).
    </para>

    <para>
     Users can determine whether hot standby is currently active for their
     session by issuing <command>SHOW in_hot_standby</command>.
     (In server versions before 14, the <varname>in_hot_standby</varname>
     parameter did not exist; a workable substitute method for older servers
     is <command>SHOW transaction_read_only</command>.)  In addition, a set of
     functions (<xref linkend="functions-recovery-info-table"/>) allow users to
     access information about the standby server. These allow you to write
     programs that are aware of the current state of the database. These
     can be used to monitor the progress of recovery, or to allow you to
     write complex programs that restore the database to particular states.
    </para>
   </sect2>

   <sect2 id="hot-standby-conflict">
    <title>Handling Query Conflicts</title>

    <para>
     The primary and standby servers are in many ways loosely connected. Actions
     on the primary will have an effect on the standby. As a result, there is
     potential for negative interactions or conflicts between them. The easiest
     conflict to understand is performance: if a huge data load is taking place
     on the primary then this will generate a similar stream of WAL records on the
     standby, so standby queries may contend for system resources, such as I/O.
    </para>

    <para>
     There are also additional types of conflict that can occur with Hot Standby.
     These conflicts are <emphasis>hard conflicts</emphasis> in the sense that queries
     might need to be canceled and, in some cases, sessions disconnected to resolve them.
     The user is provided with several ways to handle these
     conflicts. Conflict cases include:

       <itemizedlist>
        <listitem>
         <para>
          Access Exclusive locks taken on the primary server, including both
          explicit <command>LOCK</command> commands and various <acronym>DDL</acronym>
          actions, conflict with table accesses in standby queries.
         </para>
        </listitem>
        <listitem>
         <para>
          Dropping a tablespace on the primary conflicts with standby queries
          using that tablespace for temporary work files.
         </para>
        </listitem>
        <listitem>
         <para>
          Dropping a database on the primary conflicts with sessions connected
          to that database on the standby.
         </para>
        </listitem>
        <listitem>
         <para>
          Application of a vacuum cleanup record from WAL conflicts with
          standby transactions whose snapshots can still <quote>see</quote> any of
          the rows to be removed.
         </para>
        </listitem>
        <listitem>
         <para>
          Application of a vacuum cleanup record from WAL conflicts with
          queries accessing the target page on the standby, whether or not
          the data to be removed is visible.
         </para>
        </listitem>
       </itemizedlist>
    </para>

    <para>
     On the primary server, these cases simply result in waiting; and the
     user might choose to cancel either of the conflicting actions.  However,
     on the standby there is no choice: the WAL-logged action already occurred
     on the primary so the standby must not fail to apply it.  Furthermore,
     allowing WAL application to wait indefinitely may be very undesirable,
     because the standby's state will become increasingly far behind the
     primary's.  Therefore, a mechanism is provided to forcibly cancel standby
     queries that conflict with to-be-applied WAL records.
    </para>

    <para>
     An example of the problem situation is an administrator on the primary
     server running <command>DROP TABLE</command> on a table that is currently being
     queried on the standby server.  Clearly the standby query cannot continue
     if the <command>DROP TABLE</command> is applied on the standby. If this situation
     occurred on the primary, the <command>DROP TABLE</command> would wait until the
     other query had finished. But when <command>DROP TABLE</command> is run on the
     primary, the primary doesn't have information about what queries are
     running on the standby, so it will not wait for any such standby
     queries. The WAL change records come through to the standby while the
     standby query is still running, causing a conflict.  The standby server
     must either delay application of the WAL records (and everything after
     them, too) or else cancel the conflicting query so that the <command>DROP
     TABLE</command> can be applied.
    </para>

    <para>
     When a conflicting query is short, it's typically desirable to allow it to
     complete by delaying WAL application for a little bit; but a long delay in
     WAL application is usually not desirable.  So the cancel mechanism has
     parameters, <xref linkend="guc-max-standby-archive-delay"/> and <xref
     linkend="guc-max-standby-streaming-delay"/>, that define the maximum
     allowed delay in WAL application.  Conflicting queries will be canceled
     once it has taken longer than the relevant delay setting to apply any
     newly-received WAL data.  There are two parameters so that different delay
     values can be specified for the case of reading WAL data from an archive
     (i.e., initial recovery from a base backup or <quote>catching up</quote> a
     standby server that has fallen far behind) versus reading WAL data via
     streaming replication.
    </para>

    <para>
     In a standby server that exists primarily for high availability, it's
     best to set the delay parameters relatively short, so that the server
     cannot fall far behind the primary due to delays caused by standby
     queries.  However, if the standby server is meant for executing
     long-running queries, then a high or even infinite delay value may be
     preferable.  Keep in mind however that a long-running query could
     cause other sessions on the standby server to not see recent changes
     on the primary, if it delays application of WAL records.
    </para>

    <para>
     Once the delay specified by <varname>max_standby_archive_delay</varname> or
     <varname>max_standby_streaming_delay</varname> has been exceeded, conflicting
     queries will be canceled.  This usually results just in a cancellation
     error, although in the case of replaying a <command>DROP DATABASE</command>
     the entire conflicting session will be terminated.  Also, if the conflict
     is over a lock held by an idle transaction, the conflicting session is
     terminated (this behavior might change in the future).
    </para>

    <para>
     Canceled queries may be retried immediately (after beginning a new
     transaction, of course).  Since query cancellation depends on
     the nature of the WAL records being replayed, a query that was
     canceled may well succeed if it is executed again.
    </para>

    <para>
     Keep in mind that the delay parameters are compared to the elapsed time
     since the WAL data was received by the standby server.  Thus, the grace
     period allowed to any one query on the standby is never more than the
     delay parameter, and could be considerably less if the standby has already
     fallen behind as a result of waiting for previous queries to complete, or
     as a result of being unable to keep up with a heavy update load.
    </para>

    <para>
     The most common reason for conflict between standby queries and WAL replay
     is <quote>early cleanup</quote>.  Normally, <productname>PostgreSQL</productname> allows
     cleanup of old row versions when there are no transactions that need to
     see them to ensure correct visibility of data according to MVCC rules.
     However, this rule can only be applied for transactions executing on the
     primary.  So it is possible that cleanup on the primary will remove row
     versions that are still visible to a transaction on the standby.
    </para>

    <para>
     Experienced users should note that both row version cleanup and row version
     freezing will potentially conflict with standby queries. Running a manual
     <command>VACUUM FREEZE</command> is likely to cause conflicts even on tables with
     no updated or deleted rows.
    </para>

    <para>
     Users should be clear that tables that are regularly and heavily updated
     on the primary server will quickly cause cancellation of longer running
     queries on the standby. In such cases the setting of a finite value for
     <varname>max_standby_archive_delay</varname> or
     <varname>max_standby_streaming_delay</varname> can be considered similar to
     setting <varname>statement_timeout</varname>.
    </para>

    <para>
     Remedial possibilities exist if the number of standby-query cancellations
     is found to be unacceptable.  The first option is to set the parameter
     <varname>hot_standby_feedback</varname>, which prevents <command>VACUUM</command> from
     removing recently-dead rows and so cleanup conflicts do not occur.
     If you do this, you
     should note that this will delay cleanup of dead rows on the primary,
     which may result in undesirable table bloat. However, the cleanup
     situation will be no worse than if the standby queries were running
     directly on the primary server, and you are still getting the benefit of
     off-loading execution onto the standby.
     If standby servers connect and disconnect frequently, you
     might want to make adjustments to handle the period when
     <varname>hot_standby_feedback</varname> feedback is not being provided.
     For example, consider increasing <varname>max_standby_archive_delay</varname>
     so that queries are not rapidly canceled by conflicts in WAL archive
     files during disconnected periods.  You should also consider increasing
     <varname>max_standby_streaming_delay</varname> to avoid rapid cancellations
     by newly-arrived streaming WAL entries after reconnection.
    </para>

    <para>
     Another option is to increase <xref linkend="guc-vacuum-defer-cleanup-age"/>
     on the primary server, so that dead rows will not be cleaned up as quickly
     as they normally would be.  This will allow more time for queries to
     execute before they are canceled on the standby, without having to set
     a high <varname>max_standby_streaming_delay</varname>.  However it is
     difficult to guarantee any specific execution-time window with this
     approach, since <varname>vacuum_defer_cleanup_age</varname> is measured in
     transactions executed on the primary server.
    </para>

    <para>
     The number of query cancels and the reason for them can be viewed using
     the <structname>pg_stat_database_conflicts</structname> system view on the standby
     server. The <structname>pg_stat_database</structname> system view also contains
     summary information.
    </para>

    <para>
     Users can control whether a log message is produced when WAL replay is waiting
     longer than <varname>deadlock_timeout</varname> for conflicts. This
     is controlled by the <xref linkend="guc-log-recovery-conflict-waits"/> parameter.
    </para>
   </sect2>

   <sect2 id="hot-standby-admin">
    <title>Administrator's Overview</title>

    <para>
     If <varname>hot_standby</varname> is <literal>on</literal> in <filename>postgresql.conf</filename>
     (the default value) and there is a
     <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary><secondary>for hot standby</secondary></indexterm>
     file present, the server will run in Hot Standby mode.
     However, it may take some time for Hot Standby connections to be allowed,
     because the server will not accept connections until it has completed
     sufficient recovery to provide a consistent state against which queries
     can run.  During this period,
     clients that attempt to connect will be refused with an error message.
     To confirm the server has come up, either loop trying to connect from
     the application, or look for these messages in the server logs:

 <programlisting>
 LOG:  entering standby mode

 ... then some time later ...

 LOG:  consistent recovery state reached
 LOG:  database system is ready to accept read-only connections
 </programlisting>

     Consistency information is recorded once per checkpoint on the primary.
     It is not possible to enable hot standby when reading WAL
     written during a period when <varname>wal_level</varname> was not set to
     <literal>replica</literal> or <literal>logical</literal> on the primary.  Reaching
     a consistent state can also be delayed in the presence of both of these
     conditions:

       <itemizedlist>
        <listitem>
         <para>
          A write transaction has more than 64 subtransactions
         </para>
        </listitem>
        <listitem>
         <para>
          Very long-lived write transactions
         </para>
        </listitem>
       </itemizedlist>

     If you are running file-based log shipping ("warm standby"), you might need
     to wait until the next WAL file arrives, which could be as long as the
     <varname>archive_timeout</varname> setting on the primary.
    </para>

    <para>
     The settings of some parameters determine the size of shared memory for
     tracking transaction IDs, locks, and prepared transactions.  These shared
     memory structures must be no smaller on a standby than on the primary in
     order to ensure that the standby does not run out of shared memory during
     recovery.  For example, if the primary had used a prepared transaction but
     the standby had not allocated any shared memory for tracking prepared
     transactions, then recovery could not continue until the standby's
     configuration is changed.  The parameters affected are:

       <itemizedlist>
        <listitem>
         <para>
          <varname>max_connections</varname>
         </para>
        </listitem>
        <listitem>
         <para>
          <varname>max_prepared_transactions</varname>
         </para>
        </listitem>
        <listitem>
         <para>
          <varname>max_locks_per_transaction</varname>
         </para>
        </listitem>
        <listitem>
         <para>
          <varname>max_wal_senders</varname>
         </para>
        </listitem>
        <listitem>
         <para>
          <varname>max_worker_processes</varname>
         </para>
        </listitem>
       </itemizedlist>

     The easiest way to ensure this does not become a problem is to have these
     parameters set on the standbys to values equal to or greater than on the
     primary.  Therefore, if you want to increase these values, you should do
     so on all standby servers first, before applying the changes to the
     primary server.  Conversely, if you want to decrease these values, you
     should do so on the primary server first, before applying the changes to
     all standby servers.  Keep in mind that when a standby is promoted, it
     becomes the new reference for the required parameter settings for the
     standbys that follow it.  Therefore, to avoid this becoming a problem
     during a switchover or failover, it is recommended to keep these settings
     the same on all standby servers.
    </para>

    <para>
     The WAL tracks changes to these parameters on the
     primary.  If a hot standby processes WAL that indicates that the current
     value on the primary is higher than its own value, it will log a warning
     and pause recovery, for example:
 <screen>
 WARNING:  hot standby is not possible because of insufficient parameter settings
 DETAIL:  max_connections = 80 is a lower setting than on the primary server, where its value was 100.
 LOG:  recovery has paused
 DETAIL:  If recovery is unpaused, the server will shut down.
 HINT:  You can then restart the server after making the necessary configuration changes.
 </screen>
     At that point, the settings on the standby need to be updated and the
     instance restarted before recovery can continue.  If the standby is not a
     hot standby, then when it encounters the incompatible parameter change, it
     will shut down immediately without pausing, since there is then no value
     in keeping it up.
    </para>

    <para>
     It is important that the administrator select appropriate settings for
     <xref linkend="guc-max-standby-archive-delay"/> and <xref
     linkend="guc-max-standby-streaming-delay"/>.  The best choices vary
     depending on business priorities.  For example if the server is primarily
     tasked as a High Availability server, then you will want low delay
     settings, perhaps even zero, though that is a very aggressive setting. If
     the standby server is tasked as an additional server for decision support
     queries then it might be acceptable to set the maximum delay values to
     many hours, or even -1 which means wait forever for queries to complete.
    </para>

    <para>
     Transaction status "hint bits" written on the primary are not WAL-logged,
     so data on the standby will likely re-write the hints again on the standby.
     Thus, the standby server will still perform disk writes even though
     all users are read-only; no changes occur to the data values
     themselves.  Users will still write large sort temporary files and
     re-generate relcache info files, so no part of the database
     is truly read-only during hot standby mode.
     Note also that writes to remote databases using
     <application>dblink</application> module, and other operations outside the
     database using PL functions will still be possible, even though the
     transaction is read-only locally.
    </para>

    <para>
     The following types of administration commands are not accepted
     during recovery mode:

       <itemizedlist>
        <listitem>
         <para>
          Data Definition Language (DDL): e.g., <command>CREATE INDEX</command>
         </para>
        </listitem>
        <listitem>
         <para>
          Privilege and Ownership: <command>GRANT</command>, <command>REVOKE</command>,
          <command>REASSIGN</command>
         </para>
        </listitem>
        <listitem>
         <para>
          Maintenance commands: <command>ANALYZE</command>, <command>VACUUM</command>,
          <command>CLUSTER</command>, <command>REINDEX</command>
         </para>
        </listitem>
       </itemizedlist>
    </para>

    <para>
     Again, note that some of these commands are actually allowed during
     "read only" mode transactions on the primary.
    </para>

    <para>
     As a result, you cannot create additional indexes that exist solely
     on the standby, nor statistics that exist solely on the standby.
     If these administration commands are needed, they should be executed
     on the primary, and eventually those changes will propagate to the
     standby.
    </para>

    <para>
     <function>pg_cancel_backend()</function>
     and <function>pg_terminate_backend()</function> will work on user backends,
     but not the Startup process, which performs
     recovery. <structname>pg_stat_activity</structname> does not show
     recovering transactions as active. As a result,
     <structname>pg_prepared_xacts</structname> is always empty during
     recovery. If you wish to resolve in-doubt prepared transactions, view
     <literal>pg_prepared_xacts</literal> on the primary and issue commands to
     resolve transactions there or resolve them after the end of recovery.
    </para>

    <para>
     <structname>pg_locks</structname> will show locks held by backends,
     as normal. <structname>pg_locks</structname> also shows
     a virtual transaction managed by the Startup process that owns all
     <literal>AccessExclusiveLocks</literal> held by transactions being replayed by recovery.
     Note that the Startup process does not acquire locks to
     make database changes, and thus locks other than <literal>AccessExclusiveLocks</literal>
     do not show in <structname>pg_locks</structname> for the Startup
     process; they are just presumed to exist.
    </para>

    <para>
     The <productname>Nagios</productname> plugin <productname>check_pgsql</productname> will
     work, because the simple information it checks for exists.
     The <productname>check_postgres</productname> monitoring script will also work,
     though some reported values could give different or confusing results.
     For example, last vacuum time will not be maintained, since no
     vacuum occurs on the standby.  Vacuums running on the primary
     do still send their changes to the standby.
    </para>

    <para>
     WAL file control commands will not work during recovery,
     e.g., <function>pg_start_backup</function>, <function>pg_switch_wal</function> etc.
    </para>

    <para>
     Dynamically loadable modules work, including <structname>pg_stat_statements</structname>.
    </para>

    <para>
     Advisory locks work normally in recovery, including deadlock detection.
     Note that advisory locks are never WAL logged, so it is impossible for
     an advisory lock on either the primary or the standby to conflict with WAL
     replay. Nor is it possible to acquire an advisory lock on the primary
     and have it initiate a similar advisory lock on the standby. Advisory
     locks relate only to the server on which they are acquired.
    </para>

    <para>
     Trigger-based replication systems such as <productname>Slony</productname>,
     <productname>Londiste</productname> and <productname>Bucardo</productname> won't run on the
     standby at all, though they will run happily on the primary server as
     long as the changes are not sent to standby servers to be applied.
     WAL replay is not trigger-based so you cannot relay from the
     standby to any system that requires additional database writes or
     relies on the use of triggers.
    </para>

    <para>
     New OIDs cannot be assigned, though some <acronym>UUID</acronym> generators may still
     work as long as they do not rely on writing new status to the database.
    </para>

    <para>
     Currently, temporary table creation is not allowed during read-only
     transactions, so in some cases existing scripts will not run correctly.
     This restriction might be relaxed in a later release. This is
     both an SQL Standard compliance issue and a technical issue.
    </para>

    <para>
     <command>DROP TABLESPACE</command> can only succeed if the tablespace is empty.
     Some standby users may be actively using the tablespace via their
     <varname>temp_tablespaces</varname> parameter. If there are temporary files in the
     tablespace, all active queries are canceled to ensure that temporary
     files are removed, so the tablespace can be removed and WAL replay
     can continue.
    </para>

    <para>
     Running <command>DROP DATABASE</command> or <command>ALTER DATABASE ... SET
     TABLESPACE</command> on the primary
     will generate a WAL entry that will cause all users connected to that
     database on the standby to be forcibly disconnected. This action occurs
     immediately, whatever the setting of
     <varname>max_standby_streaming_delay</varname>. Note that
     <command>ALTER DATABASE ... RENAME</command> does not disconnect users, which
     in most cases will go unnoticed, though might in some cases cause a
     program confusion if it depends in some way upon database name.
    </para>

    <para>
     In normal (non-recovery) mode, if you issue <command>DROP USER</command> or <command>DROP ROLE</command>
     for a role with login capability while that user is still connected then
     nothing happens to the connected user &mdash; they remain connected. The user cannot
     reconnect however. This behavior applies in recovery also, so a
     <command>DROP USER</command> on the primary does not disconnect that user on the standby.
    </para>

    <para>
     The statistics collector is active during recovery. All scans, reads, blocks,
     index usage, etc., will be recorded normally on the standby. Replayed
     actions will not duplicate their effects on primary, so replaying an
     insert will not increment the Inserts column of pg_stat_user_tables.
     The stats file is deleted at the start of recovery, so stats from primary
     and standby will differ; this is considered a feature, not a bug.
    </para>

    <para>
     Autovacuum is not active during recovery.  It will start normally at the
     end of recovery.
    </para>

    <para>
     The checkpointer process and the background writer process are active during
     recovery. The checkpointer process will perform restartpoints (similar to
     checkpoints on the primary) and the background writer process will perform
     normal block cleaning activities. This can include updates of the hint bit
     information stored on the standby server.
     The <command>CHECKPOINT</command> command is accepted during recovery,
     though it performs a restartpoint rather than a new checkpoint.
    </para>
   </sect2>

   <sect2 id="hot-standby-parameters">
    <title>Hot Standby Parameter Reference</title>

    <para>
     Various parameters have been mentioned above in
     <xref linkend="hot-standby-conflict"/> and
     <xref linkend="hot-standby-admin"/>.
    </para>

    <para>
     On the primary, parameters <xref linkend="guc-wal-level"/> and
     <xref linkend="guc-vacuum-defer-cleanup-age"/> can be used.
     <xref linkend="guc-max-standby-archive-delay"/> and
     <xref linkend="guc-max-standby-streaming-delay"/> have no effect if set on
     the primary.
    </para>

    <para>
     On the standby, parameters <xref linkend="guc-hot-standby"/>,
     <xref linkend="guc-max-standby-archive-delay"/> and
     <xref linkend="guc-max-standby-streaming-delay"/> can be used.
     <xref linkend="guc-vacuum-defer-cleanup-age"/> has no effect
     as long as the server remains in standby mode, though it will
     become relevant if the standby becomes primary.
    </para>
   </sect2>

   <sect2 id="hot-standby-caveats">
    <title>Caveats</title>

    <para>
     There are several limitations of Hot Standby.
     These can and probably will be fixed in future releases:

   <itemizedlist>
    <listitem>
     <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read-only
      connections until the completion of the longest running write transaction.
      If this situation occurs, explanatory messages will be sent to the server log.
     </para>
    </listitem>
    <listitem>
     <para>
      Valid starting points for standby queries are generated at each
      checkpoint on the primary. If the standby is shut down while the primary
      is in a shutdown state, it might not be possible to re-enter Hot Standby
      until the primary is started up, so that it generates further starting
      points in the WAL logs.  This situation isn't a problem in the most
      common situations where it might happen. Generally, if the primary is
      shut down and not available anymore, that's likely due to a serious
      failure that requires the standby being converted to operate as
      the new primary anyway.  And in situations where the primary is
      being intentionally taken down, coordinating to make sure the standby
      becomes the new primary smoothly is also standard procedure.
     </para>
    </listitem>
    <listitem>
     <para>
      At the end of recovery, <literal>AccessExclusiveLocks</literal> held by prepared transactions
      will require twice the normal number of lock table entries. If you plan
      on running either a large number of concurrent prepared transactions
      that normally take <literal>AccessExclusiveLocks</literal>, or you plan on having one
      large transaction that takes many <literal>AccessExclusiveLocks</literal>, you are
      advised to select a larger value of <varname>max_locks_per_transaction</varname>,
      perhaps as much as twice the value of the parameter on
      the primary server. You need not consider this at all if
      your setting of <varname>max_prepared_transactions</varname> is 0.
     </para>
    </listitem>
    <listitem>
     <para>
      The Serializable transaction isolation level is not yet available in hot
      standby.  (See <xref linkend="xact-serializable"/> and
      <xref linkend="serializable-consistency"/> for details.)
      An attempt to set a transaction to the serializable isolation level in
      hot standby mode will generate an error.
     </para>
    </listitem>
   </itemizedlist>

    </para>
   </sect2>

  </sect1>

 </chapter>