| <!-- doc/src/sgml/wal.sgml --> |
| |
| <chapter id="wal"> |
| <title>Reliability and the Write-Ahead Log</title> |
| |
| <para> |
| This chapter explains how the Write-Ahead Log is used to obtain |
| efficient, reliable operation. |
| </para> |
| |
| <sect1 id="wal-reliability"> |
| <title>Reliability</title> |
| |
| <para> |
| Reliability is an important property of any serious database |
| system, and <productname>PostgreSQL</productname> does everything possible to |
| guarantee reliable operation. One aspect of reliable operation is |
| that all data recorded by a committed transaction should be stored |
| in a nonvolatile area that is safe from power loss, operating |
| system failure, and hardware failure (except failure of the |
| nonvolatile area itself, of course). Successfully writing the data |
| to the computer's permanent storage (disk drive or equivalent) |
| ordinarily meets this requirement. In fact, even if a computer is |
| fatally damaged, if the disk drives survive they can be moved to |
| another computer with similar hardware and all committed |
| transactions will remain intact. |
| </para> |
| |
| <para> |
| While forcing data to the disk platters periodically might seem like |
| a simple operation, it is not. Because disk drives are dramatically |
| slower than main memory and CPUs, several layers of caching exist |
| between the computer's main memory and the disk platters. |
| First, there is the operating system's buffer cache, which caches |
| frequently requested disk blocks and combines disk writes. Fortunately, |
| all operating systems give applications a way to force writes from |
| the buffer cache to disk, and <productname>PostgreSQL</productname> uses those |
| features. (See the <xref linkend="guc-wal-sync-method"/> parameter |
| to adjust how this is done.) |
| </para> |
| |
| <para> |
| Next, there might be a cache in the disk drive controller; this is |
| particularly common on <acronym>RAID</acronym> controller cards. Some of |
| these caches are <firstterm>write-through</firstterm>, meaning writes are sent |
| to the drive as soon as they arrive. Others are |
| <firstterm>write-back</firstterm>, meaning data is sent to the drive at |
| some later time. Such caches can be a reliability hazard because the |
| memory in the disk controller cache is volatile, and will lose its |
| contents in a power failure. Better controller cards have |
| <firstterm>battery-backup units</firstterm> (<acronym>BBU</acronym>s), meaning |
| the card has a battery that |
| maintains power to the cache in case of system power loss. After power |
| is restored the data will be written to the disk drives. |
| </para> |
| |
| <para> |
| And finally, most disk drives have caches. Some are write-through |
| while some are write-back, and the same concerns about data loss |
| exist for write-back drive caches as for disk controller |
| caches. Consumer-grade IDE and SATA drives are particularly likely |
| to have write-back caches that will not survive a power failure. Many |
| solid-state drives (SSD) also have volatile write-back caches. |
| </para> |
| |
| <para> |
| These caches can typically be disabled; however, the method for doing |
| this varies by operating system and drive type: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| On <productname>Linux</productname>, IDE and SATA drives can be queried using |
| <command>hdparm -I</command>; write caching is enabled if there is |
| a <literal>*</literal> next to <literal>Write cache</literal>. <command>hdparm -W 0</command> |
| can be used to turn off write caching. SCSI drives can be queried |
| using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</application></ulink>. |
| Use <command>sdparm --get=WCE</command> to check |
| whether the write cache is enabled and <command>sdparm --clear=WCE</command> |
| to disable it. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| On <productname>FreeBSD</productname>, IDE drives can be queried using |
| <command>atacontrol</command> and write caching turned off using |
| <literal>hw.ata.wc=0</literal> in <filename>/boot/loader.conf</filename>; |
| SCSI drives can be queried using <command>camcontrol identify</command>, |
| and the write cache both queried and changed using |
| <command>sdparm</command> when available. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| On <productname>Solaris</productname>, the disk write cache is controlled by |
| <command>format -e</command>. |
| (The Solaris <acronym>ZFS</acronym> file system is safe with disk write-cache |
| enabled because it issues its own disk cache flush commands.) |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| On <productname>Windows</productname>, if <varname>wal_sync_method</varname> is |
| <literal>open_datasync</literal> (the default), write caching can be disabled |
| by unchecking <literal>My Computer\Open\<replaceable>disk drive</replaceable>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</literal>. |
| Alternatively, set <varname>wal_sync_method</varname> to |
| <literal>fsync</literal> or <literal>fsync_writethrough</literal>, which prevent |
| write caching. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| On <productname>macOS</productname>, write caching can be prevented by |
| setting <varname>wal_sync_method</varname> to <literal>fsync_writethrough</literal>. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| Recent SATA drives (those following <acronym>ATAPI-6</acronym> or later) |
| offer a drive cache flush command (<command>FLUSH CACHE EXT</command>), |
| while SCSI drives have long supported a similar command |
| <command>SYNCHRONIZE CACHE</command>. These commands are not directly |
| accessible to <productname>PostgreSQL</productname>, but some file systems |
| (e.g., <acronym>ZFS</acronym>, <acronym>ext4</acronym>) can use them to flush |
| data to the platters on write-back-enabled drives. Unfortunately, such |
| file systems behave suboptimally when combined with battery-backup unit |
| (<acronym>BBU</acronym>) disk controllers. In such setups, the synchronize |
| command forces all data from the controller cache to the disks, |
| eliminating much of the benefit of the BBU. You can run the |
| <xref linkend="pgtestfsync"/> program to see |
| if you are affected. If you are affected, the performance benefits |
| of the BBU can be regained by turning off write barriers in |
| the file system or reconfiguring the disk controller, if that is |
| an option. If write barriers are turned off, make sure the battery |
| remains functional; a faulty battery can potentially lead to data loss. |
| Hopefully file system and disk controller designers will eventually |
| address this suboptimal behavior. |
| </para> |
| |
| <para> |
| When the operating system sends a write request to the storage hardware, |
| there is little it can do to make sure the data has arrived at a truly |
| non-volatile storage area. Rather, it is the |
| administrator's responsibility to make certain that all storage components |
| ensure integrity for both data and file-system metadata. |
| Avoid disk controllers that have non-battery-backed write caches. |
| At the drive level, disable write-back caching if the |
| drive cannot guarantee the data will be written before shutdown. |
| If you use SSDs, be aware that many of these do not honor cache flush |
| commands by default. |
| You can test for reliable I/O subsystem behavior using <ulink |
| url="https://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>. |
| </para> |
| |
| <para> |
| Another risk of data loss is posed by the disk platter write |
| operations themselves. Disk platters are divided into sectors, |
| commonly 512 bytes each. Every physical read or write operation |
| processes a whole sector. |
| When a write request arrives at the drive, it might be for some multiple |
| of 512 bytes (<productname>PostgreSQL</productname> typically writes 8192 bytes, or |
| 16 sectors, at a time), and the process of writing could fail due |
| to power loss at any time, meaning some of the 512-byte sectors were |
| written while others were not. To guard against such failures, |
| <productname>PostgreSQL</productname> periodically writes full page images to |
| permanent WAL storage <emphasis>before</emphasis> modifying the actual page on |
| disk. By doing this, during crash recovery <productname>PostgreSQL</productname> can |
| restore partially-written pages from WAL. If you have file-system software |
| that prevents partial page writes (e.g., ZFS), you can turn off |
| this page imaging by turning off the <xref |
| linkend="guc-full-page-writes"/> parameter. Battery-Backed Unit |
| (BBU) disk controllers do not prevent partial page writes unless |
| they guarantee that data is written to the BBU as full (8kB) pages. |
| </para> |
| <para> |
| <productname>PostgreSQL</productname> also protects against some kinds of data corruption |
| on storage devices that may occur because of hardware errors or media failure over time, |
| such as reading/writing garbage data. |
| <itemizedlist> |
| <listitem> |
| <para> |
| Each individual record in a WAL file is protected by a CRC-32 (32-bit) check |
| that allows us to tell if record contents are correct. The CRC value |
| is set when we write each WAL record and checked during crash recovery, |
| archive recovery and replication. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Data pages are not currently checksummed by default, though full page images |
| recorded in WAL records will be protected; see <link |
| linkend="app-initdb-data-checksums"><application>initdb</application></link> |
| for details about enabling data checksums. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Internal data structures such as <filename>pg_xact</filename>, <filename>pg_subtrans</filename>, <filename>pg_multixact</filename>, |
| <filename>pg_serial</filename>, <filename>pg_notify</filename>, <filename>pg_stat</filename>, <filename>pg_snapshots</filename> are not directly |
| checksummed, nor are pages protected by full page writes. However, where |
| such data structures are persistent, WAL records are written that allow |
| recent changes to be accurately rebuilt at crash recovery and those |
| WAL records are protected as discussed above. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Individual state files in <filename>pg_twophase</filename> are protected by CRC-32. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Temporary data files used in larger SQL queries for sorts, |
| materializations and intermediate results are not currently checksummed, |
| nor will WAL records be written for changes to those files. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| <productname>PostgreSQL</productname> does not protect against correctable memory errors |
| and it is assumed you will operate using RAM that uses industry standard |
| Error Correcting Codes (ECC) or better protection. |
| </para> |
| </sect1> |
| |
| <sect1 id="checksums"> |
| <title>Data Checksums</title> |
| <indexterm> |
| <primary>checksums</primary> |
| </indexterm> |
| |
| <para> |
| By default, data pages are not protected by checksums, but this can |
| optionally be enabled for a cluster. When enabled, each data page includes |
| a checksum that is updated when the page is written and verified each time |
| the page is read. Only data pages are protected by checksums; internal data |
| structures and temporary files are not. |
| </para> |
| |
| <para> |
| Checksums are normally enabled when the cluster is initialized using <link |
| linkend="app-initdb-data-checksums"><application>initdb</application></link>. |
| They can also be enabled or disabled at a later time as an offline |
| operation. Data checksums are enabled or disabled at the full cluster |
| level, and cannot be specified individually for databases or tables. |
| </para> |
| |
| <para> |
| The current state of checksums in the cluster can be verified by viewing the |
| value of the read-only configuration variable <xref |
| linkend="guc-data-checksums" /> by issuing the command <command>SHOW |
| data_checksums</command>. |
| </para> |
| |
| <para> |
| When attempting to recover from page corruptions, it may be necessary to |
| bypass the checksum protection. To do this, temporarily set the |
| configuration parameter <xref linkend="guc-ignore-checksum-failure" />. |
| </para> |
| |
| <sect2 id="checksums-offline-enable-disable"> |
| <title>Off-line Enabling of Checksums</title> |
| |
| <para> |
| The <link linkend="app-pgchecksums"><application>pg_checksums</application></link> |
| application can be used to enable or disable data checksums, as well as |
| verify checksums, on an offline cluster. |
| </para> |
| |
| </sect2> |
| </sect1> |
| |
| <sect1 id="wal-intro"> |
| <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title> |
| |
| <indexterm zone="wal"> |
| <primary>WAL</primary> |
| </indexterm> |
| |
| <indexterm> |
| <primary>transaction log</primary> |
| <see>WAL</see> |
| </indexterm> |
| |
| <para> |
| <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>) |
| is a standard method for ensuring data integrity. A detailed |
| description can be found in most (if not all) books about |
| transaction processing. Briefly, <acronym>WAL</acronym>'s central |
| concept is that changes to data files (where tables and indexes |
| reside) must be written only after those changes have been logged, |
| that is, after log records describing the changes have been flushed |
| to permanent storage. If we follow this procedure, we do not need |
| to flush data pages to disk on every transaction commit, because we |
| know that in the event of a crash we will be able to recover the |
| database using the log: any changes that have not been applied to |
| the data pages can be redone from the log records. (This is |
| roll-forward recovery, also known as REDO.) |
| </para> |
| |
| <tip> |
| <para> |
| Because <acronym>WAL</acronym> restores database file |
| contents after a crash, journaled file systems are not necessary for |
| reliable storage of the data files or WAL files. In fact, journaling |
| overhead can reduce performance, especially if journaling |
| causes file system <emphasis>data</emphasis> to be flushed |
| to disk. Fortunately, data flushing during journaling can |
| often be disabled with a file system mount option, e.g., |
| <literal>data=writeback</literal> on a Linux ext3 file system. |
| Journaled file systems do improve boot speed after a crash. |
| </para> |
| </tip> |
| |
| |
| <para> |
| Using <acronym>WAL</acronym> results in a |
| significantly reduced number of disk writes, because only the log |
| file needs to be flushed to disk to guarantee that a transaction is |
| committed, rather than every data file changed by the transaction. |
| The log file is written sequentially, |
| and so the cost of syncing the log is much less than the cost of |
| flushing the data pages. This is especially true for servers |
| handling many small transactions touching different parts of the data |
| store. Furthermore, when the server is processing many small concurrent |
| transactions, one <function>fsync</function> of the log file may |
| suffice to commit many transactions. |
| </para> |
| |
| <para> |
| <acronym>WAL</acronym> also makes it possible to support on-line |
| backup and point-in-time recovery, as described in <xref |
| linkend="continuous-archiving"/>. By archiving the WAL data we can support |
| reverting to any time instant covered by the available WAL data: |
| we simply install a prior physical backup of the database, and |
| replay the WAL log just as far as the desired time. What's more, |
| the physical backup doesn't have to be an instantaneous snapshot |
| of the database state — if it is made over some period of time, |
| then replaying the WAL log for that period will fix any internal |
| inconsistencies. |
| </para> |
| </sect1> |
| |
| <sect1 id="wal-async-commit"> |
| <title>Asynchronous Commit</title> |
| |
| <indexterm> |
| <primary>synchronous commit</primary> |
| </indexterm> |
| |
| <indexterm> |
| <primary>asynchronous commit</primary> |
| </indexterm> |
| |
| <para> |
| <firstterm>Asynchronous commit</firstterm> is an option that allows transactions |
| to complete more quickly, at the cost that the most recent transactions may |
| be lost if the database should crash. In many applications this is an |
| acceptable trade-off. |
| </para> |
| |
| <para> |
| As described in the previous section, transaction commit is normally |
| <firstterm>synchronous</firstterm>: the server waits for the transaction's |
| <acronym>WAL</acronym> records to be flushed to permanent storage |
| before returning a success indication to the client. The client is |
| therefore guaranteed that a transaction reported to be committed will |
| be preserved, even in the event of a server crash immediately after. |
| However, for short transactions this delay is a major component of the |
| total transaction time. Selecting asynchronous commit mode means that |
| the server returns success as soon as the transaction is logically |
| completed, before the <acronym>WAL</acronym> records it generated have |
| actually made their way to disk. This can provide a significant boost |
| in throughput for small transactions. |
| </para> |
| |
| <para> |
| Asynchronous commit introduces the risk of data loss. There is a short |
| time window between the report of transaction completion to the client |
| and the time that the transaction is truly committed (that is, it is |
| guaranteed not to be lost if the server crashes). Thus asynchronous |
| commit should not be used if the client will take external actions |
| relying on the assumption that the transaction will be remembered. |
| As an example, a bank would certainly not use asynchronous commit for |
| a transaction recording an ATM's dispensing of cash. But in many |
| scenarios, such as event logging, there is no need for a strong |
| guarantee of this kind. |
| </para> |
| |
| <para> |
| The risk that is taken by using asynchronous commit is of data loss, |
| not data corruption. If the database should crash, it will recover |
| by replaying <acronym>WAL</acronym> up to the last record that was |
| flushed. The database will therefore be restored to a self-consistent |
| state, but any transactions that were not yet flushed to disk will |
| not be reflected in that state. The net effect is therefore loss of |
| the last few transactions. Because the transactions are replayed in |
| commit order, no inconsistency can be introduced — for example, |
| if transaction B made changes relying on the effects of a previous |
| transaction A, it is not possible for A's effects to be lost while B's |
| effects are preserved. |
| </para> |
| |
| <para> |
| The user can select the commit mode of each transaction, so that |
| it is possible to have both synchronous and asynchronous commit |
| transactions running concurrently. This allows flexible trade-offs |
| between performance and certainty of transaction durability. |
| The commit mode is controlled by the user-settable parameter |
| <xref linkend="guc-synchronous-commit"/>, which can be changed in any of |
| the ways that a configuration parameter can be set. The mode used for |
| any one transaction depends on the value of |
| <varname>synchronous_commit</varname> when transaction commit begins. |
| </para> |
| |
| <para> |
| Certain utility commands, for instance <command>DROP TABLE</command>, are |
| forced to commit synchronously regardless of the setting of |
| <varname>synchronous_commit</varname>. This is to ensure consistency |
| between the server's file system and the logical state of the database. |
| The commands supporting two-phase commit, such as <command>PREPARE |
| TRANSACTION</command>, are also always synchronous. |
| </para> |
| |
| <para> |
| If the database crashes during the risk window between an |
| asynchronous commit and the writing of the transaction's |
| <acronym>WAL</acronym> records, |
| then changes made during that transaction <emphasis>will</emphasis> be lost. |
| The duration of the |
| risk window is limited because a background process (the <quote>WAL |
| writer</quote>) flushes unwritten <acronym>WAL</acronym> records to disk |
| every <xref linkend="guc-wal-writer-delay"/> milliseconds. |
| The actual maximum duration of the risk window is three times |
| <varname>wal_writer_delay</varname> because the WAL writer is |
| designed to favor writing whole pages at a time during busy periods. |
| </para> |
| |
| <caution> |
| <para> |
| An immediate-mode shutdown is equivalent to a server crash, and will |
| therefore cause loss of any unflushed asynchronous commits. |
| </para> |
| </caution> |
| |
| <para> |
| Asynchronous commit provides behavior different from setting |
| <xref linkend="guc-fsync"/> = off. |
| <varname>fsync</varname> is a server-wide |
| setting that will alter the behavior of all transactions. It disables |
| all logic within <productname>PostgreSQL</productname> that attempts to synchronize |
| writes to different portions of the database, and therefore a system |
| crash (that is, a hardware or operating system crash, not a failure of |
| <productname>PostgreSQL</productname> itself) could result in arbitrarily bad |
| corruption of the database state. In many scenarios, asynchronous |
| commit provides most of the performance improvement that could be |
| obtained by turning off <varname>fsync</varname>, but without the risk |
| of data corruption. |
| </para> |
| |
| <para> |
| <xref linkend="guc-commit-delay"/> also sounds very similar to |
| asynchronous commit, but it is actually a synchronous commit method |
| (in fact, <varname>commit_delay</varname> is ignored during an |
| asynchronous commit). <varname>commit_delay</varname> causes a delay |
| just before a transaction flushes <acronym>WAL</acronym> to disk, in |
| the hope that a single flush executed by one such transaction can also |
| serve other transactions committing at about the same time. The |
| setting can be thought of as a way of increasing the time window in |
| which transactions can join a group about to participate in a single |
| flush, to amortize the cost of the flush among multiple transactions. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="wal-configuration"> |
| <title><acronym>WAL</acronym> Configuration</title> |
| |
| <para> |
| There are several <acronym>WAL</acronym>-related configuration parameters that |
| affect database performance. This section explains their use. |
| Consult <xref linkend="runtime-config"/> for general information about |
| setting server configuration parameters. |
| </para> |
| |
| <para> |
| <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</primary></indexterm> |
| are points in the sequence of transactions at which it is guaranteed |
| that the heap and index data files have been updated with all |
| information written before that checkpoint. At checkpoint time, all |
| dirty data pages are flushed to disk and a special checkpoint record is |
| written to the log file. (The change records were previously flushed |
| to the <acronym>WAL</acronym> files.) |
| In the event of a crash, the crash recovery procedure looks at the latest |
| checkpoint record to determine the point in the log (known as the redo |
| record) from which it should start the REDO operation. Any changes made to |
| data files before that point are guaranteed to be already on disk. |
| Hence, after a checkpoint, log segments preceding the one containing |
| the redo record are no longer needed and can be recycled or removed. (When |
| <acronym>WAL</acronym> archiving is being done, the log segments must be |
| archived before being recycled or removed.) |
| </para> |
| |
| <para> |
| The checkpoint requirement of flushing all dirty data pages to disk |
| can cause a significant I/O load. For this reason, checkpoint |
| activity is throttled so that I/O begins at checkpoint start and completes |
| before the next checkpoint is due to start; this minimizes performance |
| degradation during checkpoints. |
| </para> |
| |
| <para> |
| The server's checkpointer process automatically performs |
| a checkpoint every so often. A checkpoint is begun every <xref |
| linkend="guc-checkpoint-timeout"/> seconds, or if |
| <xref linkend="guc-max-wal-size"/> is about to be exceeded, |
| whichever comes first. |
| The default settings are 5 minutes and 1 GB, respectively. |
| If no WAL has been written since the previous checkpoint, new checkpoints |
| will be skipped even if <varname>checkpoint_timeout</varname> has passed. |
| (If WAL archiving is being used and you want to put a lower limit on how |
| often files are archived in order to bound potential data loss, you should |
| adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the |
| checkpoint parameters.) |
| It is also possible to force a checkpoint by using the SQL |
| command <command>CHECKPOINT</command>. |
| </para> |
| |
| <para> |
| Reducing <varname>checkpoint_timeout</varname> and/or |
| <varname>max_wal_size</varname> causes checkpoints to occur |
| more often. This allows faster after-crash recovery, since less work |
| will need to be redone. However, one must balance this against the |
| increased cost of flushing dirty data pages more often. If |
| <xref linkend="guc-full-page-writes"/> is set (as is the default), there is |
| another factor to consider. To ensure data page consistency, |
| the first modification of a data page after each checkpoint results in |
| logging the entire page content. In that case, |
| a smaller checkpoint interval increases the volume of output to the WAL log, |
| partially negating the goal of using a smaller interval, |
| and in any case causing more disk I/O. |
| </para> |
| |
| <para> |
| Checkpoints are fairly expensive, first because they require writing |
| out all currently dirty buffers, and second because they result in |
| extra subsequent WAL traffic as discussed above. It is therefore |
| wise to set the checkpointing parameters high enough so that checkpoints |
| don't happen too often. As a simple sanity check on your checkpointing |
| parameters, you can set the <xref linkend="guc-checkpoint-warning"/> |
| parameter. If checkpoints happen closer together than |
| <varname>checkpoint_warning</varname> seconds, |
| a message will be output to the server log recommending increasing |
| <varname>max_wal_size</varname>. Occasional appearance of such |
| a message is not cause for alarm, but if it appears often then the |
| checkpoint control parameters should be increased. Bulk operations such |
| as large <command>COPY</command> transfers might cause a number of such warnings |
| to appear if you have not set <varname>max_wal_size</varname> high |
| enough. |
| </para> |
| |
| <para> |
| To avoid flooding the I/O system with a burst of page writes, |
| writing dirty buffers during a checkpoint is spread over a period of time. |
| That period is controlled by |
| <xref linkend="guc-checkpoint-completion-target"/>, which is |
| given as a fraction of the checkpoint interval (configured by using |
| <varname>checkpoint_timeout</varname>). |
| The I/O rate is adjusted so that the checkpoint finishes when the |
| given fraction of |
| <varname>checkpoint_timeout</varname> seconds have elapsed, or before |
| <varname>max_wal_size</varname> is exceeded, whichever is sooner. |
| With the default value of 0.9, |
| <productname>PostgreSQL</productname> can be expected to complete each checkpoint |
| a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's |
| duration). This spreads out the I/O as much as possible so that the checkpoint |
| I/O load is consistent throughout the checkpoint interval. The disadvantage of |
| this is that prolonging checkpoints affects recovery time, because more WAL |
| segments will need to be kept around for possible use in recovery. A user |
| concerned about the amount of time required to recover might wish to reduce |
| <varname>checkpoint_timeout</varname> so that checkpoints occur more frequently |
| but still spread the I/O across the checkpoint interval. Alternatively, |
| <varname>checkpoint_completion_target</varname> could be reduced, but this would |
| result in times of more intense I/O (during the checkpoint) and times of less I/O |
| (after the checkpoint completed but before the next scheduled checkpoint) and |
| therefore is not recommended. |
| Although <varname>checkpoint_completion_target</varname> could be set as high as |
| 1.0, it is typically recommended to set it to no higher than 0.9 (the default) |
| since checkpoints include some other activities besides writing dirty buffers. |
| A setting of 1.0 is quite likely to result in checkpoints not being |
| completed on time, which would result in performance loss due to |
| unexpected variation in the number of WAL segments needed. |
| </para> |
| |
| <para> |
| On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after"/> |
| allows to force the OS that pages written by the checkpoint should be |
| flushed to disk after a configurable number of bytes. Otherwise, these |
| pages may be kept in the OS's page cache, inducing a stall when |
| <literal>fsync</literal> is issued at the end of a checkpoint. This setting will |
| often help to reduce transaction latency, but it also can have an adverse |
| effect on performance; particularly for workloads that are bigger than |
| <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache. |
| </para> |
| |
| <para> |
| The number of WAL segment files in <filename>pg_wal</filename> directory depends on |
| <varname>min_wal_size</varname>, <varname>max_wal_size</varname> and |
| the amount of WAL generated in previous checkpoint cycles. When old log |
| segment files are no longer needed, they are removed or recycled (that is, |
| renamed to become future segments in the numbered sequence). If, due to a |
| short-term peak of log output rate, <varname>max_wal_size</varname> is |
| exceeded, the unneeded segment files will be removed until the system |
| gets back under this limit. Below that limit, the system recycles enough |
| WAL files to cover the estimated need until the next checkpoint, and |
| removes the rest. The estimate is based on a moving average of the number |
| of WAL files used in previous checkpoint cycles. The moving average |
| is increased immediately if the actual usage exceeds the estimate, so it |
| accommodates peak usage rather than average usage to some extent. |
| <varname>min_wal_size</varname> puts a minimum on the amount of WAL files |
| recycled for future usage; that much WAL is always recycled for future use, |
| even if the system is idle and the WAL usage estimate suggests that little |
| WAL is needed. |
| </para> |
| |
| <para> |
| Independently of <varname>max_wal_size</varname>, |
| the most recent <xref linkend="guc-wal-keep-size"/> megabytes of |
| WAL files plus one additional WAL file are |
| kept at all times. Also, if WAL archiving is used, old segments cannot be |
| removed or recycled until they are archived. If WAL archiving cannot keep up |
| with the pace that WAL is generated, or if <varname>archive_command</varname> |
| fails repeatedly, old WAL files will accumulate in <filename>pg_wal</filename> |
| until the situation is resolved. A slow or failed standby server that |
| uses a replication slot will have the same effect (see |
| <xref linkend="streaming-replication-slots"/>). |
| </para> |
| |
| <para> |
| In archive recovery or standby mode, the server periodically performs |
| <firstterm>restartpoints</firstterm>,<indexterm><primary>restartpoint</primary></indexterm> |
| which are similar to checkpoints in normal operation: the server forces |
| all its state to disk, updates the <filename>pg_control</filename> file to |
| indicate that the already-processed WAL data need not be scanned again, |
| and then recycles any old log segment files in the <filename>pg_wal</filename> |
| directory. |
| Restartpoints can't be performed more frequently than checkpoints on the |
| primary because restartpoints can only be performed at checkpoint records. |
| A restartpoint is triggered when a checkpoint record is reached if at |
| least <varname>checkpoint_timeout</varname> seconds have passed since the last |
| restartpoint, or if WAL size is about to exceed |
| <varname>max_wal_size</varname>. However, because of limitations on when a |
| restartpoint can be performed, <varname>max_wal_size</varname> is often exceeded |
| during recovery, by up to one checkpoint cycle's worth of WAL. |
| (<varname>max_wal_size</varname> is never a hard limit anyway, so you should |
| always leave plenty of headroom to avoid running out of disk space.) |
| </para> |
| |
| <para> |
| There are two commonly used internal <acronym>WAL</acronym> functions: |
| <function>XLogInsertRecord</function> and <function>XLogFlush</function>. |
| <function>XLogInsertRecord</function> is used to place a new record into |
| the <acronym>WAL</acronym> buffers in shared memory. If there is no |
| space for the new record, <function>XLogInsertRecord</function> will have |
| to write (move to kernel cache) a few filled <acronym>WAL</acronym> |
| buffers. This is undesirable because <function>XLogInsertRecord</function> |
| is used on every database low level modification (for example, row |
| insertion) at a time when an exclusive lock is held on affected |
| data pages, so the operation needs to be as fast as possible. What |
| is worse, writing <acronym>WAL</acronym> buffers might also force the |
| creation of a new log segment, which takes even more |
| time. Normally, <acronym>WAL</acronym> buffers should be written |
| and flushed by an <function>XLogFlush</function> request, which is |
| made, for the most part, at transaction commit time to ensure that |
| transaction records are flushed to permanent storage. On systems |
| with high log output, <function>XLogFlush</function> requests might |
| not occur often enough to prevent <function>XLogInsertRecord</function> |
| from having to do writes. On such systems |
| one should increase the number of <acronym>WAL</acronym> buffers by |
| modifying the <xref linkend="guc-wal-buffers"/> parameter. When |
| <xref linkend="guc-full-page-writes"/> is set and the system is very busy, |
| setting <varname>wal_buffers</varname> higher will help smooth response times |
| during the period immediately following each checkpoint. |
| </para> |
| |
| <para> |
| The <xref linkend="guc-commit-delay"/> parameter defines for how many |
| microseconds a group commit leader process will sleep after acquiring a |
| lock within <function>XLogFlush</function>, while group commit |
| followers queue up behind the leader. This delay allows other server |
| processes to add their commit records to the WAL buffers so that all of |
| them will be flushed by the leader's eventual sync operation. No sleep |
| will occur if <xref linkend="guc-fsync"/> is not enabled, or if fewer |
| than <xref linkend="guc-commit-siblings"/> other sessions are currently |
| in active transactions; this avoids sleeping when it's unlikely that |
| any other session will commit soon. Note that on some platforms, the |
| resolution of a sleep request is ten milliseconds, so that any nonzero |
| <varname>commit_delay</varname> setting between 1 and 10000 |
| microseconds would have the same effect. Note also that on some |
| platforms, sleep operations may take slightly longer than requested by |
| the parameter. |
| </para> |
| |
| <para> |
| Since the purpose of <varname>commit_delay</varname> is to allow the |
| cost of each flush operation to be amortized across concurrently |
| committing transactions (potentially at the expense of transaction |
| latency), it is necessary to quantify that cost before the setting can |
| be chosen intelligently. The higher that cost is, the more effective |
| <varname>commit_delay</varname> is expected to be in increasing |
| transaction throughput, up to a point. The <xref |
| linkend="pgtestfsync"/> program can be used to measure the average time |
| in microseconds that a single WAL flush operation takes. A value of |
| half of the average time the program reports it takes to flush after a |
| single 8kB write operation is often the most effective setting for |
| <varname>commit_delay</varname>, so this value is recommended as the |
| starting point to use when optimizing for a particular workload. While |
| tuning <varname>commit_delay</varname> is particularly useful when the |
| WAL log is stored on high-latency rotating disks, benefits can be |
| significant even on storage media with very fast sync times, such as |
| solid-state drives or RAID arrays with a battery-backed write cache; |
| but this should definitely be tested against a representative workload. |
| Higher values of <varname>commit_siblings</varname> should be used in |
| such cases, whereas smaller <varname>commit_siblings</varname> values |
| are often helpful on higher latency media. Note that it is quite |
| possible that a setting of <varname>commit_delay</varname> that is too |
| high can increase transaction latency by so much that total transaction |
| throughput suffers. |
| </para> |
| |
| <para> |
| When <varname>commit_delay</varname> is set to zero (the default), it |
| is still possible for a form of group commit to occur, but each group |
| will consist only of sessions that reach the point where they need to |
| flush their commit records during the window in which the previous |
| flush operation (if any) is occurring. At higher client counts a |
| <quote>gangway effect</quote> tends to occur, so that the effects of group |
| commit become significant even when <varname>commit_delay</varname> is |
| zero, and thus explicitly setting <varname>commit_delay</varname> tends |
| to help less. Setting <varname>commit_delay</varname> can only help |
| when (1) there are some concurrently committing transactions, and (2) |
| throughput is limited to some degree by commit rate; but with high |
| rotational latency this setting can be effective in increasing |
| transaction throughput with as few as two clients (that is, a single |
| committing client with one sibling transaction). |
| </para> |
| |
| <para> |
| The <xref linkend="guc-wal-sync-method"/> parameter determines how |
| <productname>PostgreSQL</productname> will ask the kernel to force |
| <acronym>WAL</acronym> updates out to disk. |
| All the options should be the same in terms of reliability, with |
| the exception of <literal>fsync_writethrough</literal>, which can sometimes |
| force a flush of the disk cache even when other options do not do so. |
| However, it's quite platform-specific which one will be the fastest. |
| You can test the speeds of different options using the <xref |
| linkend="pgtestfsync"/> program. |
| Note that this parameter is irrelevant if <varname>fsync</varname> |
| has been turned off. |
| </para> |
| |
| <para> |
| Enabling the <xref linkend="guc-wal-debug"/> configuration parameter |
| (provided that <productname>PostgreSQL</productname> has been |
| compiled with support for it) will result in each |
| <function>XLogInsertRecord</function> and <function>XLogFlush</function> |
| <acronym>WAL</acronym> call being logged to the server log. This |
| option might be replaced by a more general mechanism in the future. |
| </para> |
| |
| <para> |
| There are two internal functions to write WAL data to disk: |
| <function>XLogWrite</function> and <function>issue_xlog_fsync</function>. |
| When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total |
| amounts of time <function>XLogWrite</function> writes and |
| <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as |
| <literal>wal_write_time</literal> and <literal>wal_sync_time</literal> in |
| <xref linkend="pg-stat-wal-view"/>, respectively. |
| <function>XLogWrite</function> is normally called by |
| <function>XLogInsertRecord</function> (when there is no space for the new |
| record in WAL buffers), <function>XLogFlush</function> and the WAL writer, |
| to write WAL buffers to disk and call <function>issue_xlog_fsync</function>. |
| <function>issue_xlog_fsync</function> is normally called by |
| <function>XLogWrite</function> to sync WAL files to disk. |
| If <varname>wal_sync_method</varname> is either |
| <literal>open_datasync</literal> or <literal>open_sync</literal>, |
| a write operation in <function>XLogWrite</function> guarantees to sync written |
| WAL data to disk and <function>issue_xlog_fsync</function> does nothing. |
| If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>, |
| <literal>fsync</literal>, or <literal>fsync_writethrough</literal>, |
| the write operation moves WAL buffers to kernel cache and |
| <function>issue_xlog_fsync</function> syncs them to disk. Regardless |
| of the setting of <varname>track_wal_io_timing</varname>, the number |
| of times <function>XLogWrite</function> writes and |
| <function>issue_xlog_fsync</function> syncs WAL data to disk are also |
| counted as <literal>wal_write</literal> and <literal>wal_sync</literal> |
| in <structname>pg_stat_wal</structname>, respectively. |
| </para> |
| </sect1> |
| |
| <sect1 id="wal-internals"> |
| <title>WAL Internals</title> |
| |
| <indexterm zone="wal-internals"> |
| <primary>LSN</primary> |
| </indexterm> |
| |
| <para> |
| <acronym>WAL</acronym> is automatically enabled; no action is |
| required from the administrator except ensuring that the |
| disk-space requirements for the <acronym>WAL</acronym> logs are met, |
| and that any necessary tuning is done (see <xref |
| linkend="wal-configuration"/>). |
| </para> |
| |
| <para> |
| <acronym>WAL</acronym> records are appended to the <acronym>WAL</acronym> |
| logs as each new record is written. The insert position is described by |
| a Log Sequence Number (<acronym>LSN</acronym>) that is a byte offset into |
| the logs, increasing monotonically with each new record. |
| <acronym>LSN</acronym> values are returned as the datatype |
| <link linkend="datatype-pg-lsn"><type>pg_lsn</type></link>. Values can be |
| compared to calculate the volume of <acronym>WAL</acronym> data that |
| separates them, so they are used to measure the progress of replication |
| and recovery. |
| </para> |
| |
| <para> |
| <acronym>WAL</acronym> logs are stored in the directory |
| <filename>pg_wal</filename> under the data directory, as a set of |
| segment files, normally each 16 MB in size (but the size can be changed |
| by altering the <option>--wal-segsize</option> <application>initdb</application> option). Each segment is |
| divided into pages, normally 8 kB each (this size can be changed via the |
| <option>--with-wal-blocksize</option> configure option). The log record headers |
| are described in <filename>access/xlogrecord.h</filename>; the record |
| content is dependent on the type of event that is being logged. Segment |
| files are given ever-increasing numbers as names, starting at |
| <filename>000000010000000000000001</filename>. The numbers do not wrap, |
| but it will take a very, very long time to exhaust the |
| available stock of numbers. |
| </para> |
| |
| <para> |
| It is advantageous if the log is located on a different disk from the |
| main database files. This can be achieved by moving the |
| <filename>pg_wal</filename> directory to another location (while the server |
| is shut down, of course) and creating a symbolic link from the |
| original location in the main data directory to the new location. |
| </para> |
| |
| <para> |
| The aim of <acronym>WAL</acronym> is to ensure that the log is |
| written before database records are altered, but this can be subverted by |
| disk drives<indexterm><primary>disk drive</primary></indexterm> that falsely report a |
| successful write to the kernel, |
| when in fact they have only cached the data and not yet stored it |
| on the disk. A power failure in such a situation might lead to |
| irrecoverable data corruption. Administrators should try to ensure |
| that disks holding <productname>PostgreSQL</productname>'s |
| <acronym>WAL</acronym> log files do not make such false reports. |
| (See <xref linkend="wal-reliability"/>.) |
| </para> |
| |
| <para> |
| After a checkpoint has been made and the log flushed, the |
| checkpoint's position is saved in the file |
| <filename>pg_control</filename>. Therefore, at the start of recovery, |
| the server first reads <filename>pg_control</filename> and |
| then the checkpoint record; then it performs the REDO operation by |
| scanning forward from the log location indicated in the checkpoint |
| record. Because the entire content of data pages is saved in the |
| log on the first page modification after a checkpoint (assuming |
| <xref linkend="guc-full-page-writes"/> is not disabled), all pages |
| changed since the checkpoint will be restored to a consistent |
| state. |
| </para> |
| |
| <para> |
| To deal with the case where <filename>pg_control</filename> is |
| corrupt, we should support the possibility of scanning existing log |
| segments in reverse order — newest to oldest — in order to find the |
| latest checkpoint. This has not been implemented yet. |
| <filename>pg_control</filename> is small enough (less than one disk page) |
| that it is not subject to partial-write problems, and as of this writing |
| there have been no reports of database failures due solely to the inability |
| to read <filename>pg_control</filename> itself. So while it is |
| theoretically a weak spot, <filename>pg_control</filename> does not |
| seem to be a problem in practice. |
| </para> |
| </sect1> |
| </chapter> |