| <!-- doc/src/sgml/storage.sgml --> |
| |
| <chapter id="storage"> |
| |
| <title>Database Physical Storage</title> |
| |
| <para> |
| This chapter provides an overview of the physical storage format used by |
| <productname>PostgreSQL</productname> databases. |
| </para> |
| |
| <sect1 id="storage-file-layout"> |
| |
| <title>Database File Layout</title> |
| |
| <para> |
| This section describes the storage format at the level of files and |
| directories. |
| </para> |
| |
| <para> |
| Traditionally, the configuration and data files used by a database |
| cluster are stored together within the cluster's data |
| directory, commonly referred to as <varname>PGDATA</varname> (after the name of the |
| environment variable that can be used to define it). A common location for |
| <varname>PGDATA</varname> is <filename>/var/lib/pgsql/data</filename>. Multiple clusters, |
| managed by different server instances, can exist on the same machine. |
| </para> |
| |
| <para> |
| The <varname>PGDATA</varname> directory contains several subdirectories and control |
| files, as shown in <xref linkend="pgdata-contents-table"/>. In addition to |
| these required items, the cluster configuration files |
| <filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and |
| <filename>pg_ident.conf</filename> are traditionally stored in |
| <varname>PGDATA</varname>, although it is possible to place them elsewhere. |
| </para> |
| |
| <table tocentry="1" id="pgdata-contents-table"> |
| <title>Contents of <varname>PGDATA</varname></title> |
| <tgroup cols="2"> |
| <thead> |
| <row> |
| <entry> |
| Item |
| </entry> |
| <entry>Description</entry> |
| </row> |
| </thead> |
| |
| <tbody> |
| |
| <row> |
| <entry><filename>PG_VERSION</filename></entry> |
| <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry> |
| </row> |
| |
| <row> |
| <entry><filename>base</filename></entry> |
| <entry>Subdirectory containing per-database subdirectories</entry> |
| </row> |
| |
| <row> |
| <entry><filename>current_logfiles</filename></entry> |
| <entry>File recording the log file(s) currently written to by the logging |
| collector</entry> |
| </row> |
| |
| <row> |
| <entry><filename>global</filename></entry> |
| <entry>Subdirectory containing cluster-wide tables, such as |
| <structname>pg_database</structname></entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_commit_ts</filename></entry> |
| <entry>Subdirectory containing transaction commit timestamp data</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_cryptokeys</filename></entry> |
| <entry>Subdirectory containing file encryption keys</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_dynshmem</filename></entry> |
| <entry>Subdirectory containing files used by the dynamic shared memory |
| subsystem</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_logical</filename></entry> |
| <entry>Subdirectory containing status data for logical decoding</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_multixact</filename></entry> |
| <entry>Subdirectory containing multitransaction status data |
| (used for shared row locks)</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_notify</filename></entry> |
| <entry>Subdirectory containing LISTEN/NOTIFY status data</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_replslot</filename></entry> |
| <entry>Subdirectory containing replication slot data</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_serial</filename></entry> |
| <entry>Subdirectory containing information about committed serializable transactions</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_snapshots</filename></entry> |
| <entry>Subdirectory containing exported snapshots</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_stat</filename></entry> |
| <entry>Subdirectory containing permanent files for the statistics |
| subsystem</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_stat_tmp</filename></entry> |
| <entry>Subdirectory containing temporary files for the statistics |
| subsystem</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_subtrans</filename></entry> |
| <entry>Subdirectory containing subtransaction status data</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_tblspc</filename></entry> |
| <entry>Subdirectory containing symbolic links to tablespaces</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_twophase</filename></entry> |
| <entry>Subdirectory containing state files for prepared transactions</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_wal</filename></entry> |
| <entry>Subdirectory containing WAL (Write Ahead Log) files</entry> |
| </row> |
| |
| <row> |
| <entry><filename>pg_xact</filename></entry> |
| <entry>Subdirectory containing transaction commit status data</entry> |
| </row> |
| |
| <row> |
| <entry><filename>postgresql.auto.conf</filename></entry> |
| <entry>A file used for storing configuration parameters that are set by |
| <command>ALTER SYSTEM</command></entry> |
| </row> |
| |
| <row> |
| <entry><filename>postmaster.opts</filename></entry> |
| <entry>A file recording the command-line options the server was |
| last started with</entry> |
| </row> |
| |
| <row> |
| <entry><filename>postmaster.pid</filename></entry> |
| <entry>A lock file recording the current postmaster process ID (PID), |
| cluster data directory path, |
| postmaster start timestamp, |
| port number, |
| Unix-domain socket directory path (could be empty), |
| first valid listen_address (IP address or <literal>*</literal>, or empty if |
| not listening on TCP), |
| and shared memory segment ID |
| (this file is not present after server shutdown)</entry> |
| </row> |
| |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <para> |
| For each database in the cluster there is a subdirectory within |
| <varname>PGDATA</varname><filename>/base</filename>, named after the database's OID in |
| <structname>pg_database</structname>. This subdirectory is the default location |
| for the database's files; in particular, its system catalogs are stored |
| there. |
| </para> |
| |
| <para> |
| Note that the following sections describe the behavior of the builtin |
| <literal>heap</literal> <link linkend="tableam">table access method</link>, |
| and the builtin <link linkend="indexam">index access methods</link>. Due |
| to the extensible nature of <productname>PostgreSQL</productname>, other |
| access methods might work differently. |
| </para> |
| |
| <para> |
| Each table and index is stored in a separate file. For ordinary relations, |
| these files are named after the table or index's <firstterm>filenode</firstterm> number, |
| which can be found in <structname>pg_class</structname>.<structfield>relfilenode</structfield>. But |
| for temporary relations, the file name is of the form |
| <literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>, where <replaceable>BBB</replaceable> |
| is the backend ID of the backend which created the file, and <replaceable>FFF</replaceable> |
| is the filenode number. In either case, in addition to the main file (a/k/a |
| main fork), each table and index has a <firstterm>free space map</firstterm> (see <xref |
| linkend="storage-fsm"/>), which stores information about free space available in |
| the relation. The free space map is stored in a file named with the filenode |
| number plus the suffix <literal>_fsm</literal>. Tables also have a |
| <firstterm>visibility map</firstterm>, stored in a fork with the suffix <literal>_vm</literal>, |
| to track which pages are known to have no dead tuples. The visibility map is |
| described further in <xref linkend="storage-vm"/>. Unlogged tables and indexes |
| have a third fork, known as the initialization fork, which is stored in a fork |
| with the suffix <literal>_init</literal> (see <xref linkend="storage-init"/>). |
| </para> |
| |
| <caution> |
| <para> |
| Note that while a table's filenode often matches its OID, this is |
| <emphasis>not</emphasis> necessarily the case; some operations, like |
| <command>TRUNCATE</command>, <command>REINDEX</command>, <command>CLUSTER</command> and some forms |
| of <command>ALTER TABLE</command>, can change the filenode while preserving the OID. |
| Avoid assuming that filenode and table OID are the same. |
| Also, for certain system catalogs including <structname>pg_class</structname> itself, |
| <structname>pg_class</structname>.<structfield>relfilenode</structfield> contains zero. The |
| actual filenode number of these catalogs is stored in a lower-level data |
| structure, and can be obtained using the <function>pg_relation_filenode()</function> |
| function. |
| </para> |
| </caution> |
| |
| <para> |
| When a table or index exceeds 1 GB, it is divided into gigabyte-sized |
| <firstterm>segments</firstterm>. The first segment's file name is the same as the |
| filenode; subsequent segments are named filenode.1, filenode.2, etc. |
| This arrangement avoids problems on platforms that have file size limitations. |
| (Actually, 1 GB is just the default segment size. The segment size can be |
| adjusted using the configuration option <option>--with-segsize</option> |
| when building <productname>PostgreSQL</productname>.) |
| In principle, free space map and visibility map forks could require multiple |
| segments as well, though this is unlikely to happen in practice. |
| </para> |
| |
| <para> |
| A table that has columns with potentially large entries will have an |
| associated <firstterm>TOAST</firstterm> table, which is used for out-of-line storage of |
| field values that are too large to keep in the table rows proper. |
| <structname>pg_class</structname>.<structfield>reltoastrelid</structfield> links from a table to |
| its <acronym>TOAST</acronym> table, if any. |
| See <xref linkend="storage-toast"/> for more information. |
| </para> |
| |
| <para> |
| The contents of tables and indexes are discussed further in |
| <xref linkend="storage-page-layout"/>. |
| </para> |
| |
| <para> |
| Tablespaces make the scenario more complicated. Each user-defined tablespace |
| has a symbolic link inside the <varname>PGDATA</varname><filename>/pg_tblspc</filename> |
| directory, which points to the physical tablespace directory (i.e., the |
| location specified in the tablespace's <command>CREATE TABLESPACE</command> command). |
| This symbolic link is named after |
| the tablespace's OID. Inside the physical tablespace directory there is |
| a subdirectory with a name that depends on the <productname>PostgreSQL</productname> |
| server version, such as <literal>PG_9.0_201008051</literal>. (The reason for using |
| this subdirectory is so that successive versions of the database can use |
| the same <command>CREATE TABLESPACE</command> location value without conflicts.) |
| Within the version-specific subdirectory, there is |
| a subdirectory for each database that has elements in the tablespace, named |
| after the database's OID. Tables and indexes are stored within that |
| directory, using the filenode naming scheme. |
| The <literal>pg_default</literal> tablespace is not accessed through |
| <filename>pg_tblspc</filename>, but corresponds to |
| <varname>PGDATA</varname><filename>/base</filename>. Similarly, the <literal>pg_global</literal> |
| tablespace is not accessed through <filename>pg_tblspc</filename>, but corresponds to |
| <varname>PGDATA</varname><filename>/global</filename>. |
| </para> |
| |
| <para> |
| The <function>pg_relation_filepath()</function> function shows the entire path |
| (relative to <varname>PGDATA</varname>) of any relation. It is often useful |
| as a substitute for remembering many of the above rules. But keep in |
| mind that this function just gives the name of the first segment of the |
| main fork of the relation — you may need to append a segment number |
| and/or <literal>_fsm</literal>, <literal>_vm</literal>, or <literal>_init</literal> to find all |
| the files associated with the relation. |
| </para> |
| |
| <para> |
| Temporary files (for operations such as sorting more data than can fit in |
| memory) are created within <varname>PGDATA</varname><filename>/base/pgsql_tmp</filename>, |
| or within a <filename>pgsql_tmp</filename> subdirectory of a tablespace directory |
| if a tablespace other than <literal>pg_default</literal> is specified for them. |
| The name of a temporary file has the form |
| <filename>pgsql_tmp<replaceable>PPP</replaceable>.<replaceable>NNN</replaceable></filename>, |
| where <replaceable>PPP</replaceable> is the PID of the owning backend and |
| <replaceable>NNN</replaceable> distinguishes different temporary files of that backend. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="storage-toast"> |
| |
| <title>TOAST</title> |
| |
| <indexterm> |
| <primary>TOAST</primary> |
| </indexterm> |
| <indexterm><primary>sliced bread</primary><see>TOAST</see></indexterm> |
| |
| <para> |
| This section provides an overview of <acronym>TOAST</acronym> (The |
| Oversized-Attribute Storage Technique). |
| </para> |
| |
| <para> |
| <productname>PostgreSQL</productname> uses a fixed page size (commonly |
| 8 kB), and does not allow tuples to span multiple pages. Therefore, it is |
| not possible to store very large field values directly. To overcome |
| this limitation, large field values are compressed and/or broken up into |
| multiple physical rows. This happens transparently to the user, with only |
| small impact on most of the backend code. The technique is affectionately |
| known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>). |
| The <acronym>TOAST</acronym> infrastructure is also used to improve handling of |
| large data values in-memory. |
| </para> |
| |
| <para> |
| Only certain data types support <acronym>TOAST</acronym> — there is no need to |
| impose the overhead on data types that cannot produce large field values. |
| To support <acronym>TOAST</acronym>, a data type must have a variable-length |
| (<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first |
| four-byte word of any stored value contains the total length of the value in |
| bytes (including itself). <acronym>TOAST</acronym> does not constrain the rest |
| of the data type's representation. The special representations collectively |
| called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or |
| reinterpreting this initial length word. Therefore, the C-level functions |
| supporting a <acronym>TOAST</acronym>-able data type must be careful about how they |
| handle potentially <acronym>TOAST</acronym>ed input values: an input might not |
| actually consist of a four-byte length word and contents until after it's |
| been <firstterm>detoasted</firstterm>. (This is normally done by invoking |
| <function>PG_DETOAST_DATUM</function> before doing anything with an input value, |
| but in some cases more efficient approaches are possible. |
| See <xref linkend="xtypes-toast"/> for more detail.) |
| </para> |
| |
| <para> |
| <acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order |
| bits on big-endian machines, the low-order bits on little-endian machines), |
| thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able |
| data type to 1 GB (2<superscript>30</superscript> - 1 bytes). When both bits are zero, |
| the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and |
| the remaining bits of the length word give the total datum size (including |
| length word) in bytes. When the highest-order or lowest-order bit is set, |
| the value has only a single-byte header instead of the normal four-byte |
| header, and the remaining bits of that byte give the total datum size |
| (including length byte) in bytes. This alternative supports space-efficient |
| storage of values shorter than 127 bytes, while still allowing the data type |
| to grow to 1 GB at need. Values with single-byte headers aren't aligned on |
| any particular boundary, whereas values with four-byte headers are aligned on |
| at least a four-byte boundary; this omission of alignment padding provides |
| additional space savings that is significant compared to short values. |
| As a special case, if the remaining bits of a single-byte header are all |
| zero (which would be impossible for a self-inclusive length), the value is |
| a pointer to out-of-line data, with several possible alternatives as |
| described below. The type and size of such a <firstterm>TOAST pointer</firstterm> |
| are determined by a code stored in the second byte of the datum. |
| Lastly, when the highest-order or lowest-order bit is clear but the adjacent |
| bit is set, the content of the datum has been compressed and must be |
| decompressed before use. In this case the remaining bits of the four-byte |
| length word give the total size of the compressed datum, not the |
| original data. Note that compression is also possible for out-of-line data |
| but the varlena header does not tell whether it has occurred — |
| the content of the <acronym>TOAST</acronym> pointer tells that, instead. |
| </para> |
| |
| <para> |
| The compression technique used for either in-line or out-of-line compressed |
| data can be selected for each column by setting |
| the <literal>COMPRESSION</literal> column option in <command>CREATE |
| TABLE</command> or <command>ALTER TABLE</command>. The default for columns |
| with no explicit setting is to consult the |
| <xref linkend="guc-default-toast-compression"/> parameter at the time data is |
| inserted. |
| </para> |
| |
| <para> |
| As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums. |
| The oldest and most common type is a pointer to out-of-line data stored in |
| a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but |
| associated with, the table containing the <acronym>TOAST</acronym> pointer datum |
| itself. These <firstterm>on-disk</firstterm> pointer datums are created by the |
| <acronym>TOAST</acronym> management code (in <filename>access/common/toast_internals.c</filename>) |
| when a tuple to be stored on disk is too large to be stored as-is. |
| Further details appear in <xref linkend="storage-toast-ondisk"/>. |
| Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to |
| out-of-line data that appears elsewhere in memory. Such datums are |
| necessarily short-lived, and will never appear on-disk, but they are very |
| useful for avoiding copying and redundant processing of large data values. |
| Further details appear in <xref linkend="storage-toast-inmemory"/>. |
| </para> |
| |
| <sect2 id="storage-toast-ondisk"> |
| <title>Out-of-Line, On-Disk TOAST Storage</title> |
| |
| <para> |
| If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will |
| have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's |
| <structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry. On-disk |
| <acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as |
| described in more detail below. |
| </para> |
| |
| <para> |
| Out-of-line values are divided (after compression if used) into chunks of at |
| most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen |
| so that four chunk rows will fit on a page, making it about 2000 bytes). |
| Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table |
| belonging to the owning table. Every |
| <acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID |
| identifying the particular <acronym>TOAST</acronym>ed value), |
| <structfield>chunk_seq</structfield> (a sequence number for the chunk within its value), |
| and <structfield>chunk_data</structfield> (the actual data of the chunk). A unique index |
| on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast |
| retrieval of the values. A pointer datum representing an out-of-line on-disk |
| <acronym>TOAST</acronym>ed value therefore needs to store the OID of the |
| <acronym>TOAST</acronym> table in which to look and the OID of the specific value |
| (its <structfield>chunk_id</structfield>). For convenience, pointer datums also store the |
| logical datum size (original uncompressed data length), physical stored size |
| (different if compression was applied), and the compression method used, if |
| any. Allowing for the varlena header bytes, |
| the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18 |
| bytes regardless of the actual size of the represented value. |
| </para> |
| |
| <para> |
| The <acronym>TOAST</acronym> management code is triggered only |
| when a row value to be stored in a table is wider than |
| <symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB). |
| The <acronym>TOAST</acronym> code will compress and/or move |
| field values out-of-line until the row value is shorter than |
| <symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable) |
| or no more gains can be had. During an UPDATE |
| operation, values of unchanged fields are normally preserved as-is; so an |
| UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if |
| none of the out-of-line values change. |
| </para> |
| |
| <para> |
| The <acronym>TOAST</acronym> management code recognizes four different strategies |
| for storing <acronym>TOAST</acronym>-able columns on disk: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <literal>PLAIN</literal> prevents either compression or |
| out-of-line storage; furthermore it disables use of single-byte headers |
| for varlena types. |
| This is the only possible strategy for |
| columns of non-<acronym>TOAST</acronym>-able data types. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>EXTENDED</literal> allows both compression and out-of-line |
| storage. This is the default for most <acronym>TOAST</acronym>-able data types. |
| Compression will be attempted first, then out-of-line storage if |
| the row is still too big. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>EXTERNAL</literal> allows out-of-line storage but not |
| compression. Use of <literal>EXTERNAL</literal> will |
| make substring operations on wide <type>text</type> and |
| <type>bytea</type> columns faster (at the penalty of increased storage |
| space) because these operations are optimized to fetch only the |
| required parts of the out-of-line value when it is not compressed. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>MAIN</literal> allows compression but not out-of-line |
| storage. (Actually, out-of-line storage will still be performed |
| for such columns, but only as a last resort when there is no other |
| way to make the row small enough to fit on a page.) |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns |
| of that data type, but the strategy for a given table column can be altered |
| with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>. |
| </para> |
| |
| <para> |
| <symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using |
| <link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link> |
| </para> |
| |
| <para> |
| This scheme has a number of advantages compared to a more straightforward |
| approach such as allowing row values to span pages. Assuming that queries are |
| usually qualified by comparisons against relatively small key values, most of |
| the work of the executor will be done using the main row entry. The big values |
| of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all) |
| at the time the result set is sent to the client. Thus, the main table is much |
| smaller and more of its rows fit in the shared buffer cache than would be the |
| case without any out-of-line storage. Sort sets shrink also, and sorts will |
| more often be done entirely in memory. A little test showed that a table |
| containing typical HTML pages and their URLs was stored in about half of the |
| raw data size including the <acronym>TOAST</acronym> table, and that the main table |
| contained only about 10% of the entire data (the URLs and some small HTML |
| pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed |
| comparison table, in which all the HTML pages were cut down to 7 kB to fit. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="storage-toast-inmemory"> |
| <title>Out-of-Line, In-Memory TOAST Storage</title> |
| |
| <para> |
| <acronym>TOAST</acronym> pointers can point to data that is not on disk, but is |
| elsewhere in the memory of the current server process. Such pointers |
| obviously cannot be long-lived, but they are nonetheless useful. There |
| are currently two sub-cases: |
| pointers to <firstterm>indirect</firstterm> data and |
| pointers to <firstterm>expanded</firstterm> data. |
| </para> |
| |
| <para> |
| Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena |
| value stored somewhere in memory. This case was originally created merely |
| as a proof of concept, but it is currently used during logical decoding to |
| avoid possibly having to create physical tuples exceeding 1 GB (as pulling |
| all out-of-line field values into the tuple might do). The case is of |
| limited use since the creator of the pointer datum is entirely responsible |
| that the referenced data survives for as long as the pointer could exist, |
| and there is no infrastructure to help with this. |
| </para> |
| |
| <para> |
| Expanded <acronym>TOAST</acronym> pointers are useful for complex data types |
| whose on-disk representation is not especially suited for computational |
| purposes. As an example, the standard varlena representation of a |
| <productname>PostgreSQL</productname> array includes dimensionality information, a |
| nulls bitmap if there are any null elements, then the values of all the |
| elements in order. When the element type itself is variable-length, the |
| only way to find the <replaceable>N</replaceable>'th element is to scan through all the |
| preceding elements. This representation is appropriate for on-disk storage |
| because of its compactness, but for computations with the array it's much |
| nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote> |
| representation in which all the element starting locations have been |
| identified. The <acronym>TOAST</acronym> pointer mechanism supports this need by |
| allowing a pass-by-reference Datum to point to either a standard varlena |
| value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that |
| points to an expanded representation somewhere in memory. The details of |
| this expanded representation are up to the data type, though it must have |
| a standard header and meet the other API requirements given |
| in <filename>src/include/utils/expandeddatum.h</filename>. C-level functions |
| working with the data type can choose to handle either representation. |
| Functions that do not know about the expanded representation, but simply |
| apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically |
| receive the traditional varlena representation; so support for an expanded |
| representation can be introduced incrementally, one function at a time. |
| </para> |
| |
| <para> |
| <acronym>TOAST</acronym> pointers to expanded values are further broken down |
| into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers. |
| The pointed-to representation is the same either way, but a function that |
| receives a read-write pointer is allowed to modify the referenced value |
| in-place, whereas one that receives a read-only pointer must not; it must |
| first create a copy if it wants to make a modified version of the value. |
| This distinction and some associated conventions make it possible to avoid |
| unnecessary copying of expanded values during query execution. |
| </para> |
| |
| <para> |
| For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym> |
| management code ensures that no such pointer datum can accidentally get |
| stored on disk. In-memory <acronym>TOAST</acronym> pointers are automatically |
| expanded to normal in-line varlena values before storage — and then |
| possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing |
| tuple would otherwise be too big. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="storage-fsm"> |
| |
| <title>Free Space Map</title> |
| |
| <indexterm> |
| <primary>Free Space Map</primary> |
| </indexterm> |
| <indexterm><primary>FSM</primary><see>Free Space Map</see></indexterm> |
| |
| <para> |
| Each heap and index relation, except for hash indexes, has a Free Space Map |
| (FSM) to keep track of available space in the relation. It's stored |
| alongside the main relation data in a separate relation fork, named after the |
| filenode number of the relation, plus a <literal>_fsm</literal> suffix. For example, |
| if the filenode of a relation is 12345, the FSM is stored in a file called |
| <filename>12345_fsm</filename>, in the same directory as the main relation file. |
| </para> |
| |
| <para> |
| The Free Space Map is organized as a tree of <acronym>FSM</acronym> pages. The |
| bottom level <acronym>FSM</acronym> pages store the free space available on each |
| heap (or index) page, using one byte to represent each such page. The upper |
| levels aggregate information from the lower levels. |
| </para> |
| |
| <para> |
| Within each <acronym>FSM</acronym> page is a binary tree, stored in an array with |
| one byte per node. Each leaf node represents a heap page, or a lower level |
| <acronym>FSM</acronym> page. In each non-leaf node, the higher of its children's |
| values is stored. The maximum value in the leaf nodes is therefore stored |
| at the root. |
| </para> |
| |
| <para> |
| See <filename>src/backend/storage/freespace/README</filename> for more details on |
| how the <acronym>FSM</acronym> is structured, and how it's updated and searched. |
| The <xref linkend="pgfreespacemap"/> module |
| can be used to examine the information stored in free space maps. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="storage-vm"> |
| |
| <title>Visibility Map</title> |
| |
| <indexterm> |
| <primary>Visibility Map</primary> |
| </indexterm> |
| <indexterm><primary>VM</primary><see>Visibility Map</see></indexterm> |
| |
| <para> |
| Each heap relation has a Visibility Map |
| (VM) to keep track of which pages contain only tuples that are known to be |
| visible to all active transactions; it also keeps track of which pages contain |
| only frozen tuples. It's stored |
| alongside the main relation data in a separate relation fork, named after the |
| filenode number of the relation, plus a <literal>_vm</literal> suffix. For example, |
| if the filenode of a relation is 12345, the VM is stored in a file called |
| <filename>12345_vm</filename>, in the same directory as the main relation file. |
| Note that indexes do not have VMs. |
| </para> |
| |
| <para> |
| The visibility map stores two bits per heap page. The first bit, if set, |
| indicates that the page is all-visible, or in other words that the page does |
| not contain any tuples that need to be vacuumed. |
| This information can also be used |
| by <link linkend="indexes-index-only-scans"><firstterm>index-only |
| scans</firstterm></link> to answer queries using only the index tuple. |
| The second bit, if set, means that all tuples on the page have been frozen. |
| That means that even an anti-wraparound vacuum need not revisit the page. |
| </para> |
| |
| <para> |
| The map is conservative in the sense that we make sure that whenever a bit is |
| set, we know the condition is true, but if a bit is not set, it might or |
| might not be true. Visibility map bits are only set by vacuum, but are |
| cleared by any data-modifying operations on a page. |
| </para> |
| |
| <para> |
| The <xref linkend="pgvisibility"/> module can be used to examine the |
| information stored in the visibility map. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="storage-init"> |
| |
| <title>The Initialization Fork</title> |
| |
| <indexterm> |
| <primary>Initialization Fork</primary> |
| </indexterm> |
| |
| <para> |
| Each unlogged table, and each index on an unlogged table, has an initialization |
| fork. The initialization fork is an empty table or index of the appropriate |
| type. When an unlogged table must be reset to empty due to a crash, the |
| initialization fork is copied over the main fork, and any other forks are |
| erased (they will be recreated automatically as needed). |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="storage-page-layout"> |
| |
| <title>Database Page Layout</title> |
| |
| <para> |
| This section provides an overview of the page format used within |
| <productname>PostgreSQL</productname> tables and indexes.<footnote> |
| <para> |
| Actually, use of this page format is not required for either table or |
| index access methods. The <literal>heap</literal> table access method |
| always uses this format. All the existing index methods also use the |
| basic format, but the data kept on index metapages usually doesn't follow |
| the item layout rules. |
| </para> |
| </footnote> |
| Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table. |
| </para> |
| |
| <para> |
| In the following explanation, a |
| <firstterm>byte</firstterm> |
| is assumed to contain 8 bits. In addition, the term |
| <firstterm>item</firstterm> |
| refers to an individual data value that is stored on a page. In a table, |
| an item is a row; in an index, an item is an index entry. |
| </para> |
| |
| <para> |
| Every table and index is stored as an array of <firstterm>pages</firstterm> of a |
| fixed size (usually 8 kB, although a different page size can be selected |
| when compiling the server). In a table, all the pages are logically |
| equivalent, so a particular item (row) can be stored in any page. In |
| indexes, the first page is generally reserved as a <firstterm>metapage</firstterm> |
| holding control information, and there can be different types of pages |
| within the index, depending on the index access method. |
| </para> |
| |
| <para> |
| <xref linkend="page-table"/> shows the overall layout of a page. |
| There are five parts to each page. |
| </para> |
| |
| <table tocentry="1" id="page-table"> |
| <title>Overall Page Layout</title> |
| <titleabbrev>Page Layout</titleabbrev> |
| <tgroup cols="2"> |
| <thead> |
| <row> |
| <entry> |
| Item |
| </entry> |
| <entry>Description</entry> |
| </row> |
| </thead> |
| |
| <tbody> |
| |
| <row> |
| <entry>PageHeaderData</entry> |
| <entry>24 bytes long. Contains general information about the page, including |
| free space pointers.</entry> |
| </row> |
| |
| <row> |
| <entry>ItemIdData</entry> |
| <entry>Array of item identifiers pointing to the actual items. Each |
| entry is an (offset,length) pair. 4 bytes per item.</entry> |
| </row> |
| |
| <row> |
| <entry>Free space</entry> |
| <entry>The unallocated space. New item identifiers are allocated from |
| the start of this area, new items from the end.</entry> |
| </row> |
| |
| <row> |
| <entry>Items</entry> |
| <entry>The actual items themselves.</entry> |
| </row> |
| |
| <row> |
| <entry>Special space</entry> |
| <entry>Index access method specific data. Different methods store different |
| data. Empty in ordinary tables.</entry> |
| </row> |
| |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <para> |
| |
| The first 24 bytes of each page consists of a page header |
| (<structname>PageHeaderData</structname>). Its format is detailed in <xref |
| linkend="pageheaderdata-table"/>. The first field tracks the most |
| recent WAL entry related to this page. The second field contains |
| the page checksum if <xref linkend="app-initdb-data-checksums"/> are |
| enabled. Next is a 2-byte field containing flag bits. This is followed |
| by three 2-byte integer fields (<structfield>pd_lower</structfield>, |
| <structfield>pd_upper</structfield>, and |
| <structfield>pd_special</structfield>). These contain byte offsets |
| from the page start to the start of unallocated space, to the end of |
| unallocated space, and to the start of the special space. The next 2 |
| bytes of the page header, <structfield>pd_pagesize_version</structfield>, |
| store both the page size and a version indicator. Beginning with |
| <productname>PostgreSQL</productname> 8.3 the version number is 4; |
| <productname>PostgreSQL</productname> 8.1 and 8.2 used version number 3; |
| <productname>PostgreSQL</productname> 8.0 used version number 2; |
| <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1; |
| prior releases used version number 0. |
| (The basic page layout and header format has not changed in most of these |
| versions, but the layout of heap row headers has.) The page size |
| is basically only present as a cross-check; there is no support for having |
| more than one page size in an installation. |
| The last field is a hint that shows whether pruning the page is likely |
| to be profitable: it tracks the oldest un-pruned XMAX on the page. |
| |
| </para> |
| |
| <table tocentry="1" id="pageheaderdata-table"> |
| <title>PageHeaderData Layout</title> |
| <titleabbrev>PageHeaderData Layout</titleabbrev> |
| <tgroup cols="4"> |
| <thead> |
| <row> |
| <entry>Field</entry> |
| <entry>Type</entry> |
| <entry>Length</entry> |
| <entry>Description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>pd_lsn</entry> |
| <entry>PageXLogRecPtr</entry> |
| <entry>8 bytes</entry> |
| <entry>LSN: next byte after last byte of WAL record for last change |
| to this page</entry> |
| </row> |
| <row> |
| <entry>pd_checksum</entry> |
| <entry>uint16</entry> |
| <entry>2 bytes</entry> |
| <entry>Page checksum</entry> |
| </row> |
| <row> |
| <entry>pd_flags</entry> |
| <entry>uint16</entry> |
| <entry>2 bytes</entry> |
| <entry>Flag bits</entry> |
| </row> |
| <row> |
| <entry>pd_lower</entry> |
| <entry>LocationIndex</entry> |
| <entry>2 bytes</entry> |
| <entry>Offset to start of free space</entry> |
| </row> |
| <row> |
| <entry>pd_upper</entry> |
| <entry>LocationIndex</entry> |
| <entry>2 bytes</entry> |
| <entry>Offset to end of free space</entry> |
| </row> |
| <row> |
| <entry>pd_special</entry> |
| <entry>LocationIndex</entry> |
| <entry>2 bytes</entry> |
| <entry>Offset to start of special space</entry> |
| </row> |
| <row> |
| <entry>pd_pagesize_version</entry> |
| <entry>uint16</entry> |
| <entry>2 bytes</entry> |
| <entry>Page size and layout version number information</entry> |
| </row> |
| <row> |
| <entry>pd_prune_xid</entry> |
| <entry>TransactionId</entry> |
| <entry>4 bytes</entry> |
| <entry>Oldest unpruned XMAX on page, or zero if none</entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <para> |
| All the details can be found in |
| <filename>src/include/storage/bufpage.h</filename>. |
| </para> |
| |
| <para> |
| Following the page header are item identifiers |
| (<type>ItemIdData</type>), each requiring four bytes. |
| An item identifier contains a byte-offset to |
| the start of an item, its length in bytes, and a few attribute bits |
| which affect its interpretation. |
| New item identifiers are allocated |
| as needed from the beginning of the unallocated space. |
| The number of item identifiers present can be determined by looking at |
| <structfield>pd_lower</structfield>, which is increased to allocate a new identifier. |
| Because an item |
| identifier is never moved until it is freed, its index can be used on a |
| long-term basis to reference an item, even when the item itself is moved |
| around on the page to compact free space. In fact, every pointer to an |
| item (<type>ItemPointer</type>, also known as |
| <type>CTID</type>) created by |
| <productname>PostgreSQL</productname> consists of a page number and the |
| index of an item identifier. |
| |
| </para> |
| |
| <para> |
| |
| The items themselves are stored in space allocated backwards from the end |
| of unallocated space. The exact structure varies depending on what the |
| table is to contain. Tables and sequences both use a structure named |
| <type>HeapTupleHeaderData</type>, described below. |
| |
| </para> |
| |
| <para> |
| |
| The final section is the <quote>special section</quote> which can |
| contain anything the access method wishes to store. For example, |
| b-tree indexes store links to the page's left and right siblings, |
| as well as some other data relevant to the index structure. |
| Ordinary tables do not use a special section at all (indicated by setting |
| <structfield>pd_special</structfield> to equal the page size). |
| |
| </para> |
| |
| <para> |
| <xref linkend="storage-page-layout-figure"/> illustrates how these parts are |
| laid out in a page. |
| </para> |
| |
| <figure id="storage-page-layout-figure"> |
| <title>Page Layout</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata fileref="images/pagelayout.svg" format="SVG" width="100%"/> |
| </imageobject> |
| </mediaobject> |
| </figure> |
| |
| <sect2 id="storage-tuple-layout"> |
| |
| <title>Table Row Layout</title> |
| |
| <para> |
| |
| All table rows are structured in the same way. There is a fixed-size |
| header (occupying 23 bytes on most machines), followed by an optional null |
| bitmap, an optional object ID field, and the user data. The header is |
| detailed |
| in <xref linkend="heaptupleheaderdata-table"/>. The actual user data |
| (columns of the row) begins at the offset indicated by |
| <structfield>t_hoff</structfield>, which must always be a multiple of the MAXALIGN |
| distance for the platform. |
| The null bitmap is |
| only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in |
| <structfield>t_infomask</structfield>. If it is present it begins just after |
| the fixed header and occupies enough bytes to have one bit per data column |
| (that is, the number of bits that equals the attribute count in |
| <structfield>t_infomask2</structfield>). In this list of bits, a |
| 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not |
| present, all columns are assumed not-null. |
| The object ID is only present if the <firstterm>HEAP_HASOID_OLD</firstterm> bit |
| is set in <structfield>t_infomask</structfield>. If present, it appears just |
| before the <structfield>t_hoff</structfield> boundary. Any padding needed to make |
| <structfield>t_hoff</structfield> a MAXALIGN multiple will appear between the null |
| bitmap and the object ID. (This in turn ensures that the object ID is |
| suitably aligned.) |
| |
| </para> |
| |
| <table tocentry="1" id="heaptupleheaderdata-table"> |
| <title>HeapTupleHeaderData Layout</title> |
| <titleabbrev>HeapTupleHeaderData Layout</titleabbrev> |
| <tgroup cols="4"> |
| <thead> |
| <row> |
| <entry>Field</entry> |
| <entry>Type</entry> |
| <entry>Length</entry> |
| <entry>Description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>t_xmin</entry> |
| <entry>TransactionId</entry> |
| <entry>4 bytes</entry> |
| <entry>insert XID stamp</entry> |
| </row> |
| <row> |
| <entry>t_xmax</entry> |
| <entry>TransactionId</entry> |
| <entry>4 bytes</entry> |
| <entry>delete XID stamp</entry> |
| </row> |
| <row> |
| <entry>t_cid</entry> |
| <entry>CommandId</entry> |
| <entry>4 bytes</entry> |
| <entry>insert and/or delete CID stamp (overlays with t_xvac)</entry> |
| </row> |
| <row> |
| <entry>t_xvac</entry> |
| <entry>TransactionId</entry> |
| <entry>4 bytes</entry> |
| <entry>XID for VACUUM operation moving a row version</entry> |
| </row> |
| <row> |
| <entry>t_ctid</entry> |
| <entry>ItemPointerData</entry> |
| <entry>6 bytes</entry> |
| <entry>current TID of this or newer row version</entry> |
| </row> |
| <row> |
| <entry>t_infomask2</entry> |
| <entry>uint16</entry> |
| <entry>2 bytes</entry> |
| <entry>number of attributes, plus various flag bits</entry> |
| </row> |
| <row> |
| <entry>t_infomask</entry> |
| <entry>uint16</entry> |
| <entry>2 bytes</entry> |
| <entry>various flag bits</entry> |
| </row> |
| <row> |
| <entry>t_hoff</entry> |
| <entry>uint8</entry> |
| <entry>1 byte</entry> |
| <entry>offset to user data</entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <para> |
| All the details can be found in |
| <filename>src/include/access/htup_details.h</filename>. |
| </para> |
| |
| <para> |
| |
| Interpreting the actual data can only be done with information obtained |
| from other tables, mostly <structname>pg_attribute</structname>. The |
| key values needed to identify field locations are |
| <structfield>attlen</structfield> and <structfield>attalign</structfield>. |
| There is no way to directly get a |
| particular attribute, except when there are only fixed width fields and no |
| null values. All this trickery is wrapped up in the functions |
| <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm> |
| and <firstterm>heap_getsysattr</firstterm>. |
| |
| </para> |
| <para> |
| |
| To read the data you need to examine each attribute in turn. First check |
| whether the field is NULL according to the null bitmap. If it is, go to |
| the next. Then make sure you have the right alignment. If the field is a |
| fixed width field, then all the bytes are simply placed. If it's a |
| variable length field (attlen = -1) then it's a bit more complicated. |
| All variable-length data types share the common header structure |
| <type>struct varlena</type>, which includes the total length of the stored |
| value and some flag bits. Depending on the flags, the data can be either |
| inline or in a <acronym>TOAST</acronym> table; |
| it might be compressed, too (see <xref linkend="storage-toast"/>). |
| |
| </para> |
| </sect2> |
| </sect1> |
| |
| </chapter> |