| <!-- doc/src/sgml/arch-dev.sgml --> | |
| <chapter id="overview"> | |
| <title>Overview of PostgreSQL Internals</title> | |
| <note> | |
| <title>Author</title> | |
| <para> | |
| This chapter originated as part of | |
| <xref linkend="sim98"/> Stefan Simkovics' | |
| Master's Thesis prepared at Vienna University of Technology under the direction | |
| of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr. | |
| </para> | |
| </note> | |
| <para> | |
| This chapter gives an overview of the internal structure of the | |
| backend of <productname>PostgreSQL</productname>. After having | |
| read the following sections you should have an idea of how a query | |
| is processed. This chapter is intended to help the reader | |
| understand the general sequence of operations that occur within the | |
| backend from the point at which a query is received, to the point | |
| at which the results are returned to the client. | |
| </para> | |
| <sect1 id="query-path"> | |
| <title>The Path of a Query</title> | |
| <para> | |
| Here we give a short overview of the stages a query has to pass | |
| to obtain a result. | |
| </para> | |
| <procedure> | |
| <step> | |
| <para> | |
| A connection from an application program to the <productname>PostgreSQL</productname> | |
| server has to be established. The application program transmits a | |
| query to the server and waits to receive the results sent back by the | |
| server. | |
| </para> | |
| </step> | |
| <step> | |
| <para> | |
| The <firstterm>parser stage</firstterm> checks the query | |
| transmitted by the application | |
| program for correct syntax and creates | |
| a <firstterm>query tree</firstterm>. | |
| </para> | |
| </step> | |
| <step> | |
| <para> | |
| The <firstterm>rewrite system</firstterm> takes | |
| the query tree created by the parser stage and looks for | |
| any <firstterm>rules</firstterm> (stored in the | |
| <firstterm>system catalogs</firstterm>) to apply to | |
| the query tree. It performs the | |
| transformations given in the <firstterm>rule bodies</firstterm>. | |
| </para> | |
| <para> | |
| One application of the rewrite system is in the realization of | |
| <firstterm>views</firstterm>. | |
| Whenever a query against a view | |
| (i.e., a <firstterm>virtual table</firstterm>) is made, | |
| the rewrite system rewrites the user's query to | |
| a query that accesses the <firstterm>base tables</firstterm> given in | |
| the <firstterm>view definition</firstterm> instead. | |
| </para> | |
| </step> | |
| <step> | |
| <para> | |
| The <firstterm>planner/optimizer</firstterm> takes | |
| the (rewritten) query tree and creates a | |
| <firstterm>query plan</firstterm> that will be the input to the | |
| <firstterm>executor</firstterm>. | |
| </para> | |
| <para> | |
| It does so by first creating all possible <firstterm>paths</firstterm> | |
| leading to the same result. For example if there is an index on a | |
| relation to be scanned, there are two paths for the | |
| scan. One possibility is a simple sequential scan and the other | |
| possibility is to use the index. Next the cost for the execution of | |
| each path is estimated and the cheapest path is chosen. The cheapest | |
| path is expanded into a complete plan that the executor can use. | |
| </para> | |
| </step> | |
| <step> | |
| <para> | |
| The executor recursively steps through | |
| the <firstterm>plan tree</firstterm> and | |
| retrieves rows in the way represented by the plan. | |
| The executor makes use of the | |
| <firstterm>storage system</firstterm> while scanning | |
| relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>, | |
| evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived. | |
| </para> | |
| </step> | |
| </procedure> | |
| <para> | |
| In the following sections we will cover each of the above listed items | |
| in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal | |
| control and data structures. | |
| </para> | |
| </sect1> | |
| <sect1 id="connect-estab"> | |
| <title>How Connections Are Established</title> | |
| <para> | |
| <productname>PostgreSQL</productname> implements a | |
| <quote>process per user</quote> client/server model. | |
| In this model, every | |
| <glossterm linkend="glossary-client">client process</glossterm> | |
| connects to exactly one | |
| <glossterm linkend="glossary-backend">backend process</glossterm>. | |
| As we do not know ahead of time how many connections will be made, | |
| we have to use a <quote>supervisor process</quote> that spawns a new | |
| backend process every time a connection is requested. This supervisor | |
| process is called | |
| <glossterm linkend="glossary-postmaster">postmaster</glossterm> | |
| and listens at a specified TCP/IP port for incoming connections. | |
| Whenever it detects a request for a connection, it spawns a new | |
| backend process. Those backend processes communicate with each | |
| other and with other processes of the | |
| <glossterm linkend="glossary-instance">instance</glossterm> | |
| using <firstterm>semaphores</firstterm> and | |
| <glossterm linkend="glossary-shared-memory">shared memory</glossterm> | |
| to ensure data integrity throughout concurrent data access. | |
| </para> | |
| <para> | |
| The client process can be any program that understands the | |
| <productname>PostgreSQL</productname> protocol described in | |
| <xref linkend="protocol"/>. Many clients are based on the | |
| C-language library <application>libpq</application>, but several independent | |
| implementations of the protocol exist, such as the Java | |
| <application>JDBC</application> driver. | |
| </para> | |
| <para> | |
| Once a connection is established, the client process can send a query | |
| to the backend process it's connected to. The query is transmitted using | |
| plain text, i.e., there is no parsing done in the client. The backend | |
| process parses the query, creates an <firstterm>execution plan</firstterm>, | |
| executes the plan, and returns the retrieved rows to the client | |
| by transmitting them over the established connection. | |
| </para> | |
| </sect1> | |
| <sect1 id="parser-stage"> | |
| <title>The Parser Stage</title> | |
| <para> | |
| The <firstterm>parser stage</firstterm> consists of two parts: | |
| <itemizedlist> | |
| <listitem> | |
| <para> | |
| The <firstterm>parser</firstterm> defined in | |
| <filename>gram.y</filename> and <filename>scan.l</filename> is | |
| built using the Unix tools <application>bison</application> | |
| and <application>flex</application>. | |
| </para> | |
| </listitem> | |
| <listitem> | |
| <para> | |
| The <firstterm>transformation process</firstterm> does | |
| modifications and augmentations to the data structures returned by the parser. | |
| </para> | |
| </listitem> | |
| </itemizedlist> | |
| </para> | |
| <sect2> | |
| <title>Parser</title> | |
| <para> | |
| The parser has to check the query string (which arrives as plain | |
| text) for valid syntax. If the syntax is correct a | |
| <firstterm>parse tree</firstterm> is built up and handed back; | |
| otherwise an error is returned. The parser and lexer are | |
| implemented using the well-known Unix tools <application>bison</application> | |
| and <application>flex</application>. | |
| </para> | |
| <para> | |
| The <firstterm>lexer</firstterm> is defined in the file | |
| <filename>scan.l</filename> and is responsible | |
| for recognizing <firstterm>identifiers</firstterm>, | |
| the <firstterm>SQL key words</firstterm> etc. For | |
| every key word or identifier that is found, a <firstterm>token</firstterm> | |
| is generated and handed to the parser. | |
| </para> | |
| <para> | |
| The parser is defined in the file <filename>gram.y</filename> and | |
| consists of a set of <firstterm>grammar rules</firstterm> and | |
| <firstterm>actions</firstterm> that are executed whenever a rule | |
| is fired. The code of the actions (which is actually C code) is | |
| used to build up the parse tree. | |
| </para> | |
| <para> | |
| The file <filename>scan.l</filename> is transformed to the C | |
| source file <filename>scan.c</filename> using the program | |
| <application>flex</application> and <filename>gram.y</filename> is | |
| transformed to <filename>gram.c</filename> using | |
| <application>bison</application>. After these transformations | |
| have taken place a normal C compiler can be used to create the | |
| parser. Never make any changes to the generated C files as they | |
| will be overwritten the next time <application>flex</application> | |
| or <application>bison</application> is called. | |
| <note> | |
| <para> | |
| The mentioned transformations and compilations are normally done | |
| automatically using the <firstterm>makefiles</firstterm> | |
| shipped with the <productname>PostgreSQL</productname> | |
| source distribution. | |
| </para> | |
| </note> | |
| </para> | |
| <para> | |
| A detailed description of <application>bison</application> or | |
| the grammar rules given in <filename>gram.y</filename> would be | |
| beyond the scope of this manual. There are many books and | |
| documents dealing with <application>flex</application> and | |
| <application>bison</application>. You should be familiar with | |
| <application>bison</application> before you start to study the | |
| grammar given in <filename>gram.y</filename> otherwise you won't | |
| understand what happens there. | |
| </para> | |
| </sect2> | |
| <sect2> | |
| <title>Transformation Process</title> | |
| <para> | |
| The parser stage creates a parse tree using only fixed rules about | |
| the syntactic structure of SQL. It does not make any lookups in the | |
| system catalogs, so there is no possibility to understand the detailed | |
| semantics of the requested operations. After the parser completes, | |
| the <firstterm>transformation process</firstterm> takes the tree handed | |
| back by the parser as input and does the semantic interpretation needed | |
| to understand which tables, functions, and operators are referenced by | |
| the query. The data structure that is built to represent this | |
| information is called the <firstterm>query tree</firstterm>. | |
| </para> | |
| <para> | |
| The reason for separating raw parsing from semantic analysis is that | |
| system catalog lookups can only be done within a transaction, and we | |
| do not wish to start a transaction immediately upon receiving a query | |
| string. The raw parsing stage is sufficient to identify the transaction | |
| control commands (<command>BEGIN</command>, <command>ROLLBACK</command>, etc), and | |
| these can then be correctly executed without any further analysis. | |
| Once we know that we are dealing with an actual query (such as | |
| <command>SELECT</command> or <command>UPDATE</command>), it is okay to | |
| start a transaction if we're not already in one. Only then can the | |
| transformation process be invoked. | |
| </para> | |
| <para> | |
| The query tree created by the transformation process is structurally | |
| similar to the raw parse tree in most places, but it has many differences | |
| in detail. For example, a <structname>FuncCall</structname> node in the | |
| parse tree represents something that looks syntactically like a function | |
| call. This might be transformed to either a <structname>FuncExpr</structname> | |
| or <structname>Aggref</structname> node depending on whether the referenced | |
| name turns out to be an ordinary function or an aggregate function. | |
| Also, information about the actual data types of columns and expression | |
| results is added to the query tree. | |
| </para> | |
| </sect2> | |
| </sect1> | |
| <sect1 id="rule-system"> | |
| <title>The <productname>PostgreSQL</productname> Rule System</title> | |
| <para> | |
| <productname>PostgreSQL</productname> supports a powerful | |
| <firstterm>rule system</firstterm> for the specification | |
| of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>. | |
| Originally the <productname>PostgreSQL</productname> | |
| rule system consisted of two implementations: | |
| <itemizedlist> | |
| <listitem> | |
| <para> | |
| The first one worked using <firstterm>row level</firstterm> processing and was | |
| implemented deep in the <firstterm>executor</firstterm>. The rule system was | |
| called whenever an individual row had been accessed. This | |
| implementation was removed in 1995 when the last official release | |
| of the <productname>Berkeley Postgres</productname> project was | |
| transformed into <productname>Postgres95</productname>. | |
| </para> | |
| </listitem> | |
| <listitem> | |
| <para> | |
| The second implementation of the rule system is a technique | |
| called <firstterm>query rewriting</firstterm>. | |
| The <firstterm>rewrite system</firstterm> is a module | |
| that exists between the <firstterm>parser stage</firstterm> and the | |
| <firstterm>planner/optimizer</firstterm>. This technique is still implemented. | |
| </para> | |
| </listitem> | |
| </itemizedlist> | |
| </para> | |
| <para> | |
| The query rewriter is discussed in some detail in | |
| <xref linkend="rules"/>, so there is no need to cover it here. | |
| We will only point out that both the input and the output of the | |
| rewriter are query trees, that is, there is no change in the | |
| representation or level of semantic detail in the trees. Rewriting | |
| can be thought of as a form of macro expansion. | |
| </para> | |
| </sect1> | |
| <sect1 id="planner-optimizer"> | |
| <title>Planner/Optimizer</title> | |
| <para> | |
| The task of the <firstterm>planner/optimizer</firstterm> is to | |
| create an optimal execution plan. A given SQL query (and hence, a | |
| query tree) can be actually executed in a wide variety of | |
| different ways, each of which will produce the same set of | |
| results. If it is computationally feasible, the query optimizer | |
| will examine each of these possible execution plans, ultimately | |
| selecting the execution plan that is expected to run the fastest. | |
| </para> | |
| <note> | |
| <para> | |
| In some situations, examining each possible way in which a query | |
| can be executed would take an excessive amount of time and memory. | |
| In particular, this occurs when executing queries | |
| involving large numbers of join operations. In order to determine | |
| a reasonable (not necessarily optimal) query plan in a reasonable amount | |
| of time, <productname>PostgreSQL</productname> uses a <firstterm>Genetic | |
| Query Optimizer</firstterm> (see <xref linkend="geqo"/>) when the number of joins | |
| exceeds a threshold (see <xref linkend="guc-geqo-threshold"/>). | |
| </para> | |
| </note> | |
| <para> | |
| The planner's search procedure actually works with data structures | |
| called <firstterm>paths</firstterm>, which are simply cut-down representations of | |
| plans containing only as much information as the planner needs to make | |
| its decisions. After the cheapest path is determined, a full-fledged | |
| <firstterm>plan tree</firstterm> is built to pass to the executor. This represents | |
| the desired execution plan in sufficient detail for the executor to run it. | |
| In the rest of this section we'll ignore the distinction between paths | |
| and plans. | |
| </para> | |
| <sect2> | |
| <title>Generating Possible Plans</title> | |
| <para> | |
| The planner/optimizer starts by generating plans for scanning each | |
| individual relation (table) used in the query. The possible plans | |
| are determined by the available indexes on each relation. | |
| There is always the possibility of performing a | |
| sequential scan on a relation, so a sequential scan plan is always | |
| created. Assume an index is defined on a | |
| relation (for example a B-tree index) and a query contains the | |
| restriction | |
| <literal>relation.attribute OPR constant</literal>. If | |
| <literal>relation.attribute</literal> happens to match the key of the B-tree | |
| index and <literal>OPR</literal> is one of the operators listed in | |
| the index's <firstterm>operator class</firstterm>, another plan is created using | |
| the B-tree index to scan the relation. If there are further indexes | |
| present and the restrictions in the query happen to match a key of an | |
| index, further plans will be considered. Index scan plans are also | |
| generated for indexes that have a sort ordering that can match the | |
| query's <literal>ORDER BY</literal> clause (if any), or a sort ordering that | |
| might be useful for merge joining (see below). | |
| </para> | |
| <para> | |
| If the query requires joining two or more relations, | |
| plans for joining relations are considered | |
| after all feasible plans have been found for scanning single relations. | |
| The three available join strategies are: | |
| <itemizedlist> | |
| <listitem> | |
| <para> | |
| <firstterm>nested loop join</firstterm>: The right relation is scanned | |
| once for every row found in the left relation. This strategy | |
| is easy to implement but can be very time consuming. (However, | |
| if the right relation can be scanned with an index scan, this can | |
| be a good strategy. It is possible to use values from the current | |
| row of the left relation as keys for the index scan of the right.) | |
| </para> | |
| </listitem> | |
| <listitem> | |
| <para> | |
| <firstterm>merge join</firstterm>: Each relation is sorted on the join | |
| attributes before the join starts. Then the two relations are | |
| scanned in parallel, and matching rows are combined to form | |
| join rows. This kind of join is | |
| attractive because each relation has to be scanned only once. | |
| The required sorting might be achieved either by an explicit sort | |
| step, or by scanning the relation in the proper order using an | |
| index on the join key. | |
| </para> | |
| </listitem> | |
| <listitem> | |
| <para> | |
| <firstterm>hash join</firstterm>: the right relation is first scanned | |
| and loaded into a hash table, using its join attributes as hash keys. | |
| Next the left relation is scanned and the | |
| appropriate values of every row found are used as hash keys to | |
| locate the matching rows in the table. | |
| </para> | |
| </listitem> | |
| </itemizedlist> | |
| </para> | |
| <para> | |
| When the query involves more than two relations, the final result | |
| must be built up by a tree of join steps, each with two inputs. | |
| The planner examines different possible join sequences to find the | |
| cheapest one. | |
| </para> | |
| <para> | |
| If the query uses fewer than <xref linkend="guc-geqo-threshold"/> | |
| relations, a near-exhaustive search is conducted to find the best | |
| join sequence. The planner preferentially considers joins between any | |
| two relations for which there exists a corresponding join clause in the | |
| <literal>WHERE</literal> qualification (i.e., for | |
| which a restriction like <literal>where rel1.attr1=rel2.attr2</literal> | |
| exists). Join pairs with no join clause are considered only when there | |
| is no other choice, that is, a particular relation has no available | |
| join clauses to any other relation. All possible plans are generated for | |
| every join pair considered by the planner, and the one that is | |
| (estimated to be) the cheapest is chosen. | |
| </para> | |
| <para> | |
| When <varname>geqo_threshold</varname> is exceeded, the join | |
| sequences considered are determined by heuristics, as described | |
| in <xref linkend="geqo"/>. Otherwise the process is the same. | |
| </para> | |
| <para> | |
| The finished plan tree consists of sequential or index scans of | |
| the base relations, plus nested-loop, merge, or hash join nodes as | |
| needed, plus any auxiliary steps needed, such as sort nodes or | |
| aggregate-function calculation nodes. Most of these plan node | |
| types have the additional ability to do <firstterm>selection</firstterm> | |
| (discarding rows that do not meet a specified Boolean condition) | |
| and <firstterm>projection</firstterm> (computation of a derived column set | |
| based on given column values, that is, evaluation of scalar | |
| expressions where needed). One of the responsibilities of the | |
| planner is to attach selection conditions from the | |
| <literal>WHERE</literal> clause and computation of required | |
| output expressions to the most appropriate nodes of the plan | |
| tree. | |
| </para> | |
| </sect2> | |
| </sect1> | |
| <sect1 id="executor"> | |
| <title>Executor</title> | |
| <para> | |
| The <firstterm>executor</firstterm> takes the plan created by the | |
| planner/optimizer and recursively processes it to extract the required set | |
| of rows. This is essentially a demand-pull pipeline mechanism. | |
| Each time a plan node is called, it must deliver one more row, or | |
| report that it is done delivering rows. | |
| </para> | |
| <para> | |
| To provide a concrete example, assume that the top | |
| node is a <literal>MergeJoin</literal> node. | |
| Before any merge can be done two rows have to be fetched (one from | |
| each subplan). So the executor recursively calls itself to | |
| process the subplans (it starts with the subplan attached to | |
| <literal>lefttree</literal>). The new top node (the top node of the left | |
| subplan) is, let's say, a | |
| <literal>Sort</literal> node and again recursion is needed to obtain | |
| an input row. The child node of the <literal>Sort</literal> might | |
| be a <literal>SeqScan</literal> node, representing actual reading of a table. | |
| Execution of this node causes the executor to fetch a row from the | |
| table and return it up to the calling node. The <literal>Sort</literal> | |
| node will repeatedly call its child to obtain all the rows to be sorted. | |
| When the input is exhausted (as indicated by the child node returning | |
| a NULL instead of a row), the <literal>Sort</literal> code performs | |
| the sort, and finally is able to return its first output row, namely | |
| the first one in sorted order. It keeps the remaining rows stored so | |
| that it can deliver them in sorted order in response to later demands. | |
| </para> | |
| <para> | |
| The <literal>MergeJoin</literal> node similarly demands the first row | |
| from its right subplan. Then it compares the two rows to see if they | |
| can be joined; if so, it returns a join row to its caller. On the next | |
| call, or immediately if it cannot join the current pair of inputs, | |
| it advances to the next row of one table | |
| or the other (depending on how the comparison came out), and again | |
| checks for a match. Eventually, one subplan or the other is exhausted, | |
| and the <literal>MergeJoin</literal> node returns NULL to indicate that | |
| no more join rows can be formed. | |
| </para> | |
| <para> | |
| Complex queries can involve many levels of plan nodes, but the general | |
| approach is the same: each node computes and returns its next output | |
| row each time it is called. Each node is also responsible for applying | |
| any selection or projection expressions that were assigned to it by | |
| the planner. | |
| </para> | |
| <para> | |
| The executor mechanism is used to evaluate all four basic SQL query | |
| types: <command>SELECT</command>, <command>INSERT</command>, | |
| <command>UPDATE</command>, and <command>DELETE</command>. | |
| For <command>SELECT</command>, the top-level executor code | |
| only needs to send each row returned by the query plan tree | |
| off to the client. <command>INSERT ... SELECT</command>, | |
| <command>UPDATE</command>, and <command>DELETE</command> | |
| are effectively <command>SELECT</command>s under a special | |
| top-level plan node called <literal>ModifyTable</literal>. | |
| </para> | |
| <para> | |
| <command>INSERT ... SELECT</command> feeds the rows up | |
| to <literal>ModifyTable</literal> for insertion. For | |
| <command>UPDATE</command>, the planner arranges that each | |
| computed row includes all the updated column values, plus the | |
| <firstterm>TID</firstterm> (tuple ID, or row ID) of the original | |
| target row; this data is fed up to the <literal>ModifyTable</literal> | |
| node, which uses the information to create a new updated row and | |
| mark the old row deleted. For <command>DELETE</command>, the only | |
| column that is actually returned by the plan is the TID, and the | |
| <literal>ModifyTable</literal> node simply uses the TID to visit each | |
| target row and mark it deleted. | |
| </para> | |
| <para> | |
| A simple <command>INSERT ... VALUES</command> command creates a | |
| trivial plan tree consisting of a single <literal>Result</literal> | |
| node, which computes just one result row, feeding that up | |
| to <literal>ModifyTable</literal> to perform the insertion. | |
| </para> | |
| </sect1> | |
| </chapter> |