doc/src/sgml/arch-dev.sgml - cloudberry - Git at Google

 <!-- doc/src/sgml/arch-dev.sgml -->

  <chapter id="overview">
   <title>Overview of PostgreSQL Internals</title>

   <note>
    <title>Author</title>
    <para>
     This chapter originated as part of
     <xref linkend="sim98"/> Stefan Simkovics'
     Master's Thesis prepared at Vienna University of Technology under the direction
     of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
    </para>
   </note>

   <para>
    This chapter gives an overview of the internal structure of the
    backend of <productname>PostgreSQL</productname>.  After having
    read the following sections you should have an idea of how a query
    is processed.  This chapter is intended to help the reader
    understand the general sequence of operations that occur within the
    backend from the point at which a query is received, to the point
    at which the results are returned to the client.
   </para>

   <sect1 id="query-path">
    <title>The Path of a Query</title>

    <para>
     Here we give a short overview of the stages a query has to pass
     to obtain a result.
    </para>

    <procedure>
     <step>
      <para>
       A connection from an application program to the <productname>PostgreSQL</productname>
       server has to be established. The application program transmits a
       query to the server and waits to receive the results sent back by the
       server.
      </para>
     </step>

     <step>
      <para>
       The <firstterm>parser stage</firstterm> checks the query
       transmitted by the application
       program for correct syntax and creates
       a <firstterm>query tree</firstterm>.
      </para>
     </step>

     <step>
      <para>
       The <firstterm>rewrite system</firstterm> takes
       the query tree created by the parser stage and looks for
       any <firstterm>rules</firstterm> (stored in the
       <firstterm>system catalogs</firstterm>) to apply to
       the query tree.  It performs the
       transformations given in the <firstterm>rule bodies</firstterm>.
      </para>

      <para>
       One application of the rewrite system is in the realization of
       <firstterm>views</firstterm>.
       Whenever a query against a view
       (i.e., a <firstterm>virtual table</firstterm>) is made,
       the rewrite system rewrites the user's query to
       a query that accesses the <firstterm>base tables</firstterm> given in
       the <firstterm>view definition</firstterm> instead.
      </para>
     </step>

     <step>
      <para>
       The <firstterm>planner/optimizer</firstterm> takes
       the (rewritten) query tree and creates a
       <firstterm>query plan</firstterm> that will be the input to the
       <firstterm>executor</firstterm>.
      </para>

      <para>
       It does so by first creating all possible <firstterm>paths</firstterm>
       leading to the same result. For example if there is an index on a
       relation to be scanned, there are two paths for the
       scan. One possibility is a simple sequential scan and the other
       possibility is to use the index. Next the cost for the execution of
       each path is estimated and the cheapest path is chosen.  The cheapest
       path is expanded into a complete plan that the executor can use.
      </para>
     </step>

     <step>
      <para>
       The executor recursively steps through
       the <firstterm>plan tree</firstterm> and
       retrieves rows in the way represented by the plan.
       The executor makes use of the
       <firstterm>storage system</firstterm> while scanning
       relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
       evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
      </para>
     </step>
    </procedure>

    <para>
     In the following sections we will cover each of the above listed items
     in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
     control and data structures.
    </para>
   </sect1>

   <sect1 id="connect-estab">
    <title>How Connections Are Established</title>

    <para>
     <productname>PostgreSQL</productname> implements a
     <quote>process per user</quote> client/server model.
     In this model, every
     <glossterm linkend="glossary-client">client process</glossterm>
     connects to exactly one
     <glossterm linkend="glossary-backend">backend process</glossterm>.
     As we do not know ahead of time how many connections will be made,
     we have to use a <quote>supervisor process</quote> that spawns a new
     backend process every time a connection is requested. This supervisor
     process is called
     <glossterm linkend="glossary-postmaster">postmaster</glossterm>
     and listens at a specified TCP/IP port for incoming connections.
     Whenever it detects a request for a connection, it spawns a new
     backend process.  Those backend processes communicate with each
     other and with other processes of the
     <glossterm linkend="glossary-instance">instance</glossterm>
     using <firstterm>semaphores</firstterm> and
     <glossterm linkend="glossary-shared-memory">shared memory</glossterm>
     to ensure data integrity throughout concurrent data access.
    </para>

    <para>
     The client process can be any program that understands the
     <productname>PostgreSQL</productname> protocol described in
     <xref linkend="protocol"/>.  Many clients are based on the
     C-language library <application>libpq</application>, but several independent
     implementations of the protocol exist, such as the Java
     <application>JDBC</application> driver.
    </para>

    <para>
     Once a connection is established, the client process can send a query
     to the backend process it's connected to. The query is transmitted using
     plain text, i.e., there is no parsing done in the client. The backend
     process parses the query, creates an <firstterm>execution plan</firstterm>,
     executes the plan, and returns the retrieved rows to the client
     by transmitting them over the established connection.
    </para>
   </sect1>

   <sect1 id="parser-stage">
    <title>The Parser Stage</title>

    <para>
     The <firstterm>parser stage</firstterm> consists of two parts:

     <itemizedlist>
      <listitem>
       <para>
        The <firstterm>parser</firstterm> defined in
        <filename>gram.y</filename> and <filename>scan.l</filename> is
        built using the Unix tools <application>bison</application>
        and <application>flex</application>.
       </para>
      </listitem>
      <listitem>
       <para>
        The <firstterm>transformation process</firstterm> does
        modifications and augmentations to the data structures returned by the parser.
       </para>
      </listitem>
     </itemizedlist>
    </para>

    <sect2>
     <title>Parser</title>

     <para>
      The parser has to check the query string (which arrives as plain
      text) for valid syntax. If the syntax is correct a
      <firstterm>parse tree</firstterm> is built up and handed back;
      otherwise an error is returned. The parser and lexer are
      implemented using the well-known Unix tools <application>bison</application>
      and <application>flex</application>.
     </para>

     <para>
      The <firstterm>lexer</firstterm> is defined in the file
      <filename>scan.l</filename> and is responsible
      for recognizing <firstterm>identifiers</firstterm>,
      the <firstterm>SQL key words</firstterm> etc. For
      every key word or identifier that is found, a <firstterm>token</firstterm>
      is generated and handed to the parser.
     </para>

     <para>
      The parser is defined in the file <filename>gram.y</filename> and
      consists of a set of <firstterm>grammar rules</firstterm> and
      <firstterm>actions</firstterm> that are executed whenever a rule
      is fired. The code of the actions (which is actually C code) is
      used to build up the parse tree.
     </para>

     <para>
      The file <filename>scan.l</filename> is transformed to the C
      source file <filename>scan.c</filename> using the program
      <application>flex</application> and <filename>gram.y</filename> is
      transformed to <filename>gram.c</filename> using
      <application>bison</application>.  After these transformations
      have taken place a normal C compiler can be used to create the
      parser. Never make any changes to the generated C files as they
      will be overwritten the next time <application>flex</application>
      or <application>bison</application> is called.

      <note>
       <para>
        The mentioned transformations and compilations are normally done
        automatically using the <firstterm>makefiles</firstterm>
        shipped with the <productname>PostgreSQL</productname>
        source distribution.
       </para>
      </note>
     </para>

     <para>
      A detailed description of <application>bison</application> or
      the grammar rules given in <filename>gram.y</filename> would be
      beyond the scope of this manual. There are many books and
      documents dealing with <application>flex</application> and
      <application>bison</application>. You should be familiar with
      <application>bison</application> before you start to study the
      grammar given in <filename>gram.y</filename> otherwise you won't
      understand what happens there.
     </para>

    </sect2>

    <sect2>
      <title>Transformation Process</title>

     <para>
      The parser stage creates a parse tree using only fixed rules about
      the syntactic structure of SQL.  It does not make any lookups in the
      system catalogs, so there is no possibility to understand the detailed
      semantics of the requested operations.  After the parser completes,
      the <firstterm>transformation process</firstterm> takes the tree handed
      back by the parser as input and does the semantic interpretation needed
      to understand which tables, functions, and operators are referenced by
      the query.  The data structure that is built to represent this
      information is called the <firstterm>query tree</firstterm>.
     </para>

     <para>
      The reason for separating raw parsing from semantic analysis is that
      system catalog lookups can only be done within a transaction, and we
      do not wish to start a transaction immediately upon receiving a query
      string.  The raw parsing stage is sufficient to identify the transaction
      control commands (<command>BEGIN</command>, <command>ROLLBACK</command>, etc), and
      these can then be correctly executed without any further analysis.
      Once we know that we are dealing with an actual query (such as
      <command>SELECT</command> or <command>UPDATE</command>), it is okay to
      start a transaction if we're not already in one.  Only then can the
      transformation process be invoked.
     </para>

     <para>
      The query tree created by the transformation process is structurally
      similar to the raw parse tree in most places, but it has many differences
      in detail.  For example, a <structname>FuncCall</structname> node in the
      parse tree represents something that looks syntactically like a function
      call.  This might be transformed to either a <structname>FuncExpr</structname>
      or <structname>Aggref</structname> node depending on whether the referenced
      name turns out to be an ordinary function or an aggregate function.
      Also, information about the actual data types of columns and expression
      results is added to the query tree.
     </para>
    </sect2>
   </sect1>

   <sect1 id="rule-system">
    <title>The <productname>PostgreSQL</productname> Rule System</title>

    <para>
     <productname>PostgreSQL</productname> supports a powerful
     <firstterm>rule system</firstterm> for the specification
     of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
     Originally the <productname>PostgreSQL</productname>
     rule system consisted of two implementations:

     <itemizedlist>
      <listitem>
       <para>
        The first one worked using <firstterm>row level</firstterm> processing and was
        implemented deep in the <firstterm>executor</firstterm>. The rule system was
        called whenever an individual row had been accessed. This
        implementation was removed in 1995 when the last official release
        of the <productname>Berkeley Postgres</productname> project was
        transformed into <productname>Postgres95</productname>.
       </para>
      </listitem>

      <listitem>
       <para>
        The second implementation of the rule system is a technique
        called <firstterm>query rewriting</firstterm>.
        The <firstterm>rewrite system</firstterm> is a module
        that exists between the <firstterm>parser stage</firstterm> and the
        <firstterm>planner/optimizer</firstterm>. This technique is still implemented.
       </para>
      </listitem>
     </itemizedlist>
    </para>

    <para>
     The query rewriter is discussed in some detail in
     <xref linkend="rules"/>, so there is no need to cover it here.
     We will only point out that both the input and the output of the
     rewriter are query trees, that is, there is no change in the
     representation or level of semantic detail in the trees.  Rewriting
     can be thought of as a form of macro expansion.
    </para>

   </sect1>

   <sect1 id="planner-optimizer">
    <title>Planner/Optimizer</title>

    <para>
     The task of the <firstterm>planner/optimizer</firstterm> is to
     create an optimal execution plan. A given SQL query (and hence, a
     query tree) can be actually executed in a wide variety of
     different ways, each of which will produce the same set of
     results.  If it is computationally feasible, the query optimizer
     will examine each of these possible execution plans, ultimately
     selecting the execution plan that is expected to run the fastest.
    </para>

    <note>
     <para>
      In some situations, examining each possible way in which a query
      can be executed would take an excessive amount of time and memory.
      In particular, this occurs when executing queries
      involving large numbers of join operations. In order to determine
      a reasonable (not necessarily optimal) query plan in a reasonable amount
      of time, <productname>PostgreSQL</productname> uses a <firstterm>Genetic
      Query Optimizer</firstterm> (see <xref linkend="geqo"/>) when the number of joins
      exceeds a threshold (see <xref linkend="guc-geqo-threshold"/>).
     </para>
    </note>

    <para>
     The planner's search procedure actually works with data structures
     called <firstterm>paths</firstterm>, which are simply cut-down representations of
     plans containing only as much information as the planner needs to make
     its decisions. After the cheapest path is determined, a full-fledged
     <firstterm>plan tree</firstterm> is built to pass to the executor.  This represents
     the desired execution plan in sufficient detail for the executor to run it.
     In the rest of this section we'll ignore the distinction between paths
     and plans.
    </para>

    <sect2>
     <title>Generating Possible Plans</title>

     <para>
      The planner/optimizer starts by generating plans for scanning each
      individual relation (table) used in the query.  The possible plans
      are determined by the available indexes on each relation.
      There is always the possibility of performing a
      sequential scan on a relation, so a sequential scan plan is always
      created. Assume an index is defined on a
      relation (for example a B-tree index) and a query contains the
      restriction
      <literal>relation.attribute OPR constant</literal>. If
      <literal>relation.attribute</literal> happens to match the key of the B-tree
      index and <literal>OPR</literal> is one of the operators listed in
      the index's <firstterm>operator class</firstterm>, another plan is created using
      the B-tree index to scan the relation. If there are further indexes
      present and the restrictions in the query happen to match a key of an
      index, further plans will be considered.  Index scan plans are also
      generated for indexes that have a sort ordering that can match the
      query's <literal>ORDER BY</literal> clause (if any), or a sort ordering that
      might be useful for merge joining (see below).
     </para>

     <para>
      If the query requires joining two or more relations,
      plans for joining relations are considered
      after all feasible plans have been found for scanning single relations.
      The three available join strategies are:

      <itemizedlist>
       <listitem>
        <para>
         <firstterm>nested loop join</firstterm>: The right relation is scanned
         once for every row found in the left relation. This strategy
         is easy to implement but can be very time consuming.  (However,
         if the right relation can be scanned with an index scan, this can
         be a good strategy.  It is possible to use values from the current
         row of the left relation as keys for the index scan of the right.)
        </para>
       </listitem>

       <listitem>
        <para>
         <firstterm>merge join</firstterm>: Each relation is sorted on the join
         attributes before the join starts. Then the two relations are
         scanned in parallel, and matching rows are combined to form
         join rows. This kind of join is
         attractive because each relation has to be scanned only once.
         The required sorting might be achieved either by an explicit sort
         step, or by scanning the relation in the proper order using an
         index on the join key.
        </para>
       </listitem>

       <listitem>
        <para>
         <firstterm>hash join</firstterm>: the right relation is first scanned
         and loaded into a hash table, using its join attributes as hash keys.
         Next the left relation is scanned and the
         appropriate values of every row found are used as hash keys to
         locate the matching rows in the table.
        </para>
       </listitem>
      </itemizedlist>
     </para>

     <para>
      When the query involves more than two relations, the final result
      must be built up by a tree of join steps, each with two inputs.
      The planner examines different possible join sequences to find the
      cheapest one.
     </para>

     <para>
      If the query uses fewer than <xref linkend="guc-geqo-threshold"/>
      relations, a near-exhaustive search is conducted to find the best
      join sequence.  The planner preferentially considers joins between any
      two relations for which there exists a corresponding join clause in the
      <literal>WHERE</literal> qualification (i.e., for
      which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
      exists). Join pairs with no join clause are considered only when there
      is no other choice, that is, a particular relation has no available
      join clauses to any other relation. All possible plans are generated for
      every join pair considered by the planner, and the one that is
      (estimated to be) the cheapest is chosen.
     </para>

     <para>
      When <varname>geqo_threshold</varname> is exceeded, the join
      sequences considered are determined by heuristics, as described
      in <xref linkend="geqo"/>.  Otherwise the process is the same.
     </para>

     <para>
      The finished plan tree consists of sequential or index scans of
      the base relations, plus nested-loop, merge, or hash join nodes as
      needed, plus any auxiliary steps needed, such as sort nodes or
      aggregate-function calculation nodes.  Most of these plan node
      types have the additional ability to do <firstterm>selection</firstterm>
      (discarding rows that do not meet a specified Boolean condition)
      and <firstterm>projection</firstterm> (computation of a derived column set
      based on given column values, that is, evaluation of scalar
      expressions where needed).  One of the responsibilities of the
      planner is to attach selection conditions from the
      <literal>WHERE</literal> clause and computation of required
      output expressions to the most appropriate nodes of the plan
      tree.
     </para>
    </sect2>
   </sect1>

   <sect1 id="executor">
    <title>Executor</title>

    <para>
     The <firstterm>executor</firstterm> takes the plan created by the
     planner/optimizer and recursively processes it to extract the required set
     of rows.  This is essentially a demand-pull pipeline mechanism.
     Each time a plan node is called, it must deliver one more row, or
     report that it is done delivering rows.
    </para>

    <para>
     To provide a concrete example, assume that the top
     node is a <literal>MergeJoin</literal> node.
     Before any merge can be done two rows have to be fetched (one from
     each subplan). So the executor recursively calls itself to
     process the subplans (it starts with the subplan attached to
     <literal>lefttree</literal>). The new top node (the top node of the left
     subplan) is, let's say, a
     <literal>Sort</literal> node and again recursion is needed to obtain
     an input row.  The child node of the <literal>Sort</literal> might
     be a <literal>SeqScan</literal> node, representing actual reading of a table.
     Execution of this node causes the executor to fetch a row from the
     table and return it up to the calling node.  The <literal>Sort</literal>
     node will repeatedly call its child to obtain all the rows to be sorted.
     When the input is exhausted (as indicated by the child node returning
     a NULL instead of a row), the <literal>Sort</literal> code performs
     the sort, and finally is able to return its first output row, namely
     the first one in sorted order.  It keeps the remaining rows stored so
     that it can deliver them in sorted order in response to later demands.
    </para>

    <para>
     The <literal>MergeJoin</literal> node similarly demands the first row
     from its right subplan.  Then it compares the two rows to see if they
     can be joined; if so, it returns a join row to its caller.  On the next
     call, or immediately if it cannot join the current pair of inputs,
     it advances to the next row of one table
     or the other (depending on how the comparison came out), and again
     checks for a match.  Eventually, one subplan or the other is exhausted,
     and the <literal>MergeJoin</literal> node returns NULL to indicate that
     no more join rows can be formed.
    </para>

    <para>
     Complex queries can involve many levels of plan nodes, but the general
     approach is the same: each node computes and returns its next output
     row each time it is called.  Each node is also responsible for applying
     any selection or projection expressions that were assigned to it by
     the planner.
    </para>

    <para>
     The executor mechanism is used to evaluate all four basic SQL query
     types: <command>SELECT</command>, <command>INSERT</command>,
     <command>UPDATE</command>, and <command>DELETE</command>.
     For <command>SELECT</command>, the top-level executor code
     only needs to send each row returned by the query plan tree
     off to the client.  <command>INSERT ... SELECT</command>,
     <command>UPDATE</command>, and <command>DELETE</command>
     are effectively <command>SELECT</command>s under a special
     top-level plan node called <literal>ModifyTable</literal>.
    </para>

    <para>
     <command>INSERT ... SELECT</command> feeds the rows up
     to <literal>ModifyTable</literal> for insertion.  For
     <command>UPDATE</command>, the planner arranges that each
     computed row includes all the updated column values, plus the
     <firstterm>TID</firstterm> (tuple ID, or row ID) of the original
     target row; this data is fed up to the <literal>ModifyTable</literal>
     node, which uses the information to create a new updated row and
     mark the old row deleted.  For <command>DELETE</command>, the only
     column that is actually returned by the plan is the TID, and the
     <literal>ModifyTable</literal> node simply uses the TID to visit each
     target row and mark it deleted.
    </para>

    <para>
     A simple <command>INSERT ... VALUES</command> command creates a
     trivial plan tree consisting of a single <literal>Result</literal>
     node, which computes just one result row, feeding that up
     to <literal>ModifyTable</literal> to perform the insertion.
    </para>

   </sect1>

  </chapter>