doc/src/sgml/geqo.sgml - age - Git at Google

 <!-- doc/src/sgml/geqo.sgml -->

  <chapter id="geqo">
   <title>Genetic Query Optimizer</title>

   <para>
    <note>
     <title>Author</title>
     <para>
      Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
      for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
     </para>
    </note>
   </para>

   <sect1 id="geqo-intro">
    <title>Query Handling as a Complex Optimization Problem</title>

    <para>
     Among all relational operators the most difficult one to process
     and optimize is the <firstterm>join</firstterm>. The number of
     possible query plans grows exponentially with the
     number of joins in the query. Further optimization effort is
     caused by the support of a variety of <firstterm>join
     methods</firstterm> (e.g., nested loop, hash join, merge join in
     <productname>PostgreSQL</productname>) to process individual joins
     and a diversity of <firstterm>indexes</firstterm> (e.g.,
     B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
     access paths for relations.
    </para>

    <para>
     The normal <productname>PostgreSQL</productname> query optimizer
     performs a <firstterm>near-exhaustive search</firstterm> over the
     space of alternative strategies. This algorithm, first introduced
     in IBM's System R database, produces a near-optimal join order,
     but can take an enormous amount of time and memory space when the
     number of joins in the query grows large. This makes the ordinary
     <productname>PostgreSQL</productname> query optimizer
     inappropriate for queries that join a large number of tables.
    </para>

    <para>
     The Institute of Automatic Control at the University of Mining and
     Technology, in Freiberg, Germany, encountered some problems when
     it wanted to use <productname>PostgreSQL</productname> as the
     backend for a decision support knowledge based system for the
     maintenance of an electrical power grid. The DBMS needed to handle
     large join queries for the inference machine of the knowledge
     based system. The number of joins in these queries made using the
     normal query optimizer infeasible.
    </para>

    <para>
     In the following we describe the implementation of a
     <firstterm>genetic algorithm</firstterm> to solve the join
     ordering problem in a manner that is efficient for queries
     involving large numbers of joins.
    </para>
   </sect1>

   <sect1 id="geqo-intro2">
    <title>Genetic Algorithms</title>

    <para>
     The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
     operates through randomized search. The set of possible solutions for the
     optimization problem is considered as a
     <firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
     The degree of adaptation of an individual to its environment is specified
     by its <firstterm>fitness</firstterm>.
    </para>

    <para>
     The coordinates of an individual in the search space are represented
     by <firstterm>chromosomes</firstterm>, in essence a set of character
     strings. A <firstterm>gene</firstterm> is a
     subsection of a chromosome which encodes the value of a single parameter
     being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
     <firstterm>integer</firstterm>.
    </para>

    <para>
     Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
     <firstterm>mutation</firstterm>, and
     <firstterm>selection</firstterm> new generations of search points are found
     that show a higher average fitness than their ancestors. <xref linkend="geqo-figure"/>
     illustrates these steps.
    </para>

    <figure id="geqo-figure">
     <title>Structure of a Genetic Algorithm</title>
     <mediaobject>
      <imageobject>
       <imagedata fileref="images/genetic-algorithm.svg" format="SVG" width="100%"/>
      </imageobject>
     </mediaobject>
    </figure>

    <para>
     According to the <systemitem class="resource">comp.ai.genetic</systemitem> <acronym>FAQ</acronym> it cannot be stressed too
     strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
     problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
     non-random (better than random).
    </para>

   </sect1>

   <sect1 id="geqo-pg-intro">
    <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>

    <para>
     The <acronym>GEQO</acronym> module approaches the query
     optimization problem as though it were the well-known traveling salesman
     problem (<acronym>TSP</acronym>).
     Possible query plans are encoded as integer strings. Each string
     represents the join order from one relation of the query to the next.
     For example, the join tree
 <literallayout class="monospaced">
    /\
   /\ 2
  /\ 3
 4  1
 </literallayout>
     is encoded by the integer string '4-1-3-2',
     which means, first join relation '4' and '1', then '3', and
     then '2', where 1, 2, 3, 4 are relation IDs within the
     <productname>PostgreSQL</productname> optimizer.
    </para>

    <para>
     Specific characteristics of the <acronym>GEQO</acronym>
     implementation in <productname>PostgreSQL</productname>
     are:

     <itemizedlist spacing="compact" mark="bullet">
      <listitem>
       <para>
        Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
        individuals in a population, not whole-generational replacement)
        allows fast convergence towards improved query plans. This is
        essential for query handling with reasonable time;
       </para>
      </listitem>

      <listitem>
       <para>
        Usage of <firstterm>edge recombination crossover</firstterm>
        which is especially suited to keep edge losses low for the
        solution of the <acronym>TSP</acronym> by means of a
        <acronym>GA</acronym>;
       </para>
      </listitem>

      <listitem>
       <para>
        Mutation as genetic operator is deprecated so that no repair
        mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
       </para>
      </listitem>
     </itemizedlist>
    </para>

    <para>
     Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
     Genitor algorithm.
    </para>

    <para>
     The <acronym>GEQO</acronym> module allows
     the <productname>PostgreSQL</productname> query optimizer to
     support large join queries effectively through
     non-exhaustive search.
    </para>

   <sect2>
    <title>Generating Possible Plans with <acronym>GEQO</acronym></title>

    <para>
     The <acronym>GEQO</acronym> planning process uses the standard planner
     code to generate plans for scans of individual relations.  Then join
     plans are developed using the genetic approach.  As shown above, each
     candidate join plan is represented by a sequence in which to join
     the base relations.  In the initial stage, the <acronym>GEQO</acronym>
     code simply generates some possible join sequences at random.  For each
     join sequence considered, the standard planner code is invoked to
     estimate the cost of performing the query using that join sequence.
     (For each step of the join sequence, all three possible join strategies
     are considered; and all the initially-determined relation scan plans
     are available.  The estimated cost is the cheapest of these
     possibilities.)  Join sequences with lower estimated cost are considered
     <quote>more fit</quote> than those with higher cost.  The genetic algorithm
     discards the least fit candidates.  Then new candidates are generated
     by combining genes of more-fit candidates &mdash; that is, by using
     randomly-chosen portions of known low-cost join sequences to create
     new sequences for consideration.  This process is repeated until a
     preset number of join sequences have been considered; then the best
     one found at any time during the search is used to generate the finished
     plan.
    </para>

    <para>
     This process is inherently nondeterministic, because of the randomized
     choices made during both the initial population selection and subsequent
     <quote>mutation</quote> of the best candidates.  To avoid surprising changes
     of the selected plan, each run of the GEQO algorithm restarts its
     random number generator with the current <xref linkend="guc-geqo-seed"/>
     parameter setting.  As long as <varname>geqo_seed</varname> and the other
     GEQO parameters are kept fixed, the same plan will be generated for a
     given query (and other planner inputs such as statistics).  To experiment
     with different search paths, try changing <varname>geqo_seed</varname>.
    </para>

   </sect2>

   <sect2 id="geqo-future">
    <title>Future Implementation Tasks for
     <productname>PostgreSQL</productname> <acronym>GEQO</acronym></title>

      <para>
       Work is still needed to improve the genetic algorithm parameter
       settings.
       In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
       routines
       <function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
       we have to find a compromise for the parameter settings
       to satisfy two competing demands:
       <itemizedlist spacing="compact">
        <listitem>
         <para>
          Optimality of the query plan
         </para>
        </listitem>
        <listitem>
         <para>
          Computing time
         </para>
        </listitem>
       </itemizedlist>
      </para>

      <para>
       In the current implementation, the fitness of each candidate join
       sequence is estimated by running the standard planner's join selection
       and cost estimation code from scratch.  To the extent that different
       candidates use similar sub-sequences of joins, a great deal of work
       will be repeated.  This could be made significantly faster by retaining
       cost estimates for sub-joins.  The problem is to avoid expending
       unreasonable amounts of memory on retaining that state.
      </para>

      <para>
       At a more basic level, it is not clear that solving query optimization
       with a GA algorithm designed for TSP is appropriate.  In the TSP case,
       the cost associated with any substring (partial tour) is independent
       of the rest of the tour, but this is certainly not true for query
       optimization.  Thus it is questionable whether edge recombination
       crossover is the most effective mutation procedure.
      </para>

    </sect2>
   </sect1>

  <sect1 id="geqo-biblio">
   <title>Further Reading</title>

   <para>
    The following resources contain additional information about
    genetic algorithms:

    <itemizedlist>
     <listitem>
      <para>
       <ulink url="http://www.faqs.org/faqs/ai-faq/genetic/part1/">
       The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
       url="news://comp.ai.genetic"></ulink>)
      </para>
     </listitem>

     <listitem>
      <para>
       <ulink url="https://www.red3d.com/cwr/evolve.html">
       Evolutionary Computation and its application to art and design</ulink>, by
       Craig Reynolds
      </para>
     </listitem>

     <listitem>
      <para>
       <xref linkend="elma04"/>
      </para>
     </listitem>

     <listitem>
      <para>
       <xref linkend="fong"/>
      </para>
     </listitem>
    </itemizedlist>
   </para>

  </sect1>
 </chapter>
	<!-- doc/src/sgml/geqo.sgml -->

	<chapter id="geqo">
	<title>Genetic Query Optimizer</title>

	<para>
	<note>
	<title>Author</title>
	<para>
	Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>)
	for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
	</para>
	</note>
	</para>

	<sect1 id="geqo-intro">
	<title>Query Handling as a Complex Optimization Problem</title>

	<para>
	Among all relational operators the most difficult one to process
	and optimize is the <firstterm>join</firstterm>. The number of
	possible query plans grows exponentially with the
	number of joins in the query. Further optimization effort is
	caused by the support of a variety of <firstterm>join
	methods</firstterm> (e.g., nested loop, hash join, merge join in
	<productname>PostgreSQL</productname>) to process individual joins
	and a diversity of <firstterm>indexes</firstterm> (e.g.,
	B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as
	access paths for relations.
	</para>

	<para>
	The normal <productname>PostgreSQL</productname> query optimizer
	performs a <firstterm>near-exhaustive search</firstterm> over the
	space of alternative strategies. This algorithm, first introduced
	in IBM's System R database, produces a near-optimal join order,
	but can take an enormous amount of time and memory space when the
	number of joins in the query grows large. This makes the ordinary
	<productname>PostgreSQL</productname> query optimizer
	inappropriate for queries that join a large number of tables.
	</para>

	<para>
	The Institute of Automatic Control at the University of Mining and
	Technology, in Freiberg, Germany, encountered some problems when
	it wanted to use <productname>PostgreSQL</productname> as the
	backend for a decision support knowledge based system for the
	maintenance of an electrical power grid. The DBMS needed to handle
	large join queries for the inference machine of the knowledge
	based system. The number of joins in these queries made using the
	normal query optimizer infeasible.
	</para>

	<para>
	In the following we describe the implementation of a
	<firstterm>genetic algorithm</firstterm> to solve the join
	ordering problem in a manner that is efficient for queries
	involving large numbers of joins.
	</para>
	</sect1>

	<sect1 id="geqo-intro2">
	<title>Genetic Algorithms</title>

	<para>
	The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which
	operates through randomized search. The set of possible solutions for the
	optimization problem is considered as a
	<firstterm>population</firstterm> of <firstterm>individuals</firstterm>.
	The degree of adaptation of an individual to its environment is specified
	by its <firstterm>fitness</firstterm>.
	</para>

	<para>
	The coordinates of an individual in the search space are represented
	by <firstterm>chromosomes</firstterm>, in essence a set of character
	strings. A <firstterm>gene</firstterm> is a
	subsection of a chromosome which encodes the value of a single parameter
	being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or
	<firstterm>integer</firstterm>.
	</para>

	<para>
	Through simulation of the evolutionary operations <firstterm>recombination</firstterm>,
	<firstterm>mutation</firstterm>, and
	<firstterm>selection</firstterm> new generations of search points are found
	that show a higher average fitness than their ancestors. <xref linkend="geqo-figure"/>
	illustrates these steps.
	</para>

	<figure id="geqo-figure">
	<title>Structure of a Genetic Algorithm</title>
	<mediaobject>
	<imageobject>
	<imagedata fileref="images/genetic-algorithm.svg" format="SVG" width="100%"/>
	</imageobject>
	</mediaobject>
	</figure>

	<para>
	According to the <systemitem class="resource">comp.ai.genetic</systemitem> <acronym>FAQ</acronym> it cannot be stressed too
	strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a
	problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly
	non-random (better than random).
	</para>

	</sect1>

	<sect1 id="geqo-pg-intro">
	<title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title>

	<para>
	The <acronym>GEQO</acronym> module approaches the query
	optimization problem as though it were the well-known traveling salesman
	problem (<acronym>TSP</acronym>).
	Possible query plans are encoded as integer strings. Each string
	represents the join order from one relation of the query to the next.
	For example, the join tree
	<literallayout class="monospaced">
	/\
	/\ 2
	/\ 3
	4 1
	</literallayout>
	is encoded by the integer string '4-1-3-2',
	which means, first join relation '4' and '1', then '3', and
	then '2', where 1, 2, 3, 4 are relation IDs within the
	<productname>PostgreSQL</productname> optimizer.
	</para>

	<para>
	Specific characteristics of the <acronym>GEQO</acronym>
	implementation in <productname>PostgreSQL</productname>
	are:

	<itemizedlist spacing="compact" mark="bullet">
	<listitem>
	<para>
	Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit
	individuals in a population, not whole-generational replacement)
	allows fast convergence towards improved query plans. This is
	essential for query handling with reasonable time;
	</para>
	</listitem>

	<listitem>
	<para>
	Usage of <firstterm>edge recombination crossover</firstterm>
	which is especially suited to keep edge losses low for the
	solution of the <acronym>TSP</acronym> by means of a
	<acronym>GA</acronym>;
	</para>
	</listitem>

	<listitem>
	<para>
	Mutation as genetic operator is deprecated so that no repair
	mechanisms are needed to generate legal <acronym>TSP</acronym> tours.
	</para>
	</listitem>
	</itemizedlist>
	</para>

	<para>
	Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's
	Genitor algorithm.
	</para>

	<para>
	The <acronym>GEQO</acronym> module allows
	the <productname>PostgreSQL</productname> query optimizer to
	support large join queries effectively through
	non-exhaustive search.
	</para>

	<sect2>
	<title>Generating Possible Plans with <acronym>GEQO</acronym></title>

	<para>
	The <acronym>GEQO</acronym> planning process uses the standard planner
	code to generate plans for scans of individual relations. Then join
	plans are developed using the genetic approach. As shown above, each
	candidate join plan is represented by a sequence in which to join
	the base relations. In the initial stage, the <acronym>GEQO</acronym>
	code simply generates some possible join sequences at random. For each
	join sequence considered, the standard planner code is invoked to
	estimate the cost of performing the query using that join sequence.
	(For each step of the join sequence, all three possible join strategies
	are considered; and all the initially-determined relation scan plans
	are available. The estimated cost is the cheapest of these
	possibilities.) Join sequences with lower estimated cost are considered
	<quote>more fit</quote> than those with higher cost. The genetic algorithm
	discards the least fit candidates. Then new candidates are generated
	by combining genes of more-fit candidates — that is, by using
	randomly-chosen portions of known low-cost join sequences to create
	new sequences for consideration. This process is repeated until a
	preset number of join sequences have been considered; then the best
	one found at any time during the search is used to generate the finished
	plan.
	</para>

	<para>
	This process is inherently nondeterministic, because of the randomized
	choices made during both the initial population selection and subsequent
	<quote>mutation</quote> of the best candidates. To avoid surprising changes
	of the selected plan, each run of the GEQO algorithm restarts its
	random number generator with the current <xref linkend="guc-geqo-seed"/>
	parameter setting. As long as <varname>geqo_seed</varname> and the other
	GEQO parameters are kept fixed, the same plan will be generated for a
	given query (and other planner inputs such as statistics). To experiment
	with different search paths, try changing <varname>geqo_seed</varname>.
	</para>

	</sect2>

	<sect2 id="geqo-future">
	<title>Future Implementation Tasks for
	<productname>PostgreSQL</productname> <acronym>GEQO</acronym></title>

	<para>
	Work is still needed to improve the genetic algorithm parameter
	settings.
	In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>,
	routines
	<function>gimme_pool_size</function> and <function>gimme_number_generations</function>,
	we have to find a compromise for the parameter settings
	to satisfy two competing demands:
	<itemizedlist spacing="compact">
	<listitem>
	<para>
	Optimality of the query plan
	</para>
	</listitem>
	<listitem>
	<para>
	Computing time
	</para>
	</listitem>
	</itemizedlist>
	</para>

	<para>
	In the current implementation, the fitness of each candidate join
	sequence is estimated by running the standard planner's join selection
	and cost estimation code from scratch. To the extent that different
	candidates use similar sub-sequences of joins, a great deal of work
	will be repeated. This could be made significantly faster by retaining
	cost estimates for sub-joins. The problem is to avoid expending
	unreasonable amounts of memory on retaining that state.
	</para>

	<para>
	At a more basic level, it is not clear that solving query optimization
	with a GA algorithm designed for TSP is appropriate. In the TSP case,
	the cost associated with any substring (partial tour) is independent
	of the rest of the tour, but this is certainly not true for query
	optimization. Thus it is questionable whether edge recombination
	crossover is the most effective mutation procedure.
	</para>

	</sect2>
	</sect1>

	<sect1 id="geqo-biblio">
	<title>Further Reading</title>

	<para>
	The following resources contain additional information about
	genetic algorithms:

	<itemizedlist>
	<listitem>
	<para>
	<ulink url="http://www.faqs.org/faqs/ai-faq/genetic/part1/">
	The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink
	url="news://comp.ai.genetic"></ulink>)
	</para>
	</listitem>

	<listitem>
	<para>
	<ulink url="https://www.red3d.com/cwr/evolve.html">
	Evolutionary Computation and its application to art and design</ulink>, by
	Craig Reynolds
	</para>
	</listitem>

	<listitem>
	<para>
	<xref linkend="elma04"/>
	</para>
	</listitem>

	<listitem>
	<para>
	<xref linkend="fong"/>
	</para>
	</listitem>
	</itemizedlist>
	</para>

	</sect1>
	</chapter>