| <!-- doc/src/sgml/geqo.sgml --> |
| |
| <chapter id="geqo"> |
| <title>Genetic Query Optimizer</title> |
| |
| <para> |
| <note> |
| <title>Author</title> |
| <para> |
| Written by Martin Utesch (<email>utesch@aut.tu-freiberg.de</email>) |
| for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany. |
| </para> |
| </note> |
| </para> |
| |
| <sect1 id="geqo-intro"> |
| <title>Query Handling as a Complex Optimization Problem</title> |
| |
| <para> |
| Among all relational operators the most difficult one to process |
| and optimize is the <firstterm>join</firstterm>. The number of |
| possible query plans grows exponentially with the |
| number of joins in the query. Further optimization effort is |
| caused by the support of a variety of <firstterm>join |
| methods</firstterm> (e.g., nested loop, hash join, merge join in |
| <productname>PostgreSQL</productname>) to process individual joins |
| and a diversity of <firstterm>indexes</firstterm> (e.g., |
| B-tree, hash, GiST and GIN in <productname>PostgreSQL</productname>) as |
| access paths for relations. |
| </para> |
| |
| <para> |
| The normal <productname>PostgreSQL</productname> query optimizer |
| performs a <firstterm>near-exhaustive search</firstterm> over the |
| space of alternative strategies. This algorithm, first introduced |
| in IBM's System R database, produces a near-optimal join order, |
| but can take an enormous amount of time and memory space when the |
| number of joins in the query grows large. This makes the ordinary |
| <productname>PostgreSQL</productname> query optimizer |
| inappropriate for queries that join a large number of tables. |
| </para> |
| |
| <para> |
| The Institute of Automatic Control at the University of Mining and |
| Technology, in Freiberg, Germany, encountered some problems when |
| it wanted to use <productname>PostgreSQL</productname> as the |
| backend for a decision support knowledge based system for the |
| maintenance of an electrical power grid. The DBMS needed to handle |
| large join queries for the inference machine of the knowledge |
| based system. The number of joins in these queries made using the |
| normal query optimizer infeasible. |
| </para> |
| |
| <para> |
| In the following we describe the implementation of a |
| <firstterm>genetic algorithm</firstterm> to solve the join |
| ordering problem in a manner that is efficient for queries |
| involving large numbers of joins. |
| </para> |
| </sect1> |
| |
| <sect1 id="geqo-intro2"> |
| <title>Genetic Algorithms</title> |
| |
| <para> |
| The genetic algorithm (<acronym>GA</acronym>) is a heuristic optimization method which |
| operates through randomized search. The set of possible solutions for the |
| optimization problem is considered as a |
| <firstterm>population</firstterm> of <firstterm>individuals</firstterm>. |
| The degree of adaptation of an individual to its environment is specified |
| by its <firstterm>fitness</firstterm>. |
| </para> |
| |
| <para> |
| The coordinates of an individual in the search space are represented |
| by <firstterm>chromosomes</firstterm>, in essence a set of character |
| strings. A <firstterm>gene</firstterm> is a |
| subsection of a chromosome which encodes the value of a single parameter |
| being optimized. Typical encodings for a gene could be <firstterm>binary</firstterm> or |
| <firstterm>integer</firstterm>. |
| </para> |
| |
| <para> |
| Through simulation of the evolutionary operations <firstterm>recombination</firstterm>, |
| <firstterm>mutation</firstterm>, and |
| <firstterm>selection</firstterm> new generations of search points are found |
| that show a higher average fitness than their ancestors. <xref linkend="geqo-figure"/> |
| illustrates these steps. |
| </para> |
| |
| <figure id="geqo-figure"> |
| <title>Structure of a Genetic Algorithm</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata fileref="images/genetic-algorithm.svg" format="SVG" width="100%"/> |
| </imageobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| According to the <systemitem class="resource">comp.ai.genetic</systemitem> <acronym>FAQ</acronym> it cannot be stressed too |
| strongly that a <acronym>GA</acronym> is not a pure random search for a solution to a |
| problem. A <acronym>GA</acronym> uses stochastic processes, but the result is distinctly |
| non-random (better than random). |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="geqo-pg-intro"> |
| <title>Genetic Query Optimization (<acronym>GEQO</acronym>) in PostgreSQL</title> |
| |
| <para> |
| The <acronym>GEQO</acronym> module approaches the query |
| optimization problem as though it were the well-known traveling salesman |
| problem (<acronym>TSP</acronym>). |
| Possible query plans are encoded as integer strings. Each string |
| represents the join order from one relation of the query to the next. |
| For example, the join tree |
| <literallayout class="monospaced"> |
| /\ |
| /\ 2 |
| /\ 3 |
| 4 1 |
| </literallayout> |
| is encoded by the integer string '4-1-3-2', |
| which means, first join relation '4' and '1', then '3', and |
| then '2', where 1, 2, 3, 4 are relation IDs within the |
| <productname>PostgreSQL</productname> optimizer. |
| </para> |
| |
| <para> |
| Specific characteristics of the <acronym>GEQO</acronym> |
| implementation in <productname>PostgreSQL</productname> |
| are: |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| Usage of a <firstterm>steady state</firstterm> <acronym>GA</acronym> (replacement of the least fit |
| individuals in a population, not whole-generational replacement) |
| allows fast convergence towards improved query plans. This is |
| essential for query handling with reasonable time; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Usage of <firstterm>edge recombination crossover</firstterm> |
| which is especially suited to keep edge losses low for the |
| solution of the <acronym>TSP</acronym> by means of a |
| <acronym>GA</acronym>; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Mutation as genetic operator is deprecated so that no repair |
| mechanisms are needed to generate legal <acronym>TSP</acronym> tours. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| Parts of the <acronym>GEQO</acronym> module are adapted from D. Whitley's |
| Genitor algorithm. |
| </para> |
| |
| <para> |
| The <acronym>GEQO</acronym> module allows |
| the <productname>PostgreSQL</productname> query optimizer to |
| support large join queries effectively through |
| non-exhaustive search. |
| </para> |
| |
| <sect2> |
| <title>Generating Possible Plans with <acronym>GEQO</acronym></title> |
| |
| <para> |
| The <acronym>GEQO</acronym> planning process uses the standard planner |
| code to generate plans for scans of individual relations. Then join |
| plans are developed using the genetic approach. As shown above, each |
| candidate join plan is represented by a sequence in which to join |
| the base relations. In the initial stage, the <acronym>GEQO</acronym> |
| code simply generates some possible join sequences at random. For each |
| join sequence considered, the standard planner code is invoked to |
| estimate the cost of performing the query using that join sequence. |
| (For each step of the join sequence, all three possible join strategies |
| are considered; and all the initially-determined relation scan plans |
| are available. The estimated cost is the cheapest of these |
| possibilities.) Join sequences with lower estimated cost are considered |
| <quote>more fit</quote> than those with higher cost. The genetic algorithm |
| discards the least fit candidates. Then new candidates are generated |
| by combining genes of more-fit candidates — that is, by using |
| randomly-chosen portions of known low-cost join sequences to create |
| new sequences for consideration. This process is repeated until a |
| preset number of join sequences have been considered; then the best |
| one found at any time during the search is used to generate the finished |
| plan. |
| </para> |
| |
| <para> |
| This process is inherently nondeterministic, because of the randomized |
| choices made during both the initial population selection and subsequent |
| <quote>mutation</quote> of the best candidates. To avoid surprising changes |
| of the selected plan, each run of the GEQO algorithm restarts its |
| random number generator with the current <xref linkend="guc-geqo-seed"/> |
| parameter setting. As long as <varname>geqo_seed</varname> and the other |
| GEQO parameters are kept fixed, the same plan will be generated for a |
| given query (and other planner inputs such as statistics). To experiment |
| with different search paths, try changing <varname>geqo_seed</varname>. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="geqo-future"> |
| <title>Future Implementation Tasks for |
| <productname>PostgreSQL</productname> <acronym>GEQO</acronym></title> |
| |
| <para> |
| Work is still needed to improve the genetic algorithm parameter |
| settings. |
| In file <filename>src/backend/optimizer/geqo/geqo_main.c</filename>, |
| routines |
| <function>gimme_pool_size</function> and <function>gimme_number_generations</function>, |
| we have to find a compromise for the parameter settings |
| to satisfy two competing demands: |
| <itemizedlist spacing="compact"> |
| <listitem> |
| <para> |
| Optimality of the query plan |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Computing time |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| In the current implementation, the fitness of each candidate join |
| sequence is estimated by running the standard planner's join selection |
| and cost estimation code from scratch. To the extent that different |
| candidates use similar sub-sequences of joins, a great deal of work |
| will be repeated. This could be made significantly faster by retaining |
| cost estimates for sub-joins. The problem is to avoid expending |
| unreasonable amounts of memory on retaining that state. |
| </para> |
| |
| <para> |
| At a more basic level, it is not clear that solving query optimization |
| with a GA algorithm designed for TSP is appropriate. In the TSP case, |
| the cost associated with any substring (partial tour) is independent |
| of the rest of the tour, but this is certainly not true for query |
| optimization. Thus it is questionable whether edge recombination |
| crossover is the most effective mutation procedure. |
| </para> |
| |
| </sect2> |
| </sect1> |
| |
| <sect1 id="geqo-biblio"> |
| <title>Further Reading</title> |
| |
| <para> |
| The following resources contain additional information about |
| genetic algorithms: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <ulink url="http://www.faqs.org/faqs/ai-faq/genetic/part1/"> |
| The Hitch-Hiker's Guide to Evolutionary Computation</ulink>, (FAQ for <ulink |
| url="news://comp.ai.genetic"></ulink>) |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <ulink url="https://www.red3d.com/cwr/evolve.html"> |
| Evolutionary Computation and its application to art and design</ulink>, by |
| Craig Reynolds |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <xref linkend="elma04"/> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <xref linkend="fong"/> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| </sect1> |
| </chapter> |