src/backend/regex/README - cloudberry - Git at Google

 Implementation notes about Henry Spencer's regex library
 ========================================================

 If Henry ever had any internals documentation, he didn't publish it.
 So this file is an attempt to reverse-engineer some docs.

 General source-file layout
 --------------------------

 There are six separately-compilable source files, five of which expose
 exactly one exported function apiece:
 	regcomp.c: pg_regcomp
 	regexec.c: pg_regexec
 	regerror.c: pg_regerror
 	regfree.c: pg_regfree
 	regprefix.c: pg_regprefix
 (The pg_ prefixes were added by the Postgres project to distinguish this
 library version from any similar one that might be present on a particular
 system.  They'd need to be removed or replaced in any standalone version
 of the library.)

 The sixth file, regexport.c, exposes multiple functions that allow extraction
 of info about a compiled regex (see regexport.h).

 There are additional source files regc_*.c that are #include'd in regcomp,
 and similarly additional source files rege_*.c that are #include'd in
 regexec.  This was done to avoid exposing internal symbols globally;
 all functions not meant to be part of the library API are static.

 (Actually the above is a lie in one respect: there are two more global
 symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp.  These are
 not meant to be part of the API, but they have to be global because both
 regcomp and regexec call them.  It'd be better to get rid of
 pg_set_regex_collation, as well as the static variables it sets, in favor of
 keeping the needed locale state in the regex structs.  We have not done this
 yet for lack of a design for how to add application-specific state to the
 structs.)

 What's where in src/backend/regex/:

 regcomp.c		Top-level regex compilation code
 regc_color.c		Color map management
 regc_cvec.c		Character vector (cvec) management
 regc_lex.c		Lexer
 regc_nfa.c		NFA handling
 regc_locale.c		Application-specific locale code from Tcl project
 regc_pg_locale.c	Postgres-added application-specific locale code
 regexec.c		Top-level regex execution code
 rege_dfa.c		DFA creation and execution
 regerror.c		pg_regerror: generate text for a regex error code
 regfree.c		pg_regfree: API to free a no-longer-needed regex_t
 regexport.c		Functions for extracting info from a regex_t
 regprefix.c		Code for extracting a common prefix from a regex_t

 The locale-specific code is concerned primarily with case-folding and with
 expanding locale-specific character classes, such as [[:alnum:]].  It
 really needs refactoring if this is ever to become a standalone library.

 The header files for the library are in src/include/regex/:

 regcustom.h		Customizes library for particular application
 regerrs.h		Error message list
 regex.h			Exported API
 regexport.h		Exported API for regexport.c
 regguts.h		Internals declarations


 DFAs, NFAs, and all that
 ------------------------

 This library is a hybrid DFA/NFA regex implementation.  (If you've never
 heard either of those terms, get thee to a first-year comp sci textbook.)
 It might not be clear at first glance what that really means and how it
 relates to what you'll see in the code.  Here's what really happens:

 * Initial parsing of a regex generates an NFA representation, with number
 of states approximately proportional to the length of the regexp.

 * The NFA is then optimized into a "compact NFA" representation, which is
 basically the same idea but without fields that are not going to be needed
 at runtime.  It is simplified too: the compact format only allows "plain"
 and "LACON" arc types.  The cNFA representation is what is passed from
 regcomp to regexec.

 * Unlike traditional NFA-based regex engines, we do not execute directly
 from the NFA representation, as that would require backtracking and so be
 very slow in some cases.  Rather, we execute a DFA, which ideally can
 process an input string in linear time (O(M) for M characters of input)
 without backtracking.  Each state of the DFA corresponds to a set of
 states of the NFA, that is all the states that the NFA might have been in
 upon reaching the current point in the input string.  Therefore, an NFA
 with N states might require as many as 2^N states in the corresponding
 DFA, which could easily require unreasonable amounts of memory.  We deal
 with this by materializing states of the DFA lazily (only when needed) and
 keeping them in a limited-size cache.  The possible need to build the same
 state of the DFA repeatedly makes this approach not truly O(M) time, but
 in the worst case as much as O(M*N).  That's still far better than the
 worst case for a backtracking NFA engine.

 If that were the end of it, we'd just say this is a DFA engine, with the
 use of NFAs being merely an implementation detail.  However, a DFA engine
 cannot handle some important regex features such as capturing parens and
 back-references.  If the parser finds that a regex uses these features
 (collectively called "messy cases" in the code), then we have to use
 NFA-style backtracking search after all.

 When using the NFA mode, the representation constructed by the parser
 consists of a tree of sub-expressions ("subre"s).  Leaf tree nodes are
 either plain regular expressions (which are executed as DFAs in the manner
 described above) or back-references (which try to match the input to some
 previous substring).  Non-leaf nodes are capture nodes (which save the
 location of the substring currently matching their child node),
 concatenation, alternation, or iteration nodes.  At execution time, the
 executor recursively scans the tree.  At concatenation, alternation, or
 iteration nodes, it considers each possible alternative way of matching the
 input string, that is each place where the string could be split for a
 concatenation or iteration, or each child node for an alternation.  It
 tries the next alternative if the match fails according to the child nodes.
 This is exactly the sort of backtracking search done by a traditional NFA
 regex engine.  If there are many tree levels it can get very slow.

 But all is not lost: we can still be smarter than the average pure NFA
 engine.  To do this, each subre node has an associated DFA, which
 represents what the node could possibly match insofar as a mathematically
 pure regex can describe that, which basically means "no backrefs".
 Before we perform any search of possible alternative sub-matches, we run
 the DFA to see if it thinks the proposed substring could possibly match.
 If not, we can reject the match immediately without iterating through many
 possibilities.

 As an example, consider the regex "(a[bc]+)\1".  The compiled
 representation will have a top-level concatenation subre node.  Its first
 child is a plain DFA node for "a[bc]+" (which is marked as being a capture
 node).  The concatenation's second child is a backref node for \1.
 The DFA associated with the concatenation node will be "a[bc]+a[bc]+",
 where the backref has been replaced by a copy of the DFA for its referent
 expression.  When executed, the concatenation node will have to search for
 a possible division of the input string that allows its two child nodes to
 each match their part of the string (and although this specific case can
 only succeed when the division is at the middle, the code does not know
 that, nor would it be true in general).  However, we can first run the DFA
 and quickly reject any input that doesn't start with an "a" and contain
 one more "a" plus some number of b's and c's.  If the DFA doesn't match,
 there is no need to recurse to the two child nodes for each possible
 string division point.  In many cases, this prefiltering makes the search
 run much faster than a pure NFA engine could do.  It is this behavior that
 justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
 library.

 It's perhaps worth noting that separate capture subre nodes are a rarity:
 normally, we just mark a subre as capturing and that's it.  However, it's
 legal to write a regex like "((x))" in which the same substring has to be
 captured by multiple sets of parentheses.  Since a subre has room for only
 one "capno" field, a single subre can't handle that.  We handle such cases
 by wrapping the base subre (which captures the innermost parens) in a
 no-op capture node, or even more than one for "(((x)))" etc.  This is a
 little bit inefficient because we end up with multiple identical NFAs,
 but since the case is pointless and infrequent, it's not worth working
 harder.


 Colors and colormapping
 -----------------------

 In many common regex patterns, there are large numbers of characters that
 can be treated alike by the execution engine.  A simple example is the
 pattern "[[:alpha:]][[:alnum:]]*" for an identifier.  Basically the engine
 only needs to care whether an input symbol is a letter, a digit, or other.
 We could build the NFA or DFA with a separate arc for each possible letter
 and digit, but that's very wasteful of space and not so cheap to execute
 either, especially when dealing with Unicode which can have thousands of
 letters.  Instead, the parser builds a "color map" that maps each possible
 input symbol to a "color", or equivalence class.  The NFA or DFA
 representation then has arcs labeled with colors, not specific input
 symbols.  At execution, the first thing the executor does with each input
 symbol is to look up its color in the color map, and then everything else
 works from the color only.

 To build the colormap, we start by assigning every possible input symbol
 the color WHITE, which means "other" (that is, at the end of parsing, the
 symbols that are still WHITE are those not explicitly referenced anywhere
 in the regex).  When we see a simple literal character or a bracket
 expression in the regex, we want to assign that character, or all the
 characters represented by the bracket expression, a unique new color that
 can be used to label the NFA arc corresponding to the state transition for
 matching this character or bracket expression.  The basic idea is:
 first, change the color assigned to a character to some new value;
 second, run through all the existing arcs in the partially-built NFA,
 and for each one referencing the character's old color, add a parallel
 arc referencing its new color (this keeps the reassignment from changing
 the semantics of what we already built); and third, add a new arc with
 the character's new color to the current pair of NFA states, denoting
 that seeing this character allows the state transition to be made.

 This is complicated a bit by not wanting to create more colors
 (equivalence classes) than absolutely necessary.  In particular, if a
 bracket expression mentions two characters that had the same color before,
 they should still share the same color after we process the bracket, since
 there is still not a need to distinguish them.  But we do need to
 distinguish them from other characters that previously had the same color
 yet are not listed in the bracket expression.  To mechanize this, the code
 has a concept of "parent colors" and "subcolors", where a color's subcolor
 is the new color that we are giving to any characters of that color while
 parsing the current atom.  (The word "parent" is a bit unfortunate here,
 because it suggests a long-lived relationship, but a subcolor link really
 only lasts for the duration of parsing a single atom.)  In other words,
 a subcolor link means that we are in process of splitting the parent color
 into two colors (equivalence classes), depending on whether or not each
 member character should be included by the current regex atom.

 As an example, suppose we have the regex "a\d\wx".  Initially all possible
 character codes are labeled WHITE (color 0).  To parse the atom "a", we
 create a new color (1), update "a"'s color map entry to 1, and create an
 arc labeled 1 between the first two states of the NFA.  Now we see \d,
 which is really a bracket expression containing the digits "0"-"9".
 First we process "0", which is currently WHITE, so we create a new color
 (2), update "0"'s color map entry to 2, and create an arc labeled 2
 between the second and third states of the NFA.  We also mark color WHITE
 as having the subcolor 2, which means that future relabelings of WHITE
 characters should also select 2 as the new color.  Thus, when we process
 "1", we won't create a new color but re-use 2.  We update "1"'s color map
 entry to 2, and then find that we don't need a new arc because there is
 already one labeled 2 between the second and third states of the NFA.
 Similarly for the other 8 digits, so there will be only one arc labeled 2
 between NFA states 2 and 3 for all members of this bracket expression.
 At completion of processing of the bracket expression, we call okcolors()
 which breaks all the existing parent/subcolor links; there is no longer a
 marker saying that WHITE characters should be relabeled 2.  (Note:
 actually, we did the same creation and clearing of a subcolor link for the
 primitive atom "a", but it didn't do anything very interesting.)  Now we
 come to the "\w" bracket expression, which for simplicity assume expands
 to just "[a-z0-9]".  We process "a", but observe that it is already the
 sole member of its color 1.  This means there is no need to subdivide that
 equivalence class more finely, so we do not create any new color.  We just
 make an arc labeled 1 between the third and fourth NFA states.  Next we
 process "b", which is WHITE and far from the only WHITE character, so we
 create a new color (3), link that as WHITE's subcolor, relabel "b" as
 color 3, and make an arc labeled 3.  As we process "c" through "z", each
 is relabeled from WHITE to 3, but no new arc is needed.  Now we come to
 "0", which is not the only member of its color 2, so we suppose that a new
 color is needed and create color 4.  We link 4 as subcolor of 2, relabel
 "0" as color 4 in the map, and add an arc for color 4.  Next "1" through
 "9" are similarly relabeled as color 4, with no additional arcs needed.
 Having finished the bracket expression, we call okcolors(), which breaks
 the subcolor links.  okcolors() further observes that we have removed
 every member of color 2 (the previous color of the digit characters).
 Therefore, it runs through the partial NFA built so far and relabels arcs
 labeled 2 to color 4; in particular the arc from NFA state 2 to state 3 is
 relabeled color 4.  Then it frees up color 2, since we have no more use
 for that color.  We now have an NFA in which transitions for digits are
 consistently labeled with color 4.  Last, we come to the atom "x".
 "x" is currently labeled with color 3, and it's not the only member of
 that color, so we realize that we now need to distinguish "x" from other
 letters when we did not before.  We create a new color, which might have
 been 5 but instead we recycle the unused color 2.  "x" is relabeled 2 in
 the color map and 2 is linked as the subcolor of 3, and we add an arc for
 2 between states 4 and 5 of the NFA.  Now we call okcolors(), which breaks
 the subcolor link between colors 3 and 2 and notices that both colors are
 nonempty.  Therefore, it also runs through the existing NFA arcs and adds
 an additional arc labeled 2 wherever there is an arc labeled 3; this
 action ensures that characters of color 2 (i.e., "x") will still be
 considered as allowing any transitions they did before.  We are now done
 parsing the regex, and we have these final color assignments:
 	color 1: "a"
 	color 2: "x"
 	color 3: other letters
 	color 4: digits
 and the NFA has these arcs:
 	states 1 -> 2 on color 1 (hence, "a" only)
 	states 2 -> 3 on color 4 (digits)
 	states 3 -> 4 on colors 1, 3, 4, and 2 (covering all \w characters)
 	states 4 -> 5 on color 2 ("x" only)
 which can be seen to be a correct representation of the regex.

 There is one more complexity, which is how to handle ".", that is a
 match-anything atom.  We used to do that by generating a "rainbow"
 of arcs of all live colors between the two NFA states before and after
 the dot.  That's expensive in itself when there are lots of colors,
 and it also typically adds lots of follow-on arc-splitting work for the
 color splitting logic.  Now we handle this case by generating a single arc
 labeled with the special color RAINBOW, meaning all colors.  Such arcs
 never need to be split, so they help keep NFAs small in this common case.
 (Note: this optimization doesn't help in REG_NLSTOP mode, where "." is
 not supposed to match newline.  In that case we still handle "." by
 generating an almost-rainbow of all colors except newline's color.)

 Given this summary, we can see we need the following operations for
 colors:

 * A fast way to look up the current color assignment for any character
   code.  (This is needed during both parsing and execution, while the
   remaining operations are needed only during parsing.)
 * A way to alter the color assignment for any given character code.
 * We must track the number of characters currently assigned to each
   color, so that we can detect empty and singleton colors.
 * We must track all existing NFA arcs of a given color, so that we
   can relabel them at need, or add parallel arcs of a new color when
   an existing color has to be subdivided.

 The last two of these are handled with the "struct colordesc" array and
 the "colorchain" links in NFA arc structs.

 Ideally, we'd do the first two operations using a simple linear array
 storing the current color assignment for each character code.
 Unfortunately, that's not terribly workable for large charsets such as
 Unicode.  Our solution is to divide the color map into two parts.  A simple
 linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
 chosen large enough to include all popular characters (so that the
 significantly-slower code paths about to be described are seldom invoked).
 Characters above that need be considered at compile time only if they
 appear explicitly in the regex pattern.  We store each such mentioned
 character or character range as an entry in the "colormaprange" array in
 the colormap.  (Overlapping ranges are split into unique subranges, so that
 each range in the finished list needs only a single color that describes
 all its characters.)  When mapping a character above MAX_SIMPLE_CHR to a
 color at runtime, we search this list of ranges explicitly.

 That's still not quite enough, though, because of locale-dependent
 character classes such as [[:alpha:]].  In Unicode locales these classes
 may have thousands of entries that are above MAX_SIMPLE_CHR, and we
 certainly don't want to be searching large colormaprange arrays at runtime.
 Nor do we even want to spend the time to initialize cvec structures that
 exhaustively describe all of those characters.  Our solution is to compute
 exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR.
 For characters above that, we apply the <ctype.h> or <wctype.h> lookup
 functions at runtime for each locale-dependent character class used in the
 regex pattern, constructing a bitmap that describes which classes the
 runtime character belongs to.  The per-character-range data structure
 mentioned above actually holds, for each range, a separate color entry
 for each possible combination of character class properties.  That is,
 the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
 whose rows correspond to high characters or character ranges that are
 explicitly mentioned in the regex pattern, and whose columns correspond
 to sets of the locale-dependent character classes that are used in the
 regex.

 As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]'
 (and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need
 a high color map with three rows.  One row is for the single character
 U+1234 (represented as a single-element range), one is for the range
 U+1D100..U+1D1FF, and the other row represents all remaining high
 characters.  The color map has two columns, one for characters that
 satisfy iswalnum() and one for those that don't.

 We build this color map in parallel with scanning the regex.  Each time
 we detect a new explicit high character (or range) or a locale-dependent
 character class, we split existing entry(s) in the high color map so that
 characters we need to be able to distinguish will have distinct entries
 that can be given separate colors.  Often, though, single entries in the
 high color map will represent very large sets of characters.

 If there are both explicit high characters/ranges and locale-dependent
 character classes, we may have entries in the high color map array that
 have non-WHITE colors but don't actually represent any real characters.
 (For example, in a row representing a singleton range, only one of the
 columns could possibly be a live entry; it's the one matching the actual
 locale properties for that single character.)  We don't currently make
 any effort to reclaim such colors.  In principle it could be done, but
 it's not clear that it's worth the trouble.


 Detailed semantics of an NFA
 ----------------------------

 When trying to read dumped-out NFAs, it's helpful to know these facts:

 State 0 (additionally marked with "@" in dumpnfa's output) is always the
 goal state, and state 1 (additionally marked with ">") is the start state.
 (The code refers to these as the post state and pre state respectively.)

 The possible arc types are:

     PLAIN arcs, which specify matching of any character of a given "color"
     (see above).  These are dumped as "[color_number]->to_state".
     In addition there can be "rainbow" PLAIN arcs, which are dumped as
     "[*]->to_state".

     EMPTY arcs, which specify a no-op transition to another state.  These
     are dumped as "->to_state".

     AHEAD constraints, which represent a "next character must be of this
     color" constraint.  AHEAD differs from a PLAIN arc in that the input
     character is not consumed when crossing the arc.  These are dumped as
     ">color_number>->to_state", or possibly ">*>->to_state".

     BEHIND constraints, which represent a "previous character must be of
     this color" constraint, which likewise consumes no input.  These are
     dumped as "<color_number<->to_state", or possibly "<*<->to_state".

     '^' arcs, which specify a beginning-of-input constraint.  These are
     dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
     beginning-of-line constraints respectively.

     '$' arcs, which specify an end-of-input constraint.  These are dumped
     as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
     constraints respectively.

     LACON constraints, which represent "(?=re)", "(?!re)", "(?<=re)", and
     "(?<!re)" constraints, i.e. the input starting/ending at this point must
     match (or not match) a given sub-RE, but the matching input is not
     consumed.  These are dumped as ":subtree_number:->to_state".

 If you see anything else (especially any question marks) in the display of
 an arc, it's dumpnfa() trying to tell you that there's something fishy
 about the arc; see the source code.

 The regex executor can only handle PLAIN and LACON transitions.  The regex
 optimize() function is responsible for transforming the parser's output
 to get rid of all the other arc types.  In particular, ^ and $ arcs that
 are not dropped as impossible will always end up adjacent to the pre or
 post state respectively, and then will be converted into PLAIN arcs that
 mention the special "colors" for BOS, BOL, EOS, or EOL.

 To decide whether a thus-transformed NFA matches a given substring of the
 input string, the executor essentially follows these rules:
 1. Start the NFA "looking at" the character *before* the given substring,
 or if the substring is at the start of the input, prepend an imaginary BOS
 character instead.
 2. Run the NFA until it has consumed the character *after* the given
 substring, or an imaginary following EOS character if the substring is at
 the end of the input.
 3. If the NFA is (or can be) in the goal state at this point, it matches.

 This definition is necessary to support regexes that begin or end with
 constraints such as \m and \M, which imply requirements on the adjacent
 character if any.  The executor implements that by checking if the
 adjacent character (or BOS/BOL/EOS/EOL pseudo-character) is of the
 right color, and it does that in the same loop that checks characters
 within the match.

 So one can mentally execute an untransformed NFA by taking ^ and $ as
 ordinary constraints that match at start and end of input; but plain
 arcs out of the start state should be taken as matches for the character
 before the target substring, and similarly, plain arcs leading to the
 post state are matches for the character after the target substring.
 After the optimize() transformation, there are explicit arcs mentioning
 BOS/BOL/EOS/EOL adjacent to the pre-state and post-state.  So a finished
 NFA for a pattern without anchors or adjacent-character constraints will
 have pre-state outarcs for RAINBOW (all possible character colors) as well
 as BOS and BOL, and likewise post-state inarcs for RAINBOW, EOS, and EOL.
 Also note that LACON arcs will never connect to the pre-state
 or post-state.


 Look-around constraints (LACONs)
 --------------------------------

 The regex compiler doesn't have much intelligence about LACONs; it just
 constructs a sub-NFA representing the pattern that the constraint says to
 match or not match, and puts a LACON arc referencing that sub-NFA into the
 main NFA.  At runtime, the executor applies the sub-NFA at each point in
 the string where the constraint is relevant, and then traverses or doesn't
 traverse the arc.  ("Traversal" means including the arc's to-state in the
 set of NFA states that are considered active at the next character.)

 The actual basic matching cycle of the executor is
 1.  Identify the color of the next input character, then advance over it.
 2.  Apply the DFA to follow all the matching "plain" arcs of the NFA.
     (Notionally, the previous DFA state represents the set of states the
     NFA could have been in before the character, and the new DFA state
     represents the set of states the NFA could be in after the character.)
 3.  If there are any LACON arcs leading out of any of the new NFA states,
     apply each LACON constraint starting from the new next input character
     (while not actually consuming any input).  For each successful LACON,
     add its to-state to the current set of NFA states.  If any such
     to-state has outgoing LACON arcs, process those in the same way.
     (Mathematically speaking, we compute the transitive closure of the
     set of states reachable by successful LACONs.)

 Thus, LACONs are always checked immediately after consuming a character
 via a plain arc.  This is okay because the NFA's "pre" state only has
 plain out-arcs, so we can always consume a character (possibly a BOS
 pseudo-character as described above) before we need to worry about LACONs.