| # regexp-tree |
| |
| [](https://travis-ci.org/DmitrySoshnikov/regexp-tree) [](https://badge.fury.io/js/regexp-tree) [](https://www.npmjs.com/package/regexp-tree) |
| |
| Regular expressions processor in JavaScript |
| |
| TL;DR: **RegExp Tree** is a _regular expressions processor_, which includes _parser_, _traversal_, _transformer_, _optimizer_, and _interpreter_ APIs. |
| |
| You can get an overview of the tool in [this article](https://medium.com/@DmitrySoshnikov/regexp-tree-a-regular-expressions-parser-with-a-simple-ast-format-bcd4d5580df6). |
| |
| ### Table of Contents |
| |
| - [Installation](#installation) |
| - [Development](#development) |
| - [Usage as a CLI](#usage-as-a-cli) |
| - [Usage from Node](#usage-from-node) |
| - [Capturing locations](#capturing-locations) |
| - [Using traversal API](#using-traversal-api) |
| - [Using transform API](#using-transform-api) |
| - [Transform plugins](#transform-plugins) |
| - [Using generator API](#using-generator-api) |
| - [Using optimizer API](#using-optimizer-api) |
| - [Optimizer ESLint plugin](#optimizer-eslint-plugin) |
| - [Using compat-transpiler API](#using-compat-transpiler-api) |
| - [Compat-transpiler Babel plugin](#compat-transpiler-babel-plugin) |
| - [RegExp extensions](#regexp-extensions) |
| - [RegExp extensions Babel plugin](#regexp-extensions-babel-plugin) |
| - [Creating RegExp objects](#creating-regexp-objects) |
| - [Executing regexes](#executing-regexes) |
| - [Using interpreter API](#using-interpreter-api) |
| - [Printing NFA/DFA tables](#printing-nfadfa-tables) |
| - [AST nodes specification](#ast-nodes-specification) |
| |
| ### Installation |
| |
| The parser can be installed as an [npm module](https://www.npmjs.com/package/regexp-tree): |
| |
| ``` |
| npm install -g regexp-tree |
| ``` |
| |
| You can also [try it online](https://astexplorer.net/#/gist/4ea2b52f0e546af6fb14f9b2f5671c1c/39b55944da3e5782396ffa1fea3ba68d126cd394) using _AST Explorer_. |
| |
| ### Development |
| |
| 1. Fork https://github.com/DmitrySoshnikov/regexp-tree repo |
| 2. If there is an actual issue from the [issues](https://github.com/DmitrySoshnikov/regexp-tree/issues) list you'd like to work on, feel free to assign it yourself, or comment on it to avoid collisions (open a new issue if needed) |
| 3. Make your changes |
| 4. Make sure `npm test` still passes (add new tests if needed) |
| 5. Submit a PR |
| |
| The _regexp-tree_ parser is implemented as an automatic LR parser using [Syntax](https://www.npmjs.com/package/syntax-cli) tool. The parser module is generated from the [regexp grammar](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/parser/regexp.bnf), which is based on the regular expressions grammar used in ECMAScript. |
| |
| For development from the github repository, run `build` command to generate the parser module, and transpile JS code: |
| |
| ``` |
| git clone https://github.com/<your-github-account>/regexp-tree.git |
| cd regexp-tree |
| npm install |
| npm run build |
| ``` |
| |
| > NOTE: JS code transpilation is used to support older versions of Node. For faster development cycle you can use `npm run watch` command, which continuously transpiles JS code. |
| |
| ### Usage as a CLI |
| |
| **Note:** the CLI is exposed as its own [regexp-tree-cli](https://www.npmjs.com/package/regexp-tree-cli) module. |
| |
| Check the options available from CLI: |
| |
| ``` |
| regexp-tree-cli --help |
| ``` |
| |
| ``` |
| Usage: regexp-tree-cli [options] |
| |
| Options: |
| -e, --expression A regular expression to be parsed |
| -l, --loc Whether to capture AST node locations |
| -o, --optimize Applies optimizer on the passed expression |
| -c, --compat Applies compat-transpiler on the passed expression |
| -t, --table Print NFA/DFA transition tables (nfa/dfa/all) |
| ``` |
| |
| To parse a regular expression, pass `-e` option: |
| |
| ``` |
| regexp-tree-cli -e '/a|b/i' |
| ``` |
| |
| Which produces an AST node corresponding to this regular expression: |
| |
| ```js |
| { |
| type: 'RegExp', |
| body: { |
| type: 'Disjunction', |
| left: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| right: { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| }, |
| flags: 'i', |
| } |
| ``` |
| |
| > NOTE: the format of a regexp is `/ Body / OptionalFlags`. |
| |
| ### Usage from Node |
| |
| The parser can also be used as a Node module: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| console.log(regexpTree.parse(/a|b/i)); // RegExp AST |
| ``` |
| |
| Note, _regexp-tree_ supports parsing regexes from strings, and also from actual `RegExp` objects (in general -- from any object which can be coerced to a string). If some feature is not implemented yet in an actual JavaScript RegExp, it should be passed as a string: |
| |
| ```js |
| // Pass an actual JS RegExp object. |
| regexpTree.parse(/a|b/i); |
| |
| // Pass a string, since `s` flag may not be supported in older versions. |
| regexpTree.parse('/./s'); |
| ``` |
| |
| Also note, that in string-mode, escaping is done using two slashes `\\` per JavaScript: |
| |
| ```js |
| // As an actual regexp. |
| regexpTree.parse(/\n/); |
| |
| // As a string. |
| regexpTree.parse('/\\n/'); |
| ``` |
| |
| ### Capturing locations |
| |
| For source code transformation tools it might be useful also to capture _locations_ of the AST nodes. From the command line it's controlled via the `-l` option: |
| |
| ``` |
| regexp-tree-cli -e '/ab/' -l |
| ``` |
| |
| This attaches `loc` object to each AST node: |
| |
| ```js |
| { |
| type: 'RegExp', |
| body: { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97, |
| loc: { |
| start: { |
| line: 1, |
| column: 1, |
| offset: 1, |
| }, |
| end: { |
| line: 1, |
| column: 2, |
| offset: 2, |
| }, |
| } |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98, |
| loc: { |
| start: { |
| line: 1, |
| column: 2, |
| offset: 2, |
| }, |
| end: { |
| line: 1, |
| column: 3, |
| offset: 3, |
| }, |
| } |
| } |
| ], |
| loc: { |
| start: { |
| line: 1, |
| column: 1, |
| offset: 1, |
| }, |
| end: { |
| line: 1, |
| column: 3, |
| offset: 3, |
| }, |
| } |
| }, |
| flags: '', |
| loc: { |
| start: { |
| line: 1, |
| column: 0, |
| offset: 0, |
| }, |
| end: { |
| line: 1, |
| column: 4, |
| offset: 4, |
| }, |
| } |
| } |
| ``` |
| |
| From Node it's controlled via `setOptions` method exposed on the parser: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| const parsed = regexpTree |
| .parser |
| .setOptions({captureLocations: true}) |
| .parse(/a|b/); |
| ``` |
| |
| The `setOptions` method sets global options, which are preserved between calls. It is also possible to provide options per a single `parse` call, which might be more preferred: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| const parsed = regexpTree.parse(/a|b/, { |
| captureLocations: true, |
| }); |
| ``` |
| |
| ### Using traversal API |
| |
| The [traverse](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/traverse) module allows handling needed AST nodes using the _visitor_ pattern. In Node the module is exposed as the `regexpTree.traverse` method. Handlers receive an instance of the [NodePath](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/traverse/README.md#nodepath-class) class, which encapsulates `node` itself, its `parent` node, `property`, and `index` (in case the node is part of a collection). |
| |
| Visiting a node follows this algorithm: |
| - call `pre` handler. |
| - recurse into node's children. |
| - call `post` handler. |
| |
| For each node type of interest, you can provide either: |
| - a function (`pre`). |
| - an object with members `pre` and `post`. |
| |
| You can also provide a `*` handler which will be executed on every node. |
| |
| Example: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| // Get AST. |
| const ast = regexpTree.parse('/[a-z]{1,}/'); |
| |
| // Traverse AST nodes. |
| regexpTree.traverse(ast, { |
| |
| // Visit every node before any type-specific handlers. |
| '*': function({node}) { |
| ... |
| }, |
| |
| // Handle "Quantifier" node type. |
| Quantifier({node}) { |
| ... |
| }, |
| |
| // Handle "Char" node type, before and after. |
| Char: { |
| pre({node}) { |
| ... |
| }, |
| post({node}) { |
| ... |
| } |
| } |
| |
| }); |
| |
| // Generate the regexp. |
| const re = regexpTree.generate(ast); |
| |
| console.log(re); // '/[a-z]+/' |
| ``` |
| |
| ### Using transform API |
| |
| > NOTE: you can play with transformation APIs, and write actual transforms for quick tests in AST Explorer. See [this example](http://astexplorer.net/#/gist/d293d22742b42cd1f7ee7b7e5dc6f697/39b0aabc42fb6fb106b9e368341d3300098f08c0). |
| |
| While traverse module provides basic traversal API, which can be used for any purposes of AST handling, _transform_ module focuses mainly on _transformation_ of regular expressions. |
| |
| It accepts a regular expressions in different formats (string, an actual `RegExp` object, or an AST), applies a set of transformations, and retuns an instance of [TransformResult](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/transform/README.md#transformresult). Handles receive as a parameter the same [NodePath](https://github.com/DmitrySoshnikov/regexp-tree/blob/master/src/traverse/README.md#nodepath-class) object used in traverse. |
| |
| Example: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| // Handle nodes. |
| const re = regexpTree.transform('/[a-z]{1,}/i', { |
| |
| /** |
| * Handle "Quantifier" node type, |
| * transforming `{1,}` quantifier to `+`. |
| */ |
| Quantifier(path) { |
| const {node} = path; |
| |
| // {1,} -> + |
| if ( |
| node.kind === 'Range' && |
| node.from === 1 && |
| !node.to |
| ) { |
| path.replace({ |
| type: 'Quantifier', |
| kind: '+', |
| greedy: node.greedy, |
| }); |
| } |
| }, |
| }); |
| |
| console.log(re.toString()); // '/[a-z]+/i' |
| console.log(re.toRegExp()); // /[a-z]+/i |
| console.log(re.getAST()); // AST for /[a-z]+/i |
| ``` |
| |
| #### Transform plugins |
| |
| A _transformation plugin_ is a module which exports a _transformation handler_. We have seen [above](#using-transform-api) how we can pass a handler object directly to the `regexpTree.transform` method, here we extract it into a separate module, so it can be implemented and shared independently: |
| |
| Example of a plugin: |
| |
| ```js |
| // file: ./regexp-tree-a-to-b-transform.js |
| |
| |
| /** |
| * This plugin replaces chars 'a' with chars 'b'. |
| */ |
| module.exports = { |
| Char({node}) { |
| if (node.kind === 'simple' && node.value === 'a') { |
| node.value = 'b'; |
| node.symbol = 'b'; |
| node.codePoint = 98; |
| } |
| }, |
| }; |
| ``` |
| |
| Once we have this plugin ready, we can require it, and pass to the `transform` function: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| const plugin = require('./regexp-tree-a-to-b-transform'); |
| |
| const re = regexpTree.transform(/(a|c)a+[a-z]/, plugin); |
| |
| console.log(re.toRegExp()); // /(b|c)b+[b-z]/ |
| ``` |
| |
| > NOTE: we can also pass a _list of plugins_ to the `regexpTree.transform`. In this case the plugins are applied in one pass in order. Another approach is to run several sequential calls to `transform`, setting up a pipeline, when a transformed AST is passed further to another plugin, etc. |
| |
| You can see other examples of transform plugins in the [optimizer/transforms](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer/transforms) or in the [compat-transpiler/transforms](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/compat-transpiler/transforms) directories. |
| |
| ### Using generator API |
| |
| The [generator](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/generator) module generates regular expressions from corresponding AST nodes. In Node the module is exposed as `regexpTree.generate` method. |
| |
| Example: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| const re = regexpTree.generate({ |
| type: 'RegExp', |
| body: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| flags: 'i', |
| }); |
| |
| console.log(re); // '/a/i' |
| ``` |
| |
| ### Using optimizer API |
| |
| [Optimizer](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer) transforms your regexp into an _optimized_ version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines. |
| |
| > NOTE: the Optimizer is implemented as a set of _regexp-tree_ [plugins](#transform-plugins). |
| |
| Example: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| const originalRe = /[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/; |
| |
| const optimizedRe = regexpTree |
| .optimize(originalRe) |
| .toRegExp(); |
| |
| console.log(optimizedRe); // /\w+e+/ |
| ``` |
| |
| From CLI the optimizer is available via `--optimize` (`-o`) option: |
| |
| ``` |
| regexp-tree-cli -e '/[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/' -o |
| ``` |
| |
| Result: |
| |
| ``` |
| Optimized: /\w+e+/ |
| ``` |
| |
| See the [optimizer README](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer) for more details. |
| |
| #### Optimizer ESLint plugin |
| |
| The [optimizer](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/optimizer) module is also available as an _ESLint plugin_, which can be installed at: [eslint-plugin-optimize-regex](https://www.npmjs.com/package/eslint-plugin-optimize-regex). |
| |
| ### Using compat-transpiler API |
| |
| The [compat-transpiler](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/compat-transpiler) module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don't yet implement the new syntax. |
| |
| > NOTE: the compat-transpiler is implemented as a set of _regexp-tree_ [plugins](#transform-plugins). |
| |
| Example, "dotAll" `s` flag: |
| |
| |
| ```js |
| /./s |
| ``` |
| |
| Is translated into: |
| |
| ```js |
| /[\0-\uFFFF]/ |
| ``` |
| |
| Or [named capturing groups](#named-capturing-group): |
| |
| ```js |
| /(?<value>a)\k<value>\1/ |
| ``` |
| |
| Becomes: |
| |
| ```js |
| /(a)\1\1/ |
| ``` |
| |
| To use the API from Node: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| // Using new syntax. |
| const originalRe = '/(?<all>.)\\k<all>/s'; |
| |
| // For legacy engines. |
| const compatTranspiledRe = regexpTree |
| .compatTranspile(originalRe) |
| .toRegExp(); |
| |
| console.log(compatTranspiledRe); // /([\0-\uFFFF])\1/ |
| ``` |
| |
| From CLI the compat-transpiler is available via `--compat` (`-c`) option: |
| |
| ``` |
| regexp-tree-cli -e '/(?<all>.)\k<all>/s' -c |
| ``` |
| |
| Result: |
| |
| ``` |
| Compat: /([\0-\uFFFF])\1/ |
| ``` |
| |
| #### Compat-transpiler Babel plugin |
| |
| The [compat-transpiler](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/compat-transpiler) module is also available as a _Babel plugin_, which can be installed at: [babel-plugin-transform-modern-regexp](https://www.npmjs.com/package/babel-plugin-transform-modern-regexp). |
| |
| Note, the plugin also includes [extended regexp](#regexp-extensions) features. |
| |
| ### RegExp extensions |
| |
| Besides future proposals, like [named capturing group](#named-capturing-group), and other which are being currently standardized, _regexp-tree_ also supports _non-standard_ features. |
| |
| > NOTE: _"non-standard"_ means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard. |
| |
| One of such featurs is `x` flag, which enables _extended_ mode of regular expressions. In this mode most of whitespaces are ignored, and expressions can use #-comments. |
| |
| Example: |
| |
| ```regex |
| / |
| # A regular expression for date. |
| |
| (?<year>\d{4})- # year part of a date |
| (?<month>\d{2})- # month part of a date |
| (?<day>\d{2}) # day part of a date |
| |
| /x |
| ``` |
| |
| This is normally parsed by the _regexp-tree_ parser, and [compat-transpiler](#using-compat-transpiler-api) has full support for it; it's translated into: |
| |
| ```regex |
| /(\d{4})-(\d{2})-(\d{2})/ |
| ``` |
| |
| #### RegExp extensions Babel plugin |
| |
| The regexp extensions are also available as a _Babel plugin_, which can be installed at: [babel-plugin-transform-modern-regexp](https://www.npmjs.com/package/babel-plugin-transform-modern-regexp). |
| |
| Note, the plugin also includes [compat-transpiler](#using-compat-transpiler-api) features. |
| |
| ### Creating RegExp objects |
| |
| To create an actual `RegExp` JavaScript object, we can use `regexpTree.toRegExp` method: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| const re = regexpTree.toRegExp('/[a-z]/i'); |
| |
| console.log( |
| re.test('a'), // true |
| re.test('Z'), // true |
| ); |
| ``` |
| |
| ### Executing regexes |
| |
| It is also possible to execute regular expressions using `exec` API method, which has support for new syntax, and features, such as [named capturing group](#named-capturing-group), etc: |
| |
| ```js |
| const regexpTree = require('regexp-tree'); |
| |
| const re = `/ |
| |
| # A regular expression for date. |
| |
| (?<year>\\d{4})- # year part of a date |
| (?<month>\\d{2})- # month part of a date |
| (?<day>\\d{2}) # day part of a date |
| |
| /x`; |
| |
| const string = '2017-04-14'; |
| |
| const result = regexpTree.exec(re, string); |
| |
| console.log(result.groups); // {year: '2017', month: '04', day: '14'} |
| ``` |
| |
| ### Using interpreter API |
| |
| > NOTE: you can read more about implementation details of the interpreter in [this series of articles](https://medium.com/@DmitrySoshnikov/building-a-regexp-machine-part-1-regular-grammars-d4986b585d7e). |
| |
| In addition to executing regular expressions using JavaScript built-in RegExp engine, RegExp Tree also implements own [interpreter](https://github.com/DmitrySoshnikov/regexp-tree/tree/master/src/interpreter/finite-automaton) based on classic NFA/DFA finite automaton engine. |
| |
| Currently it aims educational purposes -- to trace the regexp matching process, transitioning in NFA/DFA states. It also allows building state transitioning table, which can be used for custom implementation. In API the module is exposed as `fa` (finite-automaton) object. |
| |
| Example: |
| |
| ```js |
| const {fa} = require('regexp-tree'); |
| |
| const re = /ab|c*/; |
| |
| console.log(fa.test(re, 'ab')); // true |
| console.log(fa.test(re, '')); // true |
| console.log(fa.test(re, 'c')); // true |
| |
| // NFA, and its transition table. |
| const nfa = fa.toNFA(re); |
| console.log(nfa.getTransitionTable()); |
| |
| // DFA, and its transition table. |
| const dfa = fa.toDFA(re); |
| console.log(dfa.getTransitionTable()); |
| ``` |
| |
| For more granular work with NFA and DFA, `fa` module also exposes convenient builders, so you can build NFA fragments directly: |
| |
| ```js |
| const {fa} = require('regexp-tree'); |
| |
| const { |
| alt, |
| char, |
| or, |
| rep, |
| } = fa.builders; |
| |
| // ab|c* |
| const re = or( |
| alt(char('a'), char('b')), |
| rep(char('c')) |
| ); |
| |
| console.log(re.matches('ab')); // true |
| console.log(re.matches('')); // true |
| console.log(re.matches('c')); // true |
| |
| // Build DFA from NFA |
| const {DFA} = fa; |
| |
| const reDFA = new DFA(re); |
| |
| console.log(reDFA.matches('ab')); // true |
| console.log(reDFA.matches('')); // true |
| console.log(reDFA.matches('c')); // true |
| ``` |
| |
| #### Printing NFA/DFA tables |
| |
| The `--table` option allows displaying NFA/DFA transition tables. RegExp Tree also applies _DFA minimization_ (using _N-equivalence_ algorithm), and produces the minimal transition table as its final result. |
| |
| In the example below for the `/a|b|c/` regexp, we first obtain the NFA transition table, which is further converted to the original DFA transition table (down from the 10 non-deterministic states to 4 deterministic states), and eventually minimized to the final DFA table (from 4 to only 2 states). |
| |
| ``` |
| ./bin/regexp-tree-cli -e '/a|b|c/' --table all |
| ``` |
| |
| Result: |
| |
| ``` |
| > - starting |
| ✓ - accepting |
| |
| NFA transition table: |
| |
| ┌─────┬───┬───┬────┬─────────────┐ |
| │ │ a │ b │ c │ ε* │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 1 > │ │ │ │ {1,2,3,7,9} │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 2 │ │ │ │ {2,3,7} │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 3 │ 4 │ │ │ 3 │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 4 │ │ │ │ {4,5,6} │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 5 │ │ │ │ {5,6} │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 6 ✓ │ │ │ │ 6 │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 7 │ │ 8 │ │ 7 │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 8 │ │ │ │ {8,5,6} │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 9 │ │ │ 10 │ 9 │ |
| ├─────┼───┼───┼────┼─────────────┤ |
| │ 10 │ │ │ │ {10,6} │ |
| └─────┴───┴───┴────┴─────────────┘ |
| |
| |
| DFA: Original transition table: |
| |
| ┌─────┬───┬───┬───┐ |
| │ │ a │ b │ c │ |
| ├─────┼───┼───┼───┤ |
| │ 1 > │ 4 │ 3 │ 2 │ |
| ├─────┼───┼───┼───┤ |
| │ 2 ✓ │ │ │ │ |
| ├─────┼───┼───┼───┤ |
| │ 3 ✓ │ │ │ │ |
| ├─────┼───┼───┼───┤ |
| │ 4 ✓ │ │ │ │ |
| └─────┴───┴───┴───┘ |
| |
| |
| DFA: Minimized transition table: |
| |
| ┌─────┬───┬───┬───┐ |
| │ │ a │ b │ c │ |
| ├─────┼───┼───┼───┤ |
| │ 1 > │ 2 │ 2 │ 2 │ |
| ├─────┼───┼───┼───┤ |
| │ 2 ✓ │ │ │ │ |
| └─────┴───┴───┴───┘ |
| ``` |
| |
| ### AST nodes specification |
| |
| Below are the AST node types for different regular expressions patterns: |
| |
| - [Char](#char) |
| - [Simple char](#simple-char) |
| - [Escaped char](#escaped-char) |
| - [Meta char](#meta-char) |
| - [Control char](#control-char) |
| - [Hex char-code](#hex-char-code) |
| - [Decimal char-code](#decimal-char-code) |
| - [Octal char-code](#octal-char-code) |
| - [Unicode](#unicode) |
| - [Character class](#character-class) |
| - [Positive character class](#positive-character-class) |
| - [Negative character class](#negative-character-class) |
| - [Character class ranges](#character-class-ranges) |
| - [Unicode properties](#unicode-properties) |
| - [Alternative](#alternative) |
| - [Disjunction](#disjunction) |
| - [Groups](#groups) |
| - [Capturing group](#capturing-group) |
| - [Named capturing group](#named-capturing-group) |
| - [Non-capturing group](#non-capturing-group) |
| - [Backreferences](#backreferences) |
| - [Quantifiers](#quantifiers) |
| - [? zero-or-one](#-zero-or-one) |
| - [* zero-or-more](#-zero-or-more) |
| - [+ one-or-more](#-one-or-more) |
| - [Range-based quantifiers](#range-based-quantifiers) |
| - [Exact number of matches](#exact-number-of-matches) |
| - [Open range](#open-range) |
| - [Closed range](#closed-range) |
| - [Non-greedy](#non-greedy) |
| - [Assertions](#assertions) |
| - [^ begin marker](#-begin-marker) |
| - [$ end marker](#-end-marker) |
| - [Boundary assertions](#boundary-assertions) |
| - [Lookahead assertions](#lookahead-assertions) |
| - [Positive lookahead assertion](#positive-lookahead-assertion) |
| - [Negative lookahead assertion](#negative-lookahead-assertion) |
| - [Lookbehind assertions](#lookbehind-assertions) |
| - [Positive lookbehind assertion](#positive-lookbehind-assertion) |
| - [Negative lookbehind assertion](#negative-lookbehind-assertion) |
| |
| #### Char |
| |
| A basic building block, single character. Can be _escaped_, and be of different _kinds_. |
| |
| ##### Simple char |
| |
| Basic _non-escaped_ char in a regexp: |
| |
| ``` |
| z |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: 'z', |
| symbol: 'z', |
| kind: 'simple', |
| codePoint: 122 |
| } |
| ``` |
| |
| > NOTE: to test this from CLI, the char should be in an actual regexp -- `/z/`. |
| |
| ##### Escaped char |
| |
| ``` |
| \z |
| ``` |
| |
| The same value, `escaped` flag is added: |
| |
| ```js |
| { |
| type: 'Char', |
| value: 'z', |
| symbol: 'z', |
| kind: 'simple', |
| codePoint: 122, |
| escaped: true |
| } |
| ``` |
| |
| Escaping is mostly used with meta symbols: |
| |
| ``` |
| // Syntax error |
| * |
| ``` |
| |
| ``` |
| \* |
| ``` |
| |
| OK, node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '*', |
| symbol: '*', |
| kind: 'simple', |
| codePoint: 42, |
| escaped: true |
| } |
| ``` |
| |
| ##### Meta char |
| |
| A _meta character_ should not be confused with an [escaped char](#escaped-char). |
| |
| Example: |
| |
| ``` |
| \n |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\n', |
| symbol: '\n', |
| kind: 'meta', |
| codePoint: 10 |
| } |
| ``` |
| |
| Among other meta character are: `.`, `\f`, `\r`, `\n`, `\t`, `\v`, `\0`, `[\b]` (backspace char), `\s`, `\S`, `\w`, `\W`, `\d`, `\D`. |
| |
| > NOTE: Meta characters representing ranges (like `.`, `\s`, etc.) have `undefined` value for `symbol` and `NaN` for `codePoint`. |
| |
| > NOTE: `\b` and `\B` are parsed as `Assertion` node type, not `Char`. |
| |
| ##### Control char |
| |
| A char preceded with `\c`, e.g. `\cx`, which stands for `CTRL+x`: |
| |
| ``` |
| \cx |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\cx', |
| symbol: undefined, |
| kind: 'control', |
| codePoint: NaN |
| } |
| ``` |
| |
| ##### HEX char-code |
| |
| A char preceded with `\x`, followed by a HEX-code, e.g. `\x3B` (symbol `;`): |
| |
| ``` |
| \x3B |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\x3B', |
| symbol: ';', |
| kind: 'hex', |
| codePoint: 59 |
| } |
| ``` |
| |
| ##### Decimal char-code |
| |
| Char-code: |
| |
| ``` |
| \42 |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\42', |
| symbol: '*', |
| kind: 'decimal', |
| codePoint: 42 |
| } |
| ``` |
| |
| ##### Octal char-code |
| |
| Char-code started with `\0`, followed by an octal number: |
| |
| ``` |
| \073 |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\073', |
| symbol: ';', |
| kind: 'oct', |
| codePoint: 59 |
| } |
| ``` |
| |
| ##### Unicode |
| |
| Unicode char started with `\u`, followed by a hex number: |
| |
| ``` |
| \u003B |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\u003B', |
| symbol: ';', |
| kind: 'unicode', |
| codePoint: 59 |
| } |
| ``` |
| |
| When using the `u` flag, unicode chars can also be represented using `\u` followed by a hex number between curly braces: |
| |
| ``` |
| \u{1F680} |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\u{1F680}', |
| symbol: '🚀', |
| kind: 'unicode', |
| codePoint: 128640 |
| } |
| ``` |
| |
| When using the `u` flag, unicode chars can also be represented using a surrogate pair: |
| |
| ``` |
| \ud83d\ude80 |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Char', |
| value: '\\ud83d\\ude80', |
| symbol: '🚀', |
| kind: 'unicode', |
| codePoint: 128640, |
| isSurrogatePair: true |
| } |
| ``` |
| |
| #### Character class |
| |
| Character classes define a _set_ of characters. A set may include as simple characters, as well as _character ranges_. A class can be _positive_ (any from the characters in the class match), or _negative_ (any _but_ the characters from the class match). |
| |
| ##### Positive character class |
| |
| A positive character class is defined between `[` and `]` brackets: |
| |
| ``` |
| [a*] |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'CharacterClass', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Char', |
| value: '*', |
| symbol: '*', |
| kind: 'simple', |
| codePoint: 42 |
| } |
| ] |
| } |
| ``` |
| |
| > NOTE: some meta symbols are treated as normal characters in a character class. E.g. `*` is not a repetition quantifier, but a simple char. |
| |
| ##### Negative character class |
| |
| A negative character class is defined between `[^` and `]` brackets: |
| |
| ``` |
| [^ab] |
| ``` |
| |
| An AST node is the same, just `negative` property is added: |
| |
| ```js |
| { |
| type: 'CharacterClass', |
| negative: true, |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| ] |
| } |
| ``` |
| |
| ##### Character class ranges |
| |
| As mentioned, a character class may also contain _ranges_ of symbols: |
| |
| ``` |
| [a-z] |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'CharacterClass', |
| expressions: [ |
| { |
| type: 'ClassRange', |
| from: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| to: { |
| type: 'Char', |
| value: 'z', |
| symbol: 'z', |
| kind: 'simple', |
| codePoint: 122 |
| } |
| } |
| ] |
| } |
| ``` |
| |
| > NOTE: it is a _syntax error_ if `to` value is less than `from` value: `/[z-a]/`. |
| |
| The range value can be the same for `from` and `to`, and the special range `-` character is treated as a simple character when it stands in a char position: |
| |
| ``` |
| // from: 'a', to: 'a' |
| [a-a] |
| |
| // from: '-', to: '-' |
| [---] |
| |
| // simple '-' char: |
| [-] |
| |
| // 3 ranges: |
| [a-zA-Z0-9]+ |
| ``` |
| |
| #### Unicode properties |
| |
| Unicode property escapes are a new type of escape sequence available in regular expressions that have the `u` flag set. With this feature it is possible to write Unicode expressions as: |
| |
| ```js |
| const greekSymbolRe = /\p{Script=Greek}/u; |
| |
| greekSymbolRe.test('π'); // true |
| ``` |
| |
| The AST node for this expression is: |
| |
| ```js |
| { |
| type: 'UnicodeProperty', |
| name: 'Script', |
| value: 'Greek', |
| negative: false, |
| shorthand: false, |
| binary: false, |
| canonicalName: 'Script', |
| canonicalValue: 'Greek' |
| } |
| ``` |
| |
| All possible property names, values, and their aliases can be found at the [specification](https://tc39.github.io/ecma262/#sec-runtime-semantics-unicodematchproperty-p). |
| |
| For `General_Category` it is possible to use a shorthand: |
| |
| ```js |
| /\p{Letter}/u; // Shorthand |
| |
| /\p{General_Category=Letter}/u; // Full notation |
| ``` |
| |
| Binary names use the single value as well: |
| |
| ```js |
| /\p{ASCII_Hex_Digit}/u; // Same as: /[0-9A-Fa-f]/ |
| ``` |
| |
| The capitalized `P` defines the negation of the expression: |
| |
| ```js |
| /\P{ASCII_Hex_Digit}/u; // NOT a ASCII Hex digit |
| ``` |
| |
| #### Alternative |
| |
| An _alternative_ (or _concatenation_) defines a chain of patterns followed one after another: |
| |
| ``` |
| abc |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| }, |
| { |
| type: 'Char', |
| value: 'c', |
| symbol: 'c', |
| kind: 'simple', |
| codePoint: 99 |
| } |
| ] |
| } |
| ``` |
| |
| Another examples: |
| |
| ``` |
| // 'a' with a quantifier, followed by 'b' |
| a?b |
| |
| // A group followed by a class: |
| (ab)[a-z] |
| ``` |
| |
| #### Disjunction |
| |
| The _disjunction_ defines "OR" operation for regexp patterns. It's a _binary_ operation, having `left`, and `right` nodes. |
| |
| Matches `a` or `b`: |
| |
| ``` |
| a|b |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Disjunction', |
| left: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| right: { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| } |
| ``` |
| |
| #### Groups |
| |
| The groups play two roles: they define _grouping precedence_, and allow to _capture_ needed sub-expressions in case of a capturing group. |
| |
| ##### Capturing group |
| |
| _"Capturing"_ means the matched string can be referred later by a user, including in the pattern itself -- by using [backreferences](#backreferences). |
| |
| Char `a`, and `b` are grouped, followed by the `c` char: |
| |
| ``` |
| (ab)c |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Group', |
| capturing: true, |
| number: 1, |
| expression: { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| ] |
| } |
| }, |
| { |
| type: 'Char', |
| value: 'c', |
| symbol: 'c', |
| kind: 'simple', |
| codePoint: 99 |
| } |
| ] |
| } |
| ``` |
| |
| As we can see, it also tracks the number of the group. |
| |
| Another example: |
| |
| ``` |
| // A grouped disjunction of a symbol, and a character class: |
| (5|[a-z]) |
| ``` |
| |
| ##### Named capturing group |
| |
| > NOTE: _Named capturing groups_ are not yet supported by JavaScript RegExp. It is an ECMAScript [proposal](https://tc39.github.io/proposal-regexp-named-groups/) which is at stage 3 at the moment. |
| |
| A capturing group can be given a name using the `(?<name>...)` syntax, for any identifier `name`. |
| |
| For example, a regular expressions for a date: |
| |
| ```js |
| /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u |
| ``` |
| |
| For the group: |
| |
| ```js |
| (?<foo>x) |
| ``` |
| |
| We have the following node (the `name` property with value `foo` is added): |
| |
| ```js |
| { |
| type: 'Group', |
| capturing: true, |
| name: 'foo', |
| number: 1, |
| expression: { |
| type: 'Char', |
| value: 'x', |
| symbol: 'x', |
| kind: 'simple', |
| codePoint: 120 |
| } |
| } |
| ``` |
| |
| ##### Non-capturing group |
| |
| Sometimes we don't need to actually capture the matched string from a group. In this case we can use a _non-capturing_ group: |
| |
| Char `a`, and `b` are grouped, _but not captured_, followed by the `c` char: |
| |
| ``` |
| (?:ab)c |
| ``` |
| |
| The same node, the `capturing` flag is `false`: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Group', |
| capturing: false, |
| expression: { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| ] |
| } |
| }, |
| { |
| type: 'Char', |
| value: 'c', |
| symbol: 'c', |
| kind: 'simple', |
| codePoint: 99 |
| } |
| ] |
| } |
| ``` |
| |
| ##### Backreferences |
| |
| A [capturing group](#capturing-group) can be referenced in the pattern using notation of an escaped group number. |
| |
| Matches `abab` string: |
| |
| ``` |
| (ab)\1 |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Group', |
| capturing: true, |
| number: 1, |
| expression: { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| ] |
| } |
| }, |
| { |
| type: 'Backreference', |
| kind: 'number', |
| number: 1, |
| reference: 1, |
| } |
| ] |
| } |
| ``` |
| |
| A [named capturing group](#named-capturing-group) can be accessed using `\k<name>` pattern, and also using a numbered reference. |
| |
| Matches `www`: |
| |
| ```js |
| (?<foo>w)\k<foo>\1 |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Group', |
| capturing: true, |
| name: 'foo', |
| number: 1, |
| expression: { |
| type: 'Char', |
| value: 'w', |
| symbol: 'w', |
| kind: 'simple', |
| codePoint: 119 |
| } |
| }, |
| { |
| type: 'Backreference', |
| kind: 'name', |
| number: 1, |
| reference: 'foo' |
| }, |
| { |
| type: 'Backreference', |
| kind: 'number', |
| number: 1, |
| reference: 1 |
| } |
| ] |
| } |
| ``` |
| |
| #### Quantifiers |
| |
| Quantifiers specify _repetition_ of a regular expression (or of its part). Below are the quantifiers which _wrap_ a parsed expression into a `Repetition` node. The quantifier itself can be of different _kinds_, and has `Quantifier` node type. |
| |
| ##### ? zero-or-one |
| |
| The `?` quantifier is short for `{0,1}`. |
| |
| ``` |
| a? |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: '?', |
| greedy: true |
| } |
| } |
| ``` |
| |
| ##### * zero-or-more |
| |
| The `*` quantifier is short for `{0,}`. |
| |
| ``` |
| a* |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: '*', |
| greedy: true |
| } |
| } |
| ``` |
| |
| ##### + one-or-more |
| |
| The `+` quantifier is short for `{1,}`. |
| |
| ``` |
| // Same as `aa*`, or `a{1,}` |
| a+ |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: '+', |
| greedy: true |
| } |
| } |
| ``` |
| |
| ##### Range-based quantifiers |
| |
| Explicit _range-based_ quantifiers are parsed as follows: |
| |
| ###### Exact number of matches |
| |
| ``` |
| a{3} |
| ``` |
| |
| The type of the quantifier is `Range`, and `from`, and `to` properties have the same value: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: 'Range', |
| from: 3, |
| to: 3, |
| greedy: true |
| } |
| } |
| ``` |
| |
| ###### Open range |
| |
| An open range doesn't have max value (assuming semantic "more", or Infinity value): |
| |
| ``` |
| a{3,} |
| ``` |
| |
| An AST node for such range doesn't contain `to` property: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: 'Range', |
| from: 3, |
| greedy: true |
| } |
| } |
| ``` |
| |
| ###### Closed range |
| |
| A closed range has explicit max value: (which syntactically can be the same as min value): |
| |
| ``` |
| a{3,5} |
| |
| // Same as a{3} |
| a{3,3} |
| ``` |
| |
| An AST node for a closed range: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: 'Range', |
| from: 3, |
| to: 5, |
| greedy: true |
| } |
| } |
| ``` |
| |
| > NOTE: it is a _syntax error_ if the max value is less than min value: `/a{3,2}/` |
| |
| ##### Non-greedy |
| |
| If any quantifier is followed by the `?`, the quantifier becomes _non-greedy_. |
| |
| Example: |
| |
| ``` |
| a+? |
| ``` |
| |
| Node: |
| |
| ```js |
| { |
| type: 'Repetition', |
| expression: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| quantifier: { |
| type: 'Quantifier', |
| kind: '+', |
| greedy: false |
| } |
| } |
| ``` |
| |
| Other examples: |
| |
| ``` |
| a?? |
| a*? |
| a{1}? |
| a{1,}? |
| a{1,3}? |
| ``` |
| |
| #### Assertions |
| |
| Assertions appear as separate AST nodes, however instread of manipulating on the characters themselves, they _assert_ certain conditions of a matching string. Examples: `^` -- beginning of a string (or a line in multiline mode), `$` -- end of a string, etc. |
| |
| ##### ^ begin marker |
| |
| The `^` assertion checks whether a scanner is at the beginning of a string (or a line in multiline mode). |
| |
| In the example below `^` is not a property of the `a` symbol, but a separate AST node for the assertion. The parsed node is actually an `Alternative` with two nodes: |
| |
| ``` |
| ^a |
| ``` |
| |
| The node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Assertion', |
| kind: '^' |
| }, |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| } |
| ] |
| } |
| ``` |
| |
| Since assertion is a separate node, it may appear anywhere in the matching string. The following regexp is completely valid, and asserts beginning of the string; it'll match an empty string: |
| |
| ``` |
| ^^^^^ |
| ``` |
| |
| ##### $ end marker |
| |
| The `$` assertion is similar to `^`, but asserts the end of a string (or a line in a multiline mode): |
| |
| ``` |
| a$ |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Assertion', |
| kind: '$' |
| } |
| ] |
| } |
| ``` |
| |
| And again, this is a completely valid regexp, and matches an empty string: |
| |
| ``` |
| ^^^^$$$$$ |
| |
| // valid too: |
| $^ |
| ``` |
| |
| ##### Boundary assertions |
| |
| The `\b` assertion check for _word boundary_, i.e. the position between a word and a space. |
| |
| Matches `x` in `x y`, but not in `xy`: |
| |
| ``` |
| x\b |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'x', |
| symbol: 'x', |
| kind: 'simple', |
| codePoint: 120 |
| }, |
| { |
| type: 'Assertion', |
| kind: '\\b' |
| } |
| ] |
| } |
| ``` |
| |
| The `\B` is vice-versa checks for _non-word_ boundary. The following example matches `x` in `xy`, but not in `x y`: |
| |
| ``` |
| x\B |
| ``` |
| |
| A node is the same: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'x', |
| symbol: 'x', |
| kind: 'simple', |
| codePoint: 120 |
| }, |
| { |
| type: 'Assertion', |
| kind: '\\B' |
| } |
| ] |
| } |
| ``` |
| |
| ##### Lookahead assertions |
| |
| These assertions check whether a pattern is _followed_ (or not followed for the negative assertion) by another pattern. |
| |
| ###### Positive lookahead assertion |
| |
| Matches `a` only if it's followed by `b`: |
| |
| ``` |
| a(?=b) |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Assertion', |
| kind: 'Lookahead', |
| assertion: { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| } |
| ] |
| } |
| ``` |
| |
| ###### Negative lookahead assertion |
| |
| Matches `a` only if it's _not_ followed by `b`: |
| |
| ``` |
| a(?!b) |
| ``` |
| |
| A node is similar, just `negative` flag is added: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| }, |
| { |
| type: 'Assertion', |
| kind: 'Lookahead', |
| negative: true, |
| assertion: { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| } |
| } |
| ] |
| } |
| ``` |
| |
| ##### Lookbehind assertions |
| |
| > NOTE: _Lookbehind assertions_ are not yet supported by JavaScript RegExp. It is an ECMAScript [proposal](https://tc39.github.io/proposal-regexp-lookbehind/) which is at stage 3 at the moment. |
| |
| These assertions check whether a pattern is _preceded_ (or not preceded for the negative assertion) by another pattern. |
| |
| ###### Positive lookbehind assertion |
| |
| Matches `b` only if it's preceded by `a`: |
| |
| ``` |
| (?<=a)b |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Assertion', |
| kind: 'Lookbehind', |
| assertion: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| } |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| }, |
| ] |
| } |
| ``` |
| |
| ###### Negative lookbehind assertion |
| |
| Matches `b` only if it's _not_ preceded by `a`: |
| |
| ``` |
| (?<!a)b |
| ``` |
| |
| A node: |
| |
| ```js |
| { |
| type: 'Alternative', |
| expressions: [ |
| { |
| type: 'Assertion', |
| kind: 'Lookbehind', |
| negative: true, |
| assertion: { |
| type: 'Char', |
| value: 'a', |
| symbol: 'a', |
| kind: 'simple', |
| codePoint: 97 |
| } |
| }, |
| { |
| type: 'Char', |
| value: 'b', |
| symbol: 'b', |
| kind: 'simple', |
| codePoint: 98 |
| }, |
| ] |
| } |
| ``` |