title: “Pattern Matching with Beam SQL” date: 2020-08-27 00:00:01 +0800 aliases:
SQL is becoming increasingly powerful and useful in the field of data analysis. MATCH_RECOGNIZE, a new SQL component introduced in 2016, brings extra analytical functionality. This project, as part of Google Summer of Code, aims to support basic MATCH_RECOGNIZE functionality. A basic MATCH_RECOGNIZE query would be something like this: {{< highlight sql >}} SELECT T.aid, T.bid, T.cid FROM MyTable MATCH_RECOGNIZE ( PARTITION BY userid ORDER BY proctime MEASURES A.id AS aid, B.id AS bid, C.id AS cid PATTERN (A B C) DEFINE A AS name = ‘a’, B AS name = ‘b’, C AS name = ‘c’ ) AS T {{< /highlight >}}
The above query finds out ordered sets of events that have names ‘a’, ‘b’ and ‘c’. Apart from this basic usage of MATCH_RECOGNIZE, I supported a few of other crucial features such as quantifiers and row pattern navigation. I will spell out the details in later sections.
The implementation is strongly based on BEAM core transforms. Specifically, one MATCH_RECOGNIZE execution composes the following series of transforms:
ParDo transform and then a GroupByKey transform that build up the partitions (PARTITION BY).ParDo transform that sorts within each partition (ORDER BY).ParDo transform that applies pattern-match in each sorted partition.A pattern-match operation was first done with the java regex library. That is, I first transform rows within a partition into a string and then apply regex pattern-match routines. If a row satisfies a condition, then I output the corresponding pattern variable. This is ok under the assumption that the pattern definitions are mutually exclusive. That is, a pattern definition like A AS A.price > 0, B AS b.price < 0 is allowed while a pattern definition like A AS A.price > 0, B AS B.proctime > 0 might results in an incomplete match. For the latter case, an event can satisfy the conditions A and B at the same time. Mutually exclusive conditions gives deterministic pattern-match: each event can only belong to at most one pattern class.
As specified in the SQL 2016 document, MATCH_RECOGNIZE defines a richer set of expression than regular expression. Specifically, it introduces Row Pattern Navigation Operations such as PREV and NEXT. This is perhaps one of the most intriguing feature of MATCH_RECOGNIZE. A regex library would no longer suffice the need since the pattern definition could be back-referencing (PREV) or forward-referencing (NEXT). So for the second version of implementation, we chose to use an NFA regex engine. An NFA brings more flexibility in terms of non-determinism (see Chapter 6 of SQL 2016 Part 5 for a more thorough discussion). My proposed NFA is based on a paper of UMASS.
This is a working project. Many of the components are still not supported. I will list some unimplemented work in the section of future work.
For now, the components I supported are:
The pattern definition evaluation is hard coded. To be more specific, it expects the column reference of the incoming row to be on the left side of a comparator. Additionally, PREV function can only appear on the right side of the comparator.
With these limited tools, we could already write some slightly more complicated queries. Imagine we have the following table:
| transTime | price |
|---|---|
| 1 | 3 |
| 2 | 2 |
| 3 | 1 |
| 4 | 5 |
| 5 | 6 |
This table reflects the price changes of a product with respect to the transaction time. We could write the following query:
{{< highlight sql >}} SELECT * FROM MyTable MATCH_RECOGNIZE ( ORDER BY transTime MEASURES LAST(A.price) AS beforePrice, FIRST(B.price) AS afterPrice PATTERN (A+ B+) DEFINE A AS price < PREV(A.price), B AS price > PREV(B.price) ) AS T {{< /highlight >}}
This will find the local minimum price and the price after it. For the example dataset, the first 3 rows will be mapped to A and the rest of the rows will be mapped to B. Thus, we will have (1, 5) as the result.
Very important: For my NFA implementation, it slightly breaks the rule in the SQL standard. Since the buffered NFA only stores an event to the buffer if the event is a match to some pattern class, There would be no way to get the previous event back if the previous row is discarded. So the first row would always be a match (different from the standard) if PREV is used.