website/www/site/content/en/blog/pattern-match-beam-sql.md

title: “Pattern Matching with Beam SQL” date: 2020-08-27 00:00:01 +0800 aliases:

/blog/2020/08/27/pattern-match-beam-sql.html categories:
blog authors:
markzeng

Introduction

SQL is becoming increasingly powerful and useful in the field of data analysis. MATCH_RECOGNIZE, a new SQL component introduced in 2016, brings extra analytical functionality. This project, as part of Google Summer of Code, aims to support basic MATCH_RECOGNIZE functionality. A basic MATCH_RECOGNIZE query would be something like this: {{< highlight sql >}} SELECT T.aid, T.bid, T.cid FROM MyTable MATCH_RECOGNIZE ( PARTITION BY userid ORDER BY proctime MEASURES A.id AS aid, B.id AS bid, C.id AS cid PATTERN (A B C) DEFINE A AS name = ‘a’, B AS name = ‘b’, C AS name = ‘c’ ) AS T {{< /highlight >}}

The above query finds out ordered sets of events that have names ‘a’, ‘b’ and ‘c’. Apart from this basic usage of MATCH_RECOGNIZE, I supported a few of other crucial features such as quantifiers and row pattern navigation. I will spell out the details in later sections.

Approach & Discussion

The implementation is strongly based on BEAM core transforms. Specifically, one MATCH_RECOGNIZE execution composes the following series of transforms:

A ParDo transform and then a GroupByKey transform that build up the partitions (PARTITION BY).
A ParDo transform that sorts within each partition (ORDER BY).
A ParDo transform that applies pattern-match in each sorted partition.

A pattern-match operation was first done with the java regex library. That is, I first transform rows within a partition into a string and then apply regex pattern-match routines. If a row satisfies a condition, then I output the corresponding pattern variable. This is ok under the assumption that the pattern definitions are mutually exclusive. That is, a pattern definition like A AS A.price > 0, B AS b.price < 0 is allowed while a pattern definition like A AS A.price > 0, B AS B.proctime > 0 might results in an incomplete match. For the latter case, an event can satisfy the conditions A and B at the same time. Mutually exclusive conditions gives deterministic pattern-match: each event can only belong to at most one pattern class.

As specified in the SQL 2016 document, MATCH_RECOGNIZE defines a richer set of expression than regular expression. Specifically, it introduces Row Pattern Navigation Operations such as PREV and NEXT. This is perhaps one of the most intriguing feature of MATCH_RECOGNIZE. A regex library would no longer suffice the need since the pattern definition could be back-referencing (PREV) or forward-referencing (NEXT). So for the second version of implementation, we chose to use an NFA regex engine. An NFA brings more flexibility in terms of non-determinism (see Chapter 6 of SQL 2016 Part 5 for a more thorough discussion). My proposed NFA is based on a paper of UMASS.

This is a working project. Many of the components are still not supported. I will list some unimplemented work in the section of future work.

Usages

For now, the components I supported are:

PARTITION BY
ORDER BY
MEASURES
1. LAST
2. FIRST
ONE ROW PER MATCH/ALL ROWS PER MATCH
DEFINE
1. Left side of the condition
  1. LAST
2. Right side of the condition
  1. PREV
Quantifier
1. Kleene plus

The pattern definition evaluation is hard coded. To be more specific, it expects the column reference of the incoming row to be on the left side of a comparator. Additionally, PREV function can only appear on the right side of the comparator.

With these limited tools, we could already write some slightly more complicated queries. Imagine we have the following table:

transTime	price
1	3
2	2
3	1
4	5
5	6

This table reflects the price changes of a product with respect to the transaction time. We could write the following query:

{{< highlight sql >}} SELECT * FROM MyTable MATCH_RECOGNIZE ( ORDER BY transTime MEASURES LAST(A.price) AS beforePrice, FIRST(B.price) AS afterPrice PATTERN (A+ B+) DEFINE A AS price < PREV(A.price), B AS price > PREV(B.price) ) AS T {{< /highlight >}}

This will find the local minimum price and the price after it. For the example dataset, the first 3 rows will be mapped to A and the rest of the rows will be mapped to B. Thus, we will have (1, 5) as the result.

Very important: For my NFA implementation, it slightly breaks the rule in the SQL standard. Since the buffered NFA only stores an event to the buffer if the event is a match to some pattern class, There would be no way to get the previous event back if the previous row is discarded. So the first row would always be a match (different from the standard) if PREV is used.

Progress

PRs
1. Support MATCH_RECOGNIZE using regex library (merged)
2. Support MATCH_RECOGNIZE using NFA (pending)
Commits
1. partition by: commit 064ada7
2. order by: commit 9cd1a82
3. regex pattern match: commit 8d6ffcc
4. support quantifiers: commit f529b87
5. measures: commit 8793574
6. added NFA implementation: commit fc731f2
7. implemented functions PREV and LAST: commit 35323da

Future Work

Support FINAL/RUNNING keywords.
Support more quantifiers.
Add optimization to the NFA.
A better way to realize MATCH_RECOGNIZE might be having a Complex Event Processing library at BEAM core (rather than using BEAM transforms).

Introduction

Approach & Discussion

Usages

Progress

Future Work

References