Improve worst-case performance of LIKE filters by 20x (#16153)

* Expected-linear-time LIKE

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking.

Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                (cardinality)  Mode  Cnt  Before Score       Error  After Score       Error  Units  Before / After
LikeFilterBenchmark.matchBoundPrefix              1000  avgt   10         6.686 ±     0.026        6.765 ±     0.087  us/op           0.99x
LikeFilterBenchmark.matchBoundPrefix            100000  avgt   10       163.936 ±     1.589      140.014 ±     0.563  us/op           1.17x
LikeFilterBenchmark.matchBoundPrefix           1000000  avgt   10      1235.259 ±     7.318     1165.330 ±     9.300  us/op           1.06x
LikeFilterBenchmark.matchLikeContains             1000  avgt   10       255.074 ±     1.530      130.212 ±     3.314  us/op           1.96x
LikeFilterBenchmark.matchLikeContains           100000  avgt   10     34789.639 ±   210.219    18563.644 ±   100.030  us/op           1.87x
LikeFilterBenchmark.matchLikeContains          1000000  avgt   10    287265.302 ±  1790.957   164684.778 ±   317.698  us/op           1.74x
LikeFilterBenchmark.matchLikeEquals               1000  avgt   10         0.410 ±     0.003        0.399 ±     0.001  us/op           1.03x
LikeFilterBenchmark.matchLikeEquals             100000  avgt   10         0.793 ±     0.005        0.719 ±     0.003  us/op           1.10x
LikeFilterBenchmark.matchLikeEquals            1000000  avgt   10         0.864 ±     0.004        0.839 ±     0.005  us/op           1.03x
LikeFilterBenchmark.matchLikeKiller               1000  avgt   10      3077.629 ±     7.928      103.714 ±     2.417  us/op          29.67x
LikeFilterBenchmark.matchLikeKiller             100000  avgt   10    311048.049 ± 13466.911    14777.567 ±    70.242  us/op          21.05x
LikeFilterBenchmark.matchLikeKiller            1000000  avgt   10   3055855.099 ± 18387.839    92476.621 ±  1198.255  us/op          33.04x
LikeFilterBenchmark.matchLikePrefix               1000  avgt   10         6.711 ±     0.035        6.653 ±     0.046  us/op           1.01x
LikeFilterBenchmark.matchLikePrefix             100000  avgt   10       161.535 ±     0.574      163.740 ±     0.833  us/op           0.99x
LikeFilterBenchmark.matchLikePrefix            1000000  avgt   10      1255.696 ±     5.207     1201.378 ±     3.466  us/op           1.05x
LikeFilterBenchmark.matchRegexContains            1000  avgt   10       467.736 ±     2.546      481.431 ±     5.647  us/op           0.97x
LikeFilterBenchmark.matchRegexContains          100000  avgt   10     64871.766 ±   223.341    65483.992 ±   391.249  us/op           0.99x
LikeFilterBenchmark.matchRegexContains         1000000  avgt   10    482906.004 ±  2003.583   477195.835 ±  3094.605  us/op           1.01x
LikeFilterBenchmark.matchRegexKiller              1000  avgt   10      8071.881 ±    18.026     8052.322 ±    17.336  us/op           1.00x
LikeFilterBenchmark.matchRegexKiller            100000  avgt   10   1120094.520 ±  2428.172   808321.542 ±  2411.032  us/op           1.39x
LikeFilterBenchmark.matchRegexKiller           1000000  avgt   10   8096745.012 ± 40782.747  8114114.896 ± 43250.204  us/op           1.00x
LikeFilterBenchmark.matchRegexPrefix              1000  avgt   10       170.843 ±     1.095      175.924 ±     1.144  us/op           0.97x
LikeFilterBenchmark.matchRegexPrefix            100000  avgt   10     17785.280 ±   116.813    18708.888 ±    61.857  us/op           0.95x
LikeFilterBenchmark.matchRegexPrefix           1000000  avgt   10    174415.586 ±  1827.478   173190.799 ±   949.224  us/op           1.01x
LikeFilterBenchmark.matchSelectorEquals           1000  avgt   10         0.411 ±     0.003        0.416 ±     0.002  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals         100000  avgt   10         0.728 ±     0.003        0.739 ±     0.003  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals        1000000  avgt   10         0.842 ±     0.002        0.879 ±     0.007  us/op           0.96x
```

* Take into account whether druid.generic.useDefaultValueForNull is set in LikeDimFilterTest assertions.

* Attempt to placate CodeQL.

* Fix handling of multi-pattern suffixes.

* Expected-linear-time LIKE

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used.

Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                      (cardinality)  Mode  Cnt  Before Score       Error      After Score     Error  Units  Before/After
LikeFilterBenchmark.matchBoundPrefix                    1000  avgt   10         5.410 ±     0.010          5.582 ±     0.004  us/op         0.97x
LikeFilterBenchmark.matchBoundPrefix                  100000  avgt   10       140.920 ±     0.306        141.082 ±     0.391  us/op         1.00x
LikeFilterBenchmark.matchBoundPrefix                 1000000  avgt   10      1082.762 ±     1.070       1171.407 ±     1.628  us/op         0.92x
LikeFilterBenchmark.matchLikeComplexContains            1000  avgt   10       221.572 ±     0.228        183.742 ±     0.210  us/op         1.21x
LikeFilterBenchmark.matchLikeComplexContains          100000  avgt   10     25461.362 ±    21.481      17373.828 ±    42.577  us/op         1.47x
LikeFilterBenchmark.matchLikeComplexContains         1000000  avgt   10    221075.917 ±   919.238     177454.683 ±   506.420  us/op         1.25x
LikeFilterBenchmark.matchLikeContains                   1000  avgt   10       283.015 ±     0.219        218.835 ±     3.126  us/op         1.29x
LikeFilterBenchmark.matchLikeContains                 100000  avgt   10     30202.910 ±    32.697      26713.488 ±    49.525  us/op         1.13x
LikeFilterBenchmark.matchLikeContains                1000000  avgt   10    284661.411 ±   130.324     243381.857 ±   540.143  us/op         1.17x
LikeFilterBenchmark.matchLikeEquals                     1000  avgt   10         0.386 ±     0.001          0.380 ±     0.001  us/op         1.02x
LikeFilterBenchmark.matchLikeEquals                   100000  avgt   10         0.670 ±     0.001          0.705 ±     0.002  us/op         0.95x
LikeFilterBenchmark.matchLikeEquals                  1000000  avgt   10         0.839 ±     0.001          0.796 ±     0.001  us/op         1.05x
LikeFilterBenchmark.matchLikeKiller                     1000  avgt   10      4882.099 ±     7.953        170.142 ±     0.494  us/op        28.69x
LikeFilterBenchmark.matchLikeKiller                   100000  avgt   10    524122.010 ±   390.170      19461.637 ±   117.090  us/op        26.93x
LikeFilterBenchmark.matchLikeKiller                  1000000  avgt   10   5121795.377 ±  4176.052     181162.978 ±   368.443  us/op        28.27x
LikeFilterBenchmark.matchLikePrefix                     1000  avgt   10         5.708 ±     0.005          5.677 ±     0.011  us/op         1.01x
LikeFilterBenchmark.matchLikePrefix                   100000  avgt   10       141.853 ±     0.554        108.313 ±     0.330  us/op         1.31x
LikeFilterBenchmark.matchLikePrefix                  1000000  avgt   10      1199.148 ±     1.298       1153.297 ±     1.575  us/op         1.04x
LikeFilterBenchmark.matchLikeSuffix                     1000  avgt   10       256.020 ±     0.283        196.339 ±     0.564  us/op         1.30x
LikeFilterBenchmark.matchLikeSuffix                   100000  avgt   10     29917.931 ±    28.218      21450.997 ±    20.341  us/op         1.39x
LikeFilterBenchmark.matchLikeSuffix                  1000000  avgt   10    241225.193 ±   465.824     194034.292 ±   362.312  us/op         1.24x
LikeFilterBenchmark.matchRegexComplexContains           1000  avgt   10       119.597 ±     0.635        135.550 ±     0.697  us/op         0.88x
LikeFilterBenchmark.matchRegexComplexContains         100000  avgt   10     13089.670 ±    13.738      13766.712 ±    12.802  us/op         0.95x
LikeFilterBenchmark.matchRegexComplexContains        1000000  avgt   10    130822.830 ±  1624.048     131076.029 ±  1636.811  us/op         1.00x
LikeFilterBenchmark.matchRegexContains                  1000  avgt   10       573.273 ±     0.421        615.399 ±     0.633  us/op         0.93x
LikeFilterBenchmark.matchRegexContains                100000  avgt   10     57259.313 ±   162.747      62900.380 ±    44.746  us/op         0.91x
LikeFilterBenchmark.matchRegexContains               1000000  avgt   10    571335.768 ±  2822.776     542536.982 ±   780.290  us/op         1.05x
LikeFilterBenchmark.matchRegexKiller                    1000  avgt   10     11525.499 ±     8.741      11061.791 ±    21.746  us/op         1.04x
LikeFilterBenchmark.matchRegexKiller                  100000  avgt   10   1170414.723 ±   766.160    1144437.291 ±   886.263  us/op         1.02x
LikeFilterBenchmark.matchRegexKiller                 1000000  avgt   10  11507668.302 ± 11318.176  110381620.014 ± 10707.974  us/op         1.11x
LikeFilterBenchmark.matchRegexPrefix                    1000  avgt   10       156.460 ±     0.097        155.217 ±     0.431  us/op         1.01x
LikeFilterBenchmark.matchRegexPrefix                  100000  avgt   10     15056.491 ±    23.906      15508.965 ±   763.976  us/op         0.97x
LikeFilterBenchmark.matchRegexPrefix                 1000000  avgt   10    154416.563 ±   473.108     153737.912 ±   273.347  us/op         1.00x
LikeFilterBenchmark.matchRegexSuffix                    1000  avgt   10       610.684 ±     0.462        590.352 ±     0.334  us/op         1.03x
LikeFilterBenchmark.matchRegexSuffix                  100000  avgt   10     53196.517 ±    78.155      59460.261 ±    56.934  us/op         0.89x
LikeFilterBenchmark.matchRegexSuffix                 1000000  avgt   10    536100.944 ±   440.353     550098.917 ±   740.464  us/op         0.97x
LikeFilterBenchmark.matchSelectorEquals                 1000  avgt   10         0.390 ±     0.001          0.366 ±     0.001  us/op         1.07x
LikeFilterBenchmark.matchSelectorEquals               100000  avgt   10         0.724 ±     0.001          0.714 ±     0.001  us/op         1.01x
LikeFilterBenchmark.matchSelectorEquals              1000000  avgt   10         0.826 ±     0.001          0.847 ±     0.001  us/op         0.98x
```
3 files changed
tree: 670b49ac1c184c138a5d9cd28a101a82129fb625
  1. .github/
  2. .idea/
  3. benchmarks/
  4. cloud/
  5. codestyle/
  6. dev/
  7. distribution/
  8. docs/
  9. examples/
  10. extensions-contrib/
  11. extensions-core/
  12. hooks/
  13. indexing-hadoop/
  14. indexing-service/
  15. integration-tests/
  16. integration-tests-ex/
  17. licenses/
  18. processing/
  19. publications/
  20. server/
  21. services/
  22. sql/
  23. web-console/
  24. website/
  25. .asf.yaml
  26. .backportrc.json
  27. .codecov.yml
  28. .dockerignore
  29. .gitignore
  30. .lgtm.yml
  31. check_test_suite.py
  32. check_test_suite_test.py
  33. CONTRIBUTING.md
  34. doap_Druid.rdf
  35. it.sh
  36. LABELS
  37. LICENSE
  38. licenses.yaml
  39. NOTICE
  40. owasp-dependency-check-suppressions.xml
  41. pom.xml
  42. README.md
  43. README.template
  44. rewrite.yml
  45. upload.sh
README.md

Coverage Status Docker Helm

WorkflowStatus
⚙️ CodeQL Configcodeql-config
🔍 CodeQLcodeql
🕒 Cron Job ITScron-job-its
🏷️ Labelerlabeler
♻️ Reusable Revised ITSreusable-revised-its
♻️ Reusable Standard ITSreusable-standard-its
♻️ Reusable Unit Testsreusable-unit-tests
🔄 Revised ITSrevised-its
🔧 Standard ITSstandard-its
🛠️ Static Checksstatic-checks
🧪 Unit and Integration Tests Unifiedunit-and-integration-tests-unified
🔬 Unit Testsunit-tests

Website Twitter Download Get Started Documentation Community Build Contribute License


Apache Druid

Druid is a high performance real-time analytics database. Druid's main value add is to reduce time to insight and action.

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at powering UIs, running operational (ad-hoc) queries, or handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases. The design documentation explains the key concepts.

Getting started

You can get started with Druid with our local or Docker quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in web console (shown below).

Load data

data loader Kafka

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

management

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.

Issue queries

query view combo

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

See the latest documentation for the documentation for the current official release. If you need information on a previous release, you can browse previous releases documentation.

Make documentation and tutorials updates in /docs using Markdown or extended Markdown (MDX). Then, open a pull request.

To build the site locally, you need Node 16.14 or higher and to install Docusaurus 2 with npm|yarn install in the website directory. Then you can run npm|yarn start to launch a local build of the docs.

If you're looking to update non-doc pages like Use Cases, those files are in the druid-website-src repo.

Community

Visit the official project community page to read about getting involved in contributing to Apache Druid, and how we help one another use and operate Druid.

  • Druid users can find help in the druid-user mailing list on Google Groups, and have more technical conversations in #troubleshooting on Slack.
  • Druid development discussions take place in the druid-dev mailing list (dev@druid.apache.org). Subscribe by emailing dev-subscribe@druid.apache.org. For live conversations, join the #dev channel on Slack.

Check out the official community page for details of how to join the community Slack channels.

Find articles written by community members and a calendar of upcoming events on the project site - contribute your own events and articles by submitting a PR in the apache/druid-website-src repository.

Building from source

Please note that JDK 8 or JDK 11 is required to build Druid.

See the latest build guide for instructions on building Apache Druid from source.

Contributing

Please follow the community guidelines for contributing.

For instructions on setting up IntelliJ dev/intellij-setup.md

License

Apache License, Version 2.0