| <!-- |
| # |
| # Licensed to the Apache Software Foundation (ASF) under one |
| # or more contributor license agreements. See the NOTICE file |
| # distributed with this work for additional information |
| # regarding copyright ownership. The ASF licenses this file |
| # to you under the Apache License, Version 2.0 (the |
| # "License"); you may not use this file except in compliance |
| # with the License. You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, software |
| # distributed under the License is distributed on an "AS IS" BASIS, |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| # See the License for the specific language governing permissions and |
| # limitations under the License. |
| # |
| --> |
| |
| # SASIIndex |
| |
| [`SASIIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/SASIIndex.java), |
| or "SASI" for short, is an implementation of Cassandra's |
| `Index` interface that can be used as an alternative to the |
| existing implementations. SASI's indexing and querying improves on |
| existing implementations by tailoring it specifically to Cassandra's |
| needs. SASI has superior performance in cases where queries would |
| previously require filtering. In achieving this performance, SASI aims |
| to be significantly less resource intensive than existing |
| implementations, in memory, disk, and CPU usage. In addition, SASI |
| supports prefix and contains queries on strings (similar to SQL's |
| `LIKE = "foo*"` or `LIKE = "*foo*"'`). |
| |
| The following goes on describe how to get up and running with SASI, |
| demonstrates usage with examples, and provides some details on its |
| implementation. |
| |
| ## Using SASI |
| |
| The examples below walk through creating a table and indexes on its |
| columns, and performing queries on some inserted data. |
| |
| The examples below assume the `demo` keyspace has been created and is |
| in use. |
| |
| ``` |
| cqlsh> CREATE KEYSPACE demo WITH replication = { |
| ... 'class': 'SimpleStrategy', |
| ... 'replication_factor': '1' |
| ... }; |
| cqlsh> USE demo; |
| ``` |
| |
| All examples are performed on the `sasi` table: |
| |
| ``` |
| cqlsh:demo> CREATE TABLE sasi (id uuid, first_name text, last_name text, |
| ... age int, height int, created_at bigint, primary key (id)); |
| ``` |
| |
| #### Creating Indexes |
| |
| To create SASI indexes use CQLs `CREATE CUSTOM INDEX` statement: |
| |
| ``` |
| cqlsh:demo> CREATE CUSTOM INDEX ON sasi (first_name) USING 'org.apache.cassandra.index.sasi.SASIIndex' |
| ... WITH OPTIONS = { |
| ... 'analyzer_class': |
| ... 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', |
| ... 'case_sensitive': 'false' |
| ... }; |
| |
| cqlsh:demo> CREATE CUSTOM INDEX ON sasi (last_name) USING 'org.apache.cassandra.index.sasi.SASIIndex' |
| ... WITH OPTIONS = {'mode': 'CONTAINS'}; |
| |
| cqlsh:demo> CREATE CUSTOM INDEX ON sasi (age) USING 'org.apache.cassandra.index.sasi.SASIIndex'; |
| |
| cqlsh:demo> CREATE CUSTOM INDEX ON sasi (created_at) USING 'org.apache.cassandra.index.sasi.SASIIndex' |
| ... WITH OPTIONS = {'mode': 'SPARSE'}; |
| ``` |
| |
| The indexes created have some options specified that customize their |
| behaviour and potentially performance. The index on `first_name` is |
| case-insensitive. The analyzers are discussed more in a subsequent |
| example. The `NonTokenizingAnalyzer` performs no analysis on the |
| text. Each index has a mode: `PREFIX`, `CONTAINS`, or `SPARSE`, the |
| first being the default. The `last_name` index is created with the |
| mode `CONTAINS` which matches terms on suffixes instead of prefix |
| only. Examples of this are available below and more detail can be |
| found in the section on |
| [OnDiskIndex](#ondiskindexbuilder).The |
| `created_at` column is created with its mode set to `SPARSE`, which is |
| meant to improve performance of querying large, dense number ranges |
| like timestamps for data inserted every millisecond. Details of the |
| `SPARSE` implementation can also be found in the section on the |
| [OnDiskIndex](#ondiskindexbuilder). The `age` |
| index is created with the default `PREFIX` mode and no |
| case-sensitivity or text analysis options are specified since the |
| field is numeric. |
| |
| After inserting the following data and performing a `nodetool flush`, |
| SASI performing index flushes to disk can be seen in Cassandra's logs |
| -- although the direct call to flush is not required (see |
| [IndexMemtable](#indexmemtable) for more details). |
| |
| ``` |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (556ebd54-cbe5-4b75-9aae-bf2a31a24500, 'Pavel', 'Yaskevich', 27, 181, 1442959315018); |
| |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (5770382a-c56f-4f3f-b755-450e24d55217, 'Jordan', 'West', 26, 173, 1442959315019); |
| |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (96053844-45c3-4f15-b1b7-b02c441d3ee1, 'Mikhail', 'Stepura', 36, 173, 1442959315020); |
| |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (f5dfcabe-de96-4148-9b80-a1c41ed276b4, 'Michael', 'Kjellman', 26, 180, 1442959315021); |
| |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (2970da43-e070-41a8-8bcb-35df7a0e608a, 'Johnny', 'Zhang', 32, 175, 1442959315022); |
| |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (6b757016-631d-4fdb-ac62-40b127ccfbc7, 'Jason', 'Brown', 40, 182, 1442959315023); |
| |
| cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at) |
| ... VALUES (8f909e8a-008e-49dd-8d43-1b0df348ed44, 'Vijay', 'Parthasarathy', 34, 183, 1442959315024); |
| |
| cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi; |
| |
| first_name | last_name | age | height | created_at |
| ------------+---------------+-----+--------+--------------- |
| Michael | Kjellman | 26 | 180 | 1442959315021 |
| Mikhail | Stepura | 36 | 173 | 1442959315020 |
| Jason | Brown | 40 | 182 | 1442959315023 |
| Pavel | Yaskevich | 27 | 181 | 1442959315018 |
| Vijay | Parthasarathy | 34 | 183 | 1442959315024 |
| Jordan | West | 26 | 173 | 1442959315019 |
| Johnny | Zhang | 32 | 175 | 1442959315022 |
| |
| (7 rows) |
| ``` |
| |
| #### Equality & Prefix Queries |
| |
| SASI supports all queries already supported by CQL, including LIKE statement |
| for PREFIX, CONTAINS and SUFFIX searches. |
| |
| ``` |
| cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi |
| ... WHERE first_name = 'Pavel'; |
| |
| first_name | last_name | age | height | created_at |
| -------------+-----------+-----+--------+--------------- |
| Pavel | Yaskevich | 27 | 181 | 1442959315018 |
| |
| (1 rows) |
| ``` |
| |
| ``` |
| cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi |
| ... WHERE first_name = 'pavel'; |
| |
| first_name | last_name | age | height | created_at |
| -------------+-----------+-----+--------+--------------- |
| Pavel | Yaskevich | 27 | 181 | 1442959315018 |
| |
| (1 rows) |
| ``` |
| |
| ``` |
| cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi |
| ... WHERE first_name LIKE 'M%'; |
| |
| first_name | last_name | age | height | created_at |
| ------------+-----------+-----+--------+--------------- |
| Michael | Kjellman | 26 | 180 | 1442959315021 |
| Mikhail | Stepura | 36 | 173 | 1442959315020 |
| |
| (2 rows) |
| ``` |
| |
| Of course, the case of the query does not matter for the `first_name` |
| column because of the options provided at index creation time. |
| |
| ``` |
| cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi |
| ... WHERE first_name LIKE 'm%'; |
| |
| first_name | last_name | age | height | created_at |
| ------------+-----------+-----+--------+--------------- |
| Michael | Kjellman | 26 | 180 | 1442959315021 |
| Mikhail | Stepura | 36 | 173 | 1442959315020 |
| |
| (2 rows) |
| ``` |
| |
| #### Compound Queries |
| |
| SASI supports queries with multiple predicates, however, due to the |
| nature of the default indexing implementation, CQL requires the user |
| to specify `ALLOW FILTERING` to opt in to the potential performance |
| pitfalls of such a query. With SASI, while the requirement to include |
| `ALLOW FILTERING` remains, to reduce modifications to the grammar, the |
| performance pitfalls do not exist because filtering is not |
| performed. Details on how SASI joins data from multiple predicates is |
| available below in the |
| [Implementation Details](#implementation-details) |
| section. |
| |
| ``` |
| cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi |
| ... WHERE first_name LIKE 'M%' and age < 30 ALLOW FILTERING; |
| |
| first_name | last_name | age | height | created_at |
| ------------+-----------+-----+--------+--------------- |
| Michael | Kjellman | 26 | 180 | 1442959315021 |
| |
| (1 rows) |
| ``` |
| |
| #### Suffix Queries |
| |
| The next example demonstrates `CONTAINS` mode on the `last_name` |
| column. By using this mode, predicates can search for any strings |
| containing the search string as a sub-string. In this case the strings |
| containing "a" or "an". |
| |
| ``` |
| cqlsh:demo> SELECT * FROM sasi WHERE last_name LIKE '%a%'; |
| |
| id | age | created_at | first_name | height | last_name |
| --------------------------------------+-----+---------------+------------+--------+--------------- |
| f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman |
| 96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | 1442959315020 | Mikhail | 173 | Stepura |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | 1442959315018 | Pavel | 181 | Yaskevich |
| 8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | 1442959315024 | Vijay | 183 | Parthasarathy |
| 2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang |
| |
| (5 rows) |
| |
| cqlsh:demo> SELECT * FROM sasi WHERE last_name LIKE '%an%'; |
| |
| id | age | created_at | first_name | height | last_name |
| --------------------------------------+-----+---------------+------------+--------+----------- |
| f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman |
| 2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang |
| |
| (2 rows) |
| ``` |
| |
| #### Expressions on Non-Indexed Columns |
| |
| SASI also supports filtering on non-indexed columns like `height`. The |
| expression can only narrow down an existing query using `AND`. |
| |
| ``` |
| cqlsh:demo> SELECT * FROM sasi WHERE last_name LIKE '%a%' AND height >= 175 ALLOW FILTERING; |
| |
| id | age | created_at | first_name | height | last_name |
| --------------------------------------+-----+---------------+------------+--------+--------------- |
| f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | 1442959315018 | Pavel | 181 | Yaskevich |
| 8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | 1442959315024 | Vijay | 183 | Parthasarathy |
| 2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang |
| |
| (4 rows) |
| ``` |
| |
| #### Delimiter based Tokenization Analysis |
| |
| A simple text analysis provided is delimiter based tokenization. This provides an alternative to indexing collections, |
| as delimiter separated text can be indexed without the overhead of `CONTAINS` mode nor using `PREFIX` or `SUFFIX` queries. |
| |
| ``` |
| cqlsh:demo> ALTER TABLE sasi ADD aliases text; |
| cqlsh:demo> CREATE CUSTOM INDEX on sasi (aliases) USING 'org.apache.cassandra.index.sasi.SASIIndex' |
| ... WITH OPTIONS = { |
| ... 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', |
| ... 'delimiter': ',', |
| ... 'mode': 'prefix', |
| ... 'analyzed': 'true'}; |
| cqlsh:demo> UPDATE sasi SET aliases = 'Mike,Mick,Mikey,Mickey' WHERE id = f5dfcabe-de96-4148-9b80-a1c41ed276b4; |
| cqlsh:demo> SELECT * FROM sasi WHERE aliases LIKE 'Mikey' ALLOW FILTERING; |
| |
| id | age | aliases | created_at | first_name | height | last_name |
| --------------------------------------+-----+------------------------+---------------+------------+--------+----------- |
| f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | Mike,Mick,Mikey,Mickey | 1442959315021 | Michael | 180 | Kjellman |
| ``` |
| |
| #### Text Analysis (Tokenization and Stemming) |
| |
| Lastly, to demonstrate text analysis an additional column is needed on |
| the table. Its definition, index, and statements to update rows are shown below. |
| |
| ``` |
| cqlsh:demo> ALTER TABLE sasi ADD bio text; |
| cqlsh:demo> CREATE CUSTOM INDEX ON sasi (bio) USING 'org.apache.cassandra.index.sasi.SASIIndex' |
| ... WITH OPTIONS = { |
| ... 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', |
| ... 'tokenization_enable_stemming': 'true', |
| ... 'analyzed': 'true', |
| ... 'tokenization_normalize_lowercase': 'true', |
| ... 'tokenization_locale': 'en' |
| ... }; |
| cqlsh:demo> UPDATE sasi SET bio = 'Software Engineer, who likes distributed systems, doesnt like to argue.' WHERE id = 5770382a-c56f-4f3f-b755-450e24d55217; |
| cqlsh:demo> UPDATE sasi SET bio = 'Software Engineer, works on the freight distribution at nights and likes arguing' WHERE id = 556ebd54-cbe5-4b75-9aae-bf2a31a24500; |
| cqlsh:demo> SELECT * FROM sasi; |
| |
| id | age | bio | created_at | first_name | height | last_name |
| --------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+--------------- |
| f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | null | 1442959315021 | Michael | 180 | Kjellman |
| 96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | null | 1442959315020 | Mikhail | 173 | Stepura |
| 6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | null | 1442959315023 | Jason | 182 | Brown |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich |
| 8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | null | 1442959315024 | Vijay | 183 | Parthasarathy |
| 5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West |
| 2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | null | 1442959315022 | Johnny | 175 | Zhang |
| |
| (7 rows) |
| ``` |
| |
| Index terms and query search strings are stemmed for the `bio` column |
| because it was configured to use the |
| [`StandardAnalyzer`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzer.java) |
| and `analyzed` is set to `true`. The |
| `tokenization_normalize_lowercase` is similar to the `case_sensitive` |
| property but for the |
| [`StandardAnalyzer`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzer.java). These |
| query demonstrates the stemming applied by [`StandardAnalyzer`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/analyzer/StandardAnalyzer.java). |
| |
| ``` |
| cqlsh:demo> SELECT * FROM sasi WHERE bio LIKE 'distributing'; |
| |
| id | age | bio | created_at | first_name | height | last_name |
| --------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+----------- |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich |
| 5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West |
| |
| (2 rows) |
| |
| cqlsh:demo> SELECT * FROM sasi WHERE bio LIKE 'they argued'; |
| |
| id | age | bio | created_at | first_name | height | last_name |
| --------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+----------- |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich |
| 5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West |
| |
| (2 rows) |
| |
| cqlsh:demo> SELECT * FROM sasi WHERE bio LIKE 'working at the company'; |
| |
| id | age | bio | created_at | first_name | height | last_name |
| --------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+----------- |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich |
| |
| (1 rows) |
| |
| cqlsh:demo> SELECT * FROM sasi WHERE bio LIKE 'soft eng'; |
| |
| id | age | bio | created_at | first_name | height | last_name |
| --------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+----------- |
| 556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich |
| 5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West |
| |
| (2 rows) |
| ``` |
| |
| ## Implementation Details |
| |
| While SASI, at the surface, is simply an implementation of the |
| `Index` interface, at its core there are several data |
| structures and algorithms used to satisfy it. These are described |
| here. Additionally, the changes internal to Cassandra to support SASI's |
| integration are described. |
| |
| The `Index` interface divides responsibility of the |
| implementer into two parts: Indexing and Querying. Further, Cassandra |
| makes it possible to divide those responsibilities into the memory and |
| disk components. SASI takes advantage of Cassandra's write-once, |
| immutable, ordered data model to build indexes along with the flushing |
| of the memtable to disk -- this is the origin of the name "SSTable |
| Attached Secondary Index". |
| |
| The SASI index data structures are built in memory as the SSTable is |
| being written, and they are flushed to disk before the writing of the |
| SSTable completes. The writing of each index file only requires |
| sequential writes to disk. In some cases, partial flushes are |
| performed, and later stitched back together, to reduce memory |
| usage. These data structures are optimized for this use case. |
| |
| Taking advantage of Cassandra's ordered data model, at query time, |
| candidate indexes are narrowed down for searching, minimizing the amount |
| of work done. Searching is then performed using an efficient method |
| that streams data off disk as needed. |
| |
| ### Indexing |
| |
| Per SSTable, SASI writes an index file for each indexed column. The |
| data for these files is built in memory using the |
| [`OnDiskIndexBuilder`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndexBuilder.java). Once |
| flushed to disk, the data is read using the |
| [`OnDiskIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndex.java) |
| class. These are composed of bytes representing indexed terms, |
| organized for efficient writing or searching respectively. The keys |
| and values they hold represent tokens and positions in an SSTable and |
| these are stored per-indexed term in |
| [`TokenTreeBuilder`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTreeBuilder.java)s |
| for writing, and |
| [`TokenTree`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java)s |
| for querying. These index files are memory mapped after being written |
| to disk, for quicker access. For indexing data in the memtable, SASI |
| uses its |
| [`IndexMemtable`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/IndexMemtable.java) |
| class. |
| |
| #### OnDiskIndex(Builder) |
| |
| Each |
| [`OnDiskIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndex.java) |
| is an instance of a modified |
| [Suffix Array](https://en.wikipedia.org/wiki/Suffix_array) data |
| structure. The |
| [`OnDiskIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndex.java) |
| is comprised of page-size blocks of sorted terms and pointers to the |
| terms' associated data, as well as the data itself, stored also in one |
| or more page-sized blocks. The |
| [`OnDiskIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndex.java) |
| is structured as a tree of arrays, where each level describes the |
| terms in the level below, the final level being the terms |
| themselves. The `PointerLevel`s and their `PointerBlock`s contain |
| terms and pointers to other blocks that *end* with those terms. The |
| `DataLevel`, the final level, and its `DataBlock`s contain terms and |
| point to the data itself, contained in [`TokenTree`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java)s. |
| |
| The terms written to the |
| [`OnDiskIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndex.java) |
| vary depending on its "mode": either `PREFIX`, `CONTAINS`, or |
| `SPARSE`. In the `PREFIX` and `SPARSE` cases, terms' exact values are |
| written exactly once per `OnDiskIndex`. For example, when using a `PREFIX` index |
| with terms `Jason`, `Jordan`, `Pavel`, all three will be included in |
| the index. A `CONTAINS` index writes additional terms for each suffix of |
| each term recursively. Continuing with the example, a `CONTAINS` index |
| storing the previous terms would also store `ason`, `ordan`, `avel`, |
| `son`, `rdan`, `vel`, etc. This allows for queries on the suffix of |
| strings. The `SPARSE` mode differs from `PREFIX` in that for every 64 |
| blocks of terms a |
| [`TokenTree`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java) |
| is built merging all the |
| [`TokenTree`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java)s |
| for each term into a single one. This copy of the data is used for |
| efficient iteration of large ranges of e.g. timestamps. The index |
| "mode" is configurable per column at index creation time. |
| |
| #### TokenTree(Builder) |
| |
| The |
| [`TokenTree`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java) |
| is an implementation of the well-known |
| [B+-tree](https://en.wikipedia.org/wiki/B%2B_tree) that has been |
| modified to optimize for its use-case. In particular, it has been |
| optimized to associate tokens, longs, with a set of positions in an |
| SSTable, also longs. Allowing the set of long values accommodates |
| the possibility of a hash collision in the token, but the data |
| structure is optimized for the unlikely possibility of such a |
| collision. |
| |
| To optimize for its write-once environment the |
| [`TokenTreeBuilder`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTreeBuilder.java) |
| completely loads its interior nodes as the tree is built, and it uses |
| the well-known algorithm optimized for bulk-loading the data |
| structure. |
| |
| [`TokenTree`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/TokenTree.java)s provide the means to iterate over tokens, and file |
| positions, that match a given term, and to skip forward in that |
| iteration, an operation used heavily at query time. |
| |
| #### IndexMemtable |
| |
| The |
| [`IndexMemtable`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/IndexMemtable.java) |
| handles indexing the in-memory data held in the memtable. The |
| [`IndexMemtable`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/IndexMemtable.java) |
| in turn manages either a |
| [`TrieMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/TrieMemIndex.java) |
| or a |
| [`SkipListMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/SkipListMemIndex.java) |
| per-column. The choice of which index type is used is data |
| dependent. The |
| [`TrieMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/TrieMemIndex.java) |
| is used for literal types. `AsciiType` and `UTF8Type` are literal |
| types by default but any column can be configured as a literal type |
| using the `is_literal` option at index creation time. For non-literal |
| types the |
| [`SkipListMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/SkipListMemIndex.java) |
| is used. The |
| [`TrieMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/TrieMemIndex.java) |
| is an implementation that can efficiently support prefix queries on |
| character-like data. The |
| [`SkipListMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/SkipListMemIndex.java), |
| conversely, is better suited for other Cassandra data types like |
| numbers. |
| |
| The |
| [`TrieMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/TrieMemIndex.java) |
| is built using either the `ConcurrentRadixTree` or |
| `ConcurrentSuffixTree` from the `com.goooglecode.concurrenttrees` |
| package. The choice between the two is made based on the indexing |
| mode, `PREFIX` or other modes, and `CONTAINS` mode, respectively. |
| |
| The |
| [`SkipListMemIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/memory/SkipListMemIndex.java) |
| is built on top of `java.util.concurrent.ConcurrentSkipListSet`. |
| |
| ### Querying |
| |
| Responsible for converting the internal `IndexExpression` |
| representation into SASI's |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) |
| and |
| [`Expression`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Expression.java) |
| trees, optimizing the trees to reduce the amount of work done, and |
| driving the query itself, the |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| is the work horse of SASI's querying implementation. To efficiently |
| perform union and intersection operations, SASI provides several |
| iterators similar to Cassandra's `MergeIterator`, but tailored |
| specifically for SASI's use while including more features. The |
| [`RangeUnionIterator`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeUnionIterator.java), |
| like its name suggests, performs set unions over sets of tokens/keys |
| matching the query, only reading as much data as it needs from each |
| set to satisfy the query. The |
| [`RangeIntersectionIterator`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java), |
| similar to its counterpart, performs set intersections over its data. |
| |
| #### QueryPlan |
| |
| The |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| instantiated per search query is at the core of SASI's querying |
| implementation. Its work can be divided in two stages: analysis and |
| execution. |
| |
| During the analysis phase, |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| converts from Cassandra's internal representation of |
| `IndexExpression`s, which has also been modified to support encoding |
| queries that contain ORs and groupings of expressions using |
| parentheses (see the |
| [Cassandra Internal Changes](#cassandra-internal-changes) |
| section below for more details). This process produces a tree of |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java)s, which in turn may contain [`Expression`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Expression.java)s, all of which |
| provide an alternative, more efficient, representation of the query. |
| |
| During execution, the |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| uses the `DecoratedKey`-generating iterator created from the |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) tree. These keys are read from disk and a final check to |
| ensure they satisfy the query is made, once again using the |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) tree. At the point the desired amount of matching data has |
| been found, or there is no more matching data, the result set is |
| returned to the coordinator through the existing internal components. |
| |
| The number of queries (total/failed/timed-out), and their latencies, |
| are maintained per-table/column family. |
| |
| SASI also supports concurrently iterating terms for the same index |
| across SSTables. The concurrency factor is controlled by the |
| `cassandra.search_concurrency_factor` system property. The default is |
| `1`. |
| |
| ##### QueryController |
| |
| Each |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| references a |
| [`QueryController`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryController.java) |
| used throughout the execution phase. The |
| [`QueryController`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryController.java) |
| has two responsibilities: to manage and ensure the proper cleanup of |
| resources (indexes), and to strictly enforce the time bound per query, |
| specified by the user via the range slice timeout. All indexes are |
| accessed via the |
| [`QueryController`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryController.java) |
| so that they can be safely released by it later. The |
| [`QueryController`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryController.java)'s |
| `checkpoint` function is called in specific places in the execution |
| path to ensure the time-bound is enforced. |
| |
| ##### QueryPlan Optimizations |
| |
| While in the analysis phase, the |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| performs several potential optimizations to the query. The goal of |
| these optimizations is to reduce the amount of work performed during |
| the execution phase. |
| |
| The simplest optimization performed is compacting multiple expressions |
| joined by logical intersections (`AND`) into a single [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) with |
| three or more [`Expression`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Expression.java)s. For example, the query `WHERE age < 100 AND |
| fname = 'p*' AND first_name != 'pa*' AND age > 21` would, |
| without modification, have the following tree: |
| |
| ┌───────┐ |
| ┌────────│ AND │──────┐ |
| │ └───────┘ │ |
| ▼ ▼ |
| ┌───────┐ ┌──────────┐ |
| ┌─────│ AND │─────┐ │age < 100 │ |
| │ └───────┘ │ └──────────┘ |
| ▼ ▼ |
| ┌──────────┐ ┌───────┐ |
| │ fname=p* │ ┌─│ AND │───┐ |
| └──────────┘ │ └───────┘ │ |
| ▼ ▼ |
| ┌──────────┐ ┌──────────┐ |
| │fname!=pa*│ │ age > 21 │ |
| └──────────┘ └──────────┘ |
| |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| will remove the redundant right branch whose root is the final `AND` |
| and has leaves `fname != pa*` and `age > 21`. These [`Expression`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Expression.java)s will |
| be compacted into the parent `AND`, a safe operation due to `AND` |
| being associative and commutative. The resulting tree looks like the |
| following: |
| |
| ┌───────┐ |
| ┌────────│ AND │──────┐ |
| │ └───────┘ │ |
| ▼ ▼ |
| ┌───────┐ ┌──────────┐ |
| ┌───────────│ AND │────────┐ │age < 100 │ |
| │ └───────┘ │ └──────────┘ |
| ▼ │ ▼ |
| ┌──────────┐ │ ┌──────────┐ |
| │ fname=p* │ ▼ │ age > 21 │ |
| └──────────┘ ┌──────────┐ └──────────┘ |
| │fname!=pa*│ |
| └──────────┘ |
| |
| When excluding results from the result set, using `!=`, the |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| determines the best method for handling it. For range queries, for |
| example, it may be optimal to divide the range into multiple parts |
| with a hole for the exclusion. For string queries, such as this one, |
| it is more optimal, however, to simply note which data to skip, or |
| exclude, while scanning the index. Following this optimization the |
| tree looks like this: |
| |
| ┌───────┐ |
| ┌────────│ AND │──────┐ |
| │ └───────┘ │ |
| ▼ ▼ |
| ┌───────┐ ┌──────────┐ |
| ┌───────│ AND │────────┐ │age < 100 │ |
| │ └───────┘ │ └──────────┘ |
| ▼ ▼ |
| ┌──────────────────┐ ┌──────────┐ |
| │ fname=p* │ │ age > 21 │ |
| │ exclusions=[pa*] │ └──────────┘ |
| └──────────────────┘ |
| |
| The last type of optimization applied, for this query, is to merge |
| range expressions across branches of the tree -- without modifying the |
| meaning of the query, of course. In this case, because the query |
| contains all `AND`s the `age` expressions can be collapsed. Along with |
| this optimization, the initial collapsing of unneeded `AND`s can also |
| be applied once more to result in this final tree using to execute the |
| query: |
| |
| ┌───────┐ |
| ┌──────│ AND │───────┐ |
| │ └───────┘ │ |
| ▼ ▼ |
| ┌──────────────────┐ ┌────────────────┐ |
| │ fname=p* │ │ 21 < age < 100 │ |
| │ exclusions=[pa*] │ └────────────────┘ |
| └──────────────────┘ |
| |
| #### Operations and Expressions |
| |
| As discussed, the |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| optimizes a tree represented by |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java)s |
| as interior nodes, and |
| [`Expression`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Expression.java)s |
| as leaves. The |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) |
| class, more specifically, can have zero, one, or two |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java)s |
| as children and an unlimited number of expressions. The iterators used |
| to perform the queries, discussed below in the |
| "Range(Union|Intersection)Iterator" section, implement the necessary |
| logic to merge results transparently regardless of the |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java)s |
| children. |
| |
| Besides participating in the optimizations performed by the |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java), |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) |
| is also responsible for taking a row that has been returned by the |
| query and performing a final validation that it in fact does match. This |
| `satisfiesBy` operation is performed recursively from the root of the |
| [`Operation`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java) |
| tree for a given query. These checks are performed directly on the |
| data in a given row. For more details on how `satisfiesBy` works, see |
| the documentation |
| [in the code](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java#L87-L123). |
| |
| #### Range(Union|Intersection)Iterator |
| |
| The abstract `RangeIterator` class provides a unified interface over |
| the two main operations performed by SASI at various layers in the |
| execution path: set intersection and union. These operations are |
| performed in an iterated, or "streaming", fashion to prevent unneeded |
| reads of elements from either set. In both the intersection and union |
| cases the algorithms take advantage of the data being pre-sorted using |
| the same sort order, e.g. term or token order. |
| |
| The |
| [`RangeUnionIterator`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeUnionIterator.java) |
| performs the "Merge-Join" portion of the |
| [Sort-Merge-Join](https://en.wikipedia.org/wiki/Sort-merge_join) |
| algorithm, with the properties of an outer-join, or union. It is |
| implemented with several optimizations to improve its performance over |
| a large number of iterators -- sets to union. Specifically, the |
| iterator exploits the likely case of the data having many subgroups |
| of overlapping ranges and the unlikely case that all ranges will |
| overlap each other. For more details see the |
| [javadoc](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeUnionIterator.java#L9-L21). |
| |
| The |
| [`RangeIntersectionIterator`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java) |
| itself is not a subclass of `RangeIterator`. It is a container for |
| several classes, one of which, `AbstractIntersectionIterator`, |
| sub-classes `RangeIterator`. SASI supports two methods of performing |
| the intersection operation, and the ability to be adaptive in choosing |
| between them based on some properties of the data. |
| |
| `BounceIntersectionIterator`, and the `BOUNCE` strategy, works like |
| the |
| [`RangeUnionIterator`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeUnionIterator.java) |
| in that it performs a "Merge-Join", however, its nature is similar to |
| an inner-join, where like values are merged by a data-specific merge |
| function (e.g. merging two tokens in a list to lookup in a SSTable |
| later). See the |
| [javadoc](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java#L88-L101) |
| for more details on its implementation. |
| |
| `LookupIntersectionIterator`, and the `LOOKUP` strategy, performs a |
| different operation, more similar to a lookup in an associative data |
| structure, or "hash lookup" in database terminology. Once again, |
| details on the implementation can be found in the |
| [javadoc](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java#L199-L208). |
| |
| The choice between the two iterators, or the `ADAPTIVE` strategy, is |
| based upon the ratio of data set sizes of the minimum and maximum |
| range of the sets being intersected. If the number of the elements in |
| minimum range divided by the number of elements is the maximum range |
| is less than or equal to `0.01`, then the `ADAPTIVE` strategy chooses |
| the `LookupIntersectionIterator`, otherwise the |
| `BounceIntersectionIterator` is chosen. |
| |
| ### The SASIIndex Class |
| |
| The above components are glued together by the |
| [`SASIIndex`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/SASIIndex.java) |
| class which implements `Index`, and is instantiated |
| per-table containing SASI indexes. It manages all indexes for a table |
| via the |
| [`sasi.conf.DataTracker`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/conf/DataTracker.java) |
| and |
| [`sasi.conf.view.View`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/conf/view/View.java) |
| components, controls writing of all indexes for an SSTable via its |
| [`PerSSTableIndexWriter`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/PerSSTableIndexWriter.java), and initiates searches with |
| `Searcher`. These classes glue the previously |
| mentioned indexing components together with Cassandra's SSTable |
| life-cycle ensuring indexes are not only written when Memtable's flush, |
| but also as SSTable's are compacted. For querying, the |
| `Searcher` does little but defer to |
| [`QueryPlan`](https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java) |
| and update e.g. latency metrics exposed by SASI. |
| |
| ### Cassandra Internal Changes |
| |
| To support the above changes and integrate them into Cassandra a few |
| minor internal changes were made to Cassandra itself. These are |
| described here. |
| |
| #### SSTable Write Life-cycle Notifications |
| |
| The `SSTableFlushObserver` is an observer pattern-like interface, |
| whose sub-classes can register to be notified about events in the |
| life-cycle of writing out a SSTable. Sub-classes can be notified when a |
| flush begins and ends, as well as when each next row is about to be |
| written, and each next column. SASI's `PerSSTableIndexWriter`, |
| discussed above, is the only current subclass. |
| |
| ### Limitations and Caveats |
| |
| The following are items that can be addressed in future updates but are not |
| available in this repository or are not currently implemented. |
| |
| * The cluster must be configured to use a partitioner that produces |
| `LongToken`s, e.g. `Murmur3Partitioner`. Other existing partitioners which |
| don't produce LongToken e.g. `ByteOrderedPartitioner` and `RandomPartitioner` |
| will not work with SASI. |
| * Not Equals and OR support have been removed in this release while |
| changes are made to Cassandra itself to support them. |
| |
| ### Contributors |
| |
| * [Pavel Yaskevich](https://github.com/xedin) |
| * [Jordan West](https://github.com/jrwest) |
| * [Michael Kjellman](https://github.com/mkjellman) |
| * [Jason Brown](https://github.com/jasobrown) |
| * [Mikhail Stepura](https://github.com/mishail) |