Storage-Attached Indexing

Overview

Storage-attached indexing is a column based local secondary index implementation for Cassandra.

The project was inspired by SASI (SSTable-Attached Secondary Indexes) and retains some of its high-level architectural character (and even some actual code), but makes significant improvements in a number of areas:

The on-disk/SSTable index formats for both string and numeric data have been completely replaced. Strings are indexed on disk using a byte-ordered trie data structure, while numeric types are indexed using a block-oriented balanced tree.
While indexes continue to be managed at the column level from the user's perspective, the storage design at the column index level is row-based, with related offset and token information stored only once at the SSTable level. This drastically reduces our on-disk footprint when several columns are indexed on the same table.
Tracing, metrics, virtual table-based metadata and snapshot-based backup/restore are supported out of the box.
On-disk index components can be streamed completely when entire SSTable streaming is enabled.
Incremental index building is supported, and on-disk index components are included in snapshots.

Many similarities with standard secondary indexes remain:

The full set of C* consistency levels is supported for both reads and writes.
Index updates are synchronous with mutations and do not require any kind of read-before-write.
Global queries are implemented on the back of C* range reads.
Paging is supported.
Only token ordering of results is supported.
Index builds are visible to operators as compactions and are executed on compaction threads.
All DML and DDL statements are CQL-based.
Single-node management operations are available via nodetool. (ex. stop & rebuild_index)

Quick Start

The following short tutorial will get you up-and-running with storage-attached indexing.

Build and Start Cassandra

Follow the instructions to build and start Cassandra in README.asc in root folder of the Cassandra repository

Create a Simple Data Model

1.) Run the following DDL statements to create a table and two indexes:

CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};

USE test;

CREATE TABLE person (id int, name text, age int, PRIMARY KEY (id));

CREATE INDEX ON person (name) USING 'sai' WITH OPTIONS = {'case_sensitive': false};

CREATE INDEX ON person (age) USING 'sai';

2.) Add some data.

INSERT INTO person (id, name, age) VALUES (1, 'John', 21);

INSERT INTO person (id, name, age) VALUES (2, 'john', 50);

INSERT INTO person (id, name, age) VALUES (3, 'Boris', 43);

INSERT INTO person (id, name, age) VALUES (4, 'Caleb', 34);

Make Some Queries

1.) Query for everyone named “John”, ignoring case.

SELECT * FROM person WHERE name = 'John';

 id | age | name
----+-----+------
  1 |  21 | John
  2 |  50 | john

2.) Query for everyone between the ages of 18 and 25.

SELECT * FROM person WHERE age >= 18 AND age <= 35;

 id | age | name
----+-----+-------
  1 |  21 |  John
  4 |  34 | Caleb

Contributors

Marc Selwan
Caleb Rackliffe
Zhao Yang
Jason Rutherglen
Maciej Zasada
Andres de la Peña
Mike Adamson
Zahir Patni
Tomek Lasica
Berenguer Blasi
Rocco Varela
Piotr Kołaczkowski