blob: f30cff877ed320a18d8ef03354b3d48e6c6ef3db [file] [log] [blame]
= 22. Cassandra Message inconsistencies
Date: 2020-02-27
== Status
Accepted (lazy consensus)
== Context
Messages are denormalized in Cassandra in order to:
* access them by their unique identifier (messageId), for example through the JMAP protocol
* access them by their mailbox identifier and Unique IDentifier within that mailbox (mailboxId + uid), for example through the IMAP protocol
Here is the table organisation:
* `messageIdTable` Holds mailbox and flags for each message, lookup by mailbox ID + UID
* `imapUidTable` Holds mailbox and flags for each message, lookup by message ID
Failures during the denormalization process will lead to inconsistencies between the two tables.
This can lead to the following user experience:
----
BOB receives a message
The denormalization process fails
BOB can read the message via JMAP
BOB cannot read the message via IMAP
BOB marks a message as SEEN
The denormalization process fails
The message is SEEN in JMAP
The message is UNSEEN in IMAP
----
=== Current operations
* Adding a message:
** (CassandraMessageMapper) First reference the message in `messageIdTable` then in `imapUidTable`.
** (CassandraMessageIdMapper) First reference the message in `imapUidTable` then in `messageIdTable`.
* Deleting a message:
** (CassandraMessageMapper) First delete the message in `imapUidTable` then in `messageIdTable`.
** (CassandraMessageIdMapper) Read the message metadata using `imapUidTable`, then first delete the message in `imapUidTable` then in `messageIdTable`.
* Copying a message:
** (CassandraMessageMapper) Read the message first, then first reference the message in `messageIdTable` then in `imapUidTable`.
* Moving a message:
** (CassandraMessageMapper) Logically copy then delete.
A failure in the chain migh lead to duplicated message (present in both source and destination mailbox) as well as different view in IMAP/JMAP.
** (CassandraMessageIdMapper) First reference the message in `imapUidTable` then in `messageIdTable`.
* Updating a message flags:
** (CassandraMessageMapper) First update conditionally the message in `imapUidTable` then in `messageIdTable`.
** (CassandraMessageIdMapper) First update conditionally the message in `imapUidTable` then in `messageIdTable`.
== Decision
Adopt `imapUidTable` as a source of truth.
Because `messageId` allows tracking changes to messages accross mailboxes upon copy and moves.
Furthermore, that is the table on which conditional flags updates are performed.
All writes will be performed to `imapUidTable` then performed on `messageIdTable` if successful.
We thus need to modify CassandraMessageMapper 'add' + 'copy' to first write to the source of truth (`imapUidTable`)
We can adopt a retry policy of the `messageIdTable` projection update as a mitigation strategy.
Using `imapUidTable` table as a source of truth, we can rebuild the `messageIdTable` projection:
* Iterating `imapUidTable` entries, we can rewrite entries in `messageIdTable`
* Iterating `messageIdTable` we can remove entries not referenced in `imapUidTable`
* Adding a delay and a re-check before the actual fix can decrease the occurrence of concurrency issues
We will expose a webAdmin task for doing this.
== Consequences
User actions concurrent to the inconsistency fixing task could result in concurrency issues.
New inconsistencies could be created.
However table of truth would not be impacted hence rerunning the inconsistency fixing task will eventually fix all issues.
This task could be run safely online and can be scheduled on a recurring basis outside of peak traffic by an admin to ensure Cassandra message consistency.
== References
* https://github.com/linagora/james-project/pull/3125[Plan for fixing Cassandra ACL inconsistencies]
* https://www.mail-archive.com/server-dev@james.apache.org/msg64432.html[General mailing list discussion about inconsistencies]
* https://github.com/linagora/james-project/pull/3130[Pull Request: JAMES-3058 Concurrency testing for fixing Cassandra mailbox inconsistencies]
The delay strategy to decrease concurrency issue occurrence is described here.