blob: 7e8b056458d16e251b46f5a40b4d1eb71e4734c5 [file] [log] [blame]
# Advice for JMAP implementors
This document describes a recommended set of data structures and algorithms for efficiently implementing JMAP. It is intended to serve as suggestions only; there may well be better ways to do it. The spec is the authoritative guide on what constitutes a conformant JMAP implementation.
## Assigning Message Ids
A good way of assigning a message id is to sha1 the RFC2822 message and take the first 80 bits (this is sufficient for avoiding collisions in any reasonable sized instance, but the server may choose any length it likes). If this is already in use (i.e. another copy of the message already exists), sha1 the GUID. Repeat until there is no collision. The bytes can be represented in any reasonable encoding for transmission as a `String` within the API.
## Modification sequences
A modification sequence, or **modseq**, is a 63-bit monotonically incrementing counter. Each user has their own modseq counter (in IMAP it's originally per-mailbox, but per user is backwards compatible with this). Every time a change occurs to data within the user, the modseq is incremented by one and the new value is associated with the changes. This is used in a number of data structures and algorithms below to efficiently calculate changes.
## Data structures
As ever in programming, get your data structures right and the server will practically write itself.
There are three types of data structures suggested in this guide (excluding the email indexing for search, which is more complicated than this guide is prepared to go into):
1. A **Map** (associative array) of id to variable-length data blob:
- Insert should be O(1).
- Lookup should be O(1).
- Delete should be O(1).
- Not ordered, but we need to be able to iterate over all the values.
2. An **append-only log** of variable length blobs:
- Insert is only at tail.
- Lookup needs to be able to binary search (based on modseq) to find start location, then read sequentially.
- Delete is only at head.
3. An **Ordered list** of fixed-sized objects:
- Insert is mainly at or near the head, but can be in a random location.
- Want to be able to read sequentially from head (common case), or read whole structure into memory and sort.
- Delete is mainly near head, but can be in a random location.
### 1. User
This is a simple collection of important top-level values for the user (kept in a **Map**):
- Highest ModSeq for the user (`63 bits` for IMAP compatibility)
- Highest ModSeq of any Mailbox.
- Low watermark ModSeq for Thread Log `63 bits` – whenever the thread log (see below) is truncated, the modseq of the new first item in the log is stored here. Any attempt to call *getThreadUpdates* with a modseq lower than this must be rejected with a `cannotCalculateUpdates` error.
- Low watermark ModSeq for Message Log `63 bits` – the same, but for the message log.
- Quota available in bytes (63 bits. -1 => Don't care)
- Quota used in bytes `63 bits`
- List of other users with mailboxes made available to this user (shared mailboxes) (var length).
### 2. Mailboxes
A map of Mailbox Id to all of the mailbox object values (as defined in the spec, including the counts). For the purposes of calculating `getMailboxUpdates`, also include the following for each mailbox:
- **Mailbox ModSeq** `63 bits` Updated when any of the mailbox properties (as defined in the spec) change not updated when the mailbox message list changes.
- **Mailbox Non-Counts ModSeq** `63 bits` Updated when any of the mailbox properties that are not counts change. When calculating `getMailboxUpdates`, if this is lower than the previous mod seq, but the Mailbox ModSeq is higher, then only counts have changed.
- **Message list UID Next** `32 bits` The next UID (incrementing 32-bit counter) to assign to messages added to the mailbox message list.
- **Message list highest ModSeq** `63 bits` The highest ModSeq of a message in the mailbox message list.
- **Low watermark message list ModSeq** `63 bits` Whenever a message is undeleted via IMAP (that is the \Deleted flag is removed) or a deleted message is fully expunged from the mailbox message list (see message list data structure below), set this to the ModSeq of that message if higher than the previous value. Any attempt to call *getMessageListUpdates* with a modseq lower than this must be rejected with a `cannotCalculateUpdates` error.
The suggested way to assign a Mailbox id is to use an incrementing counter. 16 bits should be sufficient, giving support for a maximum of 65536 mailboxes per user. Old ids are reusable without penalty while keeping with the JMAP semantics. Over the wire this would be represented as a prefix plus an encoding of the number, e.g. "m1", "m2" etc.
### 3. Mailbox message list
One of these should exist for each mailbox. It is an ordered list of fixed size objects. Each object is:
- **UID** `32 bits` As per IMAP semantics, this increments each time you append a message to the mailbox. The next UID to use is stored in the Mailbox object.
- **ModSeq** `63 bits` As per IMAP semantics, this is updated whenever any property of the message changes.
- **Message Id** `80 bits` This is the ID always used within JMAP to refer to the message, and does not change as it moves between mailboxes.
- **Thread Id** `64 bits`
- **Deleted** `64 bits` Time stamp from epoch of when the message was removed the mailbox. `0` means not deleted. See below
- **Message Date** `64 bits`
Given that most message list fetches are of the first section of a single mailbox in date descending order, if we keep it stored in this order on disk we can just read the first few bytes to return the desired information; no sorting required (we don't even have to look up anything in the message cache). The date is included in each record so we can insert a new one in the correct position without having to reference any other data.
When a message is removed from the mailbox we don't remove it immediately from the message list. Instead we just set the Deleted field to the time stamp of when it was deleted. This allows the delta update algorithm to work to efficiently update the client to the new list state. At certain intervals (either based on how many deleted nodes there are or how long since they were deleted) this needs to be cleaned up and the deleted objects fully expunged from the message list. At this point the low watermark ModSeq needs to be updated on the Mailbox data structure.
### 4. Message cache
A map of Message Id to flags, mailboxes and common header information for that message (variable length). This lets the server optimise calls to `getMessages` that only fetch this information (common when fetching the required information for a mailbox listing). It also allows optimisation of `getMessageList` as most, even quite complex filters will only require checking this cache (which should be relatively fast) rather than looking up and parsing the whole message (which could be quite slow).
For each message, store:
- **isUnread** `1 bit` (mutable)
- **isFlagged** `1 bit` (mutable)
- **isAnswered** `1 bit` (mutable)
- **isDraft** `1 bit` (mutable by IMAP)
- Other flags + labels + annotations as needed for IMAP compatibility only. (mutable)
- **Mailboxes** (list of mailbox ids this message belongs to) (mutable)
- **From** header (either in the JSON used by JMAP, or whatever IMAP needs)
- **To** header (either in the JSON used by JMAP, or whatever IMAP needs)
- **Subject** header
- **Date**
- **Preview** (see spec)
- **Attachments** (list of file names only; also used for hasAttachments)
- **GUID** (sha1 of raw email)
### 5. Message index
For searching messages at any reasonable speed, an index is required from textual content within the message to the message id. Describing how this should work is beyond the scope of this guide.
### 6. Messages
A map of GUID (sha1 of the message contents) to the raw RFC2822 message itself. This is the bulk of the data to be stored for each user. As this data is immutable and referenced by its sha1, there are many possibilities for how to store it. Each message could be stored as a separate file using its GUID as the name. Or on a separate networked object store etc. etc.
### 7. Message log
An append only log of changes made to the message cache. Each entry consists of:
- **ModSeq** of change `63 bits`
- List of **Message ids** which were affected by the change (to save extra lookups when the client is fetching these changes, should probably include a bit to say whether the message was destroyed by the change or created/modified).
Periodically, truncate the log by chopping records off the beginning. At this point you need to update the low watermark ModSeq for the log in the User object (data structure 1).
### 8. Threads
A map of thread id to an object containing information about the thread (variable length). The information stored for each thread is:
- The **list of messages** belonging to the thread, sorted in date order. For each - message we need to store:
- **Message Id** `80 bits`
- **Mailboxes** The list of mailboxes the message belongs to. For sorting a message list by thread unread/thread flagged purposes only.
- **isUnread** `1 bit` For sorting a message list by thread unread/thread flagged purposes only.
- **isFlagged** `1 bit` For sorting a message list by thread unread/thread flagged purposes only.
- **ModSeq of last change** to isRead/isFlagged for any message in the thread (for sorting a message list by thread unread/thread flagged purposes only).
- **Sha1 of subject** after stripping Fwd/Re etc. (for checking whether we need to split a new conversation out later; see threading algorithm in data structure 10).
### 9. Thread log
An append only log of changes made to the **membership** of threads (in data structure 7; changes to only the isRead/isFlagged/Mailboxes of a message in a thread do not need to go in the log). Each entry consists of:
- **ModSeq** of change
- **List of thread ids** which were affected by the change (to save extra lookups when the client is fetching these changes, should probably include a bit to say whether the thread was destroyed by the change or created/modified).
As with the message log, periodically truncate the log by chopping records off the beginning; at this point you need to update the low watermark ModSeq for the log in the user object.
### 10. Refs to ThreadId
This is solely used to look up the conversation to assign to a message on creation/import/delivery. Maps the RFC2822 Message Id (eughh, so many different types of id!) to the thread id.
The suggested rule for connecting messages into threads is this:
If two messages share a common RFC2822 Message Id in the set of such ids within the `Message-Id`, `References` and `In-Reply-To` headers of each message, and the messages have the same `Subject` header (after stripping any preceding Re:/Fwd: and trimming white space from either end), then they should belong in the same thread. Otherwise they should belong in different threads.
## Message List Algorithms
The `state` property to return with getter calls to each type is:
- Mailbox: Highest ModSeq of any Mailbox (stored in the User object)
- MessageList: Message list UID Next + Message list highest ModSeq (as stored on the Mailbox object). For message lists that aren't just a mailbox, the state string should be the highest message ModSeq (as found at the tail of the Message Log).
- Thread: The highest modseq found at the end of the Thread Log.
- Message: The highest modseq found at the end of the Message Log.
### getMailboxUpdates
Simply iterate through the set of Mailboxes comparing their current mod seqs to the client mod seq. If higher, the mailbox has changed. If the non-counts is not higher, only counts have changed.
### getMessageList
First you need to get the complete list. In the common case of…
filter == { inMailboxes: [ 'inboxId' ], notInMailboxes: [ 'trashId' ] }
…you are simply fetching the list of messages in a mailbox. This list is already pre-calculated for each mailbox and kept on disk (data structure 3). You can simply slurp this into memory and sort if needed (again, in the common case of sorting date-descending, it will be pre-sorted).
If the filter is more complex, you will need to do more work to get the set of matching messages and sort it. If there is a `String` component to the filter, first use the message index (data structure 5) to get a set of matches. Then iterate through and lookup each match in the message cache (data structure 4) to apply any other components of the filter. Finally sort as specified.
Once you have the complete message list, you can calculate the results to return to the client. Since a client is likely to fetch a different portion of the same message list soon after, it is beneficial if the server can keep the last list requested by the user in a cache for a short time.
let collapseThreads = args.collapseThreads
let position = args.position
let anchor = args.anchor
let anchorOffset = args.anchorOffset
let limit = args.limit
let total = 0
let messageIds = [] # NB Max size of array is limit
let threadIds = [] # NB Max size of array is limit
# If not collapsing threads, we can just jump to the required section
if !collapseThreads {
total = messageList.length
for i = position; i < total; i = i + 1 {
messageIds.push( msg.id )
threadIds.push( msg.threadId )
}
} else {
# Optimisation for the common case
let totalIsKnown = filter is just mailbox
let SeenThread = new Set()
let numFound = 0
foreach msg in sortedFilteredList {
if !SeenThread{ msg.threadId } {
SeenThread.add( msg.threadId )
total += 1
if position >= total && numFound < limit {
messageIds.push( msg.id )
threadIds.push( msg.threadId )
numFound += 1
if numFound == limit && totalIsKnown {
break;
}
}
}
}
if totalIsKnown {
total = getTotal( mailbox, collapseThreads )
}
}
### getMessageListUpdates
With the suggested data structures, we can do delta updates for any standard mailbox message list (the common case), but not for a search. The following algorithm correctly calculates the delta update. We're presuming we already have a `messageList` (directly read in from data structure 3 and sorted if required):
let index = -1
let added = []
let removed = []
let collapseThreads = args.collapseThreads
let uptoHasBeenFound = false
# A mutable sort is one which sorts by a mutable property, e.g.
# sort flagged messages/threads first.
let isMutable = sort.isMutable()
# An exemplar is the first message in each thread in the list, given the
# sort order that
# The old exemplar is the exemplar in the client's old state.
let SeenExemplar = collapseThreads ? new Set() : null
let SeenOldExemplar = collapseThreads ? new Set() : null
foreach msg in messageList {
let isNewExemplar = false
let isOldExemplar = false
let isNew = ( msg.uid >= mailbox.uidNext )
let isChanged = ( msg.modSeq > args.state )
let isDeleted = ( msg.deletedTimeStamp != 0 )
let wasDeleted = ( isDeleted && !isChanged )
# Is this message the current exemplar?
if !isDeleted && ( !collapseThreads || !SeenExemplar{ msg.threadId } ) {
isNewExemplar = true
index += 1
if collapseThreads {
SeenExemplar.set( msg.threadId )
}
}
# Was this message an old exemplar?
# 1. Must not have arrived after the client's state
# 2. Must have been deleted before the client's state
# 3. Must not have already found the old exemplar
if !isNew && !wasDeleted &&
( !collapseThreads || !SeenOldExemplar{ msg.threadId } ) {
isOldExemplar = true
if collapseThreads {
SeenOldExemplar.set( msg.threadId )
}
}
if isOldExemplar && !isNewExemplar {
removed.push({
messageId: msg.messageId,
threadId: msg.threadId
})
}
else if !isOldExemplar && isNewExemplar {
added.push({
index: index,
messageId: msg.messageId,
threadId: msg.threadId
})
}
# Special case for mutable sorts (based on isFlagged/isUnread). If
# collapseThread == true, the sort is based on the *conversation*
# isFlagged/isUnread status.
if isMutable && isOldExemplar && isNewExemplar {
# Has the isUnread/isFlagged status of the message/thread
# (as appropriate) possibly changed since the client's state?
let modSeq = collapseThreads ?
msg.modSeq :
getThread( msg.threadId ).modSeq
# If so, we need to remove the exemplar from the client view and add
# it back in at the correct position.
if modSeq > args.modSeq {
removed.push({
messageId: msg.messageId,
threadId: msg.threadId
})
added.push({
index: index,
messageId: msg.messageId,
threadId: msg.threadId
})
}
}
# If this is the last message the client cares about, we can stop here
# and just return what we've calculated so far. We already know the total
# count for this message list as we keep it pre calculated and cached in
# the Mailbox object.
#
# However, if the sort is mutable we can't break early, as messages may
# have moved from the region we care about to lower down the list.
if !isMutable && msg.uid == args.upto {
break
}
} # End loop
total = getTotal( mailbox, collapseThreads )
To implement this algorithm, the server must keep the metadata at least of deleted (expunged) messages for a little while (say 1-2 weeks). This is also very useful for restoring accidentally deleted messages as well. At cleanup time (when actually removing the messages), the server must keep track of the highest modseq of a message it has removed in that mailbox (the modseq would have been set at the original delete time); any requests for updates from a state before this must be rejected, as the results may be incorrect.
## getThreadUpdates
If the client id is lower than the low watermark ModSeq for the Thread Log (as found in the User object), reject with a `cannotCalculateUpdates` error. Otherwise, binary search the Thread log looking for the client modseq, then read sequentially from there to get the changes.
## getMessageUpdates
If the client id is lower than the low watermark ModSeq for the Message Log (as found in the User object), reject with a `cannotCalculateUpdates` error. Otherwise, binary search the Message log looking for the client modseq, then read sequentially from there to get the changes.