Date: 2020-04-13
Accepted (lazy consensus) & implemented
Read Distributed Mail Queue for full context.
enqueuedMailsV3 and deletedMailsV2 is never cleaned up and the corresponding blobs are always referenced. This is not ideal both from a privacy and space storage costs point of view.
Note that enqueuedMailsV3 and deletedMailsV2 rely on timeWindowCompactionStrategy.
Add a new contentStart
table referencing the point in time from which a given mailQueue holds data, for each mail queue.
The values contained between contentStart
and browseStart
can safely be deleted.
We can perform this cleanup upon browseStartUpdate
: once finished we can browse then delete content of enqueuedMailsV3 and deletedMailsV2 contained between contentStart
and the new browseStart
then we can safely set contentStart
to the new browseStart
.
Content before browseStart
can safely be considered deletable, and is applicatively no longer exposed. We don't need an additional grace period mechanism for contentStart
.
Failing cleanup will lead to the content being eventually updated upon next browseStart
update.
We will furthermore delete blobStore content upon dequeue, also when the mail had been deleted or purged via MailQueue management APIs.
All Cassandra SSTable before browseStart
can safely be dropped as part of the timeWindowCompactionStrategy.
Updating browse start will then be two times more expensive as we need to unreference passed slices.
Eventually this will allow reclaiming Cassandra disk space and enforce mail privacy by removing dandling metadata.
A proposal was made to piggy back cleanup upon dequeue/delete operations. The dequeuer/deleter then directly removes the related metadata from enqueuedMailsV3
and deletedMailsV2
. This simpler design however have several flaws: