39. Distributed blob garbage collector

Date: 2020-02-18

Status

Proposed, not implemented yet.

Work had been started on this topic.

Context

The body, headers, attachments of the mails are stored as blobs in a blob store. In order to save space in those stores, those blobs are de-duplicated using a hash of their content. To attain that the current blob store will read the content of the blob before saving it, and generate its id based on a hash of this content. This way two blobs with the same content will share the same id and thus be saved only once. This makes the safe deletion of one of those blobs a non trivial problem as we can't delete one blob without ensuring that all references to it are themselves deleted. For example if two messages share the same blob, when we delete one message there is at the time being no way to tell if the blob is still referenced by another message.

Decision

To address this issue, we propose to implement a distributed blob garbage collector built upon the previously developed Distributed Task Manager. The de-duplicating blob store will keep track of the references pointing toward a blob in a References table. It will also keep track of the deletion requests for a blob in a Deletions table. When the garbage collector algorithm runs it will fetch from the Deletions table the blobs considered to be effectively deleted, and will check in the References table if there are still some references to them. If there is no more reference to a blob, it will be effectively deleted from the blob store.

To avoid concurrency issues, where we could garbage collect a blob at the same time a new reference to it appear, a reference generation notion will be added. The de-duplicating id of the blobs which before where constructed using only the hash of their content, will now include this reference generation too. At a given interval a new reference generation will be emitted, since then all new blobs will point to this new generation.

So a garbage collection iteration will run only on the reference generation n-2 to avoid concurrency issues.

The switch of generation will be triggered by a task running on the distributed task manager. This task will emit an event into the event sourcing system to increment the reference generation.

Alternatives

Not de-duplicating the blobs' content, this simple approach which involves storing the same blob a lot of times can in some scenario be really slow and costly. Albeit it can in some case be preferred for the sake of simplicity, data security...

Consequences

This change will necessitate to extract the base blob store responsibilities (store a blob, delete a blob, read a blob) from the current blob store implementation which is doing the de-duplication, id generation... The garbage collector will use this low level blob store in order to effectively delete the blobs.

One other consequence of this work, is the fact that there will be no de-duplication on different reference generation, i.e two blobs with the same content will be stored twice now, if they were created during two different reference generation.

When writing a blob into the de-duplicating blob store, we will need to specify the reference to the object (MessageId, AttachmentId...) we store the blob for. This can make some components harder to implement as we will have to propagate the references.

Since we will not build a distributed task scheduler. To increment the reference generation and launch periodically a garbage collection iteration, the scheduling will be done by an external scheduler (cron job, kubernetes cronjob ...) which will call a webadmin endpoint to launch this task periodically.

Algorithm visualisation

Generation 1 and Iteration 1

Events
- rg1 reference generation is emitted
- gci1 garbage collection iteration is emitted
- An email is sent to user1, a m1 message, and a blob b1 are stored with rg1
- An email is sent to user1 and user2, m2 and m3 messages, and a blob b2 are stored with rg1

Tables

Generations

reference generation id
rg1

garbage collection iteration id
gci1

Blobs

blob id	reference generation id
b1	rg1
b2	rg1

References

message id	blob id	reference generation id
m1	b1	rg1
m2	b2	rg1
m3	b2	rg1

Deletions

Empty

Generation 2 / Iteration 2

Events
- rg2 reference generation is emitted
- gci2 garbage collection iteration is emitted
- An email is sent to user1, a m4 message, and a blob b3 are stored with rg2
- An email is sent to user1 and user2, m5 and m6 messages, and a blob b4 are stored with rg2

Tables

Generations

reference generation id
rg1
rg2

garbage collection iteration id
gci1
gci2

Blobs

blob id	reference generation id
b1	rg1
b2	rg1
b3	rg2
b4	rg2

References

message id	blob id	reference generation id
m1	b1	rg1
m2	b2	rg1
m3	b2	rg1
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2

Deletions

Empty

Generation 3 / Iteration 3

Events
- rg3 reference generation is emitted
- gci3 garbage collection iteration is emitted
- An email is sent to user1, a m7 message, and a blob b5 are stored with rg3
- An email is sent to user1 and user2, m8 and m9 messages, and a blob b6 are stored with rg3
- user1 deletes m1, m2, m7, and m8 with gi3
- user2 deletes m3 with gi3

Tables: before deletions

Generations

reference generation id
rg1
rg2
rg3

garbage collection iteration id
gci1
gci2
gci3

Blobs

blob id	reference generation id
b1	rg1
b2	rg1
b3	rg2
b4	rg2
b5	rg3
b6	rg3

References

message id	blob id	reference generation id
m1	b1	rg1
m2	b2	rg1
m3	b2	rg1
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2
m7	b5	rg3
m8	b6	rg3
m9	b6	rg3

Deletions

Empty

Tables: after deletions

Generations

reference generation id
rg1
rg2
rg3

garbage collection iteration id
gci1
gci2
gci3

Blobs

blob id	reference generation id
b1	rg1
b2	rg1
b3	rg2
b4	rg2
b5	rg3
b6	rg3

References

message id	blob id	reference generation id
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2
m9	b6	rg3

Deletions

blob id	reference generation id	date	garbage collection iteration id
b1	rg1	10:42	gci3
b2	rg1	10:42	gci3
b2	rg1	13:37	gci3
b5	rg3	10:42	gci3
b6	rg3	10:42	gci3

Running the algorithm

fetch Deletions for gci3 in deletions
find distinct reference-generation-id of deletions in generations = {rg1, rg3}
For each generation
- rg1
  - filter deletions to keep only rg1 entries and extract blob-ids in concernedBlobs = {b1, b2}
  - fetch all references to concernedBlobs and build a Bloom-Filter in foundedReferences = {}
  - filter concernedBlobs to keep only those which are not present in foundedReferences in blobsToDelete = {b1, b2}
  - Remove blobsToDelete from Blobs and Deletions
- rg3
  - filter deletions to keep only rg3 entries and extract blob-ids in concernedBlobs = {b5, b6}
  - fetch all references to concernedBlobs and build a Bloom-Filter in foundedReferences = {b6}
  - filter concernedBlobs to keep only those which are not present in foundedReferences in blobsToDelete = {b5}
  - Remove blobsToDelete from Blobs and Deletions

Tables: after garbage collection

Generations

reference generation id
rg1
rg2
rg3

garbage collection iteration id
gci1
gci2
gci3

Blobs

blob id	reference generation id
b3	rg2
b4	rg2
b6	rg3

References

message id	blob id	generation id
m4	b3	g2
m5	b4	g2
m6	b4	g2
m9	b6	g3

Deletions

blob id	reference generation id	date	garbage collection iteration id
b6	rg3	10:42	gci3

Generations 4

Events
- rg4 reference generation is emitted
- gci4 garbage collection iteration is emitted
- user2 deletes m9 with gcg4

Tables: before deletions

Generations

reference generation id
rg1
rg2
rg3
rg4

garbage collection iteration id
gci1
gci2
gci3
gci4

Blobs

blob id	reference generation id
b3	rg2
b4	rg2
b6	rg3

References

message id	blob id	reference generation id
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2
m9	b6	rg3

Deletions

blob id	reference generation id	date	garbage collection iteration id
b6	rg3	10:42	gci3

Tables: after deletions

Generations

reference generation id
rg1
rg2
rg3
rg4

garbage collection iteration id
gci1
gci2
gci3
gci4

Blobs

blob id	reference generation id
b3	rg2
b4	rg2
b6	rg3

References

message id	blob id	reference generation id
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2

Deletions

blob id	reference generation id	date	garbage collection iteration id
b6	rg3	10:42	gci3
b6	rg3	18:42	gci4

Running the algorithm

fetch Deletions for gci4 in deletions
find distinct generation-id of deletions in generations = {rg3}
For each generation
- rg3
  - filter deletions to keep only rg3 entries and extract blob-ids in concernedBlobs = {b6}
  - fetch all references to concernedBlobs and build a Bloom-Filter in foundedReferences = {}
  - filter concernedBlobs to keep only those which are not present in foundedReferences in blobsToDelete = {b6}
  - Remove blobsToDelete from Blobs and Deletions

Tables: after garbage collection

Generations

reference generation id
rg1
rg2
rg3
rg4

garbage collection iteration id
gci1
gci2
gci3
gci4

Blobs

blob id	reference generation id
b3	rg2
b4	rg2

References

message id	blob id	reference generation id
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2

Deletions

Empty

message id	blob id	reference generation id
m1	b1	rg1
m2	b2	rg1
m3	b2	rg1
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2
m7	b5	rg3
m8	b6	rg3
m9	b6	rg3

message id	blob id	reference generation id
m1	b1	rg1
m2	b2	rg1
m3	b2	rg1
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2
m7	b5	rg3
m8	b6	rg3
m9	b6	rg3

message id	blob id	reference generation id
m1	b1	rg1
m2	b2	rg1
m3	b2	rg1
m4	b3	rg2
m5	b4	rg2
m6	b4	rg2
m7	b5	rg3
m8	b6	rg3
m9	b6	rg3