39. Distributed blob garbage collector
Date: 2020-02-18
Status
Proposed, not implemented yet.
Work had been started on this topic.
Context
The body, headers, attachments of the mails are stored as blobs in a blob store. In order to save space in those stores, those blobs are de-duplicated using a hash of their content. To attain that the current blob store will read the content of the blob before saving it, and generate its id based on a hash of this content. This way two blobs with the same content will share the same id and thus be saved only once. This makes the safe deletion of one of those blobs a non trivial problem as we can't delete one blob without ensuring that all references to it are themselves deleted. For example if two messages share the same blob, when we delete one message there is at the time being no way to tell if the blob is still referenced by another message.
Decision
To address this issue, we propose to implement a distributed blob garbage collector built upon the previously developed Distributed Task Manager. The de-duplicating blob store will keep track of the references pointing toward a blob in a References
table. It will also keep track of the deletion requests for a blob in a Deletions
table. When the garbage collector algorithm runs it will fetch from the Deletions
table the blobs considered to be effectively deleted, and will check in the References
table if there are still some references to them. If there is no more reference to a blob, it will be effectively deleted from the blob store.
To avoid concurrency issues, where we could garbage collect a blob at the same time a new reference to it appear, a reference generation
notion will be added. The de-duplicating id of the blobs which before where constructed using only the hash of their content, will now include this reference generation
too. At a given interval a new reference generation
will be emitted, since then all new blobs will point to this new generation.
So a garbage collection iteration
will run only on the reference generation
n-2
to avoid concurrency issues.
The switch of generation will be triggered by a task running on the distributed task manager. This task will emit an event into the event sourcing system to increment the reference generation
.
Alternatives
Not de-duplicating the blobs' content, this simple approach which involves storing the same blob a lot of times can in some scenario be really slow and costly. Albeit it can in some case be preferred for the sake of simplicity, data security...
Consequences
This change will necessitate to extract the base blob store responsibilities (store a blob, delete a blob, read a blob) from the current blob store implementation which is doing the de-duplication, id generation... The garbage collector will use this low level blob store in order to effectively delete the blobs.
One other consequence of this work, is the fact that there will be no de-duplication on different reference generation
, i.e two blobs with the same content will be stored twice now, if they were created during two different reference generation
.
When writing a blob into the de-duplicating blob store, we will need to specify the reference to the object (MessageId, AttachmentId...) we store the blob for. This can make some components harder to implement as we will have to propagate the references.
Since we will not build a distributed task scheduler. To increment the reference generation
and launch periodically a garbage collection iteration
, the scheduling will be done by an external scheduler (cron job, kubernetes cronjob ...) which will call a webadmin endpoint to launch this task periodically.
Algorithm visualisation
Generation 1 and Iteration 1
- Events
rg1
reference generation is emittedgci1
garbage collection iteration is emitted- An email is sent to
user1
, a m1
message, and a blob b1
are stored with rg1
- An email is sent to
user1
and user2
, m2
and m3
messages, and a blob b2
are stored with rg1
Tables
Generations
reference generation id |
---|
rg1 |
garbage collection iteration id |
---|
gci1 |
Blobs
blob id | reference generation id |
---|
b1 | rg1 |
b2 | rg1 |
References
message id | blob id | reference generation id |
---|
m1 | b1 | rg1 |
m2 | b2 | rg1 |
m3 | b2 | rg1 |
Deletions
Empty
Generation 2 / Iteration 2
- Events
rg2
reference generation is emittedgci2
garbage collection iteration is emitted- An email is sent to
user1
, a m4
message, and a blob b3
are stored with rg2
- An email is sent to
user1
and user2
, m5
and m6
messages, and a blob b4
are stored with rg2
Tables
Generations
reference generation id |
---|
rg1 |
rg2 |
garbage collection iteration id |
---|
gci1 |
gci2 |
Blobs
blob id | reference generation id |
---|
b1 | rg1 |
b2 | rg1 |
b3 | rg2 |
b4 | rg2 |
References
message id | blob id | reference generation id |
---|
m1 | b1 | rg1 |
m2 | b2 | rg1 |
m3 | b2 | rg1 |
m4 | b3 | rg2 |
m5 | b4 | rg2 |
m6 | b4 | rg2 |
Deletions
Empty
Generation 3 / Iteration 3
- Events
rg3
reference generation is emittedgci3
garbage collection iteration is emitted- An email is sent to
user1
, a m7
message, and a blob b5
are stored with rg3
- An email is sent to
user1
and user2
, m8
and m9
messages, and a blob b6
are stored with rg3
user1
deletes m1
, m2
, m7
, and m8
with gi3
user2
deletes m3
with gi3
Tables: before deletions
Generations
reference generation id |
---|
rg1 |
rg2 |
rg3 |
garbage collection iteration id |
---|
gci1 |
gci2 |
gci3 |
Blobs
blob id | reference generation id |
---|
b1 | rg1 |
b2 | rg1 |
b3 | rg2 |
b4 | rg2 |
b5 | rg3 |
b6 | rg3 |
References
message id | blob id | reference generation id |
---|
m1 | b1 | rg1 |
m2 | b2 | rg1 |
m3 | b2 | rg1 |
m4 | b3 | rg2 |
m5 | b4 | rg2 |
m6 | b4 | rg2 |
m7 | b5 | rg3 |
m8 | b6 | rg3 |
m9 | b6 | rg3 |
Deletions
Empty
Tables: after deletions
Generations
reference generation id |
---|
rg1 |
rg2 |
rg3 |
garbage collection iteration id |
---|
gci1 |
gci2 |
gci3 |
Blobs
blob id | reference generation id |
---|
b1 | rg1 |
b2 | rg1 |
b3 | rg2 |
b4 | rg2 |
b5 | rg3 |
b6 | rg3 |
References
message id | blob id | reference generation id |
---|
m4 | b3 | rg2 |
m5 | b4 | rg2 |
m6 | b4 | rg2 |
m9 | b6 | rg3 |
Deletions
blob id | reference generation id | date | garbage collection iteration id |
---|
b1 | rg1 | 10:42 | gci3 |
b2 | rg1 | 10:42 | gci3 |
b2 | rg1 | 13:37 | gci3 |
b5 | rg3 | 10:42 | gci3 |
b6 | rg3 | 10:42 | gci3 |
Running the algorithm
- fetch
Deletions
for gci3
in deletions
- find distinct
reference-generation-id
of deletions
in generations = {rg1, rg3}
- For each generation
- rg1
- filter
deletions
to keep only rg1
entries and extract blob-ids
in concernedBlobs = {b1, b2}
- fetch all references to
concernedBlobs
and build a Bloom-Filter in foundedReferences = {}
- filter
concernedBlobs
to keep only those which are not present in foundedReferences
in blobsToDelete = {b1, b2}
- Remove
blobsToDelete
from Blobs
and Deletions
- rg3
- filter
deletions
to keep only rg3
entries and extract blob-ids
in concernedBlobs = {b5, b6}
- fetch all references to
concernedBlobs
and build a Bloom-Filter in foundedReferences = {b6}
- filter
concernedBlobs
to keep only those which are not present in foundedReferences
in blobsToDelete = {b5}
- Remove
blobsToDelete
from Blobs
and Deletions
Tables: after garbage collection
Generations
reference generation id |
---|
rg1 |
rg2 |
rg3 |
garbage collection iteration id |
---|
gci1 |
gci2 |
gci3 |
Blobs
blob id | reference generation id |
---|
b3 | rg2 |
b4 | rg2 |
b6 | rg3 |
References
message id | blob id | generation id |
---|
m4 | b3 | g2 |
m5 | b4 | g2 |
m6 | b4 | g2 |
m9 | b6 | g3 |
Deletions
blob id | reference generation id | date | garbage collection iteration id |
---|
b6 | rg3 | 10:42 | gci3 |
Generations 4
- Events
rg4
reference generation is emittedgci4
garbage collection iteration is emitteduser2
deletes m9
with gcg4
Tables: before deletions
Generations
reference generation id |
---|
rg1 |
rg2 |
rg3 |
rg4 |
garbage collection iteration id |
---|
gci1 |
gci2 |
gci3 |
gci4 |
Blobs
blob id | reference generation id |
---|
b3 | rg2 |
b4 | rg2 |
b6 | rg3 |
References
message id | blob id | reference generation id |
---|
m4 | b3 | rg2 |
m5 | b4 | rg2 |
m6 | b4 | rg2 |
m9 | b6 | rg3 |
Deletions
blob id | reference generation id | date | garbage collection iteration id |
---|
b6 | rg3 | 10:42 | gci3 |
Tables: after deletions
Generations
reference generation id |
---|
rg1 |
rg2 |
rg3 |
rg4 |
garbage collection iteration id |
---|
gci1 |
gci2 |
gci3 |
gci4 |
Blobs
blob id | reference generation id |
---|
b3 | rg2 |
b4 | rg2 |
b6 | rg3 |
References
message id | blob id | reference generation id |
---|
m4 | b3 | rg2 |
m5 | b4 | rg2 |
m6 | b4 | rg2 |
Deletions
blob id | reference generation id | date | garbage collection iteration id |
---|
b6 | rg3 | 10:42 | gci3 |
b6 | rg3 | 18:42 | gci4 |
Running the algorithm
- fetch
Deletions
for gci4
in deletions
- find distinct
generation-id
of deletions
in generations = {rg3}
- For each generation
- rg3
- filter
deletions
to keep only rg3
entries and extract blob-ids
in concernedBlobs = {b6}
- fetch all references to
concernedBlobs
and build a Bloom-Filter in foundedReferences = {}
- filter
concernedBlobs
to keep only those which are not present in foundedReferences
in blobsToDelete = {b6}
- Remove
blobsToDelete
from Blobs
and Deletions
Tables: after garbage collection
Generations
reference generation id |
---|
rg1 |
rg2 |
rg3 |
rg4 |
garbage collection iteration id |
---|
gci1 |
gci2 |
gci3 |
gci4 |
Blobs
blob id | reference generation id |
---|
b3 | rg2 |
b4 | rg2 |
References
message id | blob id | reference generation id |
---|
m4 | b3 | rg2 |
m5 | b4 | rg2 |
m6 | b4 | rg2 |
Deletions
Empty