c742d9c7ce3be29aa63b40a8b3835113a5217f0a - couchdb

commit	c742d9c7ce3be29aa63b40a8b3835113a5217f0a	[log] [tgz]
author	Nick Vatamaniuc <vatamane@gmail.com>	Sat May 27 03:35:45 2023 -0400
committer	Nick Vatamaniuc <nickva@users.noreply.github.com>	Tue May 30 18:24:37 2023 -0400
tree	82e9761c37048f2db8d2281469619c0f59e665ac
parent	1a45bc211454f0b470684ef6ffa3b458e9b02b38 [diff]

Fix purge infos replicating to the wrong shards during shard splitting.

Previously, internal replicator (mem3_rep) replicated purge infos to/from all
the target shards. Instead, it should push/pull changes only to
appropriate ranges if those purge infos belong there based on database's hash
function.

Users experienced this error as a failure in database which contains purges,
which was split twice in a row. For example, if a Q=8 database is split to
Q=16, then split again from Q=16 to Q=32, the second split operation might fail
with a `split_state:initial_copy ...{{badkey,not_in_range}` error. The
misplaced purge infos would be noticed only during the second split, when the
initial copy phase would crash because some purge infos do not hash to neither
one of the two target ranges. Moreover, the crash would lead to repeated
retries, which generated a huge job history log.

The fix consists of three improvements:

  1) Internal replicator is updated to filter purge infos based on the db hash.

  2) Account for the fact that some users' dbs might already contain misplaced
    purge infos. Since it's a known bug, we anticipate that error and ignore
    misplaced purge info during the second shard split operation with a warning
    emitted in the logs.

  3) Make similar range errors fatal, and emit a clear error in the logs and
     job history so any future range errors are immediately obvious.

Fixes #4624

4 files changed