Ratis Configuration Reference

Server Configurations

Most of the server configurations can be found at RaftServerConfigKeys. To customize configurations, we may

  1. create a RaftProperties object,
  2. set the desired values, and then
  3. pass the customized RaftProperties object when building a RaftServer.

For example,

final RaftProperties properties = new RaftProperties();
RaftServerConfigKeys.LeaderElection.setPreVote(properties, false);
final RaftServer raftServer = RaftServer.newBuilder()
    .setServerId(id)
    .setStateMachine(stateMachine)
    .setGroup(raftGroup)
    .setProperties(properties)
    .build();

Server

Propertyraft.server.storage.dir
Descriptionroot storage directory to hold RaftServer data
TypeList<File>
Default/tmp/raft-server/

Propertyraft.server.storage.free-space.min
Descriptionminimal space requirement for storage dir
TypeSizeInBytes
Default0MB

Propertyraft.server.removed.groups.dir
Descriptionstorage directory to hold removed groups
TypeFile
Default/tmp/raft-server/removed-groups/
// GroupManagementApi
RaftClientReply remove(RaftGroupId groupId,
    boolean deleteDirectory, boolean renameDirectory) throws IOException;

When removing an existing group, if the deleteDirectory flag is set to false and renameDirectory is set to true, the group data will be renamed to this dir instead of being deleted.


Propertyraft.server.sleep.deviation.threshold
Descriptiondeviation threshold for election sleep
TypeTimeDuration
Default300ms

When a server is a follower, it sleeps and wakes up from time to time for checking the heartbeat condition. If it cannot receive a heartbeat from the leader within the election timeout, it starts a leader election.

When a follower server wakes up from a sleep, if the actual sleep time is longer than the intended sleep time by this threshold, it will immediately go back to sleep again, instead of checking the heartbeat condition. The extra sleep time indicates that the server is too busy, probably due to GC.


Propertyraft.server.staging.catchup.gap
Descriptioncatching up standard of a new peer
Typeint
Default1000

When bootstrapping a new peer, If the gap between the match index of the peer and the leader's latest committed index is less than this gap, we treat the peer as caught-up. Increase this number when write throughput is high.


ThreadPool - Configurations related to server thread pools.

  • Proxy thread pool: threads that recover and initialize RaftGroups when RaftServer starts.
Propertyraft.server.threadpool.proxy.cached
Descriptionuse CachedThreadPool, otherwise, uee newFixedThreadPool
Typeboolean
Defaulttrue
Propertyraft.server.threadpool.proxy.size
Descriptionthe maximum pool size
Typeint
Default0 (means unlimited for CachedThreadPool. For FixedThreadPool, it must be >0.)

  • Server thread pool: threads that handle internal RPCs, such as appendEntries.
Propertyraft.server.threadpool.server.cached
Descriptionuse CachedThreadPool, otherwise, uee newFixedThreadPool
Typeboolean
Defaulttrue
Propertyraft.server.threadpool.proxy.size
Descriptionthe maximum pool size
Typeint
Default0 (means unlimited for CachedThreadPool. For FixedThreadPool, it must be >0.)

  • Client thread pool: threads that handle client requests, such as client.io().send() and client.async().send().
Propertyraft.server.threadpool.client.cached
Descriptionuse CachedThreadPool, otherwise, uee newFixedThreadPool
Typeboolean
Defaulttrue
Propertyraft.server.threadpool.client.size
Descriptionthe maximum pool size
Typeint
Default0 (means unlimited for CachedThreadPool. For FixedThreadPool, it must be >0.)

Read - Configurations related to read-only requests.

Propertyraft.server.read.option
DescriptionOption for processing read-only requests
TypeRead.Option enum[DEFAULT, LINEARIZABLE]
DefaultRead.Option.DEFAULT
  • Read.Option.DEFAULT - Directly query statemachine:

    • It is efficient but does not provide linearizability.
    • Only the leader can serve read requests. The followers only can serve stale-read requests.
  • Read.Option.LINEARIZABLE - Use ReadIndex (see Raft Paper section 6.4):

    • It provides linearizability.
    • All the leader and the followers can serve read requests.

Propertyraft.server.read.timeout
Descriptionrequest timeout for linearizable read-only requests
TypeTimeDuration
Default10s

Propertyraft.server.read.leader.lease.enabled
Descriptionwhether to enable lease in linearizable read-only requests
Typeboolean
Defaulttrue

Propertyraft.server.read.leader.lease.timeout.ratio
Descriptionmaximum timeout ratio of leader lease
Typedouble, ranging from (0.0,1.0)
Default0.9

Read After Write - Configurations related to read-after-write-consistency

Propertyraft.server.read.read-after-write-consistent.write-index-cache.expiry-time
Descriptionexpiration time for server's memorized last written index of a specific client
TypeTimeDuration
Default60s

Write - Configurations related to write requests.

  • Limits on pending write requests
Propertyraft.server.write.element-limit
Descriptionmaximum number of pending write requests
Typeint
Default4096
Propertyraft.server.write.byte-limit
Descriptionmaximum byte size of all pending write requests
TypeSizeInBytes
Default64MB

Ratis imposes limitations on pending write requests. If the number of pending requests exceeds element-limit or the request size accumulated in pending requests exceeds byte-limit, the server rejects new incoming write requests until the pending situation is relieved.


Propertyraft.server.write.follower.gap.ratio.max
Descriptionthe threshold between the majority committed index and slow follower's committed index to guarantee the data in cache
Typeint
Default-1, disable the feature

Watch - Configurations related to watch requests.

Propertyraft.server.watch.element-limit
Descriptionmaximum number of pending watch requests
Typeint
Default65536
Propertyraft.server.watch.timeout
Descriptionwatch request timeout
TypeTimeDuration
Default10s
Propertyraft.server.watch.timeout.denomination
Descriptionwatch request timeout denomination for rounding up
TypeTimeDuration
Default1s

Note that watch.timeout must be a multiple of watch.timeout.denomination.


Log - Configurations related to raft log.

Propertyraft.server.log.use.memory
Descriptionuse memory RaftLog
Typeboolean
Defaultfalse

Only use memory RaftLog for testing.


Propertyraft.server.log.queue.element-limit
Descriptionmaximum number of pending log tasks
Typeint
Default4096
Propertyraft.server.log.queue.byte-limit
Descriptionmaximum bytes size of all pending log tasks
TypeSizeInBytes
Default64MB

Note that log.queue.element-limit and log.queue.byte-limit are similar to write.element-limit and write.byte-limit. When the pending IO tasks reached the limit, Ratis will temporarily stall the new IO Tasks.


Propertyraft.server.log.purge.gap
Descriptionminimal log gap between two purge tasks
Typeint
Default1024
Propertyraft.server.log.purge.upto.snapshot.index
Descriptionpurge logs up to snapshot index when taking a new snapshot
Typeboolean
Defaultfalse
Propertyraft.server.log.purge.preservation.log.num
Descriptionpreserve logs when purging logs up to snapshot index
Typelong
Default0
Propertyraft.server.log.segment.size.max
Descriptionmax file size for a single Raft Log Segment
TypeSizeInBytes
Default32MB
Propertyraft.server.log.segment.cache.num.max
Descriptionthe maximum number of segments caching log entries besides the open segment
Typeint
Default6
Propertyraft.server.log.segment.cache.size.max
Descriptionthe maximum byte size of segments caching log entries
TypeSizeInBytes
Default200MB
Propertyraft.server.log.preallocated.size
Descriptionpreallocate size of log segment
TypeSizeInBytes
Default4MB
Propertyraft.server.log.write.buffer.size
Descriptionsize of direct byte buffer for SegmentedRaftLog FileChannel
TypeSizeInBytes
Default8MB
Propertyraft.server.log.force.sync.num
Descriptionperform RaftLog flush tasks when pending flush tasks num exceeds force.sync.num
Typeint
Default128
Propertyraft.server.log.unsafe-flush.enabled
Descriptionunsafe-flush allows increasing flush index without waiting the actual async-flush to complete
Typeboolean
Defaultfalse
Propertyraft.server.log.async-flush.enabled
Descriptionasync-flush enables to flush the RaftLog asynchronously
Typeboolean
Defaultfalse
Propertyraft.server.log.corruption.policy
Descriptionthe policy to handle corrupted raft log
TypeLog.CorruptionPolicy enum [EXCEPTION, WARN_AND_RETURN]
DefaultCorruptionPolicy.EXCEPTION
  1. Log.CorruptionPolicy.EXCEPTION: Rethrow the exception.
  2. Log.CorruptionPolicy.WARN_AND_RETURN: Print a warning log message and return all uncorrupted log entries up to the corruption.

StateMachineData - Configurations related to StateMachine.DataApi

Propertyraft.server.log.statemachine.data.sync
DescriptionRaftLog flush should wait for statemachine data to sync
Typeboolean
Defaulttrue
Propertyraft.server.log.statemachine.data.sync.timeout
Descriptionmaximum timeout for statemachine data sync
TypeTimeDuration
Default10s
Propertyraft.server.log.statemachine.data.sync.timeout.retry
Descriptionretry policy when statemachine data sync timeouts
Typeint
Default-1
  • -1: retry indefinitely
  • 0: no retry
  • >0: the number of retries

Propertyraft.server.log.statemachine.data.read.timeout
Descriptionstatemachine data read timeout when get entire log entry
TypeTimeDuration
Default1000ms

Propertyraft.server.log.statemachine.data.caching.enabled
Descriptionenable RaftLogCache to cache statemachine data
Typeboolean
Defaultfalse

If disabled, the state machine is responsible to cache the data. RaftLogCache will remove the state machine data part when caching a LogEntry. It is to avoid double caching.


Appender - Configurations related to leader's LogAppender

Propertyraft.server.log.appender.buffer.element-limit
Descriptionlimits on log entries num of in a single AppendEntries RPC
Typeint
Default0, means no limit
Propertyraft.server.log.appender.buffer.byte-limit
Descriptionlimits on byte size of all RPC log entries in a single AppendEntries
TypeSizeInBytes
Default4MB

It is the limit of

  • max serialized size of a single Log Entry.
  • max payload of a single AppendEntries RPC.

Propertyraft.server.log.appender.snapshot.chunk.size.max
Descriptionmax chunk size of the snapshot contained in a single InstallSnapshot RPC
TypeSizeInBytes
Default16MB
Propertyraft.server.log.appender.install.snapshot.enabled
Descriptionallow leader to send snapshot to followers
Typeboolean
Defaulttrue
  • When install.snapshot.enabled is true and the leader detects that it does not contain the missing logs of a follower, the leader sends a snapshot to follower as specified in the Raft Consensus Algorithm.
  • When install.snapshot.enabled is false, the leader won‘t send snapshots to follower. It will just send a notification to that follower instead. The follower’s statemachine is responsible for fetching and installing snapshot by some other means.
Propertyraft.server.log.appender.wait-time.min
Descriptionwait time between two subsequent AppendEntries
TypeTimeDuration
Default10ms
Propertyraft.server.log.appender.retry.policy
Descriptionretry policy under error conditions
Typestring
Default1ms,10, 1s,20, 5s,1000

“1ms,10, 1s,20, 5s,1000” means The min wait time as 1ms (0 is not allowed) for first 10, (5 iteration with 2 times grpc client retry), next wait 1sec for next 20 retry (10 iteration with 2 times grpc client) further wait for 5sec for max times ((5sec*980)/2 times ~= 40min)


Snapshot - Configurations related to snapshot.

Propertyraft.server.snapshot.auto.trigger.enabled
Descriptionwhether to trigger snapshot when log size exceeds limit
Typeboolean
Defaultfalse, by default let the state machine to decide when to do checkpoint
Propertyraft.server.snapshot.trigger-when-stop.enabled
Descriptionwhether to trigger snapshot when raft server stops
Typeboolean
Defaulttrue
Propertyraft.server.snapshot.trigger-when-remove.enabled
Descriptionwhether to trigger snapshot when raft server is removed
Typeboolean
Defaulttrue
Propertyraft.server.snapshot.creation.gap
Descriptionthe log index gap between to two snapshot creations.
Typelong
Default1024
Propertyraft.server.snapshot.auto.trigger.threshold
Descriptionlog size limit (in number of applied log entries) that triggers the snapshot
Typelong
Default400000
Propertyraft.server.snapshot.retention.file.num
Descriptionhow many old snapshot versions to retain
Typeint
Default-1, means only keep latest snapshot

DataStream - ThreadPool configurations related to DataStream Api.

Propertyraft.server.data-stream.async.request.thread.pool.cached
Descriptionuse CachedThreadPool, otherwise, uee newFixedThreadPool
Typeboolean
Defaultfalse
Propertyraft.server.data-stream.async.request.thread.pool.size
DescriptionmaximumPoolSize for async request pool
Typeint
Default32
Propertyraft.server.data-stream.async.write.thread.pool.cached
Descriptionuse CachedThreadPool, otherwise, uee newFixedThreadPool
Typeboolean
Defaultfalse
Propertyraft.server.data-stream.async.write.thread.pool.size
DescriptionmaximumPoolSize for async write pool
Typeint
Default16
Propertyraft.server.data-stream.client.pool.size
DescriptionmaximumPoolSize for data stream client pool
Typeint
Default10

RPC - Configurations related to Server RPC timeout.

Propertyraft.server.rpc.request.timeout
Descriptiontimeout for AppendEntries RPC
TypeTimeDuration
Default3000ms
Propertyraft.server.rpc.sleep.time
Descriptionsleep time of two subsequent AppendEntries RPC
TypeTimeDuration
Default25ms
Propertyraft.server.rpc.slowness.timeout
Descriptionslowness timeout
TypeTimeDuration
Default60s

Note that slowness.timeout is use in two places:

  • Leader would consider a follower slow if slowness.timeout elapsed without hearing any responses from this follower.
  • If server monitors a JVM Pause longer than slowness.timeout, it would shut down itself.

RetryCache - Configuration related to server retry cache.

Propertyraft.server.retrycache.expire-time
Descriptionexpire time of retry cache entry
TypeTimeDuration
Default60s
Note that we should set an expiration time longer than the total retry waiting duration of clients
in order to ensure exactly-once semantic.
Propertyraft.server.retrycache.statistics.expire-time
Descriptionexpire time of retry cache statistics
TypeTimeDuration
Default100us

Notification - Configurations related to state machine notifications.

Propertyraft.server.notification.no-leader.timeout
Descriptiontimeout value to notify the state machine when there is no leader for a period
TypeTimeDuration
Default60s

LeaderElection - Configurations related to leader election.

Propertyraft.server.rpc.timeout.min
DescriptionRaft Protocol min election timeout
TypeTimeDuration
Default150ms
Propertyraft.server.rpc.timeout.max
DescriptionRaft Protocol max election timeout
TypeTimeDuration
Default300ms

First election timeout is introduced to reduce unavailable time when a RaftGroup initially starts up.

Propertyraft.server.rpc.first-election.timeout.min
DescriptionRaft Protocol min election timeout
TypeTimeDuration
Default150ms
Propertyraft.server.rpc.first-election.timeout.max
DescriptionRaft Protocol max election timeout
TypeTimeDuration
Default300ms

Propertyraft.server.leaderelection.leader.step-down.wait-time
Descriptionwhen a leader steps down, it can't be re-elected until wait-time elapsed
TypeTimeDuration
Default10s
Propertyraft.server.leaderelection.pre-vote
Descriptionenable pre-vote
Typeboolean
Defaulttrue

In Pre-Vote, the candidate does not change its term and try to learn if a majority of the cluster would be willing to grant the candidate their votes (if the candidate’s log is sufficiently up-to-date, and the voters have not received heartbeats from a valid leader for at least a baseline election timeout).

Propertyraft.server.leaderelection.member.majority-add
Descriptionenable majority-add
Typeboolean
Defaultfalse

Does it allow majority-add, i.e. adding a majority of members in a single setConf?

  • Note that, when a single setConf removes and adds members at the same time, the majority is counted after the removal. For examples,

    1. setConf to a 3-member group by adding 2 new members is NOT a majority-add.
    2. However, setConf to a 3-member group by removing 2 of members and adding 2 new members is a majority-add.
  • Note also that adding 1 new member to an 1-member group is always allowed, although it is a majority-add.