blob: 0402a4bc1e49f5dc0fff4a3b9846720f7ade815f [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Overview
The Consensus Package provides the Consensus Layer's service definitions and its implementations.
### What is Consensus Layer used for?
Generally, We maintain multiple copies of application data for the purpose of fault-tolerance and
data-integrity. There are variety of consensus algorithms to manage multiple copies of data, which
differs in levels of data consistency and performance. Each consensus algorithm may have multiple
industrial implementations. The Consensus Layer aims to hide all complications behind different
consensus algorithms and implementations, providing higher level of abstraction to the user.
# Consensus Layer
### Basic Concepts
* `IStateMachine` is the user application that manages a local copy of data.
* `Peer` is the smallest consensus unit (inside a process), which holds a `IStateMachine`
internally.
* `ConsensusGroup` is a group of `Peer` all managing the same copy of data.
* `IConsensus` interface defines the basic functionality provided by Consensus Layer.
* `ConsensusFactory` is the only factory class exposed to other modules to create a consensus layer
implementation.
User application can create a `ConsensusGroup` with k `Peer` to store data, i.e. that there will be
k copies of data.
When write data into a `ConsensusGroup` using `IConsensus::write`, it will be sent to the group
leader's `IStateMachine::write` . The leader makes decision about this write operation first, then
applies write-operation to the local statemachine using `IStateMachine::write`, and forward this
operation to other members' `IStateMachine::write` in the same group
### How to use
1. Define the `IStateMachine` to manage local copy of data.
2. Define the `IConsensusRequest` to customize request format
3. Define the `DataSet` to customize the response format
4. Select a specific "IConsensus" class name and call `ConsensusFactory.getConsensusImpl()` to
instantiate the corresponding consensus protocol
# Ratis Consensus Implementation
RatisConsensus is a multi-raft implementation of `IConsensus` protocol. It is based
on [Apache Ratis](https://ratis.apache.org/).
### 1. build and start a RatisConsensusImpl
```java
IConsensus consensusImpl =
ConsensusFactory.getConsensusImpl(
ConsensusFactory.RatisConsensus,
new Endpoint(conf.getRpcAddress(), conf.getInternalPort()),
new File(conf.getConsensusDir()),
gid -> new ConfigRegionStateMachine())
.orElseThrow(() ->
new IllegalArgumentException(
String.format(
ConsensusFactory.CONSTRUCT_FAILED_MSG,
ConsensusFactory.RatisConsensus)));
consensusImpl.start();
```
* `endpoint` is the communication endpoint for this consensusImpl.
* `StateMachineRegistry` Indicates that the consensus layer should generate different state machines for different gid
* `StoreageDir` specifies the location to store RaftLog. Assign a fix location so that the
RatisConsensus knows where to recover when crashes and restarts.
### 2. assign local RatisConsensus a new Group
```java
ConsensusGroup group = new ConsensusGroup(...);
response = consensusImpl.addConsensusGroup(group.getGroupId(),group.getPeers());
```
The underling consensusImpl will initialize its states, and reaching out to other peers to elect the
raft leader.
**Notice**: this request may fail. It's caller's responsibility to retry / rollback.
### 3. change group configuration
#### (1) remove a member
suppose now the group contains peer[0,1,2], and we want to remove[1,2] from this group
```java
// the following code should be called in peer both 1 & 2
// first use removePeer to inform the group leader of configuration change
consensusImpl.removePeer(gid,myself);
// then use removeConsensusGroup to clean up local states and data
consensusImpl.removeConsensusGroup(gid);
```
**Notice**: either of `removePeer` or `removeConsensusGroup` may fail. It's caller's responsibility
to retry and make these two calls atomic.
#### (2) add a member
adding a new member is similar to removing a member except that you should call `addConsensusGroup`
first and then `addPeer`
```java
// the following code should be called in peer both 1 & 2
// first addConsensusGroup to initialize local states
consensusImpl.addConsensusGroup(gid);
// then use addPeer to inform the previous group members of joining a new member
consensusImpl.addPeer(gid,myself);
```
#### (3) add/remove multiple members
```java
// pre. For each member newly added, call addConsensusGroup locally to initialize
consensusImpl.changePeer(group.getGroupId(),newGroupmember);
// after. For each member removed, call removeConsensusGroup locally to clean up
```
**Notice**: the old group and the new group must overlap in at least one member.
### 4. write data
```java
ConsensusWriteResponse response = consensusImpl.write(gid,request)
if(response.isSuccess() && response.getStates().code() == 200){
...
}
```
### 5. read data
```java
ConsensusReadResponse response = consensusImpl.read(gid,request);
if(response.isSuccess()){
MyDataSet result=(MyDataSet)response.getDataset();
}
```
**NOTICE**: currently in RatisConsensus, read will direct read the local copy. Thus, the result may
be stale and not linearizable!