HDDS-4948. SCM-HA documentation (#2050)

Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
diff --git a/hadoop-hdds/docs/content/design/scmha.md b/hadoop-hdds/docs/content/design/scmha.md
index 5b153af..46acffb 100644
--- a/hadoop-hdds/docs/content/design/scmha.md
+++ b/hadoop-hdds/docs/content/design/scmha.md
@@ -4,7 +4,7 @@
 date: 2020-03-05
 jira: HDDS-2823
 status: implementing
-author: Li Cheng, Nandakumar Vadivelu
+author: Li Cheng, Nandakumar Vadivelu, Rui Wang, Glen Geng, Shashikant Banerjee
 ---
 <!--
   Licensed under the Apache License, Version 2.0 (the "License");
@@ -24,6 +24,15 @@
 
  Proposal to implement HA similar to the OM HA: Using Apache Ratis to propagate the 
  
-# Link
+# Links
 
- * https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing
+The main SCM HA design doc is available from [here](https://docs.google.com/document/d/1vr_z6mQgtS1dtI0nANoJlzvF1oLV-AtnNJnxAgg69rM/edit?usp=sharing)
+
+During the implementation of SCM-HA many smaller design docs are created specific to various areas:
+
+ * [SCM HA Distributed Sequence ID Generator](https://docs.google.com/document/d/1LaXz_mjeXPmIKys3oogxQSDLVQOzewpIp3baPGT0Vqw/edit): about generating unique identifier across multiple nodes of the HA quorum
+ * [SCM HA Service Manager](https://docs.google.com/document/d/1DbbqP0m3g_iEpY9qkSGOuQgcCN-QqlSNgWpvBOLv5h0/edit): about starting and stopping the main SCM services (like PipelienManager, ReplicationManager) in case of a failover
+ * [SCM HA SCMContext](https://docs.google.com/document/d/1h_3gpC4o2EpuBlcQiJC7MMoZz9JmaMX9CxObSxWU614/edit): about using a helper object which includes all the key information for all the required service components
+ * [SCM HA Snapshots](https://docs.google.com/document/d/1uy4_ER2V6nNQJ7_5455Wz8NmI142JHPnif6Y1OdPi8E/edit): about RAFT state-machine snapshots
+ * [SCM HA: DeleteBlockLog](https://docs.google.com/document/d/166Aea2EowSGWtAFWNlDv0gu4rA06dQ2rJAsBd-l210Q/edit): about coordinating block deletions in HA environment
+ * [SCM HA: bootstrap](https://issues.apache.org/jira/secure/attachment/13021254/SCM%20HA%20Bootstrap_updated.pdf): about initializing the SCM HA cluster
\ No newline at end of file
diff --git a/hadoop-hdds/docs/content/feature/HA.md b/hadoop-hdds/docs/content/feature/OM-HA.md
similarity index 78%
rename from hadoop-hdds/docs/content/feature/HA.md
rename to hadoop-hdds/docs/content/feature/OM-HA.md
index 3f8ad53..1da660c 100644
--- a/hadoop-hdds/docs/content/feature/HA.md
+++ b/hadoop-hdds/docs/content/feature/OM-HA.md
@@ -1,10 +1,10 @@
 ---
-title: "High Availability"
+title: "OM High Availability"
 weight: 1
 menu:
    main:
       parent: Features
-summary: HA setup for Ozone to avoid any single point of failure.
+summary: HA setup for Ozone Manager to avoid any single point of failure.
 ---
 <!---
   Licensed to the Apache Software Foundation (ASF) under one or more
@@ -23,19 +23,19 @@
   limitations under the License.
 -->
 
-Ozone has two leader nodes (*Ozone Manager* for key space management and *Storage Container Management* for block space management) and storage nodes (Datanode). Data is replicated between datanodes with the help of RAFT consensus algorithm.
+Ozone has two metadata-manager nodes (*Ozone Manager* for key space management and *Storage Container Management* for block space management) and multiple storage nodes (Datanode). Data is replicated between Datanodes with the help of RAFT consensus algorithm.
 
-To avoid any single point of failure the leader nodes also should have a HA setup.
+To avoid any single point of failure the metadata-manager nodes also should have a HA setup.
 
- 1. HA of Ozone Manager is implemented with the help of RAFT (Apache Ratis)
- 2. HA of Storage Container Manager is [under implementation]({{< ref "scmha.md">}})
+Both Ozone Manager and Storage Container Manager supports HA. In this mode the internal state is replicated via RAFT (with Apache Ratis) 
+
+This document explain the HA setup of Ozone Manager (OM) HA, please check [this page]({{< ref "SCM-HA" >}}) for SCM HA.  While they can be setup for HA independently, a reliable, full HA setup requires enabling HA for both services.
 
 ## Ozone Manager HA
 
-A single Ozone Manager uses [RocksDB](https://github.com/facebook/rocksdb/) to persiste metadata (volumes, buckets, keys) locally. HA version of Ozone Manager does exactly the same but all the data is replicated with the help of the RAFT consensus algorithm to follower Ozone Manager instances.
+A single Ozone Manager uses [RocksDB](https://github.com/facebook/rocksdb/) to persist metadata (volumes, buckets, keys) locally. HA version of Ozone Manager does exactly the same but all the data is replicated with the help of the RAFT consensus algorithm to follower Ozone Manager instances.
 
 ![OM HA](HA-OM.png)
-
 Client connects to the Leader Ozone Manager which process the request and schedule the replication with RAFT. When the request is replicated to all the followers the leader can return with the response.
 
 ## Configuration
@@ -112,4 +112,4 @@
 ## References
 
  * Check [this page]({{< ref "design/omha.md" >}}) for the links to the original design docs
- * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}).
\ No newline at end of file
+ * Ozone distribution contains an example OM HA configuration, under the `compose/ozone-om-ha` directory which can be tested with the help of [docker-compose]({{< ref "start/RunningViaDocker.md" >}}).
diff --git a/hadoop-hdds/docs/content/feature/HA.zh.md b/hadoop-hdds/docs/content/feature/OM-HA.zh.md
similarity index 100%
rename from hadoop-hdds/docs/content/feature/HA.zh.md
rename to hadoop-hdds/docs/content/feature/OM-HA.zh.md
diff --git a/hadoop-hdds/docs/content/feature/SCM-HA.md b/hadoop-hdds/docs/content/feature/SCM-HA.md
new file mode 100644
index 0000000..4a84aa9
--- /dev/null
+++ b/hadoop-hdds/docs/content/feature/SCM-HA.md
@@ -0,0 +1,162 @@
+---
+title: "SCM High Availability"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: HA setup for Storage Container Manager to avoid any single point of failure.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+Ozone has two metadata-manager nodes (*Ozone Manager* for key space management and *Storage Container Management* for block space management) and multiple storage nodes (Datanode). Data is replicated between Datanodes with the help of RAFT consensus algorithm.
+
+<div class="alert alert-warning" role="alert">
+Please note that SCM-HA is not ready for production in secure environments. Security work is in progress and will be finished soon.
+</div>
+
+To avoid any single point of failure the metadata-manager nodes also should have a HA setup.
+
+Both Ozone Manager and Storage Container Manager supports HA. In this mode the internal state is replicated via RAFT (with Apache Ratis) 
+
+This document explains the HA setup of Storage Container Manager (SCM), please check [this page]({{< ref "OM-HA" >}}) for HA setup of Ozone Manager (OM). While they can be setup for HA independently, a reliable, full HA setup requires enabling HA for both services. 
+
+## Configuration
+
+HA mode of Storage Container Manager can be enabled with the following settings in `ozone-site.xml`:
+
+```XML
+<property>
+   <name>ozone.scm.ratis.enable</name>
+   <value>true</value>
+</property>
+```
+One Ozone configuration (`ozone-site.xml`) can support multiple SCM HA node set, multiple Ozone clusters. To select between the available SCM nodes a logical name is required for each of the clusters which can be resolved to the IP addresses (and domain names) of the Storage Container Managers.
+
+This logical name is called `serviceId` and can be configured in the `ozone-site.xml`
+
+Most of the time you need to set only the values of your current cluster:
+
+ ```XML
+<property>
+   <name>ozone.scm.service.ids</name>
+   <value>cluster1</value>
+</property>
+```
+
+For each of the defined `serviceId` a logical configuration name should be defined for each of the servers
+
+```XML
+<property>
+   <name>ozone.scm.nodes.cluster1</name>
+   <value>scm1,scm2,scm3</value>
+</property>
+```
+
+The defined prefixes can be used to define the address of each of the SCM services:
+
+```XML
+<property>
+   <name>ozone.scm.address.cluster1.scm1</name>
+   <value>host1</value>
+</property>
+<property>
+   <name>ozone.scm.address.cluster1.scm1</name>
+   <value>host2</value>
+</property>
+<property>
+   <name>ozone.scm.address.cluster1.scm1</name>
+   <value>host3</value>
+</property>
+```
+
+For reliable HA support choose 3 independent nodes to form a quorum. 
+
+## Bootstrap
+
+The initialization of the **first** SCM-HA node is the same as a none-HA SCM:
+
+```
+bin/ozone scm --init
+```
+
+Second and third nodes should be *bootstrapped* instead of init. These clusters will join to the configured RAFT quorum. The id of the current server is identified by DNS name or can be set explicitly by `ozone.scm.node.id`. Most of the time you don't need to set it as DNS based id detection can work well.
+
+```
+bin/ozone scm --bootstrap
+```
+
+## Auto-bootstrap
+
+In some environment -- such as containerized / K8s environment -- we need to have a common, unified way to initialize SCM HA quorum. As a remained, the standard initialization flow is the following:
+
+ 1. On the first, "primordial" node, call `scm --init`
+ 2. On second/third nodes call `scm --bootstrap`
+
+This can be changed with using `ozone.scm.primordial.node.id`. You can define the primordial node. After setting this node, you should execute **both** `scm --init` and `scm --bootstrap` on **all** nodes.
+
+Based on the `ozone.scm.primordial.node.id`, the init process will be ignored on the second/third nodes and bootstrap process will be ignored on all nodes except the primordial one.
+
+## Implementation details
+
+SCM HA uses Apache Ratis to replicate state between the members of the SCM HA quorum. Each node maintains the block management metadata in local RocksDB.
+
+This replication process is a simpler version of OM HA replication process as it doesn't use any double buffer (as the overall db thourghput of SCM requests are lower)
+
+Datanodes are sending all the reports (Container reports, Pipeline reports...) to *all* the Datanodes parallel. Only the leader node can assign/create new containers, and only the leader node sends command back to the Datanodes.
+
+## Verify SCM HA setup
+
+After starting an SCM-HA it can be validated if the SCM nodes are forming one single quorum instead of 3 individual SCM nodes.
+
+First, check if all the SCM nodes store the same ClusterId metadata:
+
+```bash
+cat /data/metadata/scm/current/VERSION
+```
+
+ClusterId is included in the VERSION file and should be the same in all the SCM nodes:
+
+```bash
+#Tue Mar 16 10:19:33 UTC 2021
+cTime=1615889973116
+clusterID=CID-130fb246-1717-4313-9b62-9ddfe1bcb2e7
+nodeType=SCM
+scmUuid=e6877ce5-56cd-4f0b-ad60-4c8ef9000882
+layoutVersion=0
+```
+
+You can also create data and double check with `ozone debug` tool if all the container metadata is replicated.
+
+```shell
+bin/ozone freon randomkeys --numOfVolumes=1 --numOfBuckets=1 --numOfKeys=10000 --keySize=524288 --replicationType=RATIS --numOfThreads=8 --factor=THREE --bufferSize=1048576
+ 
+ 
+// use debug ldb to check scm db on all the machines
+bin/ozone debug ldb --db=/tmp/metadata/scm.db/ ls
+ 
+ 
+bin/ozone debug ldb --db=/tmp/metadata/scm.db/ scan --with-keys --column_family=containers
+```
+
+## Migrating from existing SCM
+
+SCM HA can be turned on on any Ozone cluster. First enable Ratis (`ozone.scm.ratis.enable`) and configure only one node for the Ratis ring (`ozone.scm.nodes.NAME` should have one element).
+
+Start the cluster and test if it works well.
+
+If everything is fine, you can extend the cluster configuration with multiple nodes, restart SCM node, and initialize the additional nodes with `scm --bootstrap` command.
\ No newline at end of file