| --- |
| layout: default |
| |
| # Top navigation |
| top-nav-group: user-guide |
| top-nav-pos: 7 |
| top-nav-title: Global Replicated Log |
| |
| # Sub-level navigation |
| sub-nav-group: user-guide |
| sub-nav-parent: user-guide |
| sub-nav-id: global-replicated-log |
| sub-nav-pos: 7 |
| sub-nav-title: Global Replicated Log |
| --- |
| |
| .. contents:: Global Replicated Log |
| |
| Global Replicated Log |
| ===================== |
| |
| A typical setup for DistributedLog is within a datacenter. But a global setup is required for |
| providing global replicated logs for distributed key/value store to achieve strong consistency |
| across multiple datacenters. `Global` here means across datacenters, which is different from |
| `Local` meaning within a datacenter. |
| |
| A global setup of DistributedLog is organized as a set of `regions`, where each region is the |
| rough analog of a local setup. Regions are the unit of administrative deployment. The set of |
| regions is also the set of locations across which data can be replicated. Regions can be added |
| to or removed from a running system as new datacenters are brought into service and old ones |
| are turned off, respectively. Regions are also the unit of physical isolation: there may be one |
| or more regions in a datacenter if they have isolated power or network supplies. |
| |
| .. figure:: ../../images/globalreplicatedlog.png |
| :align: center |
| :width: 600px |
| |
| Figure 1. Global Replicated Log |
| |
| Figure 1 illustrates the servers in a `Global Replicated Log` setup. There is no inter datacenter |
| communication between write proxies or log segment storage nodes. The only component that does |
| inter datacenters communications within its hosts is the 'Global' metadata store, which is a global |
| setup of ZooKeeper. Write clients will talk to the write proxies in its local region to bootstrap |
| the ownership cache and redirect to correct write proxies in other regions through direct TCP |
| connections. While readers will identify the regions of the log segment storage nodes according to |
| the `region aware` placement policy, and try reading from local region at most of the time and |
| speculatively try on remote regions. |
| |
| Region Aware Data Placement Policy |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Region aware placement policy uses hierarchical allocation where-in nodes are allocated so that data |
| is spread uniformly across the available regions and within each region it uses the `rack-aware` |
| placement policy to spread the data uniformly across the available racks. |
| |
| Region aware placement policy is governed by a parameter ensures that the ack quorum covers at least |
| *minRegionsForDurability* distinct regions. This ensures that the system can survive the failure of |
| `(totalRegions - minRegionsForDurability)` regions without loss of availability. For example if we |
| have bookie nodes in *5* regions and if the *minRegionsForDurability* is *3* then we can survive the |
| failure of `(5 - 3) = 2` regions. |
| |
| The placement algorithm follows the following simple invariant: |
| |
| :: |
| |
| There is no combination of nodes that would satisfy the ack quorum with |
| less than "minRegionsForDurability" responses. |
| |
| |
| This invariant ensures that enforcing ack quorum is sufficient to enforce that the entry has been made durable |
| in *minRegionsForDurability* regions. |
| |
| The *minRegionsForDurability* requirement enforces constraints on the minimum ack quorum as we want to ensure |
| that when we run in degraded mode - *i.e. when only a subset of the regions are available* - we would still not |
| be able to allocate nodes in such a way that the ack quorum would be satisfied by fewer than *minRegionsForDurability* |
| regions. |
| |
| For instance consider the following scenario with three regions each containing 20 bookie nodes: |
| |
| :: |
| |
| minRegionsForDurability = 2 |
| ensemble size = write quorum = 15 |
| ack quorum = 8 |
| |
| |
| Let's say that one of the regions is currently unavailable and we want to still ensure that writes can continue. |
| The ensemble placement may then have to choose bookies from the two available regions. Given that *15* bookies have |
| to be allocated, we will have to allocate at least *8* bookies from one of the remaining regions - but with ack quorum |
| of *8* we run the risk of satisfying ack quorum with bookies from a single region. Therefore we must require that |
| the ack quorum is greater than *8*. |
| |
| Cross Region Speculative Reads |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| As discussed before, read requests can be satisfied by any replica of the data, however for high availability |
| speculative requests are sent to multiple copies to ensure that at least one of the requests returns within |
| the time specified by the *SLA*. The reader consults the data placement policy to get the list of replicas that |
| can satisfy the request in the order of preference. This order is decided as follows: |
| |
| * The first node in the list is always the bookie node that is closest to the client - if more than one such nodes exist, one is chosen at random. |
| * The second node is usually the closest node in a different failure domain. In the case of a two level hierarchy that would be a node in a different rack. |
| * The third node is chosen from a different region |
| |
| The delay between successive speculative read requests ensures that the probability of sending the *nth* |
| speculative read request decays exponentially with *n*. This ensures that the number of requests that go to |
| farther nodes is still kept at a minimum. However by making sure that we cross failure domains in the first |
| few speculative requests improves fault-tolerance of the reader. Transient node failures are transparently |
| handled by the reader by this simple and generalizable speculative read policy. This can be thought of as |
| the most granular form of failover where each request essentially fails-over to an alternate node if the |
| primary node it attempted to access is unavailable. In practice we have found this to also better handle |
| network congestion where routes between specific pairs of nodes may become unavailable without necessarily |
| making the nodes completely inaccessible. |
| |
| In addition to static decisions based on the location of the bookie nodes, we can also make dynamic decisions |
| based on observed latency or failure rates from specific bookies. These statistics are tracked by the bookie |
| client and are used to influence the order in which speculative read requests are scheduled. This again is |
| able to capture partial network outages that affect specific routes within the network. |