| --- |
| layout: default |
| |
| # Sub-level navigation |
| sub-nav-group: user-guide |
| sub-nav-parent: implementation |
| sub-nav-pos: 1 |
| sub-nav-title: Storage |
| |
| --- |
| |
| .. contents:: Storage |
| |
| Storage |
| ======= |
| |
| This describes some implementation details of storage layer. |
| |
| Ensemble Placement Policy |
| ------------------------- |
| |
| `EnsemblePlacementPolicy` encapsulates the algorithm that bookkeeper client uses to select a number of bookies from the |
| cluster as an ensemble for storing data. The algorithm is typically based on the data input as well as the network |
| topology properties. |
| |
| By default, BookKeeper offers a `RackawareEnsemblePlacementPolicy` for placing the data across racks within a |
| datacenter, and a `RegionAwareEnsemblePlacementPolicy` for placing the data across multiple datacenters. |
| |
| How does EnsemblePlacementPolicy work? |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The interface of `EnsemblePlacementPolicy` is described as below. |
| |
| :: |
| |
| public interface EnsemblePlacementPolicy { |
| |
| /** |
| * Initialize the policy. |
| * |
| * @param conf client configuration |
| * @param optionalDnsResolver dns resolver |
| * @param hashedWheelTimer timer |
| * @param featureProvider feature provider |
| * @param statsLogger stats logger |
| * @param alertStatsLogger stats logger for alerts |
| */ |
| public EnsemblePlacementPolicy initialize(ClientConfiguration conf, |
| Optional<DNSToSwitchMapping> optionalDnsResolver, |
| HashedWheelTimer hashedWheelTimer, |
| FeatureProvider featureProvider, |
| StatsLogger statsLogger, |
| AlertStatsLogger alertStatsLogger); |
| |
| /** |
| * Uninitialize the policy |
| */ |
| public void uninitalize(); |
| |
| /** |
| * A consistent view of the cluster (what bookies are available as writable, what bookies are available as |
| * readonly) is updated when any changes happen in the cluster. |
| * |
| * @param writableBookies |
| * All the bookies in the cluster available for write/read. |
| * @param readOnlyBookies |
| * All the bookies in the cluster available for readonly. |
| * @return the dead bookies during this cluster change. |
| */ |
| public Set<BookieSocketAddress> onClusterChanged(Set<BookieSocketAddress> writableBookies, |
| Set<BookieSocketAddress> readOnlyBookies); |
| |
| /** |
| * Choose <i>numBookies</i> bookies for ensemble. If the count is more than the number of available |
| * nodes, {@link BKNotEnoughBookiesException} is thrown. |
| * |
| * @param ensembleSize |
| * Ensemble Size |
| * @param writeQuorumSize |
| * Write Quorum Size |
| * @param excludeBookies |
| * Bookies that should not be considered as targets. |
| * @return list of bookies chosen as targets. |
| * @throws BKNotEnoughBookiesException if not enough bookies available. |
| */ |
| public ArrayList<BookieSocketAddress> newEnsemble(int ensembleSize, int writeQuorumSize, int ackQuorumSize, |
| Set<BookieSocketAddress> excludeBookies) throws BKNotEnoughBookiesException; |
| |
| /** |
| * Choose a new bookie to replace <i>bookieToReplace</i>. If no bookie available in the cluster, |
| * {@link BKNotEnoughBookiesException} is thrown. |
| * |
| * @param bookieToReplace |
| * bookie to replace |
| * @param excludeBookies |
| * bookies that should not be considered as candidate. |
| * @return the bookie chosen as target. |
| * @throws BKNotEnoughBookiesException |
| */ |
| public BookieSocketAddress replaceBookie(int ensembleSize, int writeQuorumSize, int ackQuorumSize, |
| Collection<BookieSocketAddress> currentEnsemble, BookieSocketAddress bookieToReplace, |
| Set<BookieSocketAddress> excludeBookies) throws BKNotEnoughBookiesException; |
| |
| /** |
| * Reorder the read sequence of a given write quorum <i>writeSet</i>. |
| * |
| * @param ensemble |
| * Ensemble to read entries. |
| * @param writeSet |
| * Write quorum to read entries. |
| * @param bookieFailureHistory |
| * Observed failures on the bookies |
| * @return read sequence of bookies |
| */ |
| public List<Integer> reorderReadSequence(ArrayList<BookieSocketAddress> ensemble, |
| List<Integer> writeSet, Map<BookieSocketAddress, Long> bookieFailureHistory); |
| |
| |
| /** |
| * Reorder the read last add confirmed sequence of a given write quorum <i>writeSet</i>. |
| * |
| * @param ensemble |
| * Ensemble to read entries. |
| * @param writeSet |
| * Write quorum to read entries. |
| * @param bookieFailureHistory |
| * Observed failures on the bookies |
| * @return read sequence of bookies |
| */ |
| public List<Integer> reorderReadLACSequence(ArrayList<BookieSocketAddress> ensemble, |
| List<Integer> writeSet, Map<BookieSocketAddress, Long> bookieFailureHistory); |
| } |
| |
| The methods in this interface covers three parts - 1) initialization and uninitialization; 2) how to choose bookies to |
| place data; and 3) how to choose bookies to do speculative reads. |
| |
| Initialization and uninitialization |
| ___________________________________ |
| |
| The ensemble placement policy is constructed by jvm reflection during constructing bookkeeper client. After the |
| `EnsemblePlacementPolicy` is constructed, bookkeeper client will call `#initialize` to initialize the placement policy. |
| |
| The `#initialize` method takes a few resources from bookkeeper for instantiating itself. These resources include: |
| |
| 1. `ClientConfiguration` : The client configuration that used for constructing the bookkeeper client. The implementation of the placement policy could obtain its settings from this configuration. |
| 2. `DNSToSwitchMapping`: The DNS resolver for the ensemble policy to build the network topology of the bookies cluster. It is optional. |
| 3. `HashedWheelTimer`: A hashed wheel timer that could be used for timing related work. For example, a stabilize network topology could use it to delay network topology changes to reduce impacts of flapping bookie registrations due to zk session expires. |
| 4. `FeatureProvider`: A feature provider that the policy could use for enabling or disabling its offered features. For example, a region-aware placement policy could offer features to disable placing data to a specific region at runtime. |
| 5. `StatsLogger`: A stats logger for exposing stats. |
| 6. `AlertStatsLogger`: An alert stats logger for exposing critical stats that needs to be alerted. |
| |
| The ensemble placement policy is a single instance per bookkeeper client. The instance will be `#uninitialize` when |
| closing the bookkeeper client. The implementation of a placement policy should be responsible for releasing all the |
| resources that allocated during `#initialize`. |
| |
| How to choose bookies to place |
| ______________________________ |
| |
| The bookkeeper client discovers list of bookies from zookeeper via `BookieWatcher` - whenever there are bookie changes, |
| the ensemble placement policy will be notified with new list of bookies via `onClusterChanged(writableBookie, readOnlyBookies)`. |
| The implementation of the ensemble placement policy will react on those changes to build new network topology. Subsequent |
| operations like `newEnsemble` or `replaceBookie` hence can operate on the new network topology. |
| |
| newEnsemble(ensembleSize, writeQuorumSize, ackQuorumSize, excludeBookies) |
| Choose `ensembleSize` bookies for ensemble. If the count is more than the number of available nodes, |
| `BKNotEnoughBookiesException` is thrown. |
| |
| replaceBookie(ensembleSize, writeQuorumSize, ackQuorumSize, currentEnsemble, bookieToReplace, excludeBookies) |
| Choose a new bookie to replace `bookieToReplace`. If no bookie available in the cluster, |
| `BKNotEnoughBookiesException` is thrown. |
| |
| |
| Both `RackAware` and `RegionAware` placement policies are `TopologyAware` policies. They build a `NetworkTopology` on |
| responding bookie changes, use it for ensemble placement and ensure rack/region coverage for write quorums - a write |
| quorum should be covered by at least two racks or regions. |
| |
| Network Topology |
| ^^^^^^^^^^^^^^^^ |
| |
| The network topology is presenting a cluster of bookies in a tree hierarchical structure. For example, a bookie cluster |
| may be consists of many data centers (aka regions) filled with racks of machines. In this tree structure, leaves |
| represent bookies and inner nodes represent switches/routes that manage traffic in/out of regions or racks. |
| |
| For example, there are 3 bookies in region `A`. They are `bk1`, `bk2` and `bk3`. And their network locations are |
| `/region-a/rack-1/bk1`, `/region-a/rack-1/bk2` and `/region-a/rack-2/bk3`. So the network topology will look like below: |
| |
| :: |
| |
| root |
| | |
| region-a |
| / \ |
| rack-1 rack-2 |
| / \ \ |
| bk1 bk2 bk3 |
| |
| Another example, there are 4 bookies spanning in two regions `A` and `B`. They are `bk1`, `bk2`, `bk3` and `bk4`. And |
| their network locations are `/region-a/rack-1/bk1`, `/region-a/rack-1/bk2`, `/region-b/rack-2/bk3` and `/region-b/rack-2/bk4`. |
| The network topology will look like below: |
| |
| :: |
| |
| root |
| / \ |
| region-a region-b |
| | | |
| rack-1 rack-2 |
| / \ / \ |
| bk1 bk2 bk3 bk4 |
| |
| The network location of each bookie is resolved by a `DNSResolver` (interface is described as below). The `DNSResolver` |
| resolves a list of DNS-names or IP-addresses into a list of network locations. The network location that is returned |
| must be a network path of the form `/region/rack`, where `/` is the root, and `region` is the region id representing |
| the data center where `rack` is located. The network topology of the bookie cluster would determine the number of |
| components in the network path. |
| |
| :: |
| |
| /** |
| * An interface that must be implemented to allow pluggable |
| * DNS-name/IP-address to RackID resolvers. |
| * |
| */ |
| @Beta |
| public interface DNSToSwitchMapping { |
| /** |
| * Resolves a list of DNS-names/IP-addresses and returns back a list of |
| * switch information (network paths). One-to-one correspondence must be |
| * maintained between the elements in the lists. |
| * Consider an element in the argument list - x.y.com. The switch information |
| * that is returned must be a network path of the form /foo/rack, |
| * where / is the root, and 'foo' is the switch where 'rack' is connected. |
| * Note the hostname/ip-address is not part of the returned path. |
| * The network topology of the cluster would determine the number of |
| * components in the network path. |
| * <p/> |
| * |
| * If a name cannot be resolved to a rack, the implementation |
| * should return {@link NetworkTopology#DEFAULT_RACK}. This |
| * is what the bundled implementations do, though it is not a formal requirement |
| * |
| * @param names the list of hosts to resolve (can be empty) |
| * @return list of resolved network paths. |
| * If <i>names</i> is empty, the returned list is also empty |
| */ |
| public List<String> resolve(List<String> names); |
| |
| /** |
| * Reload all of the cached mappings. |
| * |
| * If there is a cache, this method will clear it, so that future accesses |
| * will get a chance to see the new data. |
| */ |
| public void reloadCachedMappings(); |
| } |
| |
| By default, the network topology responds to bookie changes immediately. That means if a bookie's znode appears in or |
| disappears from zookeeper, the network topology will add the bookie or remove the bookie immediately. It introduces |
| instability when bookie's zookeeper registration becomes flapping. In order to address this, there is a `StabilizeNetworkTopology` |
| which delays removing bookies from network topology if they disappear from zookeeper. It could be enabled by setting |
| the following option. |
| |
| :: |
| |
| # enable stabilize network topology by setting it to a positive value. |
| bkc.networkTopologyStabilizePeriodSeconds=10 |
| |
| |
| RackAware and RegionAware |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| `RackAware` placement policy basically just chooses bookies from different racks in the built network topology. It |
| guarantees that a write quorum will cover at least two racks. |
| |
| `RegionAware` placement policy is a hierarchical placement policy, which it chooses equal-sized bookies from regions, and |
| within each region it uses `RackAware` placement policy to choose bookies from racks. For example, if there is 3 regions - |
| `region-a`, `region-b` and `region-c`, an application want to allocate a 15-bookies ensemble. First, it would figure |
| out there are 3 regions and it should allocate 5 bookies from each region. Second, for each region, it would use |
| `RackAware` placement policy to choose 5 bookies. |
| |
| How to choose bookies to do speculative reads? |
| ______________________________________________ |
| |
| `reorderReadSequence` and `reorderReadLACSequence` are two methods exposed by the placement policy, to help client |
| determine a better read sequence according to the network topology and the bookie failure history. |
| |
| In `RackAware` placement policy, the reads will be tried in following sequence: |
| |
| - bookies are writable and didn't experience failures before |
| - bookies are writable and experienced failures before |
| - bookies are readonly |
| - bookies already disappeared from network topology |
| |
| In `RegionAware` placement policy, the reads will be tried in similar following sequence as `RackAware` placement policy. |
| There is a slight different on trying writable bookies: after trying every 2 bookies from local region, it would try |
| a bookie from remote region. Hence it would achieve low latency even there is network issues within local region. |
| |
| How to enable different EnsemblePlacementPolicy? |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Users could configure using different ensemble placement policies by setting following options in distributedlog |
| configuration files. |
| |
| :: |
| |
| # enable rack-aware ensemble placement policy |
| bkc.ensemblePlacementPolicy=org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy |
| # enable region-aware ensemble placement policy |
| bkc.ensemblePlacementPolicy=org.apache.bookkeeper.client.RegionAwareEnsemblePlacementPolicy |
| |
| The network topology of bookies built by either `RackawareEnsemblePlacementPolicy` or `RegionAwareEnsemblePlacementPolicy` |
| is done via a `DNSResolver`. The default `DNSResolver` is a script based DNS resolver. It reads the configuration |
| parameters, executes any defined script, handles errors and resolves domain names to network locations. The script |
| is configured via following settings in distributedlog configuration. |
| |
| :: |
| |
| bkc.networkTopologyScriptFileName=/path/to/dns/resolver/script |
| |
| Alternatively, the `DNSResolver` could be configured in following settings and loaded via reflection. `DNSResolverForRacks` |
| is a good example to check out for customizing your dns resolver based our network environments. |
| |
| :: |
| |
| bkEnsemblePlacementDnsResolverClass=org.apache.distributedlog.net.DNSResolverForRacks |
| |