src/docs/src/documentation/content/xdocs/zookeeperOver.xml - zookeeper - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
   Copyright 2002-2004 The Apache Software Foundation

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->

 <!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
 "http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
 <article id="bk_Overview">
   <title>ZooKeeper</title>

   <articleinfo>
     <legalnotice>
       <para>Licensed under the Apache License, Version 2.0 (the "License");
       you may not use this file except in compliance with the License. You may
       obtain a copy of the License at <ulink
       url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>

       <para>Unless required by applicable law or agreed to in writing,
       software distributed under the License is distributed on an "AS IS"
       BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
       implied. See the License for the specific language governing permissions
       and limitations under the License.</para>
     </legalnotice>

     <abstract>
       <para>This document contains overview information about ZooKeeper. It
       discusses design goals, key concepts, implementation, and
       performance.</para>
     </abstract>
   </articleinfo>

   <section id="ch_DesignOverview">
     <title>ZooKeeper: A Distributed Coordination Service for Distributed
     Applications</title>

     <para>ZooKeeper is a distributed, open-source coordination service for
     distributed applications. It exposes a simple set of primitives that
     distributed applications can build upon to implement higher level services
     for synchronization, configuration maintenance, and groups and naming. It
     is designed to be easy to program to, and uses a data model styled after
     the familiar directory tree structure of file systems. It runs in Java and
     has bindings for both Java and C.</para>

     <para>Coordination services are notoriously hard to get right. They are
     especially prone to errors such as race conditions and deadlock. The
     motivation behind ZooKeeper is to relieve distributed applications the
     responsibility of implementing coordination services from scratch.</para>

     <section id="sc_designGoals">
       <title>Design Goals</title>

       <para><emphasis role="bold">ZooKeeper is simple.</emphasis> ZooKeeper
       allows distributed processes to coordinate with each other through a
       shared hierarchal namespace which is organized similarly to a standard
       file system. The name space consists of data registers - called znodes,
       in ZooKeeper parlance - and these are similar to files and directories.
       Unlike a typical file system, which is designed for storage, ZooKeeper
       data is kept in-memory, which means ZooKeeper can acheive high
       throughput and low latency numbers.</para>

       <para>The ZooKeeper implementation puts a premium on high performance,
       highly available, strictly ordered access. The performance aspects of
       ZooKeeper means it can be used in large, distributed systems. The
       reliability aspects keep it from being a single point of failure. The
       strict ordering means that sophisticated synchronization primitives can
       be implemented at the client.</para>

       <para><emphasis role="bold">ZooKeeper is replicated.</emphasis> Like the
       distributed processes it coordinates, ZooKeeper itself is intended to be
       replicated over a sets of hosts called an ensemble.</para>

       <figure>
         <title>ZooKeeper Service</title>

         <mediaobject>
           <imageobject>
             <imagedata fileref="images/zkservice.jpg" />
           </imageobject>
         </mediaobject>
       </figure>

       <para>The servers that make up the ZooKeeper service must all know about
       each other. They maintain an in-memory image of state, along with a
       transaction logs and snapshots in a persistent store. As long as a
       majority of the servers are available, the ZooKeeper service will be
       available.</para>

       <para>Clients connect to a single ZooKeeper server. The client maintains
       a TCP connection through which it sends requests, gets responses, gets
       watch events, and sends heart beats. If the TCP connection to the server
       breaks, the client will connect to a different server.</para>

       <para><emphasis role="bold">ZooKeeper is ordered.</emphasis> ZooKeeper
       stamps each update with a number that reflects the order of all
       ZooKeeper transactions. Subsequent operations can use the order to
       implement higher-level abstractions, such as synchronization
       primitives.</para>

       <para><emphasis role="bold">ZooKeeper is fast.</emphasis> It is
       especially fast in "read-dominant" workloads. ZooKeeper applications run
       on thousands of machines, and it performs best where reads are more
       common than writes, at ratios of around 10:1.</para>
     </section>

     <section id="sc_dataModelNameSpace">
       <title>Data model and the hierarchical namespace</title>

       <para>The name space provided by ZooKeeper is much like that of a
       standard file system. A name is a sequence of path elements separated by
       a slash (/). Every node in ZooKeeper's name space is identified by a
       path.</para>

       <figure>
         <title>ZooKeeper's Hierarchical Namespace</title>

         <mediaobject>
           <imageobject>
             <imagedata fileref="images/zknamespace.jpg" />
           </imageobject>
         </mediaobject>
       </figure>
     </section>

     <section>
       <title>Nodes and ephemeral nodes</title>

       <para>Unlike is standard file systems, each node in a ZooKeeper
       namespace can have data associated with it as well as children. It is
       like having a file-system that allows a file to also be a directory.
       (ZooKeeper was designed to store coordination data: status information,
       configuration, location information, etc., so the data stored at each
       node is usually small, in the byte to kilobyte range.) We use the term
       <emphasis>znode</emphasis> to make it clear that we are talking about
       ZooKeeper data nodes.</para>

       <para>Znodes maintain a stat structure that includes version numbers for
       data changes, ACL changes, and timestamps, to allow cache validations
       and coordinated updates. Each time a znode's data changes, the version
       number increases. For instance, whenever a client retrieves data it also
       receives the version of the data.</para>

       <para>The data stored at each znode in a namespace is read and written
       atomically. Reads get all the data bytes associated with a znode and a
       write replaces all the data. Each node has an Access Control List (ACL)
       that restricts who can do what.</para>

       <para>ZooKeeper also has the notion of ephemeral nodes. These znodes
       exists as long as the session that created the znode is active. When the
       session ends the znode is deleted. Ephemeral nodes are useful when you
       want to implement <emphasis>[tbd]</emphasis>.</para>
     </section>

     <section>
       <title>Conditional updates and watches</title>

       <para>ZooKeeper supports the concept of <emphasis>watches</emphasis>.
       Clients can set a watch on a znodes. A watch will be triggered and
       removed when the znode changes. When a watch is triggered the client
       receives a packet saying that the znode has changed. And if the
       connection between the client and one of the Zoo Keeper servers is
       broken, the client will receive a local notification. These can be used
       to <emphasis>[tbd]</emphasis>.</para>
     </section>

     <section>
       <title>Guarantees</title>

       <para>ZooKeeper is very fast and very simple. Since its goal, though, is
       to be a basis for the construction of more complicated services, such as
       synchronization, it provides a set of guarantees. These are:</para>

       <itemizedlist>
         <listitem>
           <para>Sequential Consistency - Updates from a client will be applied
           in the order that they were sent.</para>
         </listitem>

         <listitem>
           <para>Atomicity - Updates either succeed or fail. No partial
           results.</para>
         </listitem>

         <listitem>
           <para>Single System Image - A client will see the same view of the
           service regardless of the server that it connects to.</para>
         </listitem>
       </itemizedlist>

       <itemizedlist>
         <listitem>
           <para>Reliability - Once an update has been applied, it will persist
           from that time forward until a client overwrites the update.</para>
         </listitem>
       </itemizedlist>

       <itemizedlist>
         <listitem>
           <para>Timeliness - The clients view of the system is guaranteed to
           be up-to-date within a certain time bound.</para>
         </listitem>
       </itemizedlist>

       <para>For more information on these, and how they can be used, see
       <emphasis>[tbd]</emphasis></para>
     </section>

     <section>
       <title>Simple API</title>

       <para>One of the design goals of ZooKeeper is provide a very simple
       programming interface. As a result, it supports only these
       operations:</para>

       <variablelist>
         <varlistentry>
           <term>create</term>

           <listitem>
             <para>creates a node at a location in the tree</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>delete</term>

           <listitem>
             <para>deletes a node</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>exists</term>

           <listitem>
             <para>tests if a node exists at a location</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>get data</term>

           <listitem>
             <para>reads the data from a node</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>set data</term>

           <listitem>
             <para>writes data to a node</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>get children</term>

           <listitem>
             <para>retrieves a list of children of a node</para>
           </listitem>
         </varlistentry>

         <varlistentry>
           <term>sync</term>

           <listitem>
             <para>waits for data to be propagated</para>
           </listitem>
         </varlistentry>
       </variablelist>

       <para>For a more in-depth discussion on these, and how they can be used
       to implement higher level operations, please refer to
       <emphasis>[tbd]</emphasis></para>
     </section>

     <section>
       <title>Implementation</title>

       <para><xref linkend="fg_zkComponents" /> shows the high-level components
       of the ZooKeeper service. With the exception of the request processor,
      each of
       the servers that make up the ZooKeeper service replicates its own copy
       of each of components.</para>

       <figure id="fg_zkComponents">
         <title>ZooKeeper Components</title>

         <mediaobject>
           <imageobject>
             <imagedata fileref="images/zkcomponents.jpg" />
           </imageobject>
         </mediaobject>
       </figure>

       <para>The replicated database is an in-memory database containing the
       entire data tree. Updates are logged to disk for recoverability, and
       writes are serialized to disk before they are applied to the in-memory
       database.</para>

       <para>Every ZooKeeper server services clients. Clients connect to
       exactly one server to submit irequests. Read requests are serviced from
       the local replica of each server database. Requests that change the
       state of the service, write requests, are processed by an agreement
       protocol.</para>

       <para>As part of the agreement protocol all write requests from clients
       are forwarded to a single server, called the
       <emphasis>leader</emphasis>. The rest of the ZooKeeper servers, called
       <emphasis>followers</emphasis>, receive message proposals from the
       leader and agree upon message delivery. The messaging layer takes care
       of replacing leaders on failures and syncing followers with
       leaders.</para>

       <para>ZooKeeper uses a custom atomic messaging protocol. Since the
       messaging layer is atomic, ZooKeeper can guarantee that the local
       replicas never diverge. When the leader receives a write request, it
       calculates what the state of the system is when the write is to be
       applied and transforms this into a transaction that captures this new
       state.</para>
     </section>

     <section>
       <title>Uses</title>

       <para>The programming interface to ZooKeeper is deliberately simple.
       With it, however, you can implement higher order operations, such as
       synchronizations primitives, group membership, ownership, etc. Some
       distributed applications have used it to: <emphasis>[tbd: add uses from
       white paper and video presentation.]</emphasis> For more information, see
       <emphasis>[tbd]</emphasis></para>
     </section>

     <section>
       <title>Performance</title>

       <para>ZooKeeper is designed to be highly performant. But is it? The
       results of the ZooKeeper's development team at Yahoo! Research indicate
       that it is. (See <xref linkend="fg_zkPerfRW" />.) It is especially high
       performance in applications where reads outnumber writes, since writes
       involve synchronizing the state of all servers. (Reads outnumbering
       writes is typically the case for a coordination service.)</para>

       <figure id="fg_zkPerfRW">
         <title>ZooKeeper Throughput as the Read-Write Ratio Varies</title>

         <mediaobject>
           <imageobject>
             <imagedata fileref="images/zkperfRW-3.2.jpg" />
           </imageobject>
         </mediaobject>
       </figure>
       <para>The figure <xref linkend="fg_zkPerfRW"/> is a throughput
       graph of ZooKeeper release 3.2 running on servers with dual 2Ghz
       Xeon and two SATA 15K RPM drives.  One drive was used as a
       dedicated ZooKeeper log device. The snapshots were written to
       the OS drive. Write requests were 1K writes and the reads were
       1K reads.  "Servers" indicate the size of the ZooKeeper
       ensemble, the number of servers that make up the
       service. Approximately 30 other servers were used to simulate
       the clients. The ZooKeeper ensemble was configured such that
       leaders do not allow connections from clients.</para>

       <note><para>In version 3.2 r/w performance improved by ~2x
       compared to the <ulink
       url="http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperOver.html#Performance">previous
       3.1 release</ulink>.</para></note>

       <para>Benchmarks also indicate that it is reliable, too. <xref
       linkend="fg_zkPerfReliability" /> shows how a deployment responds to
       various failures. The events marked in the figure are the
       following:</para>

       <orderedlist>
         <listitem>
           <para>Failure and recovery of a follower</para>
         </listitem>

         <listitem>
           <para>Failure and recovery of a different follower</para>
         </listitem>

         <listitem>
           <para>Failure of the leader</para>
         </listitem>

         <listitem>
           <para>Failure and recovery of two followers</para>
         </listitem>

         <listitem>
           <para>Failure of another leader</para>
         </listitem>
       </orderedlist>
     </section>

     <section>
       <title>Reliability</title>

       <para>To show the behavior of the system over time as
         failures are injected we ran a ZooKeeper service made up of
         7 machines. We ran the same saturation benchmark as before,
         but this time we kept the write percentage at a constant
         30%, which is a conservative ratio of our expected
         workloads.
       </para>
       <figure id="fg_zkPerfReliability">
         <title>Reliability in the Presence of Errors</title>
         <mediaobject>
           <imageobject>
             <imagedata fileref="images/zkperfreliability.jpg" />
           </imageobject>
         </mediaobject>
       </figure>

       <para>The are a few important observations from this graph. First, if
       followers fail and recover quickly, then ZooKeeper is able to sustain a
       high throughput despite the failure. But maybe more importantly, the
       leader election algorithm allows for the system to recover fast enough
       to prevent throughput from dropping substantially. In our observations,
       ZooKeeper takes less than 200ms to elect a new leader. Third, as
       followers recover, ZooKeeper is able to raise throughput again once they
       start processing requests.</para>
     </section>

     <section>
       <title>The ZooKeeper Project</title>

       <para>ZooKeeper has been
         <ulink url="http://wiki.apache.org/hadoop/ZooKeeper/PoweredBy">
           successfully used
         </ulink>
         in many industrial applications.  It is used at Yahoo! as the
         coordination and failure recovery service for Yahoo! Message
         Broker, which is a highly scalable publish-subscribe system
         managing thousands of topics for replication and data
         delivery.  It is used by the Fetching Service for Yahoo!
         crawler, where it also manages failure recovery. A number of
         Yahoo! advertising systems also use ZooKeeper to implement
         reliable services.
       </para>

       <para>All users and developers are encouraged to join the
         community and contribute their expertise. See the
         <ulink url="http://hadoop.apache.org/zookeeper/">
           Zookeeper Project on Apache
         </ulink>
         for more information.
       </para>
     </section>
   </section>
 </article>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Copyright 2002-2004 The Apache Software Foundation

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	<!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
	"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
	<article id="bk_Overview">
	<title>ZooKeeper</title>

	<articleinfo>
	<legalnotice>
	<para>Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License. You may
	obtain a copy of the License at <ulink
	url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para>

	<para>Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an "AS IS"
	BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
	implied. See the License for the specific language governing permissions
	and limitations under the License.</para>
	</legalnotice>

	<abstract>
	<para>This document contains overview information about ZooKeeper. It
	discusses design goals, key concepts, implementation, and
	performance.</para>
	</abstract>
	</articleinfo>

	<section id="ch_DesignOverview">
	<title>ZooKeeper: A Distributed Coordination Service for Distributed
	Applications</title>

	<para>ZooKeeper is a distributed, open-source coordination service for
	distributed applications. It exposes a simple set of primitives that
	distributed applications can build upon to implement higher level services
	for synchronization, configuration maintenance, and groups and naming. It
	is designed to be easy to program to, and uses a data model styled after
	the familiar directory tree structure of file systems. It runs in Java and
	has bindings for both Java and C.</para>

	<para>Coordination services are notoriously hard to get right. They are
	especially prone to errors such as race conditions and deadlock. The
	motivation behind ZooKeeper is to relieve distributed applications the
	responsibility of implementing coordination services from scratch.</para>

	<section id="sc_designGoals">
	<title>Design Goals</title>

	<para><emphasis role="bold">ZooKeeper is simple.</emphasis> ZooKeeper
	allows distributed processes to coordinate with each other through a
	shared hierarchal namespace which is organized similarly to a standard
	file system. The name space consists of data registers - called znodes,
	in ZooKeeper parlance - and these are similar to files and directories.
	Unlike a typical file system, which is designed for storage, ZooKeeper
	data is kept in-memory, which means ZooKeeper can acheive high
	throughput and low latency numbers.</para>

	<para>The ZooKeeper implementation puts a premium on high performance,
	highly available, strictly ordered access. The performance aspects of
	ZooKeeper means it can be used in large, distributed systems. The
	reliability aspects keep it from being a single point of failure. The
	strict ordering means that sophisticated synchronization primitives can
	be implemented at the client.</para>

	<para><emphasis role="bold">ZooKeeper is replicated.</emphasis> Like the
	distributed processes it coordinates, ZooKeeper itself is intended to be
	replicated over a sets of hosts called an ensemble.</para>

	<figure>
	<title>ZooKeeper Service</title>

	<mediaobject>
	<imageobject>
	<imagedata fileref="images/zkservice.jpg" />
	</imageobject>
	</mediaobject>
	</figure>

	<para>The servers that make up the ZooKeeper service must all know about
	each other. They maintain an in-memory image of state, along with a
	transaction logs and snapshots in a persistent store. As long as a
	majority of the servers are available, the ZooKeeper service will be
	available.</para>

	<para>Clients connect to a single ZooKeeper server. The client maintains
	a TCP connection through which it sends requests, gets responses, gets
	watch events, and sends heart beats. If the TCP connection to the server
	breaks, the client will connect to a different server.</para>

	<para><emphasis role="bold">ZooKeeper is ordered.</emphasis> ZooKeeper
	stamps each update with a number that reflects the order of all
	ZooKeeper transactions. Subsequent operations can use the order to
	implement higher-level abstractions, such as synchronization
	primitives.</para>

	<para><emphasis role="bold">ZooKeeper is fast.</emphasis> It is
	especially fast in "read-dominant" workloads. ZooKeeper applications run
	on thousands of machines, and it performs best where reads are more
	common than writes, at ratios of around 10:1.</para>
	</section>

	<section id="sc_dataModelNameSpace">
	<title>Data model and the hierarchical namespace</title>

	<para>The name space provided by ZooKeeper is much like that of a
	standard file system. A name is a sequence of path elements separated by
	a slash (/). Every node in ZooKeeper's name space is identified by a
	path.</para>

	<figure>
	<title>ZooKeeper's Hierarchical Namespace</title>

	<mediaobject>
	<imageobject>
	<imagedata fileref="images/zknamespace.jpg" />
	</imageobject>
	</mediaobject>
	</figure>
	</section>

	<section>
	<title>Nodes and ephemeral nodes</title>

	<para>Unlike is standard file systems, each node in a ZooKeeper
	namespace can have data associated with it as well as children. It is
	like having a file-system that allows a file to also be a directory.
	(ZooKeeper was designed to store coordination data: status information,
	configuration, location information, etc., so the data stored at each
	node is usually small, in the byte to kilobyte range.) We use the term
	<emphasis>znode</emphasis> to make it clear that we are talking about
	ZooKeeper data nodes.</para>

	<para>Znodes maintain a stat structure that includes version numbers for
	data changes, ACL changes, and timestamps, to allow cache validations
	and coordinated updates. Each time a znode's data changes, the version
	number increases. For instance, whenever a client retrieves data it also
	receives the version of the data.</para>

	<para>The data stored at each znode in a namespace is read and written
	atomically. Reads get all the data bytes associated with a znode and a
	write replaces all the data. Each node has an Access Control List (ACL)
	that restricts who can do what.</para>

	<para>ZooKeeper also has the notion of ephemeral nodes. These znodes
	exists as long as the session that created the znode is active. When the
	session ends the znode is deleted. Ephemeral nodes are useful when you
	want to implement <emphasis>[tbd]</emphasis>.</para>
	</section>

	<section>
	<title>Conditional updates and watches</title>

	<para>ZooKeeper supports the concept of <emphasis>watches</emphasis>.
	Clients can set a watch on a znodes. A watch will be triggered and
	removed when the znode changes. When a watch is triggered the client
	receives a packet saying that the znode has changed. And if the
	connection between the client and one of the Zoo Keeper servers is
	broken, the client will receive a local notification. These can be used
	to <emphasis>[tbd]</emphasis>.</para>
	</section>

	<section>
	<title>Guarantees</title>

	<para>ZooKeeper is very fast and very simple. Since its goal, though, is
	to be a basis for the construction of more complicated services, such as
	synchronization, it provides a set of guarantees. These are:</para>

	<itemizedlist>
	<listitem>
	<para>Sequential Consistency - Updates from a client will be applied
	in the order that they were sent.</para>
	</listitem>

	<listitem>
	<para>Atomicity - Updates either succeed or fail. No partial
	results.</para>
	</listitem>

	<listitem>
	<para>Single System Image - A client will see the same view of the
	service regardless of the server that it connects to.</para>
	</listitem>
	</itemizedlist>

	<itemizedlist>
	<listitem>
	<para>Reliability - Once an update has been applied, it will persist
	from that time forward until a client overwrites the update.</para>
	</listitem>
	</itemizedlist>

	<itemizedlist>
	<listitem>
	<para>Timeliness - The clients view of the system is guaranteed to
	be up-to-date within a certain time bound.</para>
	</listitem>
	</itemizedlist>

	<para>For more information on these, and how they can be used, see
	<emphasis>[tbd]</emphasis></para>
	</section>

	<section>
	<title>Simple API</title>

	<para>One of the design goals of ZooKeeper is provide a very simple
	programming interface. As a result, it supports only these
	operations:</para>

	<variablelist>
	<varlistentry>
	<term>create</term>

	<listitem>
	<para>creates a node at a location in the tree</para>
	</listitem>
	</varlistentry>

	<varlistentry>
	<term>delete</term>

	<listitem>
	<para>deletes a node</para>
	</listitem>
	</varlistentry>

	<varlistentry>
	<term>exists</term>

	<listitem>
	<para>tests if a node exists at a location</para>
	</listitem>
	</varlistentry>

	<varlistentry>
	<term>get data</term>

	<listitem>
	<para>reads the data from a node</para>
	</listitem>
	</varlistentry>

	<varlistentry>
	<term>set data</term>

	<listitem>
	<para>writes data to a node</para>
	</listitem>
	</varlistentry>

	<varlistentry>
	<term>get children</term>

	<listitem>
	<para>retrieves a list of children of a node</para>
	</listitem>
	</varlistentry>

	<varlistentry>
	<term>sync</term>

	<listitem>
	<para>waits for data to be propagated</para>
	</listitem>
	</varlistentry>
	</variablelist>

	<para>For a more in-depth discussion on these, and how they can be used
	to implement higher level operations, please refer to
	<emphasis>[tbd]</emphasis></para>
	</section>

	<section>
	<title>Implementation</title>

	<para><xref linkend="fg_zkComponents" /> shows the high-level components
	of the ZooKeeper service. With the exception of the request processor,
	each of
	the servers that make up the ZooKeeper service replicates its own copy
	of each of components.</para>

	<figure id="fg_zkComponents">
	<title>ZooKeeper Components</title>

	<mediaobject>
	<imageobject>
	<imagedata fileref="images/zkcomponents.jpg" />
	</imageobject>
	</mediaobject>
	</figure>

	<para>The replicated database is an in-memory database containing the
	entire data tree. Updates are logged to disk for recoverability, and
	writes are serialized to disk before they are applied to the in-memory
	database.</para>

	<para>Every ZooKeeper server services clients. Clients connect to
	exactly one server to submit irequests. Read requests are serviced from
	the local replica of each server database. Requests that change the
	state of the service, write requests, are processed by an agreement
	protocol.</para>

	<para>As part of the agreement protocol all write requests from clients
	are forwarded to a single server, called the
	<emphasis>leader</emphasis>. The rest of the ZooKeeper servers, called
	<emphasis>followers</emphasis>, receive message proposals from the
	leader and agree upon message delivery. The messaging layer takes care
	of replacing leaders on failures and syncing followers with
	leaders.</para>

	<para>ZooKeeper uses a custom atomic messaging protocol. Since the
	messaging layer is atomic, ZooKeeper can guarantee that the local
	replicas never diverge. When the leader receives a write request, it
	calculates what the state of the system is when the write is to be
	applied and transforms this into a transaction that captures this new
	state.</para>
	</section>

	<section>
	<title>Uses</title>

	<para>The programming interface to ZooKeeper is deliberately simple.
	With it, however, you can implement higher order operations, such as
	synchronizations primitives, group membership, ownership, etc. Some
	distributed applications have used it to: <emphasis>[tbd: add uses from
	white paper and video presentation.]</emphasis> For more information, see
	<emphasis>[tbd]</emphasis></para>
	</section>

	<section>
	<title>Performance</title>

	<para>ZooKeeper is designed to be highly performant. But is it? The
	results of the ZooKeeper's development team at Yahoo! Research indicate
	that it is. (See <xref linkend="fg_zkPerfRW" />.) It is especially high
	performance in applications where reads outnumber writes, since writes
	involve synchronizing the state of all servers. (Reads outnumbering
	writes is typically the case for a coordination service.)</para>

	<figure id="fg_zkPerfRW">
	<title>ZooKeeper Throughput as the Read-Write Ratio Varies</title>

	<mediaobject>
	<imageobject>
	<imagedata fileref="images/zkperfRW-3.2.jpg" />
	</imageobject>
	</mediaobject>
	</figure>
	<para>The figure <xref linkend="fg_zkPerfRW"/> is a throughput
	graph of ZooKeeper release 3.2 running on servers with dual 2Ghz
	Xeon and two SATA 15K RPM drives. One drive was used as a
	dedicated ZooKeeper log device. The snapshots were written to
	the OS drive. Write requests were 1K writes and the reads were
	1K reads. "Servers" indicate the size of the ZooKeeper
	ensemble, the number of servers that make up the
	service. Approximately 30 other servers were used to simulate
	the clients. The ZooKeeper ensemble was configured such that
	leaders do not allow connections from clients.</para>

	<note><para>In version 3.2 r/w performance improved by ~2x
	compared to the <ulink
	url="http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperOver.html#Performance">previous
	3.1 release</ulink>.</para></note>

	<para>Benchmarks also indicate that it is reliable, too. <xref
	linkend="fg_zkPerfReliability" /> shows how a deployment responds to
	various failures. The events marked in the figure are the
	following:</para>

	<orderedlist>
	<listitem>
	<para>Failure and recovery of a follower</para>
	</listitem>

	<listitem>
	<para>Failure and recovery of a different follower</para>
	</listitem>

	<listitem>
	<para>Failure of the leader</para>
	</listitem>

	<listitem>
	<para>Failure and recovery of two followers</para>
	</listitem>

	<listitem>
	<para>Failure of another leader</para>
	</listitem>
	</orderedlist>
	</section>

	<section>
	<title>Reliability</title>

	<para>To show the behavior of the system over time as
	failures are injected we ran a ZooKeeper service made up of
	7 machines. We ran the same saturation benchmark as before,
	but this time we kept the write percentage at a constant
	30%, which is a conservative ratio of our expected
	workloads.
	</para>
	<figure id="fg_zkPerfReliability">
	<title>Reliability in the Presence of Errors</title>
	<mediaobject>
	<imageobject>
	<imagedata fileref="images/zkperfreliability.jpg" />
	</imageobject>
	</mediaobject>
	</figure>

	<para>The are a few important observations from this graph. First, if
	followers fail and recover quickly, then ZooKeeper is able to sustain a
	high throughput despite the failure. But maybe more importantly, the
	leader election algorithm allows for the system to recover fast enough
	to prevent throughput from dropping substantially. In our observations,
	ZooKeeper takes less than 200ms to elect a new leader. Third, as
	followers recover, ZooKeeper is able to raise throughput again once they
	start processing requests.</para>
	</section>

	<section>
	<title>The ZooKeeper Project</title>

	<para>ZooKeeper has been
	<ulink url="http://wiki.apache.org/hadoop/ZooKeeper/PoweredBy">
	successfully used
	</ulink>
	in many industrial applications. It is used at Yahoo! as the
	coordination and failure recovery service for Yahoo! Message
	Broker, which is a highly scalable publish-subscribe system
	managing thousands of topics for replication and data
	delivery. It is used by the Fetching Service for Yahoo!
	crawler, where it also manages failure recovery. A number of
	Yahoo! advertising systems also use ZooKeeper to implement
	reliable services.
	</para>

	<para>All users and developers are encouraged to join the
	community and contribute their expertise. See the
	<ulink url="http://hadoop.apache.org/zookeeper/">
	Zookeeper Project on Apache
	</ulink>
	for more information.
	</para>
	</section>
	</section>
	</article>