zookeeper-server/src/main/java/org/apache/zookeeper/server/package.html - zookeeper - Git at Google

 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 -->

 <html>
 <body>
 <h1>ZooKeeper server theory of operation</h1>
 ZooKeeperServer is designed to work in standalone mode and also
 be extensible so that it can be used to implement the quorum based
 version of ZooKeeper.
 <p>
 ZooKeeper maintains a order when processing requests:
 <ul>
 <li>All requests will be processed in order.
 <li>All responses will return in order.
 <li>All watches will be sent in the order that the update takes place.
 </ul>
 <p>
 We will explain the three aspects of ZooKeeperServer: request processing, data
 structure maintenance, and session tracking.

 <h2>Request processing</h2>

 Requests are received by the ServerCnxn. Demarshalling of a request is
 done by ClientRequestHandler. After a request has been demarshalled,
 ClientRequestHandler invokes the relevant method in ZooKeeper and marshals
 the result.
 <p>
 If the request is just a query, it will be processed by ZooKeeper and returned.
 Otherwise, the request will be validated and a transaction will be generated
 and logged. This the request will then wait until the request has been logged
 before continuing processing.
 <p>
 Requests are logged as a group. Transactions are queued up and the SyncThread
 will process them at predefined intervals. (Currently 20ms) The SyncThread
 interacts with ZooKeeperServer the txnQueue. Transactions are added to the
 txnQueue of SyncThread via queueItem. When the transaction has been synced to
 disk, its callback will be invoked which will cause the request processing to
 be completed.

 <h2>Data structure maintenance</h2>

 ZooKeeper data is stored in-memory. Each znode is stored in a DataNode object.
 This object is accessed through a hash table that maps paths to DataNodes.
 DataNodes also organize themselves into a tree. This tree is only used for
 serializing nodes.
 <p>
 We guarantee that changes to nodes are stored to non-volatile media before
 responding to a client. We do this quickly by writing changes as a sequence
 of transactions in a log file. Even though we flush transactions as a group,
 we need to avoid seeks as much as possible. Also, since the server can fail
 at any point, we need to be careful of partial records.
 <p>
 We address the above problems by
 <ul>
 <li>Pre-allocating 1M chunks of file space. This allows us to append to the
 file without causing seeks to update file size. It also means that we need
 to check for the end of the log by looking for a zero length transaction
 rather than simply end of file.
 <li>Writing a signature at the end of each transaction. When processing
 transactions, we only use transactions that have a valid signature at the end.
 </ul>
 <p>
 As the server runs, the log file will grow quite large. To avoid long startup
 times we periodically take a snapshot of the tree of DataNodes. We cannot
 take the snapshot synchronously as the data takes a while to write out, so
 instead we asynchronously write out the tree. This means that we end up
 with a "corrupt" snapshot of the data tree. More formally if we define T
 to be the real snapshot of the tree at the time we begin taking the snapshot
 and l as the sequence of transactions that are applied to the tree between
 the time the snapshot begins and the time the snapshot completes, we write
 to disk T+l' where l' is a subset of the transactions in l. While we do not
 have a way of figuring out which transactions make up l', it doesn't really
 matter. T+l'+l = T+l since the transactions we log are idempotent (applying
 the transaction multiple times has the same result as applying the transaction
 once). So when we restore the snapshot we also play all transactions in the log
 that occur after the snapshot was begun. We can easily figure out where to
 start the replay because we start a new logfile when we start a snapshot. Both
 the snapshot file and log file have a numeric suffix that represent the
 transaction id that created the respective files.

 <h2>Session tracking</h2>
 Rather than tracking sessions exactly, we track them in batches. That are
 processed at fixed intervals. This is easier to implement than exact
 session tracking and it is more efficient in terms of performance. It also
 provides a small grace period for session renewal.
 </body>
 </html>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	<html>
	<body>
	<h1>ZooKeeper server theory of operation</h1>
	ZooKeeperServer is designed to work in standalone mode and also
	be extensible so that it can be used to implement the quorum based
	version of ZooKeeper.
	<p>
	ZooKeeper maintains a order when processing requests:
	<ul>
	<li>All requests will be processed in order.
	<li>All responses will return in order.
	<li>All watches will be sent in the order that the update takes place.
	</ul>
	<p>
	We will explain the three aspects of ZooKeeperServer: request processing, data
	structure maintenance, and session tracking.

	<h2>Request processing</h2>

	Requests are received by the ServerCnxn. Demarshalling of a request is
	done by ClientRequestHandler. After a request has been demarshalled,
	ClientRequestHandler invokes the relevant method in ZooKeeper and marshals
	the result.
	<p>
	If the request is just a query, it will be processed by ZooKeeper and returned.
	Otherwise, the request will be validated and a transaction will be generated
	and logged. This the request will then wait until the request has been logged
	before continuing processing.
	<p>
	Requests are logged as a group. Transactions are queued up and the SyncThread
	will process them at predefined intervals. (Currently 20ms) The SyncThread
	interacts with ZooKeeperServer the txnQueue. Transactions are added to the
	txnQueue of SyncThread via queueItem. When the transaction has been synced to
	disk, its callback will be invoked which will cause the request processing to
	be completed.

	<h2>Data structure maintenance</h2>

	ZooKeeper data is stored in-memory. Each znode is stored in a DataNode object.
	This object is accessed through a hash table that maps paths to DataNodes.
	DataNodes also organize themselves into a tree. This tree is only used for
	serializing nodes.
	<p>
	We guarantee that changes to nodes are stored to non-volatile media before
	responding to a client. We do this quickly by writing changes as a sequence
	of transactions in a log file. Even though we flush transactions as a group,
	we need to avoid seeks as much as possible. Also, since the server can fail
	at any point, we need to be careful of partial records.
	<p>
	We address the above problems by
	<ul>
	<li>Pre-allocating 1M chunks of file space. This allows us to append to the
	file without causing seeks to update file size. It also means that we need
	to check for the end of the log by looking for a zero length transaction
	rather than simply end of file.
	<li>Writing a signature at the end of each transaction. When processing
	transactions, we only use transactions that have a valid signature at the end.
	</ul>
	<p>
	As the server runs, the log file will grow quite large. To avoid long startup
	times we periodically take a snapshot of the tree of DataNodes. We cannot
	take the snapshot synchronously as the data takes a while to write out, so
	instead we asynchronously write out the tree. This means that we end up
	with a "corrupt" snapshot of the data tree. More formally if we define T
	to be the real snapshot of the tree at the time we begin taking the snapshot
	and l as the sequence of transactions that are applied to the tree between
	the time the snapshot begins and the time the snapshot completes, we write
	to disk T+l' where l' is a subset of the transactions in l. While we do not
	have a way of figuring out which transactions make up l', it doesn't really
	matter. T+l'+l = T+l since the transactions we log are idempotent (applying
	the transaction multiple times has the same result as applying the transaction
	once). So when we restore the snapshot we also play all transactions in the log
	that occur after the snapshot was begun. We can easily figure out where to
	start the replay because we start a new logfile when we start a snapshot. Both
	the snapshot file and log file have a numeric suffix that represent the
	transaction id that created the respective files.

	<h2>Session tracking</h2>
	Rather than tracking sessions exactly, we track them in batches. That are
	processed at fixed intervals. This is easier to implement than exact
	session tracking and it is more efficient in terms of performance. It also
	provides a small grace period for session renewal.
	</body>
	</html>