| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Copyright 2002-2004 The Apache Software Foundation |
| |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN" |
| "http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd"> |
| <article id="bk_GettStartedGuide"> |
| <title>BookKeeper overview</title> |
| |
| <articleinfo> |
| <legalnotice> |
| <para>Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. You may |
| obtain a copy of the License at <ulink |
| url="http://www.apache.org/licenses/LICENSE-2.0">http://www.apache.org/licenses/LICENSE-2.0</ulink>.</para> |
| |
| <para>Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an "AS IS" |
| BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or |
| implied. See the License for the specific language governing permissions |
| and limitations under the License.</para> |
| </legalnotice> |
| |
| <abstract> |
| <para>This guide contains detailed information about using BookKeeper |
| for logging. It discusses the basic operations BookKeeper supports, |
| and how to create logs and perform basic read and write operations on these |
| logs.</para> |
| </abstract> |
| </articleinfo> |
| <section id="bk_Overview"> |
| <title>BookKeeper overview</title> |
| |
| <section id="bk_Intro"> |
| <title>BookKeeper introduction</title> |
| <para> |
| BookKeeper is a replicated service to reliably log streams of records. In BookKeeper, |
| servers are "bookies", log streams are "ledgers", and each unit of a log (aka record) is a |
| "ledger entry". BookKeeper is designed to be reliable; bookies, the servers that store |
| ledgers, can crash, corrupt data, discard data, but as long as there are enough bookies |
| behaving correctly the service as a whole behaves correctly. |
| </para> |
| |
| <para> |
| The initial motivation for BookKeeper comes from the namenode of HDFS. Namenodes have to |
| log operations in a reliable fashion so that recovery is possible in the case of crashes. |
| We have found the applications for BookKeeper extend far beyond HDFS, however. Essentially, |
| any application that requires an append storage can replace their implementations with |
| BookKeeper. BookKeeper has the advantage of scaling throughput with the number of servers. |
| </para> |
| |
| <para> |
| At a high level, a bookkeeper client receives entries from a client application and stores it to |
| sets of bookies, and there are a few advantages in having such a service: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| We can use hardware that is optimized for such a service. We currently believe that such a |
| system has to be optimized only for disk I/O; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| We can have a pool of servers implementing such a log system, and shared among a number of servers; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| We can have a higher degree of replication with such a pool, which makes sense if the hardware necessary for it is cheaper compared to the one the application uses. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| <section id="bk_moreDetail"> |
| <title>In slightly more detail...</title> |
| |
| <para> BookKeeper implements highly available logs, and it has been designed with write-ahead logging in mind. Besides high availability |
| due to the replicated nature of the service, it provides high throughput due to striping. As we write entries in a subset of bookies of an |
| ensemble and rotate writes across available quorums, we are able to increase throughput with the number of servers for both reads and writes. |
| Scalability is a property that is possible to achieve in this case due to the use of quorums. Other replication techniques, such as |
| state-machine replication, do not enable such a property. |
| </para> |
| |
| <para> An application first creates a ledger before writing to bookies through a local BookKeeper client instance. |
| Upon creating a ledger, a BookKeeper client writes metadata about the ledger to ZooKeeper. Each ledger currently |
| has a single writer. This writer has to execute a close ledger operation before any other client can read from it. |
| If the writer of a ledger does not close a ledger properly because, for example, it has crashed before having the |
| opportunity of closing the ledger, then the next client that tries to open a ledger executes a procedure to recover |
| it. As closing a ledger consists essentially of writing the last entry written to a ledger to ZooKeeper, the recovery |
| procedure simply finds the last entry written correctly and writes it to ZooKeeper. |
| </para> |
| |
| <para> |
| Note that currently this recovery procedure is executed automatically upon trying to open a ledger and no explicit action is necessary. |
| Although two clients may try to recover a ledger concurrently, only one will succeed, the first one that is able to create the close znode |
| for the ledger. |
| </para> |
| </section> |
| |
| <section id="bk_basicComponents"> |
| <title>Bookkeeper elements and concepts</title> |
| <para> |
| BookKeeper uses four basic elements: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <emphasis role="bold">Ledger</emphasis>: A ledger is a sequence of entries, and each entry is a sequence of bytes. Entries are |
| written sequentially to a ledger and at most once. Consequently, ledgers have an append-only semantics; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <emphasis role="bold">BookKeeper client</emphasis>: A client runs along with a BookKeeper application, and it enables applications |
| to execute operations on ledgers, such as creating a ledger and writing to it; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <emphasis role="bold">Bookie</emphasis>: A bookie is a BookKeeper storage server. Bookies store the content of ledgers. For any given |
| ledger L, we call an <emphasis>ensemble</emphasis> the group of bookies storing the content of L. For performance, we store on |
| each bookie of an ensemble only a fragment of a ledger. That is, we stripe when writing entries to a ledger such that |
| each entry is written to sub-group of bookies of the ensemble. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <emphasis role="bold">Metadata storage service</emphasis>: BookKeeper requires a metadata storage service to store information related |
| to ledgers and available bookies. We currently use ZooKeeper for such a task. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </section> |
| |
| <section id="bk_initialDesign"> |
| <title>Bookkeeper initial design</title> |
| <para> |
| A set of bookies implements BookKeeper, and we use a quorum-based protocol to replicate data across the bookies. |
| There are basically two operations to an existing ledger: read and append. Here is the complete API list |
| (mode detail <ulink url="bookkeeperProgrammer.html"> |
| here</ulink>): |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Create ledger: creates a new empty ledger; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Open ledger: opens an existing ledger for reading; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Add entry: adds a record to a ledger either synchronously or asynchronously; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Read entries: reads a sequence of entries from a ledger either synchronously or asynchronously |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| There is only a single client that can write to a ledger. Once that ledger is closed or the client fails, |
| no more entries can be added. (We take advantage of this behavior to provide our strong guarantees.) |
| There will not be gaps in the ledger. Fingers get broken, people get roughed up or end up in prison when |
| books are manipulated, so there is no deleting or changing of entries. |
| </para> |
| |
| <figure> |
| <title>BookKeeper Overview</title> |
| |
| <mediaobject> |
| <imageobject> |
| <imagedata fileref="images/bk-overview.jpg" width="3in" depth="3in" contentwidth="3in" contentdepth="3in" scalefit="0"/> |
| </imageobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| A simple use of BooKeeper is to implement a write-ahead transaction log. A server maintains an in-memory data structure |
| (with periodic snapshots for example) and logs changes to that structure before it applies the change. The application |
| server creates a ledger at startup and store the ledger id and password in a well known place (ZooKeeper maybe). When |
| it needs to make a change, the server adds an entry with the change information to a ledger and apply the change when |
| BookKeeper adds the entry successfully. The server can even use asyncAddEntry to queue up many changes for high change |
| throughput. BooKeeper meticulously logs the changes in order and call the completion functions in order. |
| </para> |
| |
| <para> |
| When the application server dies, a backup server will come online, get the last snapshot and then it will open the |
| ledger of the old server and read all the entries from the time the snapshot was taken. (Since it doesn't know the |
| last entry number it will use MAX_INTEGER). Once all the entries have been processed, it will close the ledger and |
| start a new one for its use. |
| </para> |
| |
| <para> |
| A client library takes care of communicating with bookies and managing entry numbers. An entry has the following fields: |
| </para> |
| |
| <table frame='all'><title>Entry fields</title> |
| <tgroup cols='3' align='left' colsep='1' rowsep='1'> |
| <colspec colname='Field'/> |
| <colspec colname='Type'/> |
| <colspec colname='Description'/> |
| <colspec colnum='5' colname='c5'/> |
| <thead> |
| <row> |
| <entry>Field</entry> |
| <entry>Type</entry> |
| <entry>Description</entry> |
| </row> |
| </thead> |
| <tfoot> |
| <row> |
| <entry>Ledger number</entry> |
| <entry>long</entry> |
| <entry>The id of the ledger of this entry</entry> |
| </row> |
| <row> |
| <entry>Entry number</entry> |
| <entry>long</entry> |
| <entry>The id of this entry</entry> |
| </row> |
| </tfoot> |
| <tbody> |
| <row> |
| <entry>last confirmed (<emphasis>LC</emphasis>)</entry> |
| <entry>long</entry> |
| <entry>id of the last recorded entry</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>byte[]</entry> |
| <entry>the entry data (supplied by application)</entry> |
| </row> |
| <row> |
| <entry>authentication code</entry> |
| <entry>byte[]</entry> |
| <entry>Message authentication code that includes all other fields of the entry</entry> |
| </row> |
| |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <para> |
| The client library generates a ledger entry. None of the fields are modified by the bookies and only the first three |
| fields are interpreted by the bookies. |
| </para> |
| |
| <para> |
| To add to a ledger, the client generates the entry above using the ledger number. The entry number will be one more |
| than the last entry generated. The <emphasis>LC</emphasis> field contains the last entry that has been successfully recorded by BookKeeper. |
| If the client writes entries one at a time, <emphasis>LC</emphasis> is the last entry id. But, if the client is using asyncAddEntry, there |
| may be many entries in flight. An entry is considered recorded when both of the following conditions are met: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| the entry has been accepted by a quorum of bookies |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| all entries with a lower entry id have been accepted by a quorum of bookies |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| <emphasis>LC</emphasis> seems mysterious right now, but it is too early to explain how we use it; just smile and move on. |
| </para> |
| |
| <para> |
| Once all the other fields have been field in, the client generates an authentication code with all of the previous fields. |
| The entry is then sent to a quorum of bookies to be recorded. Any failures will result in the entry being sent to a new |
| quorum of bookies. |
| </para> |
| |
| <para> |
| To read, the client library initially contacts a bookie and starts requesting entries. If an entry is missing or |
| invalid (a bad MAC for example), the client will make a request to a different bookie. By using quorum writes, |
| as long as enough bookies are up we are guaranteed to eventually be able to read an entry. |
| </para> |
| |
| </section> |
| |
| <section id="bk_metadata"> |
| <title>Bookkeeper metadata management</title> |
| |
| <para> |
| There are some meta data that needs to be made available to BookKeeper clients: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| The available bookies; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The list of ledgers; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The list of bookies that have been used for a given ledger; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The last entry of a ledger; |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| We maintain this information in ZooKeeper. Bookies use ephemeral nodes to indicate their availability. Clients |
| use znodes to track ledger creation and deletion and also to know the end of the ledger and the bookies that |
| were used to store the ledger. Bookies also watch the ledger list so that they can cleanup ledgers that get deleted. |
| </para> |
| |
| </section> |
| |
| <section id="bk_closingOut"> |
| <title>Closing out ledgers</title> |
| |
| <para> |
| The process of closing out the ledger and finding the last ledger is difficult due to the durability guarantees of BookKeeper: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| If an entry has been successfully recorded, it must be readable. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| If an entry is read once, it must always be available to be read. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| If the ledger was closed gracefully, ZooKeeper will have the last entry and everything will work well. But, if the |
| BookKeeper client that was writing the ledger dies, there is some recovery that needs to take place. |
| </para> |
| |
| <para> |
| The problematic entries are the ones at the end of the ledger. There can be entries in flight when a BookKeeper client |
| dies. If the entry only gets to one bookie, the entry should not be readable since the entry will disappear if that bookie |
| fails. If the entry is only on one bookie, that doesn't mean that the entry has not been recorded successfully; the other |
| bookies that recorded the entry might have failed. |
| </para> |
| |
| <para> |
| The trick to making everything work is to have a correct idea of a last entry. We do it in roughly three steps: |
| </para> |
| <orderedlist> |
| <listitem> |
| <para> |
| Find the entry with the highest last recorded entry, <emphasis>LC</emphasis>; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Find the highest consecutively recorded entry, <emphasis>LR</emphasis>; |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Make sure that all entries between <emphasis>LC</emphasis> and <emphasis>LR</emphasis> are on a quorum of bookies; |
| </para> |
| </listitem> |
| |
| </orderedlist> |
| </section> |
| </section> |
| </article> |