content/mavibot/user-guide/2.1-mvcc-btree.mdtext - directory-site - Git at Google

 Title: 2 - MVCC B-tree
 NavUp: 2-btree-types.html
 NavUpText: 2 - B-tree flavors
 NavNext: 3-btree-management.html
 NavNextText: 3 - Mavibot B-tree management
 NavPrev: 2-btree-types.html
 NavPrevText: 2 - B-tree flavors
 Notice: Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
     .
     http://www.apache.org/licenses/LICENSE-2.0
     .
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 # 2.1 - MVCC B-tree

 **MVCC** stands for **M**ulti **V**ersion **C**oncurrency **C**control. It all boils down to allow readers to **ALWAYS** have a consistent version of the data they reached at the time they started to work with them. In other words, when a reader access the database, it will work on a specific version of it, that will never change.

 The drawback is that the data may be accurate at the time the reader started to work with them, but this reader will never see the updated data if a writer has modified them.

 Most of the time, this is a price the users are ready to pay, for at least two reasons :
 - udpates are way less frequent that reads, and the users 'accept' the fact that what they get is not necessarily up to date
 - there are some mechanisms that workaround this problem, checking afterward that the data represent the latest state of the database, otherwise the operation is retried.

 You can read more about **MVCC** **B-tree** on [Wikipedia, MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)

 ## How does it work ?

 A **MVCC** **B-Tree** is just a plain standard **B+Tree**, except that we keep a track of more than one version of the **B+Tree**. Actually, we have multiple roots, one per version. This can bee seen in this picture :

 ![MVCC sample](images/R4-node.png)

 Here, we have 4 different versions, from r1 to r4. Each one of them might be used at the same time by one or many readers. In any case, the writer update only the latest one, creating a new revision doing so. In this exemple, the writer has just created r4 updating r3, adding the element 'I'. As we can see, some pages are shared across the revisions : the leaf contaning the element 'A' (version r3) is also referenced by the revision r4. This is a key : that means a page (**Node** or **Leaf** may be used by many revisions).

 We don't delete old revision until they are not anymore in use. And even so, we don't delete all the associated pages blindly : they may be referenced by newer versions that are still active.

 ### Browsing the Database

 So we want to fetch some information from a B-Tree. That's just a matter of creating a *ReadTransaction*, which will grab the latest existing revision for thi **B-Tree** and allow you to browse data from this **B-Tree**. One you are done with the **B-Tree**, you just have to close the *ReadTransaction* (this is critical, because keeping it opened will pin the revision, forbidding the removal of old pages. Although if you forget to do it, the revision will be released after a period of time). This is a description of the typical code :

     open a ReadTransaction
     get the B-Tree you want to read
     read the B-tree
     ...
     close the ReadTransaction

 It's important to note that the transaction covers all the access on all the managed **B-Tree**. There is no need to open a new *ReadTransaction* if you want to fetch some other information from another **B-Tree**. That also mean you can be sure that you have a consistent view of all the database for a given revision.

 You may have many threads reading data from many **B-Trees**, and we will have as many *ReadTransaction* created for that purpose. As we may keep a *readTransaction* opened for quite a long period of time, it's likely that at some point, the various Threads will use different versions of the **B-Trees**.


 ### Updating the Database

 This is a bit more complex. Here, you need a *WriteTransaction*, and there can be only one running across all the application. That means the access to the Database is serialized, when it comes to update it. The direct consequence is that writes are **slow**.

 Again, the typical call is like :

     open a WriteTransaction
     get the B-Tree you want to update
     update the B-tree
     ...
     close the WriteTransaction

 and again, many **B-Trees** might be updated during this *writeTransaction*. For a system like a **LDAP** server, this is critical, because updating an entry is impacting many **B-trees**, and we want all of them to be consistent globally : eitehr all the **B-trees** are updated, or none of them.

 It's even more critical to close the *writeTransaction* because all the other writers are waiting for this transaction to be completed to be able to proceed.

 Let's now describe how an update works, because it's quite convoluted.

 #### Updates, from the tranchees...

 Updating a **B-Tree** is not a simple operation. Actually, we want to guarantee that the operation succeed or fail, but in any case, leave the database in a consistant state. We also want to guarantee that even if a crash occurs in the middle of a write will leave the database in a consistant state. Let's see how it's possible.

 A **B-Tree** is composed of 3 elements :

     * The *BTreeHeader*, which contains the reference on the two other elements
     * The *BTreeInfo*, which contains informations relative to this **B-Tree** (this element is never updated)
     * The *RootPage*, which is the root of the **B-Tree**

 Each of these elemenst are stored into *Pages*, and referenced through the offset of those *Pages* on disk (the offset is a long, and it's just the position of a page in the database file).

 The very first step is to retrieve the latest **B-Tree** revision. We have a management **B-Tree** that keeps a track of all the active revisions for every **B-Tree** : the **BtreeOfBtrees**, called **BOB**. It stores a reference on each **B-tree**'s header for every active revision. The key is a composition of the **B-Tree** name and its revision : *<name, revision> and it reference the *BTreeHeader*'s offset (the position of the *BtreeHeader*'s page). Fo that, we browse the *BOB* looking for the newest <Name, Revision> couple (actually, we are looking for the first tuple that is lower that <name, infinite>).

 Once we have found the tuple, we know where to fetch the *BtreeHeader* from. So do we. We can then fetch the *RootPage*, which is the starting point for any update.

 At this point, the update is either an addition or a removal. That will potentially change the whole structure of the **B-Tree**, or a single page. In any case, we will have a new *RootPage* pointing to old pages and to newly created pages. Let's see again the previous picture :

 ![MVCC sample](images/R4-node.png)

 Let's say we are at revision 3, and we want to add an entry 'I' : as we can see, the *rootPage* is updated, the right **Leaf** is split in two, and the left **Leaf** is unchanged. All in all, we have created 3 new pages, and 2 old pages are now unused and can be reclaimed later. The internal modification of pages is not in the scope of this explaination, so we will leave it for the moment.

 Here is an animated image that shows how it's done :

 ![R3 to R4](images/r3tor4.gif)


 So now, we have a new *BtreeHeader*, revision 4. We have to keep a track of it, which means we have to inject it into the **BOB**. As it's a **B-Tree**, we proceed with it the exact same way, except that we always have a reference on the **BOB** *BtreeHeader* in the transaction context.

 <DIV class="note" markdown="1">
 We will see later on what is the Transaction Context
 </DIV>

 We are almost done. We also have to keep a track of the 'old' pages that were copied. In order to be able to reclaim them, we have to store them somewhere convenient. An 'old' page is not necessarily 'old' for everyone : another thread might have a reader on revision 3, thus might need those 'old' pages. Actually, reclaiming 'old' pages is a bit complex and will be explained later on. Enough said that we want to keep a track of every of those pages in a **B-Tree**, to be able to reclaimn them later.

 We use a specific **B-Tree** for taht : the **CopiedPageBtree**, or **CPB**. The key is different than for the **BOB**, it's a composition of the revision we are creating, and the name of the **B-Tree** being updated : <revision, name>. The value is the offset of the copied page. This way, when the revision is not anymore in use by any reader, we can reclaim the associated pages.

 The **CPB** is a **B-Tree**, so the update process is the one we already have described. There is one big difference though : once the **CPB** has been updated, we can immediately reclaim the copied pages, we won't use them at all (this is because there is only one writer thread, and this is the only one allowed to manipulate this **B-Tree**).

 We are almost done now. At this point, the targetted **B-Tree** has been updated, and the **BOB** and **CPB** have also been updated. But no reader can see those updated versions yet. We need to update the **RecordManagerHeader**, a structure that contain the reference to the current **BOB**. Once we have updated this structure, and wrote it on disk, we can safely make the new **BOB** available for any new readers.

 #### Reclaiming unused pages

 As we have seen in the previous paragraph, we always create new pages, we never remove any. This is obviously going to eat a lot of disk space if we don't reclaim unused pages !

 Hopefully, there is a *reclaimer* thread that is executed when no writerer is running, which role is to check if some old pages can be removed. There are strict conditions for a page to be candidate for removal :

     * It must be presnet in the **CPB**
     * The revision that has copied those pages must not be in use by any reader
     * The revision must be the oldest one

 The latest condition is the critical one : it means that if a revision is used for a long period of time, we will not be able to reclaim any pages from this revision nor from any younger revision, even if they are not in use. That may bloat the Database...

 There are ways to deal with this problem. Typically, if a page is copied twice, then we can reclaim it if it's not referenced. The algorithm is a bit complex though...

 The easiest solution is simply to forbid long read transactions.

 ### Dealing with alive revisions

 Every time we have a reader starting a *readTransaction*, we have to keep a track of the current revision, and the number of readers using it. This is done using a specific List of Used Revisions. Every new reader will add a new revision to this list, or increment a counter associated with the existing revision. Everytime a reader close the *readTransaction*, the counter is decremented. When this counter go down to 0, the revision can be discarded, unless it's the latest one.

 Discarding an old revision is possible only when it's the oldest. That means the list may contain many inactive revisions, with a counter equal to 0. Those inactive revisions will be removed when they will become the oldest revision.
	Title: 2 - MVCC B-tree
	NavUp: 2-btree-types.html
	NavUpText: 2 - B-tree flavors
	NavNext: 3-btree-management.html
	NavNextText: 3 - Mavibot B-tree management
	NavPrev: 2-btree-types.html
	NavPrevText: 2 - B-tree flavors
	Notice: Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at
	.
	http://www.apache.org/licenses/LICENSE-2.0
	.
	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	# 2.1 - MVCC B-tree

	MVCC stands for Multi Version Concurrency Ccontrol. It all boils down to allow readers to ALWAYS have a consistent version of the data they reached at the time they started to work with them. In other words, when a reader access the database, it will work on a specific version of it, that will never change.

	The drawback is that the data may be accurate at the time the reader started to work with them, but this reader will never see the updated data if a writer has modified them.

	Most of the time, this is a price the users are ready to pay, for at least two reasons :
	- udpates are way less frequent that reads, and the users 'accept' the fact that what they get is not necessarily up to date
	- there are some mechanisms that workaround this problem, checking afterward that the data represent the latest state of the database, otherwise the operation is retried.

	You can read more about MVCC B-tree on [Wikipedia, MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)

	## How does it work ?

	A MVCC B-Tree is just a plain standard B+Tree, except that we keep a track of more than one version of the B+Tree. Actually, we have multiple roots, one per version. This can bee seen in this picture :

	![MVCC sample](images/R4-node.png)

	Here, we have 4 different versions, from r1 to r4. Each one of them might be used at the same time by one or many readers. In any case, the writer update only the latest one, creating a new revision doing so. In this exemple, the writer has just created r4 updating r3, adding the element 'I'. As we can see, some pages are shared across the revisions : the leaf contaning the element 'A' (version r3) is also referenced by the revision r4. This is a key : that means a page (Node or Leaf may be used by many revisions).

	We don't delete old revision until they are not anymore in use. And even so, we don't delete all the associated pages blindly : they may be referenced by newer versions that are still active.

	### Browsing the Database

	So we want to fetch some information from a B-Tree. That's just a matter of creating a ReadTransaction, which will grab the latest existing revision for thi B-Tree and allow you to browse data from this B-Tree. One you are done with the B-Tree, you just have to close the ReadTransaction (this is critical, because keeping it opened will pin the revision, forbidding the removal of old pages. Although if you forget to do it, the revision will be released after a period of time). This is a description of the typical code :

	open a ReadTransaction
	get the B-Tree you want to read
	read the B-tree
	...
	close the ReadTransaction

	It's important to note that the transaction covers all the access on all the managed B-Tree. There is no need to open a new ReadTransaction if you want to fetch some other information from another B-Tree. That also mean you can be sure that you have a consistent view of all the database for a given revision.

	You may have many threads reading data from many B-Trees, and we will have as many ReadTransaction created for that purpose. As we may keep a readTransaction opened for quite a long period of time, it's likely that at some point, the various Threads will use different versions of the B-Trees.


	### Updating the Database

	This is a bit more complex. Here, you need a WriteTransaction, and there can be only one running across all the application. That means the access to the Database is serialized, when it comes to update it. The direct consequence is that writes are slow.

	Again, the typical call is like :

	open a WriteTransaction
	get the B-Tree you want to update
	update the B-tree
	...
	close the WriteTransaction

	and again, many B-Trees might be updated during this writeTransaction. For a system like a LDAP server, this is critical, because updating an entry is impacting many B-trees, and we want all of them to be consistent globally : eitehr all the B-trees are updated, or none of them.

	It's even more critical to close the writeTransaction because all the other writers are waiting for this transaction to be completed to be able to proceed.

	Let's now describe how an update works, because it's quite convoluted.

	#### Updates, from the tranchees...

	Updating a B-Tree is not a simple operation. Actually, we want to guarantee that the operation succeed or fail, but in any case, leave the database in a consistant state. We also want to guarantee that even if a crash occurs in the middle of a write will leave the database in a consistant state. Let's see how it's possible.

	A B-Tree is composed of 3 elements :

	* The BTreeHeader, which contains the reference on the two other elements
	* The BTreeInfo, which contains informations relative to this B-Tree (this element is never updated)
	* The RootPage, which is the root of the B-Tree

	Each of these elemenst are stored into Pages, and referenced through the offset of those Pages on disk (the offset is a long, and it's just the position of a page in the database file).

	The very first step is to retrieve the latest B-Tree revision. We have a management B-Tree that keeps a track of all the active revisions for every B-Tree : the BtreeOfBtrees, called BOB. It stores a reference on each B-tree's header for every active revision. The key is a composition of the B-Tree name and its revision : <name, revision> and it reference the BTreeHeader's offset (the position of the BtreeHeader's page). Fo that, we browse the BOB* looking for the newest <Name, Revision> couple (actually, we are looking for the first tuple that is lower that <name, infinite>).

	Once we have found the tuple, we know where to fetch the BtreeHeader from. So do we. We can then fetch the RootPage, which is the starting point for any update.

	At this point, the update is either an addition or a removal. That will potentially change the whole structure of the B-Tree, or a single page. In any case, we will have a new RootPage pointing to old pages and to newly created pages. Let's see again the previous picture :

	![MVCC sample](images/R4-node.png)

	Let's say we are at revision 3, and we want to add an entry 'I' : as we can see, the rootPage is updated, the right Leaf is split in two, and the left Leaf is unchanged. All in all, we have created 3 new pages, and 2 old pages are now unused and can be reclaimed later. The internal modification of pages is not in the scope of this explaination, so we will leave it for the moment.

	Here is an animated image that shows how it's done :

	![R3 to R4](images/r3tor4.gif)


	So now, we have a new BtreeHeader, revision 4. We have to keep a track of it, which means we have to inject it into the BOB. As it's a B-Tree, we proceed with it the exact same way, except that we always have a reference on the BOB BtreeHeader in the transaction context.

	<DIV class="note" markdown="1">
	We will see later on what is the Transaction Context
	</DIV>

	We are almost done. We also have to keep a track of the 'old' pages that were copied. In order to be able to reclaim them, we have to store them somewhere convenient. An 'old' page is not necessarily 'old' for everyone : another thread might have a reader on revision 3, thus might need those 'old' pages. Actually, reclaiming 'old' pages is a bit complex and will be explained later on. Enough said that we want to keep a track of every of those pages in a B-Tree, to be able to reclaimn them later.

	We use a specific B-Tree for taht : the CopiedPageBtree, or CPB. The key is different than for the BOB, it's a composition of the revision we are creating, and the name of the B-Tree being updated : <revision, name>. The value is the offset of the copied page. This way, when the revision is not anymore in use by any reader, we can reclaim the associated pages.

	The CPB is a B-Tree, so the update process is the one we already have described. There is one big difference though : once the CPB has been updated, we can immediately reclaim the copied pages, we won't use them at all (this is because there is only one writer thread, and this is the only one allowed to manipulate this B-Tree).

	We are almost done now. At this point, the targetted B-Tree has been updated, and the BOB and CPB have also been updated. But no reader can see those updated versions yet. We need to update the RecordManagerHeader, a structure that contain the reference to the current BOB. Once we have updated this structure, and wrote it on disk, we can safely make the new BOB available for any new readers.

	#### Reclaiming unused pages

	As we have seen in the previous paragraph, we always create new pages, we never remove any. This is obviously going to eat a lot of disk space if we don't reclaim unused pages !

	Hopefully, there is a reclaimer thread that is executed when no writerer is running, which role is to check if some old pages can be removed. There are strict conditions for a page to be candidate for removal :

	* It must be presnet in the CPB
	* The revision that has copied those pages must not be in use by any reader
	* The revision must be the oldest one

	The latest condition is the critical one : it means that if a revision is used for a long period of time, we will not be able to reclaim any pages from this revision nor from any younger revision, even if they are not in use. That may bloat the Database...

	There are ways to deal with this problem. Typically, if a page is copied twice, then we can reclaim it if it's not referenced. The algorithm is a bit complex though...

	The easiest solution is simply to forbid long read transactions.

	### Dealing with alive revisions

	Every time we have a reader starting a readTransaction, we have to keep a track of the current revision, and the number of readers using it. This is done using a specific List of Used Revisions. Every new reader will add a new revision to this list, or increment a counter associated with the existing revision. Everytime a reader close the readTransaction, the counter is decremented. When this counter go down to 0, the revision can be discarded, unless it's the latest one.

	Discarding an old revision is possible only when it's the oldest. That means the list may contain many inactive revisions, with a counter equal to 0. Those inactive revisions will be removed when they will become the oldest revision.