content/mavibot/user-guide/1.1-btree-basics.mdtext - directory-site - Git at Google

 Title: 1.1 - B-tree Basics
 NavUp: 1-introduction.html
 NavUpText: 1 - Introduction
 NavNext: 2-btree-types.html
 NavNextText: 2 - B-tree Types
 Notice: Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
     .
     http://www.apache.org/licenses/LICENSE-2.0
     .
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 # 1.1 - B-tree Basics

 **B-tree** was invented back in 1971 by **Bayer** and **McCreight**  (the **B** does not mean anything known, speculations are that it comes either form the **B** of **Bayer** or from **Boing** they were working for back then). It was an extension to binary search tree, which was just able to store 2 elements per page.

 A **B-tree** is a "tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time." (see [Wikipedia](http://en.wikipedia.org/wiki/B-tree))

 The important point here is the last one : it guarantees **O(logn)** operations, compared to any other data structures (a hashed data structure offers **O(n)** average operations, but can degenerate to **O(n2)**, and ordering is not kept). Although it would be optimal to keep only 2 elements in each node, it's way better from a performance point of view to keep more than 2 elements in each node, even if it leads to more comparison than strictly necessary. The reason is that reading a fixed size from disk is better than reading only what we need (considering that a disk access is 3 to 4 orders of magnitude slower than RAM access, it's always a good idea to cache pages in memory, rather than read data from the disk over and over. And these 3 to 4 orders of magnitude is for SSD... For spinning disks, we are more in the 4 to 5 orders of magnitude !)

 Add to that the fact that when *B-tree* were invented, data were stored on magnetic tapes, with a pure sequential access. Reading data sequentially made sense, and reading a block of data was the way to go - thus the page was meant to store more than 2 elements -.

 **B-trees** are everywhere : databases, OS, etc. It's a critical data structure when you are to deal with a large set of data.

 1.1.1 - Inside a B-tree

 A **B-Tree** contains **Nodes** and **Leaves**. A *Node* points to other **Nodes** or **Leaves**. **Nodes** contain only **keys** and references to child **nNodes**, when **Leaves** have **Keys** that are associated with **Values**. The **Nodes**' **Keys* are used to find the right child that will ultimately contain the **Value** we are looking for (if it's present)

 Pretty simple !

 One last thing : **Keys** are ordered, and this is the condition for the easy and fast retrieval of **Values**.

 A few more rules are enforced :
 * A **Node** and a **Leaf** contains up to N values (N being generally a power of 2, so 2, 4, 8, 16...).
 * You can't have less than N/2 **Values** or **keys** in a **Leaf** or a **Node**, except for the root **Node**.

 The second rule allows the **B-tree** to have only one **Leaf**, when up to **N** elements are to be stored. If we want to store one mor element, then the root **Leaf** will be split and the elements will be spread equally in two **Leaves**, with a root **Node$* containing references on those two **Leaves**.

 1.1.2 - Concurrent access

 The real problem with **B-tree**s is that we need to use some locks to allow concurrent access to the data, especially when some updates occur. The rationale is that when browsing a **B-tree**, changing it will just break the browse operation. There are some technics to avoid having such an issue, up to a point no read locks are needed, assuming some extra information is added to the pages : a pointer to the next page at the same level. Sadly, it comes with drawbacks, like it's difficult to remove empty pages from the **B-tree**.

 What we want is a way to guarantee that many users can read data from the **B-tree**, even while the **B-tree** is being updated. Of course, that will heavily favor reads over writes, as writes will have to be serialized, but in many cases, this is a comfortable trade we are ready to make.
	Title: 1.1 - B-tree Basics
	NavUp: 1-introduction.html
	NavUpText: 1 - Introduction
	NavNext: 2-btree-types.html
	NavNextText: 2 - B-tree Types
	Notice: Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at
	.
	http://www.apache.org/licenses/LICENSE-2.0
	.
	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	# 1.1 - B-tree Basics

	B-tree was invented back in 1971 by Bayer and McCreight (the B does not mean anything known, speculations are that it comes either form the B of Bayer or from Boing they were working for back then). It was an extension to binary search tree, which was just able to store 2 elements per page.

	A B-tree is a "tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time." (see [Wikipedia](http://en.wikipedia.org/wiki/B-tree))

	The important point here is the last one : it guarantees O(logn) operations, compared to any other data structures (a hashed data structure offers O(n) average operations, but can degenerate to O(n2), and ordering is not kept). Although it would be optimal to keep only 2 elements in each node, it's way better from a performance point of view to keep more than 2 elements in each node, even if it leads to more comparison than strictly necessary. The reason is that reading a fixed size from disk is better than reading only what we need (considering that a disk access is 3 to 4 orders of magnitude slower than RAM access, it's always a good idea to cache pages in memory, rather than read data from the disk over and over. And these 3 to 4 orders of magnitude is for SSD... For spinning disks, we are more in the 4 to 5 orders of magnitude !)

	Add to that the fact that when B-tree were invented, data were stored on magnetic tapes, with a pure sequential access. Reading data sequentially made sense, and reading a block of data was the way to go - thus the page was meant to store more than 2 elements -.

	B-trees are everywhere : databases, OS, etc. It's a critical data structure when you are to deal with a large set of data.

	1.1.1 - Inside a B-tree

	A B-Tree contains Nodes and Leaves. A Node points to other Nodes or Leaves. Nodes contain only keys and references to child nNodes, when Leaves have Keys that are associated with Values. The Nodes' *Keys are used to find the right child that will ultimately contain the Value we are looking for (if it's present)

	Pretty simple !

	One last thing : Keys are ordered, and this is the condition for the easy and fast retrieval of Values.

	A few more rules are enforced :
	* A Node and a Leaf contains up to N values (N being generally a power of 2, so 2, 4, 8, 16...).
	* You can't have less than N/2 Values or keys in a Leaf or a Node, except for the root Node.

	The second rule allows the B-tree to have only one Leaf, when up to N elements are to be stored. If we want to store one mor element, then the root Leaf will be split and the elements will be spread equally in two Leaves, with a root *Node$ containing references on those two Leaves.

	1.1.2 - Concurrent access

	The real problem with B-trees is that we need to use some locks to allow concurrent access to the data, especially when some updates occur. The rationale is that when browsing a B-tree, changing it will just break the browse operation. There are some technics to avoid having such an issue, up to a point no read locks are needed, assuming some extra information is added to the pages : a pointer to the next page at the same level. Sadly, it comes with drawbacks, like it's difficult to remove empty pages from the B-tree.

	What we want is a way to guarantee that many users can read data from the B-tree, even while the B-tree is being updated. Of course, that will heavily favor reads over writes, as writes will have to be serialized, but in many cases, this is a comfortable trade we are ready to make.