blob: 8b25313e14e715548533922283aa0ce0ea6c17c9 [file] [log] [blame]
Title: 1.1 - B-tree Basics
NavUp: 1-introduction.html
NavUpText: 1 - Introduction
NavNext: 2-btree-types.html
NavNextText: 2 - B-tree Types
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.
http://www.apache.org/licenses/LICENSE-2.0
.
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
# 1.1 - B-tree Basics
**B-tree** was invented back in 1971 by **Bayer** and **McCreight** (the **B** does not mean anything known, speculations are that it comes either form the **B** of **Bayer** or from **Boing** they were working for back then). It was an extension to binary search tree, which was just able to store 2 elements per page.
A **B-tree** is a "tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time." (see [Wikipedia](http://en.wikipedia.org/wiki/B-tree))
The important point here is the last one : it guarantees **O(logn)** operations, compared to any other data structures (a hashed data structure offers **O(n)** average operations, but can degenerate to **O(n2)**, and ordering is not kept). Although it would be optimal to keep only 2 elements in each node, it's way better from a performance point of view to keep more than 2 elements in each node, even if it leads to more comparison than strictly necessary. The reason is that reading a fixed size from disk is better than reading only what we need (considering that a disk access is 3 to 4 orders of magnitude slower than RAM access, it's always a good idea to cache pages in memory, rather than read data from the disk over and over. And these 3 to 4 orders of magnitude is for SSD... For spinning disks, we are more in the 4 to 5 orders of magnitude !)
Add to that the fact that when *B-tree* were invented, data were stored on magnetic tapes, with a pure sequential access. Reading data sequentially made sense, and reading a block of data was the way to go - thus the page was meant to store more than 2 elements -.
**B-trees** are everywhere : databases, OS, etc. It's a critical data structure when you are to deal with a large set of data.
1.1.1 - Inside a B-tree
A **B-Tree** contains **Nodes** and **Leaves**. A *Node* points to other **Nodes** or **Leaves**. **Nodes** contain only **keys** and references to child **nNodes**, when **Leaves** have **Keys** that are associated with **Values**. The **Nodes**' **Keys* are used to find the right child that will ultimately contain the **Value** we are looking for (if it's present)
Pretty simple !
One last thing : **Keys** are ordered, and this is the condition for the easy and fast retrieval of **Values**.
A few more rules are enforced :
* A **Node** and a **Leaf** contains up to N values (N being generally a power of 2, so 2, 4, 8, 16...).
* You can't have less than N/2 **Values** or **keys** in a **Leaf** or a **Node**, except for the root **Node**.
The second rule allows the **B-tree** to have only one **Leaf**, when up to **N** elements are to be stored. If we want to store one mor element, then the root **Leaf** will be split and the elements will be spread equally in two **Leaves**, with a root **Node$* containing references on those two **Leaves**.
1.1.2 - Concurrent access
The real problem with **B-tree**s is that we need to use some locks to allow concurrent access to the data, especially when some updates occur. The rationale is that when browsing a **B-tree**, changing it will just break the browse operation. There are some technics to avoid having such an issue, up to a point no read locks are needed, assuming some extra information is added to the pages : a pointer to the next page at the same level. Sadly, it comes with drawbacks, like it's difficult to remove empty pages from the **B-tree**.
What we want is a way to guarantee that many users can read data from the **B-tree**, even while the **B-tree** is being updated. Of course, that will heavily favor reads over writes, as writes will have to be serialized, but in many cases, this is a comfortable trade we are ready to make.