blob: 95e2b19dd8aa558f17cc9b96ca316d9f5f706774 [file] [log] [blame]
Title: 7.1 - Logical Structure
NavUp: 7-btree-internals.html
NavUpText: 7 - Mavibot Internals
NavNext: 7.2-physical-storage.html
NavNextText: 7.2 - Physical storage
NavPrev: 7-btree-internals.html
NavPrevText: 7 - Mavibot Internals
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.
http://www.apache.org/licenses/LICENSE-2.0
.
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
# 7.1 - Logical Structure
**Mavibot** stores data in one or more *BTree*s, and defines a few more internal data structures to handle the data and *BTree*s.
We can have three different ways to use **Mavibot** :
* using in-memory *BTree*s (IN-MEMORY)
* using in-memory *BTree*s stored on disk (PERSISTED)
* storing the *BTree*s on disk (so called managed *BTree*s) (MANAGED)
## In Memory BTrees
They are *BTree*s stored in memory : as soon as you quit your program, all the stored data will be lost. The biggest advantage is that it is fast.
As *Mavibot* is handling **MVCC** *BTree*s, you have to keep in mind that for each modification, we copy pages and values, hence the *BTree*s will quickly grow and use a lot of memory. On the other hand, copied data which are not anymore in use will be discarded automatically. The beauty of having a garbage collector is that we don't have to take care of this copied data, i.e., if they are not any more referenced by any objects using the *BTree*, they will be reclaimed by the GC.
The below diagram shows the logical representation of an in-memory *BTree* :
![In-Memory BTree](images/InMemoryBTree.png)
## Persistent BTrees
A persistent *BTree* is a *BTree* which can be flushed to disk on demand. The *BTree* is a in-Memory *BTree*, but while closing it, then content of the latest revision is serialized on disk. The data can be loaded while opening a persistent BTree.
Other than that, there is no difference between an in-memory *BTree* and a persistent *BTree*.
## Managed BTrees
Managed *BTree*s are very different : data is guaranteed to be preserved on disk after each modification, even when the program crashes, it is guaranteed that the disk will contain everything needed to recover the *BTree* to the state it was in just before the crash.
This is important to understand that in managed mode, not all *BTree*s (of a mavibot database) are kept in memory. In other words, all nodes, except the *root* node, of a BTree may or may not be present at the time of accessing. **Mavibot** will fetch these nodes from disk when needed.
Obviously this approach has both pros and cons :
Pros :
* there is no limit on the number of elements one can store in a BTree, except on the available disk space
* A *BTree* will always be consistent, even if there was a crash
* data durability is guaranteed
Cons :
* reads might be costly when the data is not present in memory, due to fetching data from disk
* accessing the data from disk requires an extra layer of accessor code, this costs some extra memory
Here, this is just a question of tradeoff : depending on the existing memory size, and the level of robustness needed, one may decide to go for an in-memory *BTree*, a persistent *BTree* or a managed one. Most of the time, though, managed *BTree* is what you want to use.
Also note that we use internal cache to speed up the data access. This cache and its size can be configured.
Managed *BTree*s are stored using *Nodes* and *Leaves*. A *Node* contains only keys or references to underlaying nodes or leaves. A *Leaf* contains keys and values. As we don't want to eat too much memory, the references to nodes, leaves, keys and values are stored as offset, read and translated to java objects on demand. For instance, we keep an offset to a key until someone needs to access the key, then we deserialize this key and store it in memory. This is the very same for references to nodes, leaves or values.
Here is a picture describing this structure :
![Managed references](images/managedReferences.png)
In this BTree, only two pages are present in memory : one node and one leaf. In these pages, the keys aren't yet objects, they are pointing to the page's raw data, except for the **D** key and it's value, they were loaded and deserialized.
Here each element, contains an offset and the byte[] of the serialized value or the deserialized value if the value has already been accessed.
### User's BTrees
These are the BTrees that are created by the user and these trees hold the data.
### Special BTrees
These are the two special *BTree*s used internally to manage the revisions and the copied pages.
#### Revision tree
Mavibot uses this tree to keep track of each active revision, so that a search can work with a specific revision. The idea is that when a search starts, it uses the latest revision, but while the search is being processed a new modification can occur which creates a new revision. And also sometimes, we may want to keep a revision active for quite a long time.
This revision *BTree* manages the revisions of all the managed *BTree*s.
The key of the revision btree is a combination of the *BTree* name and its revision.
When a revision is not anymore used, it can be removed from the revision *BTree*.
Unlike all other user btrees the revision *BTree* is not a **MVCC** *BTree*. In other words, only the latest revision of the revision btree is preserved(i.e, all the modified pages are immediately freed)
#### Copied pages BTree
Once a new revision is created, the pages that were copied are not anymore in use except if the revisions they are associated with are still in use. These pages cannot be discarded and moved
to the free list until the associated revision is free.
A dedicated *BTree* is used to keep track of the copied pages, which will be reclaimed and moved to the free pages list once the associated revision gets released.
### Managing the free pages
There is a mechanism to manage the *PageIO* that are not anymore in use. This is a linked list in which the free pages are added. Whenever a new page(s) is needed this list is searched first and reclaim as many *PageIO*s as needed - until the end of this list is reached. When a page gets freed that will be added at the end of the free page list.
Note that only logical pages are released, which may be stored in many *PageIO*s. These *PageIO*s are already linked, hence while adding this logical page to the free page list, the last existing free *PageIO* will be modified to point to the first freed *PageIO* of this logical page, and update the pointer of the last free page to the last *PageIO* of this logical page.