| --- |
| layout: post |
| status: PUBLISHED |
| published: true |
| title: Apache Mavibot history |
| id: 58152764-1378-4668-8820-a8d6e25ced73 |
| date: '2017-02-16 23:23:41 -0500' |
| categories: directory |
| tags: |
| - diretcory |
| - mavibot |
| permalink: directory/entry/apache-mavibot-history |
| --- |
| <h1>Introduction</h1> |
| <p> |
| First of all, let me introduce <b>Apache Mavibot</b>: it's a <a href="https://en.wikipedia.org/wiki/Multiversion_concurrency_control">MVCC</a> <a href="https://en.wikipedia.org/wiki/B%2B_tree">B+<br /> |
| tree</a> library in Java under an <a href="https://www.apache.org/licenses/LICENSE-2.0">AL 2.0 license</a> (<b>MVCC</b> stands for<br /> |
| <em>Multi-Version Concurrency Control</em>). The whole idea is to have a B-tree<br /> |
| implementation that never crashes, and does not use locks to protect the<br /> |
| data against concurrent access (well … while reading).</p> |
| <p> |
| The B+ tree is a variant of a B-tree, where values are only stored in<br /> |
| the leaves, not in internal nodes.</p> |
| <p> |
| Ok. Good. You don't know much about Mavibot after this introduction, so<br /> |
| I'll dig a bit deeper in this post. Let's start with the original idea.</p> |
| <h2>Apache Directory, CouchDB, and some other databases...</h2> |
| <p> |
| Back in 2009, I was attending the <a href="http://archive.apachecon.com/c/acus2009/">Apache Conference in Oakland</a>. I had been<br /> |
| working on the <a href="http://directory.apache.org">Apache Directory project</a> for a bit more than a 4 years<br /> |
| and a half. <em>Apache Directory</em> is a <b>LDAP</b> server written in Java, and we<br /> |
| chose to store data in B-trees. There was a very limited choice back<br /> |
| then, and the library we used - and still use as of today - was <a href="http://jdbm.sourceforge.net/">JDBM<br /> |
| </a>, a java avatar of <a href="http://www.gnu.org.ua/software/gdbm/">GDBM</a>.</p> |
| <p> |
| <b>JDBM</b> is written in Java, implements <b>B+trees</b> and has transactions support<br /> |
| (experimental), but it has one big drawback: it's not a cross-B-trees<br /> |
| transaction system. And that does not fit our requirement in <b>LDAP</b>.</p> |
| <p> |
| An alternative could have been <b>Berkeley DB &tm;</b>, which released a Java<br /> |
| edition of its database, but its license was incompatible with the AL<br /> |
| 2.0 license. Moreover Berkeley DB was bought by <b>Oracle</b> in 2006, so it was<br /> |
| simply not an option.</p> |
| <h2>What’s wrong with using JDBM?</h2> |
| <p> |
| In <b>LDAP</b>, an update operation impacts one single entry but this entry<br /> |
| uses many <em>AttributeTypes</em>, which can be indexed. In other words, an<br /> |
| update will impact as many B-trees as we have indexes (and a bit more).<br /> |
| In order to guarantee that an entry UPDATE is consistent, we must be<br /> |
| sure that either all or none of the indexes have been flushed to disk:<br /> |
| otherwise we might end with an inconsistent database, where some indexes<br /> |
| are up to date when some other aren't.</p> |
| <p><a href="https://blogs.apache.org/directory/mediaresource/847a2aa6-dce8-4d6f-9a65-79c74219d785"><img src="https://blogs.apache.org/directory/mediaresource/847a2aa6-dce8-4d6f-9a65-79c74219d785" alt="indexeldap.png" height="50%" width="50%"></img></a></p> |
| <p> |
| Even worse, in the event of a crash, we might simply not be able<br /> |
| to restart the server because the database gets corrupted (and<br /> |
| sadly, we are experiencing this problem today...).</p> |
| <p> |
| So in <b>Oakland</b>, I went to the <a href="http://couchdb.apache.org/">Apache Couch-DB</a> presentation (sadly,<br /> |
| the slides are not anymore available), and was struck by the idea behind<br /> |
| their database: <b>MVCC</b>. Crucially when you start to use the<br /> |
| database at a given revision, you always see everything associated with<br /> |
| this revision, up to the point you are done. Sounds like <b>Git</b> or<br /> |
| <b>Subversion</b> … actually, it's pretty much the same mechanism.</p> |
| <p> |
| Being able to process some read operations on a specific version of the<br /> |
| database guarantees that no update will ever corrupt the data being<br /> |
| processed. And every time we want to access the database, the very first<br /> |
| thing it will do is to select the latest available version: this is<br /> |
| all we will see during the operation processing. Perfect when you don't really care about having a fresh view of the stored data at any time, which is the case in <b>LDAP</b>.</p> |
| <p> |
| But <b>Apache CouchDB</b> was written in <b>Erlang</b> :/ Anyway, the discussion we<br /> |
| had with the Directory team was really about moving to a <b>MVCC</b> database.</p> |
| <h2>Transactions</h2> |
| <p> |
| Transactions are another big missing feature in <b>LDAP</b>. This is not something that was<br /> |
| in the air back then: it was specified only <a href="https://tools.ietf.org/html/rfc5805">one year later</a>. Of course, the original specifications said that every operation is atomic, but there is no requirement for multiple operations to be atomic (and we often need to update two entries in <b>LDAP</b>, and to guarantee that those two operations are either completed, or<br /> |
| roll-backed). Think about user/group management...</p> |
| <p> |
| <b>Alex Karasulu</b> always had in mind that we needed a transactional database<br /> |
| in Apache Directory, too. And his point was proved correct when years<br /> |
| later, we faced the first database corruptions. It's a bit sad that we<br /> |
| ignored this aspect for so long :/</p> |
| <p> |
| Anyway, we needed (a) transactions and (b) a rock solid database that could<br /> |
| resist any type of crash.</p> |
| <h2>Locks</h2> |
| <p> |
| For some time, we tried to mitigate the consistency problems we had by<br /> |
| adding tons of locks. As we weren't able to protect the database against<br /> |
| concurrent reads and writes we made them exclusive (i.e.<br /> |
| when some write is processed, no read can be processed). This was<br /> |
| slightly better, but it came at a huge cost: a major slowdown when<br /> |
| writes were done. Also it was not good enough: long-lasting searches<br /> |
| were just killing us, as there were no solution to guarantee that an<br /> |
| entry for which we had a reference would still be present in the database<br /> |
| when we needed to fetch it. In such cases, we simply ended up by<br /> |
| discarding the entry. Last, not least, a crash in the middle of an<br /> |
| update operation would leave the database in a potential inconsistent<br /> |
| state, which would make it impossible to start again (this was somehow<br /> |
| mitigated by adding a 'repair' mode lately, but this is just an horrible<br /> |
| hack).</p> |
| <h2>Mavibot first steps</h2> |
| <p> |
| So we needed something better, which turned out to be <b>Mavibot</b>. We started working<br /> |
| on Mavibot in June 2012 (<a href="http://svn.apache.org/viewvc/labs/mavibot/README.txt?revision=1349589&view=markup&pathrev=1349589">Jun 13 00:04:10 2012, exactly</a>).</p> |
| <p> |
| The funny thing is that <b>OpenLDAP</b> started to work on the exact same kind<br /> |
| of database 1 year before (<a href="http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=commit;h=d620d4368a7ee17d60f2b381d4c80b22c68ba8e2">LMDB</a>) - even if the <a href="http://marc.info/?l=openldap-devel&m=124253456912512">discussion about the<br /> |
| need for such a database started in 2009</a>. Parallel discussions,<br /> |
| parallel developments, we have always shared a lot!</p> |
| <p> |
| The very first released version of <b>Mavibot</b> was out one year later, in<br /> |
| June 2013, followed by 7 other versions (all of them milestones). At<br /> |
| some point, we added a <b>MVBT</b> partition in <b>ApacheDS</b>, in 2.0.0-M13 (and it<br /> |
| was using a SNAPSHOT!!! Mavibot 1.0.0-M1 was used in <b>ApacheDS</b><br /> |
| 2.0.0-M15). This was 'good enough' to have the <b>LDAP</b< server working (and<br /> |
| it was 2 to 5 times faster than <b>JDBM</b>, too ;-), but it didn't offer all<br /> |
| we wanted to add: typically, we didn't have transaction support.</p> |
| <p> |
| So why isn’t <b>Mavibot</b> the <b>Apache Directory Server</b> backend of choice today?</p> |
| <p> |
| Well, we don't have cross B-tree transactions, so we are pretty much in<br /> |
| the same situation as with <b>JDBM</b> (except that it's faster, and we also<br /> |
| have a bulk-loader for <b>Mavibot</b>). Adding cross-B-trees transaction is not<br /> |
| a piece of cake, and it requires some thinking. Sadly, it arrived at a<br /> |
| moment where the team had less time to work on it (new jobs, family,<br /> |
| you name it).</p> |
| <p> |
| So in 2017, the effort has been rebooted, and we do expect to have a<br /> |
| working version soon enough!</p> |
| <p> |
| I'll blog later on about various technical aspects on <b>Apache Mavibot</b>, so keep tuned !</p> |