We've now taken the grand tour though the entire project and we can discuss futures.
The discussions here discussed not just how the code works, but why it was created as it was. The project spanned many months during which many issues arose, were discussed, and solutions adopted. For this reason, the project, as it exists, evolved to solve very specific issues that Drill presents.
The path of least resistance is to complete the roll-out of the work up through the CSV and JSON readers, then to upgrade other important readers, uch as Parquet.
Once the core readers are upgraded, attention can turn to other readers and operators as discussed below.
After the present work is committed, two paths become open to the team:
The project chose to start with CSV (actually, the “compliant text”) reader because it is simple and proved the basic concept. The project then tackled JSON because it is the hardest. (As noted, the work identified many unresolved design issues in the JSON reader.) All other readers fall somewhere in between.
Readers where the focus of this project because readers are the largest source of batch size problems. While alternative solutions exist for internal (non-scan) operators, no solution exists for readers since, in general, readers do not know the size of incoming data until it is read.
The logical progression is that each community member upgrades the readers that are important to them. MapR is Drill's primary sponsor, so perhaps MapR upgrades readers that are commercially important to MapR (such as the Parquet reader.) Other teams, at their convenience, upgrade other readers.
The timing is flexible. Once the code is available, upgrades can happen at any time. But, once Drill moves to create an admission control solution, then Drill must be able to manage memory, and all readers must be upgraded or deprecated t that time. If any reader declines to limit its batch size, then the entire query may die because the fragment (and each operator in the fragment) will be forced to handle data beyond the budgeted memory size. The only three solutions at that time will be to either:
Drills internal (non-scan) operators must also be upgraded, but here we have two choices.
Frankly, the “batch sizer” solution is the simplest short-term solution if our focus is exclusively batch size control. (Efforts are underway using this approach.)
The cost of upgrading existing operators will be high because:
The above work is not justified just for batch size control since a simpler lternative exists. Instead, the above work is justified only if we get more value. Some long-term architectural advantages include:
Each of the above deserves far more detail than can be provided here. This should at least provide the flavor of the opportunities available.
These are all complex architectural issues which are not at all obvious from a superficial task-oriented view of the code. For this reason, it may take years to understand the need for, and move toward, the above approaches. This again argues for using simpler (if less robust) solutions short-term.
This project started, in part, from a desire to address memory fragmentation in Drill. The project solves one aspect of fragmentation (limiting vector sizes to the Netty block size or smaller.) But, it does not directly solve a more fundamental issue: that database experts have known since at least the '80s that memory fragmentation is unavoidable in a system that does random allocations or random sized blocks without compaction.
C++ applications are subject to memory fragmentation because the follow the above patterns. Database systems written in C++ (such as Impala) solve the issue by allocing fixed size “buffers” that reside in a “buffer pool.” Database systems have long shown that such a solution works well and avoids fragmentation.
Java applications are not subject to fragmentation because the Java GC performs compaction and coalescing of free blocks. This prevents the issue of fragmentation by ensuring a steady supply of large spans of free memory (until the application allocates so many blocks that it exhausts heap.) Many Java developers (including Drill's) see GC as an unwanted cost, and look for alternatives.
Fixed-size blocks work well for row-based systems: the system simply packs a block with rows until no more fit. Fixed-sized blocks are less obvious for columnar systems. Each column grows independently and each has different sized data. This is why Drill adopted a malloc
-like approach. But, in so doing, Drill became subject to memory fragmentation. Is this a Catch-22?
Drill uses direct memory to avoid GC costs. But, in so doing, Drill gave up the solution that allows random-sized allocations to work: compaction. So, to continue to use direct memory without compaction, Drill may find it necessary to follow most other database systems and adopt a memory model based on fixed-size blocks.
One possible solution (for which a prototype exists) is to rethink the value vector. Today each value vector is represented as a single, variable-sized (but power-of-two sized) block. This leads to two issues:
This model has the advantage of representing vectors as, essentially, Netty byte buffers. But, Netty points to an alternative solution: a composite byte buffer.
In Netty, we can have a logical byte buffer which is actually comprised of a series of underlying buffers. Netty does the required offset translations.
The prototype mentioned earlier does something similar, but with fixed size blocks. Rather than a vector being a single block, it is physically comprised of a string of fixed-size blocks:
Current: [The quick brown fox jumps over the lazy dog.........] Revised: Logical: [The quick brown fox jumps over the lazy dog.] Physical: [The quic][k brown ][fox jump][s over t][he lazy ][dog.....]
The implementation can be based on power-of-two blocks. The example uses 8 bytes, but a real solution would use much larger. (The prototype found that 1 MB blocks amortize costs and are at least as fast as the current solution.) Power-of-two blocks allow a very simple indexing schema: just mask and shift similar to how SV4 vectors work.
The prototype adds a simple caching scheme: writes happen sequentially, so the block reference need be resolved only when a block becomes full.
Once fixed size blocks are implemented, various memory management designs become available. For example, if we decide to give each fragment x bytes of memory, then we are effectively giving the fragment (x / block size) blocks of memory. A block pool for the fragment can recycle those blocks:
Scan --> Filter --> SVR --> Sender ^ ^ | | | | | v v +---- Block pool ---+--------+
Such a memory allocator would be, essentially, lock-free for all but the sender. (Sent blocks would be released in a Netty thread.) Since Drill's current allocator is designed to be lock-free, we would retain this benefit in the new solution.
Further, the allocator becomes far simpler: we no longer need to anage variable-sized blocks at the byte level, no longer need ledgers, no longer need to manage operator allocations. All we need to do is manage the overall fragment memory, which should be driven by the kind of memory management plan discussed at the start.
To change topics just a bit, one of Drill's compelling advantages is that it is written in Java and thus can enable community-provided extensions. Drill already enjoys a variety of community-provided readers. However, there seems to e general agreement that it is far harder than necessary to create a reader. Each reader author must:
One of the background goals of this project is to make it far easier to create a format plugin. This work can't solve all the issues, but it did attempt to do the following:
A good task for a community member would be, once the code is available, to create an example reader using the new framework, then publicize that example to encourage new reader contributions.
And so we come full circle. We started with a desire to define a memory budget for a fragment (and thus a query) and to avoid one form of memory allocation. We worked out the technical mechanisms to do that. We have now seen how those solutions allow us to move to a fixed-size-block structure that sweeps away the remaining memory fragmentation issues, while resulting in a much simpler, proven memory allocation model.
As the team moves forward with this project, please keep this big picture in mind. By doing so, we can continue to march forward towards our common goal of making Drill the best big-data query engine available.