blob: 653355138c272bfcc503bf8aa8bacfe7246d8b9d [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.5">Jekyll</generator><link href="http://crail.incubator.apache.org//feed.xml" rel="self" type="application/atom+xml" /><link href="http://crail.incubator.apache.org//" rel="alternate" type="text/html" /><updated>2020-01-14T11:05:49+01:00</updated><id>http://crail.incubator.apache.org//feed.xml</id><title type="html">The Apache Crail (Incubating) Project</title><entry><title type="html">YCSB Benchmark with Crail on DRAM, Flash and Optane over RDMA and NVMe-over-Fabrics</title><link href="http://crail.incubator.apache.org//blog/2019/10/ycsb.html" rel="alternate" type="text/html" title="YCSB Benchmark with Crail on DRAM, Flash and Optane over RDMA and NVMe-over-Fabrics" /><published>2019-10-09T00:00:00+02:00</published><updated>2019-10-09T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2019/10/ycsb</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/10/ycsb.html">&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Recently, suppport for Crail has been added to the &lt;a href=&quot;https://github.com/brianfrankcooper/YCSB&quot;&gt;YCSB&lt;/a&gt; benchmark suite. In this blog we describe how to run the benchmark and briefly show some performance comparisons between Crail and other key-value stores running on DRAM, Flash and Optane such as &lt;a href=&quot;https://www.aerospike.com&quot;&gt;Aerospike&lt;/a&gt; or &lt;a href=&quot;https://ramcloud.atlassian.net/wiki/spaces/RAM/overview&quot;&gt;RAMCloud&lt;/a&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;the-crail-key-value-storage-namespace&quot;&gt;The Crail Key-Value Storage Namespace&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Remember that Crail exports a hierarchical storage namespace where individual nodes in the storage hierarchy can have different types. Supported node types are Directory (Dir), File, Table, KeyValue (KV) and Bag. Each node type has slightly different properties and operations users can execute on them, but also restricts the possible node types of its child nodes. For instance, directories offer efficient enumeration of all of its children, but restricts children to be either of type Directory of File. Table nodes allow users to insert or retrieve KV nodes using a PUT/GET API, but restricts the children to be of type KV. All nodes, independent of their type, are identified using path names encoding the location in the storage hierarchy, similar to files and directories in a file system. All nodes further consist of metadata, managed by one of Crail's metadata servers, and an arbitrary large data sets, distributed across Crail's storage servers. For a more detailed description of Crail's node types please consider our recent &lt;a href=&quot;https://www.usenix.org/conference/atc19/presentation/stuedi&quot;&gt;USENIX ATC'19 &lt;/a&gt; paper.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/ycsb/storage_namespace.svg&quot; width=&quot;400&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In this blog we focus on Crail's KeyValue API available to users through the Table and KV node types. Creating a new table and inserting a key-value pair into it can be done as follows.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CrailStore crail = CrailStore.newInstance();
CrailTable table = fs.create(&quot;/tab1&quot;, CrailNodeType.TABLE, ..., ...).get();
CrailKeyValue kv = fs.create(&quot;/tab1/key1&quot;, CrailNodeType.KEYVALUE, ..., ...).get();
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Here the table's name is &quot;/tab1&quot; and the key of the key-value pair is &quot;key1&quot;. Unlike in a traditional key-value store where the value of a key is defined when inserting the key, in Crail the value of the key consists of an arbitrary size append-only data set, that is, a user may set the value of a key by appending data to it as follows.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CrailOutputStream stream = kv.getOutputStream();
CrailBuffer buf = CrailBuffer.wrap(&quot;data&quot;.bytes());
stream.append(buf);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Lookup and reading of a key-value pair is done in a similar fashion.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CrailStore crail = CrailStore.newInstance();
CrailKeyValue kv = fs.lookup(&quot;/tab1/key1&quot;).get().asKeyValue();
CrailInputStream stream = kv.getInputStream();
CrailBuffer buf = crail.createBuffer();
stream.read(buf);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Note that multiple clients may concurrently try to create a key-value pair with the same name in the same table. In that case Crail provides last-put-wins semantics where the most recently created key-value pair prevails. Clients currently reading or writing a stale key-value pair will be notified about the staleness of their object upon the next data access.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;running-ycsb-with-crail&quot;&gt;Running YCSB with Crail&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
The &lt;a href=&quot;https://github.com/brianfrankcooper/YCSB&quot;&gt;YCSB&lt;/a&gt; benchmark is a popular benchmark to measure the performance of key-value stores in terms of PUT/GET latency and throughput for different workloads. We recently contributed support for Crail to the benchmark and we are excited that the Crail binding got accepted and integrated into the benchmark last June. With Crail, users can now run NoSQL workloads over Crail's RDMA and NVMe-over-fabrics storage tiers.
&lt;/p&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In order to run the benchmark simply clone the YCSB repository and build the Crail binding as follows.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;crail@clustermaster:~$ git clone http://github.com/brianfrankcooper/YCSB.git
crail@clustermaster:~$ cd YCSB
crail@clustermaster:~$ mvn -pl com.yahoo.ycsb:crail-binding -am clean package
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
You need to have a Crail deployment up and running to run the YCSB benchmark. Follow the &lt;a href=&quot;https://incubator-crail.readthedocs.io/en/latest&quot;&gt;Crail documentation&lt;/a&gt; if you need help with configuring and deploying Crail. Once Crail is up and accessible, data can be generated and loaded into Crail as follows.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;crail@clustermaster:~$ ./bin/ycsb load crail -s -P workloads/workloada -P large.dat -p crail.enumeratekeys=true &amp;gt;outputLoad.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In this case we are running workload A which is an update heavy workload. Different workloads for different read/update ratios can be specified using the -P switch. The size of the data -- or more precisely -- the number of records to be written, can be defined via the YSCB property &quot;recordcount&quot;. You can define arbitrary number of YSCB properties in a file (e.g., &quot;large.dat&quot;) and pass the name of the file to the YSCB benchmark when loading the data. Note the Crail YCSB binding will pick up all the Crail configuration parameters defined in &quot;$CRAIL_HOME/crail-site.conf&quot;. In the above example we further use &quot;crail.enumeratekeys=true&quot; which is a parameter specific to the Crail YCSB binding that enables enumeration of Crail tables. Enumeration support is convenient as it allows browsing of tables using the Crail command line tools. During actual performance measurements, however, it is recommended to turn off enumeration support which is faster.
&lt;/p&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
So far we have just loaded the data. Let's now run the actual benchmark which consists of a series of read and update operations.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;crail@clustermaster:~$ ./bin/ycsb run crail -s -P workloads/workloada
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h3 id=&quot;ycsb-benchmark-performance-for-dram--intel-optane&quot;&gt;YCSB Benchmark Performance for DRAM &amp;amp; Intel Optane&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
We ran workload B of the YSCB benchmark using the Crail binding. Workload B has 95% read and 5% update operations and the records are selected with a Zipfian distribution. The two figures below show the update latencies for two different sizes of key-value pairs, 1K (10 fields of 100 bytes per KV pair) and 100K (10 fields of 10KB per KV pair). We show the cumulative distribution function of the latency for both Crail's DRAM/RDMA storage tier and the NVMf storage tier running on Intel Optane SSDs. As a reference we also report the update performance for RAMCloud and Aerospike on the same hardware, that is, RAMCLoud on RDMA and Aerospike on Optane. All Crail experiments run in a single namenode single datanode configuration with the YSCB benchmark executing remote to both namenode and datanode.
&lt;/p&gt;
&lt;p&gt;
Looking at small KV pairs first (left figure below), we can see that Crail's DRAM storage tier delivers update latencies that are about 5-10us higher than those of RAMCloud for a large fraction of updates. At the same time, Crail's Optane storage tier delivers update latencies that are substantially better than those of Aerospike on Optane, namely, 50us for Crail vs 100us for Aerospike for a large fraction of updates.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/ycsb/ycsb_put.svg&quot; width=&quot;700&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
The reason Crail trails RAMCloud for small KV pairs is because Crail's internal message flow for implementing a PUT operation is more complex than the message flow of a PUT in most other distributed key-value stores. This is illustrated by the two figures below. A PUT operation in Crail (left figure below) essentially requires two metadata operations to create and close the key-value obejct, and one data operation to write the actual &quot;value&quot;. Involving the metadata server adds flexibility because it allows Crail to dynamically choose the best storage server or the best storage tier for the given value, but it also adds two extra roundtrips compared to a PUT in a traditional key-value store (right figure below). In Crail we have designed the RPC subsystem between clients and metadata servers to be extremely light and low-latency, which in turn allowed us to favor flexibility over absolut lowest performance during PUT/GET operations. As can be seen, despite two extra roundtrips during a PUT compared to RAMCloud, the overall overhead is only 5-10us.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/ycsb/crail_put_anatomy.svg&quot; width=&quot;550&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Talking about Crail's superior Optane performance compared to Aerospike (in the CDF plot above) there are really two main reasons leading to this result. First, Aerospike uses synchronous I/O and multiple I/O threads, which cause contention and spend a significant amount of execution time in synchronization functions. Crail on the other hand uses asynchronous I/O and executes I/O requests in one context, avoiding context switching and synchronization completely. Second, Crail uses the NVMe-over-fabrics protocol which is based on RDMA and eliminates data copies at both client and server ends and generally reduces the code path that is executed during put/get operations.
&lt;/p&gt;
&lt;p&gt;
One observation from the above YSCB experiment is that as we move toward larger key-value sizes (right figure above), Crail's update latency for DRAM is looking substantially better than the udpate latency of RAMCloud. We believe that Crail's use of one-side RDMA operations and the fact that data is placed directly into application target buffers are key factors that play into this result. Both factors reduce data copies, an important optimization as the data gets bigger. In contrast to Crail, RAMCloud uses two-sided RDMA operations and user data is always copied into separate I/O buffers which is costly for large key-value pairs.
&lt;/p&gt;
&lt;p&gt;
Besides update latency, we are also showing read latency in the CDF figure below. Here Crail's absolut and relative performance compared to the latency of RAMCloud and Aerospike looks even better than in the case of updates. The main reason is that, compared to a PUT, Crail's internal message flow for a GET is simpler consisting of a metadata lookup and a data read but no close operation. Consequently, Crail's read latency for small KV pairs (1K) almost matches the read latency of RAMCloud.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/ycsb/ycsb_get.svg&quot; width=&quot;700&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;pushing-the-throughput-limits&quot;&gt;Pushing the throughput limits&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
When measuring the latency performance of a system, what you actually want to see is how the latency is affected as the system gets increasingly loaded. The YCSB benchmark is based on a synchronous database interface for updates and reads which means that in order to create high system load one essentially needs a large number of threads, and, most likely a large number of machines. Crail, on the other hand, does have an asynchronous interface and it is relatively straightforward to manage multiple simultaneous outstanding operations per client.
&lt;/p&gt;
&lt;p&gt;
We used Crail's asynchronous API to benchmark Crail's key-value performance under load. In a first set of experiments, we increase the number of clients from 1 to 64 but each client always only has one outstanding PUT/GET operation in flight. The two figures below show the latency (shown on the y-axis) of Crail's DRAM, Optane and Flash tiers under increasing load measured in terms of operations per second (shown on the x-axis). As can be seen, Crail delivers stable latencies up to a reasonably high throughput. For DRAM, the get latencies stay at 12-15μs up to 4M IOPS, at which point the metadata server became the bottleneck (note: Crail's metadata plane can be scaled out by adding more metadata servers if needed). For the Optane NVM configuration, latencies stay at 20μs up until almost 1M IOPS, which is very close to the device limit (we have two Intel Optane SSDs in a single machine). The Flash latencies are higher but the Samsung drives combined (we have 16 Samsung drives in 4 machines) also have a higher throughput limit. In fact, 64 clients with queue depth 1 could not saturate the Samsung devices.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/ycsb/iops_qd1.svg&quot; width=&quot;650&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In order to generate a higher load, we measured throughput and latencies for the case where each client always has four operations in flight. As shown in the figure below, using a queue depth of 4 generally achieves a higher throughput up to a point where the hardware limit is reached, the device queues are overloaded (e.g., for NVM Optane) and latencies sky rock. For instance, at the point before the exponential increase in the latencies, Crail delivers GET latencies of 30.1μs at 4.2M IOPS (DRAM), 60.7μs for 1.1M IOPS (Optane), and 99.86μs for 640.3K IOPS (Flash). The situation for PUT is similar (not shown), though generally with lower performance.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/ycsb/iops_qd4.svg&quot; width=&quot;650&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Previous blog posts have mostly focused on Crail's file system interface. In this blog we gave a brief overview of the key-value interface in Crail and showed some performance results using the Crail YSCB binding that got added to the YSCB benchmark suite earlier this year. The results indicate the Crail offers comparable or superior performance than the other state-of-the-art key-value stores for DRAM, NVM (Intel Optane) and Flash. Moreoever, measurements have shown that Crail provides stable latencies very close to the hardware limits even under high load serving millions of operations per second.
&lt;/p&gt;
&lt;/div&gt;</content><author><name>Patrick Stuedi and Jonas Pfefferle</name></author><category term="blog" /><summary type="html">Recently, suppport for Crail has been added to the YCSB benchmark suite. In this blog we describe how to run the benchmark and briefly show some performance comparisons between Crail and other key-value stores running on DRAM, Flash and Optane such as Aerospike or RAMCloud.</summary></entry><entry><title type="html">Atc</title><link href="http://crail.incubator.apache.org//blog/2019/08/atc.html" rel="alternate" type="text/html" title="Atc" /><published>2019-08-05T00:00:00+02:00</published><updated>2019-08-05T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2019/08/atc</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/08/atc.html">&lt;p&gt;Paper on Crail’s system architecture was presented at &lt;a href=&quot;https://www.usenix.org/conference/atc19/presentation/stuedi&quot;&gt;USENIX ATC’19&lt;/a&gt;&lt;/p&gt;</content><author><name></name></author><category term="news" /><summary type="html">Paper on Crail’s system architecture was presented at USENIX ATC’19</summary></entry><entry><title type="html">Ycsb</title><link href="http://crail.incubator.apache.org//blog/2019/06/ycsb.html" rel="alternate" type="text/html" title="Ycsb" /><published>2019-06-11T00:00:00+02:00</published><updated>2019-06-11T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2019/06/ycsb</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/06/ycsb.html">&lt;p&gt;Crail is now part of the &lt;a href=&quot;https://github.com/brianfrankcooper/YCSB&quot;&gt;YCSB benchmark&lt;/a&gt; suite&lt;/p&gt;</content><author><name></name></author><category term="news" /><summary type="html">Crail is now part of the YCSB benchmark suite</summary></entry><entry><title type="html">Strata</title><link href="http://crail.incubator.apache.org//blog/2019/04/strata.html" rel="alternate" type="text/html" title="Strata" /><published>2019-04-11T00:00:00+02:00</published><updated>2019-04-11T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2019/04/strata</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/04/strata.html">&lt;p&gt;Slides from Oreilly’s Strata talk on Crail are now &lt;a href=&quot;https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/71902&quot;&gt;available&lt;/a&gt;&lt;/p&gt;</content><author><name></name></author><category term="news" /><summary type="html">Slides from Oreilly’s Strata talk on Crail are now available</summary></entry><entry><title type="html">Deployment Options for Tiered Storage Disaggregation</title><link href="http://crail.incubator.apache.org//blog/2019/03/deployment.html" rel="alternate" type="text/html" title="Deployment Options for Tiered Storage Disaggregation" /><published>2019-03-13T00:00:00+01:00</published><updated>2019-03-13T00:00:00+01:00</updated><id>http://crail.incubator.apache.org//blog/2019/03/deployment</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/03/deployment.html">&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In the last &lt;a href=&quot;http://crail.incubator.apache.org/blog/2019/03/disaggregation.html&quot;&gt;blog post&lt;/a&gt; we discussed the basic design of the Crail disaggregated shuffler as well as its performance under different configurations for two workloads. In this short follow-up blog we briefly describe the various options in Crail for deploying disaggregated storage.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;mixing-disaggregated-and-co-located-configurations&quot;&gt;Mixing Disaggregated and Co-located Configurations&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In a traditional &quot;non-disaggregated&quot; Crail deployment, the Crail storage servers are deployed co-located with the compute nodes running the data processing workloads. By contrast, a disaggregated Crail deployment refers to a setup where the Crail storage servers -- or more precisely, the storage resources exposed by the storage servers -- are seperated (via a network) from the compute servers running the data processing workloads. Storage disaggregation may be implemented at the level of an entire data center (by provisioning dedicated compute and storage racks), or at the level of individual racks (by dedicating some nodes in a rack exlusively for storage).
&lt;/p&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Remember that Crail is a tiered storage system where each storage tier consists of a subset of storage servers. Crail permits each storage tier (e.g., RDMA/DRAM, NVMf/flash, etc.) to be deployed and configured independently. This means we can decide to disaggregate one storage tier but use a co-located setup for another tier. For instance, it is more natural to disaggregate the flash storage tier than to disaggregate the memory tier. High-density all-flash storage enclosures are commonly available and often provide NVMe over Fabrics (NVMf) connectivity, thus, exposing such a flash enclosure in Crail is straightforward. High-density memory servers, on the other hand (e.g., AWS x1e.32xlarge), would be wasted if we were not using the CPU to run memory intensive computations. Exporting the memory of compute servers into Crail, however, may still make sense as it allows any server to operate on remote memory as soon as it runs out of local memory. The figure below illustrates three possible configurations of Crail for a single rack deployment:
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/deployment/three_options.svg&quot; width=&quot;580&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Non-disaggrageted (left): each each compute server exports some of its local DRAM and flash into Crail by running one Crail storage server instance for each storage type.&lt;/li&gt;
&lt;li&gt;Complete disaggregation (middle): the compute servers do not participate in Crail storage. Instead, dedicated storage servers for DRAM and flash are deployed. The storage servers export their storage resources into Crail by running corresponding Crail storage servers.&lt;/li&gt;
&lt;li&gt;Mixed disaggregation (right): each compute server exports some of its local DRAM into Crail. The Crail storage space is then augmented by disaggregated flash.&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Remember that a Crail storage server is entirely a control path entity, responsible for (a) registering storage resources (and corresponding access endpoints) with Crail metadata servers, and (b) monitoring the health of the storage resources and reporting this information to the Crail metadata servers. Therefore, a storage server does not necessarily need to run co-located with the storage resource it exports. For instance, one may export an all-flash storage enclosure in Crail by deploying a Crail storage server on one of the compute nodes.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;fine-grained-tiering-using-storage-and-location-classes&quot;&gt;Fine-grained Tiering using Storage and Location Classes&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In all of the previously discussed configurations there is a one-to-one mapping between storage media type and storage tier. There are situations, however, where it can be useful to configure multiple storage tiers of a particular media type. For instance, consider a setup where the compute nodes have access to disaggregated flash (e.g., on a remote rack) but are also attached to some amount of local flash. In this case, you may want to priotize the use of flash in the same rack over disaggregated flash in a different rack. And of course you want to also priortize DRAM over any flash if DRAM is available. The way this is done in Crail is through storage and location classes. A reasonable configuration would be to create three storage classes. The first storage class contains the combined DRAM of all compute nodes, the second storage class contains all of the local flash, and the third storage class represents disaggregated flash. The figure below illustrates such a configuration with three storage classes in a simplified single-rack deployment.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/deployment/storage_class.svg&quot; width=&quot;400&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Storage classes can easily be defined in the slaves file as follows (see the &lt;a href=&quot;https://incubator-crail.readthedocs.io/en/latest/config.html#storage-tiers&quot;&gt;Crail documentation&lt;/a&gt; for details):
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;crail@clustermaster:~$ cat $CRAIL_HOME/conf/slaves
clusternode1 -t org.apache.crail.storage.rdma.RdmaStorageTier -c 0
clusternode2 -t org.apache.crail.storage.rdma.RdmaStorageTier -c 0
clusternode1 -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 1
clusternode2 -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 1
disaggnode -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
One can also manually attach a storage server to a particular storage class:
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;crail@clusternode2:~$ $CRAIL_HOME/bin/crail datanode -t org.apache.crail.storage.nvmf.NvmfStorageTier -c 2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Remember that the storage class ID is implicitly ordering the storage tiers. During writes, Crail either allocates blocks from the highest priority tier that has free space, or from a specific tier if explicitly requested. The following timeline shows a set of Crail operations and a possible resource allocation in a 3-tier Crail deployment (abbreviations: &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;W-A&lt;/font&gt; refers to the create and write operation of a file &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;A&lt;/font&gt;, &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;D-A&lt;/font&gt; refers to the deletion of file &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;A&lt;/font&gt;). Note that at time &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;t10&lt;/font&gt; the system runs out of DRAM space across the entire rack, forcing file &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;C&lt;/font&gt; to be partially allocated in local flash. At time &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;t11&lt;/font&gt; the system runs out of tier 1 storage, forcing file &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;D&lt;/font&gt; to be partially allocated in disaggregated flash. Subsequently, a set of delete operations (time &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;t13&lt;/font&gt; and &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;t14&lt;/font&gt;) free up space in tier 0, allowing file &lt;font face=&quot;Courier&quot; color=&quot;blue&quot;&gt;F&lt;/font&gt; to be allocated in DRAM.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/deployment/timeline.svg&quot; width=&quot;300&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
If applications want to further prioritize the specific local resource of a machine over any other resource in the same storage class they can do so via the location class parameter when creating an object in Crail.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CrailLocationClass local = fs.getLocationClass();
CrailFile file = fs.create(&quot;/tmp.dat&quot;, CrailNodeType.DATAFILE, CrailStorageClass.DEFAULT, local).get().asFile();
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In this case, Crail would first try to allocate storage blocks local to the client machine. Note also that the location class preference is always weighed lower than the storage class preference, therefore Crail would still prioritize a remote block over a local block if the remote block is part of a higher priority storage class. In any case, if no local block can be found, Crail falls back to the default policy of filling up storage tiers in their oder of preference.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;resource-provisioning&quot;&gt;Resource Provisioning&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
During the deployment of Crail, one has to decide on the storage capacity of each individual storage tier or storage class, which is a non-trivial task. One approach is to provision sufficient capacity to make sure that under normal operation the storage demands can be served by the the highest performing storage class, and then allocate additional resources in the local and disaggregated flash tiers to absorb the peak storage demands.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/deployment/resource_provisioning.svg&quot; width=&quot;400&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Ideally, we would want individual storage tiers to be elastic in a way that storage capacities can be adjutsed dynamically (and automatically) based on the load. Currently, Crail does not provide elastic storage tiers (adding storage servers on the fly is always possible, but not removing). A recent research project has been exploring how to build elastic storage in the context of serverless computing and in the future we might integrate some of these ideas into Crail as well. Have a look at the &lt;a href=&quot;https://www.usenix.org/system/files/osdi18-klimovic.pdf&quot;&gt;Pocket OSDI'18&lt;/a&gt; paper for more details or check out the system at &lt;a href=&quot;https://github.com/stanford-mast/pocket&quot;&gt;https://github.com/stanford-mast/pocket&lt;/a&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In this blog we discussed various configuration options in Crail for deploying tiered disaggrated storage. Crail allows mixing traditional non-disaggregated storage with disaggregated storage in a single storage namespace and is thereby able to seamlessly absorb peak storage demands while offering excellent performance during regular operation. Storage classes and location classes in Crail further provide fine-grained control over how storage resources are provisoned and allocated. In the future, we are considering to make resource provisioning in Crail dynamic and automatic, similar to &lt;a href=&quot;https://www.usenix.org/system/files/osdi18-klimovic.pdf&quot;&gt;Pocket&lt;/a&gt;.
&lt;/p&gt;
&lt;/div&gt;</content><author><name>Patrick Stuedi</name></author><category term="blog" /><summary type="html">In the last blog post we discussed the basic design of the Crail disaggregated shuffler as well as its performance under different configurations for two workloads. In this short follow-up blog we briefly describe the various options in Crail for deploying disaggregated storage.</summary></entry><entry><title type="html">Shuffle Disaggregation using RDMA accessible remote DRAM and NVMe Flash</title><link href="http://crail.incubator.apache.org//blog/2019/03/disaggregation.html" rel="alternate" type="text/html" title="Shuffle Disaggregation using RDMA accessible remote DRAM and NVMe Flash" /><published>2019-03-04T00:00:00+01:00</published><updated>2019-03-04T00:00:00+01:00</updated><id>http://crail.incubator.apache.org//blog/2019/03/disaggregation</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/03/disaggregation.html">&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
One of the goals of Crail has always been to enable efficient storage disaggregation for distributed data processing workloads. Separating storage from compute resources in a cluster is known to have several interesting benefits. For instance, it allows storage resources to scale independently from compute resources, or to run storage systems on specialized hardware (e.g., storage servers with a weak CPU attached to a fast network) for better performance and reduced cost. Storage disaggregation also simplifies system maintenance as one can uprade compute and storage resources at different cycles.
&lt;/p&gt;
&lt;p&gt;
Today, data processing applications running in the cloud may implicitly use disaggregated storage through cloud storage services like S3. For instance, it is not uncommon for map-reduce workloads in the cloud to use S3 instead of HDFS for storing input and output data. While Crail can offer high-performance disaggregated storage for input/output data as well, in this blog we specifically look at how to use Crail for efficient disaggregation of shuffle data.
&lt;/p&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/spark_disagg.svg&quot; width=&quot;580&quot; /&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;p&gt;
Generally, the arguments for disaggregation hold for any type of data including input, output and shuffle data. One aspect that makes disaggregation of shuffle data particularly interesting is that compute nodes generating shuffle data now have access to ''infinitely'' large pools or storage resources, whereas in traditional ''non-disaggregated'' deployments the amount of shuffle data per compute node is bound by local resources.
&lt;/p&gt;
&lt;p&gt;
Recently, there has been an increased interest in disaggregating shuffle data. For instance, &lt;a href=&quot;https://dl.acm.org/citation.cfm?id=3190534&quot;&gt;Riffle&lt;/a&gt; -- a research effort driven by Facebook -- is a shuffle implementation for Spark that is capable of operating in a disaggregated environment. Facebook's disaggregated shuffle engine has also been presented at &lt;a href=&quot;https://databricks.com/session/sos-optimizing-shuffle-i-o&quot;&gt;SparkSummit'18&lt;/a&gt;. In this blog, we discuss how Spark shuffle disaggregation is done in Crail using RDMA accessible remote DRAM and NVMe flash. We occasionally also point out certain aspects where our approach differs from Riffle and Facebook's disaggregated shuffle manager. Some of the material shown in this blog has also been presented in previous talks on Crail, namely at &lt;a href=&quot;https://databricks.com/session/running-apache-spark-on-a-high-performance-cluster-using-rdma-and-nvme-flash&quot;&gt;SparkSummit'17&lt;/a&gt; and &lt;a href=&quot;https://databricks.com/session/serverless-machine-learning-on-modern-hardware-using-apache-spark&quot;&gt;SparkSummit'18&lt;/a&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;overview&quot;&gt;Overview&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In a traditional shuffle operation, data is exchanged between map and reduce tasks using direct communication. For instance, in a typical Spark deployment map tasks running on worker machines write data to a series of local files -- one per task and partition -- and reduce tasks later on connect to all of the worker machines to fetch the data belonging to their associated partition. By contrast, in a disaggregated shuffle operation, map and reduce tasks exchange data with each other via a remote shared storage system. In the case of Crail, shuffle data is organized hierarchically, meaning, each partition is mapped to a separate directory (the directory is actually a ''MultiFile&quot; as we will see later). Map tasks dispatch data tuples to files in different partitions based on a key, and reduce tasks eventually read all the data in a given partition or directory (each reduce task is associated with exactly one partition).
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/overview.svg&quot; width=&quot;350&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 id=&quot;challenge-large-number-of-small-files&quot;&gt;Challenge: Large Number of Small Files&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
One of the challenges with shuffle implementations in general is the large number of objects they have to deal with. The number and size of shuffle files depend on the workload, but also on the configuration, in particular on the number of map and reduce tasks in a stage of a job. The number of tasks in a job is often indirectly controlled through the partition size specifying the amount of data each task is operating on. Finding the optimal partition size for a job is difficult and often requires manual tuning. What you want is a partition size small enough to generate enough tasks to exploit all the parallelism available in the cluster. In theory a small partition size and therefore a large number of small tasks helps to mitigate stragglers, a major issue for distributed data processing frameworks.
&lt;/p&gt;
&lt;p&gt;
Unfortunately, a small partition size often leads to a large number of small shuffle files. As an illustration, we
performed a simple experiment where we measured the size distribution of Spark shuffle (and broadcast) files of individual tasks when executing (a) PageRank on the Twitter graph, and (b) SQL queries on a TPC-DS dataset. As shown in the figure below, the range of shuffle data is large, ranging from a few bytes to a few GBs per compute task.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/cdf-plot.svg&quot; width=&quot;480&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
From an I/O performance perspective, writing and reading large numbers of small files is much more challenging than, let's say, dealing with a small number of large files. This is true in a 'non-disaggregated' shuffle operation, but even more so in a disaggregated shuffle operation where I/O requests include both networking and storage.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;per-core-aggregation-and-parallel-fetching&quot;&gt;Per-core Aggregation and Parallel Fetching&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
To mitigate the overheads of writing and reading large numbers of small data sets, the disaggregated Crail shuffler implements two simple optimizations. First, subsequent map tasks running on the same core append shuffle data to a per-core set of files. The number of files in a per-core set corresponds to the number of partitions. Logically, the files in a per core-set are actually part of different directories (one directory per partition as discussed before). For instance, the blue map tasks running on the first core all append their data to the same set of files (marked blue) distributed over the different partitions, same for the light blue and the white tasks. Consequently, the number of files in the system depends on the number of cores and the number of partitions, but not on the number of tasks.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/optimization_tasks.svg&quot; width=&quot;380&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
The second optimization we use in our disaggregated Crail shuffler is efficient parallel reading of entire partitions using Crail MultiFiles. One problem with large number of small files is that it makes efficient parallel reading difficult, mainly because the small file size limits the number of in-flight read operations a reducer can issue on a single file. One may argue that we don't necessarily need to parallelize the reading of a single file. As long as we have large numbers of files we can instead read different files in parallel. The reason this is inefficient is because we want the entire partition to be available at the reducer in a virtually contiguous memory area to simplify sorting (which is often required). If we were to read multiple files concurrently we either have to temporarily store the receiving data of a file and later copy the data to right place within the contiguous memory area, or, if we want to avoid copying data we can directly instruct the readers to receive the data at the correct offset which leads to random writes and cache thrashing at the reducer. Both, copying data and cache thrashing are a concern at a network speed of 100 Gb/s or more.
&lt;/p&gt;
&lt;p&gt;
Crail Multifiles offer zero-copy parallel reading of large numbers of files in a sequential manner. From the prespective of a map task MultiFiles are flat directories consisting of files belonging to different per-core file sets. From a reducer perspective, a MultFile looks like a large file that can be read sequentially using many in-flight operations. For instance, the following code shows how a reduce task during a Crail shuffle operation reads a partition from remote storage.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CrailStore fs = CrailStore.newInstance();
CrailMultiFile multiFile = fs.lookup(&quot;/shuffle/partition1&quot;).get().asMultiFile();
ByteBuffer buffer = Buffer.allocateDirect(multiFile.size());
int batchSize = 16;
CrailBufferedInputStream stream = multiFile.getMultiStream(batchSize);
while (stream.read(buf) &amp;gt; 0);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Internally, a MultiFile manages multiple streams to different files and maintains a fixed number of active in-flight operations at all times (except when the stream reaches its end). The number of in-flight operations is controlled via the batch size parameter which is set to 16 in the above example.
&lt;/p&gt;
&lt;p&gt;
&lt;strong&gt;Comparison with Riffle&lt;/strong&gt;: In contrast to Riffle, the Crail shuffler does not merge files at the map stage. There are two reasons for this. First, even though Riffle shows how to overlap merge operations with map tasks, there is always a certain number of merge operations at the end of a map task that cannot be hidden effectively. Those merge operations delay the map phase. Second, in contrast to Riffle which assumes commodity hardware with low network bandwidth and low metadata throughput (file &quot;open&quot; requests per second), the Crail shuffler is built on top of Crail which offers high bandwidth and high metadata throughput implemented over fast network and storage hardware. As shown in previous posts, Crail supports millions of file metadata operations per second and provides a random read bandwidth for small I/O sizes that is close 100Gb/s. At such a storage performance, any optimization involving the CPU is typically just adding overhead. For instance, the Crail shuffler also does not compress shuffle data as compression rates of common compression libraries (LZ4, Snappy, etc.) are lower than the read/write bandwidth of Crail.
&lt;/p&gt;
&lt;p&gt;
The Crail shuffler also differs from Riffle with regard to how file indexes are managed. Riffle, just like the vanilla Spark shuffler, relies on the Spark driver to map partitions to sets of files and requires reduce tasks to interact with the driver while reading the partitions. In contrast, the Crail shuffler totally eliminates the Spark driver from the loop by encoding the mapping between partitions and files implicitly using the hierarchical namespace of Crail.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;robustness-against-machine-skew&quot;&gt;Robustness Against Machine Skew&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Shuffle operations, being essentially barriers between compute stages, are highly sensitive to task runtime variations. Runtime variations may be caused by skew in the input data which leads to variations in partition size, meaning, different reduce tasks get assigned different amounts of data. Dealing with data skew is tricky and typically requires re-paritioning of the data. Another cause of task runtime variation is machine skew. For example, in a heterogeneous cluster some machines are able to process more map tasks than others, thus, generating more data. In traditional non-disaggregated shuffle operations, machines hogging large amounts of shuffle data after the map phase quickly become the bottleneck (links marked red below) during the all-to-all network transfer phase.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/machine_skew.svg&quot; width=&quot;520&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
One way to deal with this problem is through weighted fair scheduling of network transfers, as shown &lt;a href=&quot;https://dl.acm.org/citation.cfm?id=2018448&quot;&gt;here&lt;/a&gt;. But doing so requires global knowledge of the amounts of data to transfer and also demands fine grained scheduling of network transfers. In contrast to a traditional shuffle, the Crail disaggregated shuffler naturally is more robust against machine skew. Even though some machines generate more shuffle data than others during the map phase, the resulting temporary data residing on disaggregated storage is still evenly distributed across storage servers. This is because shuffle data stored on Crail is chopped up into small blocks that get distributed across the servers. Naturally, different storage servers are serving roughly equal amounts of data during the reduce phase.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/machine_skew_crail.svg&quot; width=&quot;520&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
One may argue that chopping shuffle data up into blocks and transferring them over the network does not come for free. However, as it turns out, it does almost come for free. The bandwidth of the Spark I/O pipeline in a map task running on a single core is dominated by the serialization speed. The serialization speed depends on the serializer (e.g., Java serializer, Kryo, etc.), the workload and the object types that need to be serialized. In our measurements we found that even for rather simple object types, like byte arrays, the serialization bandwidth of Kryo -- one of the fastest serializers available -- was in the order of a 2-3 Gb/s per core and didn't scale to more than 40 Gb/s aggregated bandwidth on a 16 core machine. The I/O bandwidth in Crail, however, is already at close to 100 Gb/s for a client running on a single core which means that we can write data out to remote storage faster than serialization produces data. In addition, all write (and read) operations in Crail are truly asynchronous and overlap with the map tasks. More precisely, during a map task the RDMA NIC fetches serialized data from host memory and transmits it over the wire while the CPU can continue to serialize the next chunk of data.
&lt;/p&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
&lt;strong&gt;Loadbalancing:&lt;/strong&gt; While shuffle disaggregation mitigates machine skew by distributing data evenly across storage servers, the set of storage blocks read by clients (reduce tasks) at a given time may not always be evenly distributed among the servers. For instance, as shown in the figure below (left part), one of the servers concurrently serves two clients, while the other server only serves one client. Consequently, the bandwidth of the first two clients is bottlenecked by the server whereas the bandwidth of last client is not. To mitigate the effect of uneven allocation of bandwidth, it is important to maintain a sufficiently large queue depth (number of in-flight read operations) at the clients. In the example below, a queue depth of two is sufficient to make sure neither of the two clients is bottlenecked by the server (right part in the figure below).
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/loadbalancing.svg&quot; width=&quot;520&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Note that Crail disaggregated storage may be provided by a few highly dense storage nodes (e.g., a high density flash enclosure) or by a larger group of storage servers exposing their local DRAM or flash. We will discuss different deployment modes of Crail disaggregated storage in the next blog post.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;disaggregated-spark-map-reduce-sorting&quot;&gt;Disaggregated Spark Map-Reduce (Sorting)&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt; Let's look at some performance data. In the first experiment we measure the runtime of a simple Spark job sorting 200G of data on a 8 node cluster. We compare the performance of different configurations. In the disaggregated configuration the Crail disaggregated shuffler is used, storing shuffle data on disaggregated Crail storage. In this configuration, Crail is deployed on a 4 node storage cluster connected to the compute cluster over a 100 Gb/s RoCE network. As a direct comparison to the disaggregated configuration we also measure the performance of a co-located setup that also uses the Crail shuffler but deploys the Crail storage platform co-located on the compute cluster. The disaggregated and the co-located configurations are shown for both DRAM only and NVMe only. As a reference we also show the performance of vanilla Spark in a non-disaggregated configuration where the shuffle directory is configured to point to a local NVMe mountpoint.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/terasort.svg&quot; width=&quot;450&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
The main observation from the figure is that there is almost no performance difference between the Crail co-located and the Crail disaggregated configurations, which means we can disaggregate shuffle data at literally no performance penalty. In fact, the performance improves slightly in the disaggregated configuration because more CPU cycles are available to execute the Spark workload, cycles that in the co-located setup are used for local storage processing. Generally, storing shuffle data in DRAM is about 25% faster than storing shuffle data on NVMe (matching the results of a previous &lt;a href=&quot;http://crail.incubator.apache.org/blog/2017/08/crail-nvme-fabrics-v1.html&quot;&gt;blog&lt;/a&gt;). The effect, however, is independent from whether Crail is deployed in a co-located or in a disaggregated mode. Also note that using the Crail shuffler generally is about 2-3x faster than vanilla Spark, which is the result of faster I/O as well as improved serialization and sorting as discussed in a previous &lt;a href=&quot;http://crail.incubator.apache.org/blog/2017/01/sorting.html&quot;&gt;blog&lt;/a&gt;.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;disaggregated-spark-sql&quot;&gt;Disaggregated Spark SQL&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
Next we look at Spark SQL performance in a disaggregated configuration. Again we are partitioning our cluster into two separate silos of compute (8 nodes) and storage (4 nodes), but this time we also investigate the effect of different network speeds and network software stacks when connecting the compute cluster to the storage cluster. The Spark SQL job (TPC-DS, query #87) further differs from the I/O heavy sorting job in that it contains many shuffle phases but each shuffle phase is light on data, thus, stressing latency aspects of shuffle disaggregation more than raw bandwidth aspects.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/disaggregation/sql.svg&quot; width=&quot;420&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
The first bar from the left in the figure above shows, as a reference, the runtime of the SQL job using vanilla Spark in a non-disaggregated configuration. Note that the vanilla Spark configuration is using the 100 Gb/s Ethernet network available in the compute cluster. In the next experiment we use the Crail disaggregated shuffler, but configure the network connecting the compute to the storage cluster to be 10 Gb/s. We further use standard TCP communication between the compute nodes and the storage nodes. As one can observe, the overall runtime of this configuration is worse than the runtime of vanilla Spark, mainly due to the extra network transfers required when disaggregating shuffle data.
&lt;/p&gt;
&lt;p&gt;
Often, high network bandwidth is assumed to be the key enabling factor for storage disaggregation. That this is only partially true is shown in the next experiment where we increase the network bandwidth between the compute and the storage cluster to 100 Gb/s, but still use TCP to communicate between compute and storage nodes. As can be observed, this improves the runtime of the SQL job to a point that is almost matching the performance of the non-disaggregated configuration based on vanilla Spark. We can, however, carve out even more performance using the RDMA-based version of the Crail disaggrageted shuffler and get to a runtime that is lower than the one of vanilla Spark in a co-located configuration. One reason why RDMA helps is here because many shuffle network transfers in the Spark SQL workload are small and the Crail RDMA storage tier is able to substantially reduce the latencies of these transfers.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;
&lt;p&gt;Efficient disaggregation of shuffle data is challenging, requiring shuffle managers and storage systems to be co-designed in order to effectively handle large numbers of small files, machine skew and loadbalancing issues. In this blog post we discussed the basic architecture of the Crail disaggregated shuffle engine and showed that by using Crail we can effectively disaggregate shuffle data in both bandwidth intensive map-reduce jobs as well as in more latency sensitive SQL workloads. In the next blog post we will discuss several deployment options of disaggregated storage in a tiered storage environment.&lt;/p&gt;</content><author><name>Patrick Stuedi</name></author><category term="blog" /><summary type="html">One of the goals of Crail has always been to enable efficient storage disaggregation for distributed data processing workloads. Separating storage from compute resources in a cluster is known to have several interesting benefits. For instance, it allows storage resources to scale independently from compute resources, or to run storage systems on specialized hardware (e.g., storage servers with a weak CPU attached to a fast network) for better performance and reduced cost. Storage disaggregation also simplifies system maintenance as one can uprade compute and storage resources at different cycles. Today, data processing applications running in the cloud may implicitly use disaggregated storage through cloud storage services like S3. For instance, it is not uncommon for map-reduce workloads in the cloud to use S3 instead of HDFS for storing input and output data. While Crail can offer high-performance disaggregated storage for input/output data as well, in this blog we specifically look at how to use Crail for efficient disaggregation of shuffle data. Generally, the arguments for disaggregation hold for any type of data including input, output and shuffle data. One aspect that makes disaggregation of shuffle data particularly interesting is that compute nodes generating shuffle data now have access to ''infinitely'' large pools or storage resources, whereas in traditional ''non-disaggregated'' deployments the amount of shuffle data per compute node is bound by local resources.</summary></entry><entry><title type="html">Crail Python API: Python -&amp;gt; C/C++ call overhead</title><link href="http://crail.incubator.apache.org//blog/2019/01/python.html" rel="alternate" type="text/html" title="Crail Python API: Python -&gt; C/C++ call overhead" /><published>2019-01-22T00:00:00+01:00</published><updated>2019-01-22T00:00:00+01:00</updated><id>http://crail.incubator.apache.org//blog/2019/01/python</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2019/01/python.html">&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
With python being used in many machine learning applications, serverless frameworks, etc.
as the go-to language, we believe a Crail client Python API would be a useful tool to
broaden the use-case for Crail.
Since the Crail core is written in Java, performance has always been a concern due to
just-in-time compilation, garbage collection, etc.
However with careful engineering (Off heap buffers, stateful verbs calls, ...)
we were able to show that Crail can devliever similar or better performance compared
to other statically compiled storage systems. So how can we engineer the Python
library to deliver the best possible performance?
&lt;/p&gt;
&lt;p&gt;
Python's reference implementation, also the most widely-used, CPython has historically
always been an interpreter and not a JIT compiler like PyPy. We will focus on
CPython since its alternatives are in general not plug-and-play replacements.
&lt;/p&gt;
&lt;p&gt;
Crail is client-driven so most of its logic is implemented in the client library.
For this reason we do not want to reimplement the client logic for every new
language we want to support as it would result in a maintance nightmare.
However interfacing with Java is not feasible since it encurs in to much overhead
so we decided to implement a C++ client (more on this in a later blog post).
The C++ client allows us to use a foreign function interface in Python to call
C++ functions directly from Python.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;options-options-options&quot;&gt;Options, Options, Options&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
There are two high-level concepts of how to integrate (C)Python and C: extension
modules and embedding.
&lt;/p&gt;
&lt;p&gt;
Embedding Python uses Python as a component in an application. Our aim is to
develop a Python API to be used by other Python applications so embeddings are
not what we look for.
&lt;/p&gt;
&lt;p&gt;
Extension modules are shared libraries that extend the Python interpreter.
For this use-case CPython offers a C API to interact with the Python interpreter
and allows to define modules, objects and functions in C which can be called
from Python. Note that there is also the option to extend the Python interpreter
through a Python library like ctypes or cffi. They are generally easier to
use and should preserve portability (extension modules are CPython specific).
However they do not give as much flexibility as extension modules and incur
in potentially more overhead (see below). There are multiple wrapper frameworks
available for CPython's C API to ease development of extension modules.
Here is an overview of frameworks and libraries we tested:
&lt;/p&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://cython.org/&quot;&gt;&lt;strong&gt;Cython:&lt;/strong&gt;&lt;/a&gt; optimising static compiler for Python and the Cython programming
language (based on Pyrex). C/C++ function, objects, etc. can be directly
accessed from Cython. The compiler generates C code from Cython which interfaces
with the CPython C-API.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.swig.org/&quot;&gt;&lt;strong&gt;SWIG:&lt;/strong&gt;&lt;/a&gt; (Simplified Wrapper and Interface Generator) is a tool to connect
C/C++ with various high-level languages. C/C++ interfaces that should be available
in Python have to be defined in a SWIG interface file. The interface files
are compiled to C/C++ wrapper files which interface with the CPython C-API.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.boost.org/&quot;&gt;&lt;strong&gt;Boost.Python:&lt;/strong&gt;&lt;/a&gt; is a C++ library that wraps CPython’s C-API. It uses
advanced metaprogramming techniques to simplify the usage and allows wrapping
C++ interfaces non-intrusively.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://docs.python.org/3.7/library/ctypes.html#module-ctypes&quot;&gt;&lt;strong&gt;ctypes:&lt;/strong&gt;&lt;/a&gt;
is a foreign function library. It allows calling C functions in shared libraries
with predefined compatible data types. It does not require writing any glue code
and does not interface with the CPython C-API directly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;benchmarks&quot;&gt;Benchmarks&lt;/h3&gt;
&lt;p&gt;In this blog post we focus on the overhead of calling a C/C++ function from Python.
We vary the number of arguments, argument types and the return types. We also
test passing strings to C/C++ since it is part of the Crail API e.g. when
opening or creating a file. Some frameworks expect &lt;code class=&quot;highlighter-rouge&quot;&gt;bytes&lt;/code&gt; when passing a string
to a underlying &lt;code class=&quot;highlighter-rouge&quot;&gt;const char *&lt;/code&gt;, some allow to pass a &lt;code class=&quot;highlighter-rouge&quot;&gt;str&lt;/code&gt; and others allow both.
If C++ is supported by the framework we also test passing a &lt;code class=&quot;highlighter-rouge&quot;&gt;std::string&lt;/code&gt; to a
C++ function. Note that we perform all benchmarks with CPython version 3.5.2.
We measure the time it takes to call the Python function until it returns.
The C/C++ functions are empty, except a &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statement where necessary.&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/python_c/python_c_foo.svg&quot; width=&quot;725&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;The plot shows that adding more arguments to a function increases runtime.
Introducing the first argument increases the runtime the most. Adding a the integer
return type only increased runtime slightly.&lt;/p&gt;
&lt;p&gt;As expected, cytpes as the only test which is not based on extension modules
performed the worst. Function call overhead for a function without return value
and any arguments is almost 300ns and goes up to 1/2 a microsecond with 4
arguments. Considering that RDMA writes can be performed below 1us this would
introduce a major overhead (more on this below in the discussion section).&lt;/p&gt;
&lt;p&gt;SWIG and Boost.Python show similar performance where Boost is slightly slower and
out of the implementations based on extension modules is the slowest.
Cython is also based on extension modules so it was a surprise to us that it showed
the best performance of all methods tested. Investigating the performance difference
between Cython and our extension module implementation we found that Cython makes
better use of the C-API.&lt;/p&gt;
&lt;p&gt;Our extension module implementation follows the official tutorial and uses
&lt;code class=&quot;highlighter-rouge&quot;&gt;PyArg_ParseTuple&lt;/code&gt; to parse the arguments. However as shown below we found that
manually unpacking the arguments with &lt;code class=&quot;highlighter-rouge&quot;&gt;PyArg_UnpackTuple&lt;/code&gt; already significantly
increased the performance. Although these numbers still do not match Cython’s
performance we did not further investigate possible optimizations
to our code.&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/python_c/python_c_foo_opt.svg&quot; width=&quot;725&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Let’s take a look at the string performance. &lt;code class=&quot;highlighter-rouge&quot;&gt;bytes&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;str&lt;/code&gt; is used whereever
applicable. To pass strings as bytes the ‘b’ prefix is used.&lt;/p&gt;
&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/python_c/python_c_foo_str.svg&quot; width=&quot;725&quot; /&gt;&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Again Cython and the extension module implementation with manual unpacking seem to
deliver the best performance. Passing a 64bit value in form of a &lt;code class=&quot;highlighter-rouge&quot;&gt;const char *&lt;/code&gt;
pointer seems to be slightly faster than passing an integer argument (up to 20%).
Passing the string to a C++ function which takes a &lt;code class=&quot;highlighter-rouge&quot;&gt;std::string&lt;/code&gt;
is ~50% slower than passing a &lt;code class=&quot;highlighter-rouge&quot;&gt;const char *&lt;/code&gt;, probably because of the
instantiation of the underlying data buffer and copying however we have not
confirmed this.&lt;/p&gt;
&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;/h3&gt;
&lt;p&gt;One might think a difference of 100ns should not really matter and you should
anyway not call to often into C/C++. However we believe that this is not true
when it comes to latency sensitive or high IOPS applications. For example
using RDMA one can perform IO operations below a 1us RTT so 100ns is already
a 10% performance hit. Also batching operations (to reduce amount of calls to C)
is not feasible for low latency operations since it typically incurs in wait
time until the batch size is large enough to be posted. Furthermore, even in high
IOPS applications batching is not always feasible and might lead to undesired
latency increase.&lt;/p&gt;
&lt;p&gt;Efficient IO is typically performed through an asynchronous
interface to allow not having to wait for IO to complete to perform the next
operation. Even with an asynchronous interface, not only the latency of the operation
is affected but the call overhead also limits the maximum IOPS. For example,
in the best case scenario, our async call only takes one pointer as an argument so
100ns call overhead. And say our C library is capable of posting 5 million requests
per seconds (and is limited by the speed of posting not the device) that calculates
to 200ns per operation. If we introduce a 100ns overhead we limit the IOPS to 3.3
million operations per second which is a 1/3 decrease in performance. This is
already significant consider using ctypes for such an operation now we are
talking about limiting the throughput by a factor of 3.&lt;/p&gt;
&lt;p&gt;Besides performance another aspect is the usability of the different approaches.
Considering only ease of use &lt;em&gt;ctypes&lt;/em&gt; is a clear winner for us. However it only
supports to interface with C and is slow. &lt;em&gt;Cython&lt;/em&gt;, &lt;em&gt;SWIG&lt;/em&gt; and &lt;em&gt;Boost.Python&lt;/em&gt;
require a similar amount of effort to declare the interfaces, however here
&lt;em&gt;Cython&lt;/em&gt; clearly wins the performance crown. Writing your own &lt;em&gt;extension module&lt;/em&gt;
is feasible however as shown above to get the best performance one needs
a good understanding of the CPython C-API/internals. From the tested approaches
this one requires the most glue code.&lt;/p&gt;
&lt;h3 id=&quot;setup--source-code&quot;&gt;Setup &amp;amp; Source Code&lt;/h3&gt;
&lt;p&gt;All tests were run on the following system:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Intel(R) Core(TM) i7-3770&lt;/li&gt;
&lt;li&gt;16GB DDR3-1600MHz&lt;/li&gt;
&lt;li&gt;Ubuntu 16.04 / Linux kernel version 4.4.0-142&lt;/li&gt;
&lt;li&gt;CPython 3.5.2&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The source code is available on &lt;a href=&quot;https://github.com/zrlio/Python-c-benchmark&quot;&gt;GitHub&lt;/a&gt;&lt;/p&gt;</content><author><name>Jonas Pfefferle</name></author><category term="blog" /><summary type="html">With python being used in many machine learning applications, serverless frameworks, etc. as the go-to language, we believe a Crail client Python API would be a useful tool to broaden the use-case for Crail. Since the Crail core is written in Java, performance has always been a concern due to just-in-time compilation, garbage collection, etc. However with careful engineering (Off heap buffers, stateful verbs calls, ...) we were able to show that Crail can devliever similar or better performance compared to other statically compiled storage systems. So how can we engineer the Python library to deliver the best possible performance? Python's reference implementation, also the most widely-used, CPython has historically always been an interpreter and not a JIT compiler like PyPy. We will focus on CPython since its alternatives are in general not plug-and-play replacements. Crail is client-driven so most of its logic is implemented in the client library. For this reason we do not want to reimplement the client logic for every new language we want to support as it would result in a maintance nightmare. However interfacing with Java is not feasible since it encurs in to much overhead so we decided to implement a C++ client (more on this in a later blog post). The C++ client allows us to use a foreign function interface in Python to call C++ functions directly from Python.</summary></entry><entry><title type="html">Sql P1 News</title><link href="http://crail.incubator.apache.org//blog/2018/08/sql-p1-news.html" rel="alternate" type="text/html" title="Sql P1 News" /><published>2018-08-09T00:00:00+02:00</published><updated>2018-08-09T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2018/08/sql-p1-news</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2018/08/sql-p1-news.html">&lt;p&gt;A new blog &lt;a href=&quot;//crail.incubator.apache.org/blog/2018/08/sql-p1.html&quot;&gt;post&lt;/a&gt; discussing file formats performance is now online&lt;/p&gt;</content><author><name></name></author><category term="news" /><summary type="html">A new blog post discussing file formats performance is now online</summary></entry><entry><title type="html">SQL Performance: Part 1 - Input File Formats</title><link href="http://crail.incubator.apache.org//blog/2018/08/sql-p1.html" rel="alternate" type="text/html" title="SQL Performance: Part 1 - Input File Formats" /><published>2018-08-08T00:00:00+02:00</published><updated>2018-08-08T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2018/08/sql-p1</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2018/08/sql-p1.html">&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
This is the first user blog post in a multi-part series where we will focus on relational data processing performance (e.g., SQL) in presence of high-performance network and storage devices - the kind of devices that Crail targets. Relational data processing is one of the most popular and versatile workloads people run in the cloud. The general idea is that data is stored in tables with a schema, and is processed using a domain specific language like SQL. Examples of some popular systems that support such relational data analytics in the cloud are &lt;a href=&quot;https://spark.apache.org/sql/&quot;&gt;Apache Spark/SQL&lt;/a&gt;, &lt;a href=&quot;https://hive.apache.org/&quot;&gt;Apache Hive&lt;/a&gt;, &lt;a href=&quot;https://impala.apache.org/&quot;&gt;Apache Impala&lt;/a&gt;, etc. In this post, we discuss the important first step in relational data processing, which is the reading of input data tables.
&lt;/p&gt;
&lt;/div&gt;
&lt;h3 id=&quot;hardware-and-software-configuration&quot;&gt;Hardware and Software Configuration&lt;/h3&gt;
&lt;p&gt;The specific cluster configuration used for the experiments in this blog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cluster
&lt;ul&gt;
&lt;li&gt;4 compute + 1 management node x86_64 cluster&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Node configuration
&lt;ul&gt;
&lt;li&gt;CPU: 2 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz&lt;/li&gt;
&lt;li&gt;DRAM: 256 GB DDR3&lt;/li&gt;
&lt;li&gt;Network: 1x100Gbit/s Mellanox ConnectX-5&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Software
&lt;ul&gt;
&lt;li&gt;Ubuntu 16.04.3 LTS (Xenial Xerus) with Linux kernel version 4.10.0-33-generic&lt;/li&gt;
&lt;li&gt;Apache HDFS (2.7.3)&lt;/li&gt;
&lt;li&gt;Apache Paruqet (1.8), Apache ORC (1.4), Apache Arrow (0.8), Apache Avro (1.4)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-crail/&quot;&gt;Apache Crail (incubating) with NVMeF support&lt;/a&gt;, commit 64e635e5ce9411041bf47fac5d7fadcb83a84355 (since then Crail has a stable source release v1.0 with a newer NVMeF code-base)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;overview&quot;&gt;Overview&lt;/h3&gt;
&lt;p&gt;In a typical cloud-based relational data processing setup, the input data is stored on an external data storage solution like HDFS or AWS S3. Data tables and their associated schema are converted into a storage-friendly format for optimal performance. Examples of some popular and familiar file formats are &lt;a href=&quot;https://parquet.apache.org/&quot;&gt;Apache Parquet&lt;/a&gt;, &lt;a href=&quot;https://orc.apache.org/&quot;&gt;Apache ORC&lt;/a&gt;, &lt;a href=&quot;https://avro.apache.org/&quot;&gt;Apache Avro&lt;/a&gt;, &lt;a href=&quot;https://en.wikipedia.org/wiki/JSON&quot;&gt;JSON&lt;/a&gt;, etc. More recently, &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;Apache Arrow&lt;/a&gt; has been introduced to standardize the in-memory columnar data representation between multiple frameworks. To be precise, Arrow is not a storage format but it defines an &lt;a href=&quot;https://github.com/apache/arrow/blob/master/format/IPC.md&quot;&gt;interprocess communication (IPC) format&lt;/a&gt; that can be used to store data in a stroage system (our binding for reading Arrow IPC messages from HDFS is available &lt;a href=&quot;https://github.com/zrlio/fileformat-benchmarks/blob/master/src/main/java/com/github/animeshtrivedi/FileBench/HdfsSeekableByteChannel.java&quot;&gt;here&lt;/a&gt;). There is no one size fits all as all these formats have their own strengths, weaknesses, and features. In this blog, we are specifically interested in the performance of these formats on modern high-performance networking and storage devices.&lt;/p&gt;
&lt;figure&gt;&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/sql-p1/outline.svg&quot; width=&quot;550&quot; /&gt;&lt;figcaption&gt;Figure 1: The benchmarking setup with HDFS and file formats on a 100 Gbps network with NVMe flash devices. All formats contains routines for compression, encoding, and value materialization with associated I/O buffer management and data copies routines.&lt;p&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/div&gt;&lt;/figure&gt;
&lt;p&gt;To benchmark the performance of file formats, we wrote a set of micro-benchmarks which are available at &lt;a href=&quot;https://github.com/zrlio/fileformat-benchmarks&quot;&gt;https://github.com/zrlio/fileformat-benchmarks&lt;/a&gt;. We cannot use typical SQL micro-benchmarks because every SQL engine has its own favorite file format, on which it performs the best. Hence, in order to ensure parity, we decoupled the performance of reading the input file format from the SQL query processing by writing simple table reading micro-benchmarks. Our benchmark reads in the store_sales table from the TPC-DS dataset (scale factor 100), and calculates a sum of values present in the table. The table contains 23 columns of integers, doubles, and longs.&lt;/p&gt;
&lt;figure&gt;&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/sql-p1/performance-all.svg&quot; width=&quot;550&quot; /&gt;&lt;figcaption&gt;Figure 2: Performance of JSON, Avro, Parquet, ORC, and Arrow on NVMe devices over a 100 Gbps network.&lt;p&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/div&gt;&lt;/figure&gt;
&lt;p&gt;We evaluate the performance of the benchmark on a 3 node HDFS cluster connected using 100 Gbps RoCE. One datanode in HDFS contains 4 NVMe devices with a collective aggregate bandwidth of 12.5 GB/sec (equals to 100 Gbps, hence, we have a balanced network and storage performance). Figure 2 shows our results where none of the file formats is able to deliver the full hardware performance for reading input files. One third of the performance is already lost in HDFS (maximum throughput 74.9 Gbps out of possible 100 Gbps). The rest of the performance is lost inside the file format implementation, which needs to deal with encoding, buffer and I/O management, compression, etc. The best performer is Apache Arrow which is designed for in-memory columnar datasets. The performance of these file formats are bounded by the performance of the CPU, which is 100% loaded during the experiment. For a detailed analysis of the file formats, please refer to our paper - &lt;a href=&quot;https://www.usenix.org/conference/atc18/presentation/trivedi&quot;&gt;Albis: High-Performance File Format for Big Data Systems (USENIX, ATC’18)&lt;/a&gt;. As a side-note on the Arrow performance - we have evaluated the performance of &lt;em&gt;implementation of Arrow’s Java library&lt;/em&gt;. As this library has been focused on interactions with off-heap memory, there is a head room for optimizing the HDFS/on-heap reading path of Arrow’s Java library.&lt;/p&gt;
&lt;h3 id=&quot;albis-high-performance-file-format-for-big-data-systems&quot;&gt;Albis: High-Performance File Format for Big Data Systems&lt;/h3&gt;
&lt;p&gt;Based on these findings, we have developed a new file format called Albis. Albis is built on similar design choices as Crail. The top-level idea is to leverage the performance of modern networking and storage devices without being bottleneck by the CPU. While designing Albis we revisited many outdated assumptions about the nature of I/O in a distributed setting, and came up with the following ideas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No compression or encoding: Modern network and storage devices are fast. Hence, there is no need to trade CPU cycles for performance. A 4 byte integer should be stored as a 4 byte value.&lt;/li&gt;
&lt;li&gt;Keep the data/metadata management simple: Albis splits a table into row and column groups, which are stored in hierarchical files and directories on the underlying file system (e.g., HDFS or Crail).&lt;/li&gt;
&lt;li&gt;Careful object materialization using a binary API: To optimize the runtime representation in managed runtimes like the JVM, only objects which are necessary for SQL processing are materialized. Otherwise, a 4 byte integer can be passed around as a byte array (using the binary API of Albis).&lt;/li&gt;
&lt;/ul&gt;
&lt;figure&gt;&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/sql-p1/core-scalability.svg&quot; width=&quot;550&quot; /&gt;&lt;figcaption&gt;Figure 3: Core scalability of JSON, Avro, Parquet, ORC, Arrow, and Albis on HDFS/NVMe.&lt;p&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/div&gt;&lt;/figure&gt;
&lt;p&gt;Using the Albis format, we revise our previous experiment where we read the input store_sales table from HDFS. In the figure above, we show the performance of Albis and other file formats with number of CPU cores involved. At the right hand of the x-axis, we have performance with all 16 cores engaged, hence, representing the peak possible performance. As evident, Albis delivers 59.9 Gbps out of 74.9 Gbps possible bandwidth with HDFS over NVMe. Albis performance is 1.9 - 21.4x better than other file formats. To give an impression where the performance is coming from, in the table below we show some micro-architectural features for Parquet, ORC, Arrow, and Albis. Our previously discussed design ideas in Albis result in a shorter code path (shown as less instructions required for each row), better cache performance (shows as lower cache misses per row), and clearly better performance (shown as nanoseconds required per row for processing). For a detailed evaluation of Albis please refer to our paper.&lt;/p&gt;
&lt;table style=&quot;width:100%&quot;&gt;
&lt;caption&gt; Table 1: Micro-architectural analysis for Parquet, ORC, Arrow, and Albis on a 16-core Xeon machine.&lt;p&gt;&lt;/p&gt;&lt;/caption&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Parquet&lt;/th&gt;
&lt;th&gt;ORC&lt;/th&gt;
&lt;th&gt;Arrow&lt;/th&gt;
&lt;th&gt;Albis&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th&gt;Instructions/row&lt;/th&gt;
&lt;td&gt;6.6K&lt;/td&gt;
&lt;td&gt;4.9K&lt;/td&gt;
&lt;td&gt;1.9K&lt;/td&gt;
&lt;td&gt;1.6K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th&gt;Cache misses/row&lt;/th&gt;
&lt;td&gt;9.2&lt;/td&gt;
&lt;td&gt;4.6&lt;/td&gt;
&lt;td&gt;5.1&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th&gt;Nanoseconds/row&lt;/th&gt;
&lt;td&gt;105.3&lt;/td&gt;
&lt;td&gt;63.0&lt;/td&gt;
&lt;td&gt;31.2&lt;/td&gt;
&lt;td&gt;20.8&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h3 id=&quot;apache-crail-incubating-with-albis&quot;&gt;Apache Crail (Incubating) with Albis&lt;/h3&gt;
&lt;p&gt;For our final experiment, we try to answer the question what it would take to deliver the full 100 Gbps bandwidth for Albis. Certainly, the first bottleneck is to improve the base storage layer performance. Here we use Apache Crail (Incubating) with its &lt;a href=&quot;https://en.wikipedia.org/wiki/NVM_Express#NVMeOF&quot;&gt;NVMeF&lt;/a&gt; storage tier. This tier uses &lt;a href=&quot;https://github.com/zrlio/jNVMf&quot;&gt;jNVMf library&lt;/a&gt; to implement NVMeF stack in Java. As we have shown in a previous blog &lt;a href=&quot;//crail.incubator.apache.org/blog/2017/08/crail-nvme-fabrics-v1.html&quot;&gt;post&lt;/a&gt; that Crail’s NVMeF tier can deliver performance (97.8 Gbps) very close to the hardware limits. Hence, Albis with Crail is a perfect setup to evaluate on high-performance NVMe and RDMA devices. Before we get there, let’s get some calculations right. The store_sales table in the TPC-DS dataset has a data density of 93.9% (out of 100 bytes, only 93.9 is data, others are null values). As we measure the goodput, the expected performance of Albis on Crail is 93.9% of 97.8 Gbps, which calculates to 91.8 Gbps. In our experiments, Albis on Crail delivers 85.5 Gbps. Figure 4 shows more detailed results.&lt;/p&gt;
&lt;figure&gt;&lt;div style=&quot;text-align:center&quot;&gt;&lt;img src=&quot;//crail.incubator.apache.org/img/blog/sql-p1/albis-crail.svg&quot; width=&quot;550&quot; /&gt;&lt;figcaption&gt;Figure 4: Performance of Albis on Crail.&lt;p&gt;&lt;/p&gt;&lt;/figcaption&gt;&lt;/div&gt;&lt;/figure&gt;
&lt;p&gt;The left half of the figure shows the performance scalability of Albis on Crail in a setup with 1 core (8.9 Gbps) to 16 cores (85.5 Gbps). In comparison, the right half of the figure shows the performance of Crail on HDFS/NVMe at 59.9 Gbps, and on Crail/NVMe at 85.5 Gbps. The last bar shows the performance of Albis if the benchmark does not materialize Java object values. In this configuration, Albis on Crail delivers 91.3 Gbps, which is very close to the expected peak of 91.8 Gbps.&lt;/p&gt;
&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;
&lt;div style=&quot;text-align: justify&quot;&gt;
&lt;p&gt;
In this first blog of a multipart series, we have looked at the data ingestion performance of file formats on high-performance networking and storage devices. We found that popular file formats are in need for a performance revision. Based on our analysis, we designed and implemented Albis - a new file format for storing relational data. Albis and Crail share many design choices. Their combined performance of 85+ Gbps on a 100 Gbps network, gives us confidence in our approach and underlying software philosophy for both, Crail and Albis.
&lt;/p&gt;
&lt;p&gt;
Stay tuned for the next part where we look at workload-level performance in Spark/SQL on modern high-performance networking and storage devices. Meanwhile let us know if you have any feedback or comments.
&lt;/p&gt;
&lt;/div&gt;</content><author><name>Animesh Trivedi</name></author><category term="blog" /><summary type="html">This is the first user blog post in a multi-part series where we will focus on relational data processing performance (e.g., SQL) in presence of high-performance network and storage devices - the kind of devices that Crail targets. Relational data processing is one of the most popular and versatile workloads people run in the cloud. The general idea is that data is stored in tables with a schema, and is processed using a domain specific language like SQL. Examples of some popular systems that support such relational data analytics in the cloud are Apache Spark/SQL, Apache Hive, Apache Impala, etc. In this post, we discuss the important first step in relational data processing, which is the reading of input data tables.</summary></entry><entry><title type="html">Dataworks</title><link href="http://crail.incubator.apache.org//blog/2018/06/dataworks.html" rel="alternate" type="text/html" title="Dataworks" /><published>2018-06-05T00:00:00+02:00</published><updated>2018-06-05T00:00:00+02:00</updated><id>http://crail.incubator.apache.org//blog/2018/06/dataworks</id><content type="html" xml:base="http://crail.incubator.apache.org//blog/2018/06/dataworks.html">&lt;p&gt;Apache Crail (incubating) to feature in the &lt;a href=&quot;https://dataworkssummit.com/san-jose-2018/session/data-processing-at-the-speed-of-100-gbpsapache-crail-incubating/&quot;&gt;DataWorks Summit&lt;/a&gt; on June 21st&lt;/p&gt;</content><author><name></name></author><category term="news" /><summary type="html">Apache Crail (incubating) to feature in the DataWorks Summit on June 21st</summary></entry></feed>