Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/incubator-crail-website

commit: 2e7887e6151f9ff57ff12f272a6ebb0212a306f1 [log] [tgz]
author: Animesh Trivedi <animesh.trivedi@gmail.com> Thu Sep 06 11:25:16 2018 +0200
committer: Animesh Trivedi <animesh.trivedi@gmail.com> Thu Sep 06 11:25:16 2018 +0200
tree: c14cf443c6588b8e234f8ed692113cafd70681c1
parent: 1f2cbd90bd8736cf8039d86107df315b40030dec [diff]
parent: ff498052418181475486b63a7d66169591496498 [diff]
diff --git a/site/_layouts/post.html b/site/_layouts/post.html
index 8d9c93d..e27f85a 100644
--- a/site/_layouts/post.html
+++ b/site/_layouts/post.html

@@ -1,7 +1,7 @@
 ---
 layout: default
 ---
-<p class="meta">{{ page.date | date_to_string }}</p>
+<p class="meta">{{ page.date | date_to_string }}, {% if page.mode == "guest" %} <mark>this is a blog post from a user of the Crail project.</mark> {% endif %} </p>
 
 <div class="post">
 {{ content }}

diff --git a/site/_posts/blog/2017-01-17-sorting.md b/site/_posts/blog/2017-01-17-sorting.md
index 2873ae0..624cf94 100644
--- a/site/_posts/blog/2017-01-17-sorting.md
+++ b/site/_posts/blog/2017-01-17-sorting.md

@@ -3,7 +3,7 @@
 title: "Sorting on a 100Gbit/s Cluster using Spark/Crail"
 author: Patrick Stuedi
 category: blog
-comments: true
+mode: guest
 ---
 
 <div style="text-align: justify"> 

diff --git a/site/_posts/blog/2017-11-17-rdmashuffle.md b/site/_posts/blog/2017-11-17-rdmashuffle.md
index 07ed304..02534b6 100644
--- a/site/_posts/blog/2017-11-17-rdmashuffle.md
+++ b/site/_posts/blog/2017-11-17-rdmashuffle.md

@@ -3,7 +3,7 @@
 title: "Spark Shuffle: SparkRDMA vs Crail"
 author: Jonas Pfefferle, Patrick Stuedi, Animesh Trivedi, Bernard Metzler, Adrian Schuepbach
 category: blog
-comments: true
+mode: guest
 ---
 
 <div style="text-align: justify">

diff --git a/site/_posts/blog/2018-08-08-sql-p1.md b/site/_posts/blog/2018-08-08-sql-p1.md
index 87fc2f8..8179159 100644
--- a/site/_posts/blog/2018-08-08-sql-p1.md
+++ b/site/_posts/blog/2018-08-08-sql-p1.md

@@ -1,14 +1,15 @@
 ---
 layout: post
 title: "SQL Performance: Part 1 - Input File Formats"
-author: Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, and Bernard Metzler
+author: Animesh Trivedi
 category: blog
 comments: true
+mode: guest
 ---
 
 <div style="text-align: justify">
 <p>
-This is the first blog post in a multi-part series where we will focus on relational data processing performance (e.g., SQL) in presence of high-performance network and storage devices - the kind of devices that Crail targets. Relational data processing is one of the most popular and versatile workloads people run in the  cloud. The general idea is that data is stored in tables with a schema, and is processed using a domain specific language like SQL. Examples of some popular systems that support such relational data analytics in the cloud are <a href="https://spark.apache.org/sql/">Apache Spark/SQL</a>, <a href="https://hive.apache.org/">Apache Hive</a>, <a href="https://impala.apache.org/">Apache Impala</a>, etc. In this post, we discuss the important first step in relational data processing, which is the reading of input data tables.
+This is the first user blog post in a multi-part series where we will focus on relational data processing performance (e.g., SQL) in presence of high-performance network and storage devices - the kind of devices that Crail targets. Relational data processing is one of the most popular and versatile workloads people run in the  cloud. The general idea is that data is stored in tables with a schema, and is processed using a domain specific language like SQL. Examples of some popular systems that support such relational data analytics in the cloud are <a href="https://spark.apache.org/sql/">Apache Spark/SQL</a>, <a href="https://hive.apache.org/">Apache Hive</a>, <a href="https://impala.apache.org/">Apache Impala</a>, etc. In this post, we discuss the important first step in relational data processing, which is the reading of input data tables.
 </p>
 </div>
 
@@ -30,7 +31,7 @@
 
 ### Overview
 
-In a typical cloud-based relational data processing setup, the input data is stored on an external data storage solution like HDFS or AWS S3. Data tables and their associated schema are converted into a storage-friendly format for optimal performance. Examples of some popular and familiar file formats are [Apache Parquet](https://parquet.apache.org/), [Apache ORC](https://orc.apache.org/), [Apache Avro](https://avro.apache.org/), [JSON](https://en.wikipedia.org/wiki/JSON), etc. More recently, [Apache Arrow](https://arrow.apache.org/) has been introduced to standardize the in-memory columnar data representation between multiple frameworks. There is no one size fits all as all these formats have their own strengths, weaknesses, and features. In this blog, we are specifically interested in the performance of these formats on modern high-performance networking and storage devices. 
+In a typical cloud-based relational data processing setup, the input data is stored on an external data storage solution like HDFS or AWS S3. Data tables and their associated schema are converted into a storage-friendly format for optimal performance. Examples of some popular and familiar file formats are [Apache Parquet](https://parquet.apache.org/), [Apache ORC](https://orc.apache.org/), [Apache Avro](https://avro.apache.org/), [JSON](https://en.wikipedia.org/wiki/JSON), etc. More recently, [Apache Arrow](https://arrow.apache.org/) has been introduced to standardize the in-memory columnar data representation between multiple frameworks. To be precise, Arrow is not a storage format but it defines an [interprocess communication (IPC) format](https://github.com/apache/arrow/blob/master/format/IPC.md) that can be used to store data in a stroage system (our binding for reading Arrow IPC messages from HDFS is available [here](https://github.com/zrlio/fileformat-benchmarks/blob/master/src/main/java/com/github/animeshtrivedi/FileBench/HdfsSeekableByteChannel.java)). There is no one size fits all as all these formats have their own strengths, weaknesses, and features. In this blog, we are specifically interested in the performance of these formats on modern high-performance networking and storage devices. 
 
 <figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/outline.svg" width="550"/><figcaption>Figure 1: The benchmarking setup with HDFS and file formats on a 100 Gbps network with NVMe flash devices. All formats contains routines for compression, encoding, and value materialization with associated I/O buffer management and data copies routines.<p></p></figcaption></div></figure>
 
@@ -38,7 +39,7 @@
 
 <figure><div style="text-align:center"><img src ="{{ site.base }}/img/blog/sql-p1/performance-all.svg" width="550"/><figcaption>Figure 2: Performance of JSON, Avro, Parquet, ORC, and Arrow on NVMe devices over a 100 Gbps network.<p></p></figcaption></div></figure>
 
-We evaluate the performance of the benchmark on a 3 node HDFS cluster connected using 100 Gbps RoCE. One datanode in HDFS contains 4 NVMe devices with a collective aggregate bandwidth of 12.5 GB/sec (equals to 100 Gbps, hence, we have a balanced network and storage performance). Figure 2 shows our results where none of the file formats is able to deliver the full hardware performance for reading input files. One third of the performance is already lost in HDFS (maximum throughput 74.9 Gbps out of possible 100 Gbps). The rest of the performance is lost inside the file format implementation, which needs to deal with encoding, buffer and I/O management, compression, etc. The best performer is Apache Arrow which is designed for in-memory columnar datasets. The performance of these file formats are bounded by the performance of the CPU, which is 100% loaded during the experiment. For a detailed analysis of the file formats, please refer to our paper - [Albis: High-Performance File Format for Big Data Systems (USENIX, ATC’18)](https://www.usenix.org/conference/atc18/presentation/trivedi). 
+We evaluate the performance of the benchmark on a 3 node HDFS cluster connected using 100 Gbps RoCE. One datanode in HDFS contains 4 NVMe devices with a collective aggregate bandwidth of 12.5 GB/sec (equals to 100 Gbps, hence, we have a balanced network and storage performance). Figure 2 shows our results where none of the file formats is able to deliver the full hardware performance for reading input files. One third of the performance is already lost in HDFS (maximum throughput 74.9 Gbps out of possible 100 Gbps). The rest of the performance is lost inside the file format implementation, which needs to deal with encoding, buffer and I/O management, compression, etc. The best performer is Apache Arrow which is designed for in-memory columnar datasets. The performance of these file formats are bounded by the performance of the CPU, which is 100% loaded during the experiment. For a detailed analysis of the file formats, please refer to our paper - [Albis: High-Performance File Format for Big Data Systems (USENIX, ATC’18)](https://www.usenix.org/conference/atc18/presentation/trivedi). As a side-note on the Arrow performance - we have evaluated the performance of _implementation of Arrow's Java library_. As this library has been focused on interactions with off-heap memory, there is a head room for optimizing the HDFS/on-heap reading path of Arrow's Java library. 
 
 ### Albis: High-Performance File Format for Big Data Systems
 
@@ -97,7 +98,7 @@
 ### Summary 
 <div style="text-align: justify">
 <p>
-In this first blog of a multipart series, we have looked at the data ingestion performance of file formats on high-performance networking and storage devices. We found that popular file formats are in need for a performance revision. Based on our analysis, we designed and implemented Albis - a new file format for storing relational data. Albis and Crail share many design choices. Their combined performance of 85+ Gbps on a 100 Gbps network, gives us confidence in our approach and underlying software philosophy for both, Crail and Albis.
+In this first blog of a multipart series, we have looked at the data ingestion performance of file formats on high-performance networking and storage devices. We found that popular file formats are in need for a performance revision. Based on our analysis, we designed and implemented Albis - a new file format for storing relational data. Albis and Crail share many design choices. Their combined performance of 85+ Gbps on a 100 Gbps network, gives us confidence in our approach and underlying software philosophy for both, Crail and Albis. 
 </p>
 
 <p>

diff --git a/site/blog/index.html b/site/blog/index.html
index 28de65a..88732b6 100644
--- a/site/blog/index.html
+++ b/site/blog/index.html

@@ -17,7 +17,11 @@
         </h3>
         {% endif %}
     </a>
-    <p class="post-meta">Posted by {% if post.author %}{{ post.author }}{% else %}{{ site.title }}{% endif %} on {{ post.date | date: "%B %-d, %Y" }}</p>
+		{% if post.mode == "guest" %}
+    <p class="post-meta"><b>User Post</b> by {% if post.author %}{{ post.author }}{% else %}{{ site.title }}{% endif %} on {{ post.date | date: "%B %-d, %Y" }}</p>
+		{% else %}
+    <p class="post-meta"><b>Developer Post</b> by {% if post.author %}{{ post.author }}{% else %}{{ site.title }}{% endif %} on {{ post.date | date: "%B %-d, %Y" }}</p>
+    {% endif %}
 </div>
 <hr>
 {% endfor %}
commit	2e7887e6151f9ff57ff12f272a6ebb0212a306f1	[log] [tgz]
author	Animesh Trivedi <animesh.trivedi@gmail.com>	Thu Sep 06 11:25:16 2018 +0200
committer	Animesh Trivedi <animesh.trivedi@gmail.com>	Thu Sep 06 11:25:16 2018 +0200
tree	c14cf443c6588b8e234f8ed692113cafd70681c1
parent	1f2cbd90bd8736cf8039d86107df315b40030dec [diff]
parent	ff498052418181475486b63a7d66169591496498 [diff]