site/_posts/blog/2017-01-17-sorting.md

layout: post title: “Sorting on a 100Gbit/s Cluster using Spark/Crail” author: Patrick Stuedi category: blog mode: guest

Hardware Configuration

The specific cluster configuration used for the experiments in this blog:

Cluster
- 128 node OpenPower cluster
Node configuration
- CPU: 2x OpenPOWER Power8 10-core @2.9Ghz
- DRAM: 512GB DDR4
- Storage: 4x Huawei ES3600P V3 1.2TB NVMe SSD
- Network: 100Gbit/s Ethernet Mellanox ConnectX-4 EN (RoCE)
Software
- Ubuntu 16.04 with Linux kernel version 4.4.0-31
- Spark 2.0.0
- Crail 1.0 (Crail only used during shuffle, input/output is on HDFS)

Anatomy Spark Sorting

Using Vanilla Spark

Using the Crail Shuffler

Spark/Crail Sorting Performance

One key question of interest is about the network usage of the Crail shuffler during the sorting benchmark. In the figure below, we show the data rate at which the different reduce tasks fetch data from the network. Each point in the figure corresponds to one reduce task. In our configuration, we run 3 Spark executors per node and 5 Spark cores per executor. Thus, 1920 reduce tasks are running concurrently (out of 6400 reduce tasks in total) generating a cluster-wide all-to-all traffic of about 70Gbit/s per node during that phase.

In this blog post, we have shown that Crail successfully manages to translate the raw network performance into actual workload level gains. The exercise with TeraSort as an application validates the design decisions we made in Crail. Stay tuned for more results with different workloads and hardware configurations.

How to run Sorting with Spark/Crail

All the components required to run the sorting benchmark using Spark/Crail are open source. Here is some guidance how to run the benchmark:

Build and deploy Crail using the instructions at documentation
Enable the Crail shuffler for Spark by building Spark-IO using the instructions at documentation
Configure the DRAM storage tier of Crail so that all the shuffle data fits into the DRAM tier.
Build the sorting benchmark using the instructions on GitHub
Make sure you have the custom serializer and sorter specified in spark-defaults.conf
Run Hadoop TeraGen to produce a valid data set. We used standard HDFS for both input and output data.
Run the Crail-TeraSort on your Spark cluster. The command line we have used on the 128 node cluster is the following:

./bin/spark-submit -v --num-executors 384 --executor-cores 5 --executor-memory 64G 
--driver-memory 64G --master yarn 
--class com.ibm.crail.terasort.TeraSort path/to/crail-terasort-2.0.jar 
-i /terasort-input-1280g -o /terasort-output-1280g

Have questions or comments? Feel free to discuss at the dev mailing list at dev@crail.incubator.apache.org