The specific cluster configuration used for the experiments in this blog:
One key question of interest is about the network usage of the Crail shuffler during the sorting benchmark. In the figure below, we show the data rate at which the different reduce tasks fetch data from the network. Each point in the figure corresponds to one reduce task. In our configuration, we run 3 Spark executors per node and 5 Spark cores per executor. Thus, 1920 reduce tasks are running concurrently (out of 6400 reduce tasks in total) generating a cluster-wide all-to-all traffic of about 70Gbit/s per node during that phase.
In this blog post, we have shown that Crail successfully manages to translate the raw network performance into actual workload level gains. The exercise with TeraSort as an application validates the design decisions we made in Crail. Stay tuned for more results with different workloads and hardware configurations.
All the components required to run the sorting benchmark using Spark/Crail are open source. Here is some guidance how to run the benchmark:
./bin/spark-submit -v --num-executors 384 --executor-cores 5 --executor-memory 64G --driver-memory 64G --master yarn --class com.ibm.crail.terasort.TeraSort path/to/crail-terasort-2.0.jar -i /terasort-input-1280g -o /terasort-output-1280g
Have questions or comments? Feel free to discuss at the dev mailing list at dev@crail.incubator.apache.org