Data shuffle is a key operation that underlies almost all large-scale data processing jobs. A shuffle operation typically involves writing intermediate data to disk, and reading the data back later when the successive computations are scheduled.
Sailfish is an optimization technique that reduces disk overheads associated with a shuffle operation. Specifically, Sailfish minimizes the number of disk seeks involved in reading intermediate data back from disk. Jobs that handle large volumes of data can especially benefit from the Sailfish technique.
Nemo provides an optimization policy interface that makes it easy for users to employ techniques like Sailfish to improve application performance. To demonstrate the flexibility of Nemo, we have developed and evaluated SailfishPolicy. We summarize preliminary evaluation results as follows.
As shown in Figure 1, Nemo outperforms Spark by 2.26X primarily because Nemo’s reduce stage completes faster than Spark’s.
To understand the performance difference, we’ve measured the mean throughput of the scratch disks that Nemo and Spark use for handling intermediate data. As depicted in Figure 2, Nemo’s reduce stage enjoys much higher disk read throughput with a smaller number of disk seeks. This explains why Nemo’s reduce stage was able to complete more quickly, and validates the effectiveness of SailfishPolicy.
 Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. 2012. Sailfish: a framework for large scale data processing. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12).
 Apache Spark. https://spark.apache.org/.
 Wikipedia pageview statistics. https://dumps.wikimedia.org/other/pagecounts-raw/.