blob: d190341ffc250b721a1ca620fa5db388c48faa77 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="/assets/css/custom.css">
<link rel="stylesheet" href="/assets/css/main.css">
<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
<link rel="shortcut icon" href="/favicon.ico?1">
<!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Optimizing shuffle performance using Nemo | Nemo</title>
<meta name="generator" content="Jekyll v3.9.2" />
<meta property="og:title" content="Optimizing shuffle performance using Nemo" />
<meta name="author" content="John Yang" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Data shuffle is a key operation that underlies almost all large-scale data processing jobs. A shuffle operation typically involves writing intermediate data to disk, and reading the data back later when the successive computations are scheduled." />
<meta property="og:description" content="Data shuffle is a key operation that underlies almost all large-scale data processing jobs. A shuffle operation typically involves writing intermediate data to disk, and reading the data back later when the successive computations are scheduled." />
<link rel="canonical" href="http://nemo.apache.org//blog/2018/03/23/shuffle-on-nemo/" />
<meta property="og:url" content="http://nemo.apache.org//blog/2018/03/23/shuffle-on-nemo/" />
<meta property="og:site_name" content="Nemo" />
<meta property="og:type" content="article" />
<meta property="article:published_time" content="2018-03-23T00:00:00+09:00" />
<meta name="twitter:card" content="summary" />
<meta property="twitter:title" content="Optimizing shuffle performance using Nemo" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"John Yang"},"dateModified":"2018-03-23T00:00:00+09:00","datePublished":"2018-03-23T00:00:00+09:00","description":"Data shuffle is a key operation that underlies almost all large-scale data processing jobs. A shuffle operation typically involves writing intermediate data to disk, and reading the data back later when the successive computations are scheduled.","headline":"Optimizing shuffle performance using Nemo","mainEntityOfPage":{"@type":"WebPage","@id":"http://nemo.apache.org//blog/2018/03/23/shuffle-on-nemo/"},"url":"http://nemo.apache.org//blog/2018/03/23/shuffle-on-nemo/"}</script>
<!-- End Jekyll SEO tag -->
<link rel="canonical" href="http://nemo.apache.org//blog/2018/03/23/shuffle-on-nemo/">
<link rel="alternate" type="application/rss+xml" title="Nemo" href="http://nemo.apache.org//feed.xml" />
</head>
<body>
<nav class="navbar navbar-default navbar-fixed-top">
<div class="container navbar-container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<span><img src="/assets/img/nemo-logo.png" alt="Logo"></span>
</a>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li ><a href="/docs/home/">Docs</a></li>
<li ><a href="/apidocs">APIs</a></li>
<li ><a href="/pages/downloads">Downloads</a></li>
<li ><a href="/pages/talks">Talks</a></li>
<li ><a href="/pages/team">Team</a></li>
<li ><a href="/pages/license">License</a></li>
<li class="active" ><a href="/blog/2022/09/10/release-note-0.4/">Blog</a></li>
</ul>
<div class="navbar-right">
<form class="navbar-form navbar-left">
<div class="form-group has-feedback">
<input id="search-box" type="search" class="form-control" placeholder="Search...">
<i class="fa fa-search form-control-feedback"></i>
</div>
</form>
<ul class="nav navbar-nav">
<li><a href="https://github.com/apache/incubator-nemo"><i class="fa fa-github" aria-hidden="true"></i></a></li>
</ul>
</div>
</div>
</div>
</nav>
<div class="page-content">
<div class="wrapper">
<div class="container">
<div class="row">
<div class="col-md-4">
<div class="well">
<h4>RECENT POSTS</h4>
<ul class="list-unstyled post-list-container">
<li><a href="/blog/2022/09/10/release-note-0.4/" >Nemo Release 0.4</a></li>
<li><a href="/blog/2022/08/15/beam-runner/" >Beam Nemo Runner documents updated!</a></li>
<li><a href="/blog/2022/06/07/release-note-0.3/" >Nemo Release 0.3</a></li>
<li><a href="/blog/2020/03/09/release-note-0.2/" >Nemo Release 0.2</a></li>
<li><a href="/blog/2019/03/02/release-note-0.1/" >Nemo Release 0.1</a></li>
<li><a href="/blog/2018/03/23/shuffle-on-nemo/" class="active" >Optimizing shuffle performance using Nemo</a></li>
<li><a href="/blog/2018/03/23/pado-on-nemo/" >Harnessing transient resources using Nemo</a></li>
<li><a href="/blog/2018/03/20/nemo-blog-published/" >Nemo blog published!</a></li>
<li><a href="/allposts">All posts ...</a></li>
</ul>
</div>
</div>
<div class="col-md-8">
<h1>Optimizing shuffle performance using Nemo</h1>
<p>Mar 23, 2018 • John Yang</p>
<div id="markdown-content-container"><p>Data shuffle is a key operation that underlies almost all large-scale data processing jobs. A shuffle operation typically involves writing intermediate data to disk, and reading the data back later when the successive computations are scheduled.</p>
<p>Sailfish[1] is an optimization technique that reduces disk overheads associated with a shuffle operation. Specifically, Sailfish minimizes the number of disk seeks involved in reading intermediate data back from disk. Jobs that handle large volumes of data can especially benefit from the Sailfish technique.</p>
<p>Nemo provides an optimization policy interface that makes it easy for users to employ techniques like Sailfish to improve application performance. To demonstrate the flexibility of Nemo, we have developed and evaluated SailfishPolicy. We summarize preliminary evaluation results as follows.</p>
<h3 id="experimentation-setup">Experimentation setup</h3>
<ul>
<li>Systems: Spark[2] 2.3.0 (a state-of-the-art system), and Nemo with SailfishPolicy</li>
<li>Resources: 20 h1.4xlarge (16 vCPU, 64GB memory, 2 HDDs) AWS instances
<ul>
<li>One of the disk is used by a HDFS cluster, and the other is used as a scratch disk by Nemo and Spark for maintaining intermediate data</li>
</ul>
</li>
<li>Dataset: 2TB Wikipedia pageview statistics[3] stored in the HDFS cluster</li>
<li>Application: A MapReduce application that reads input data from HDFS, computes the sum of pageview counts per Wikipedia project, and writes the results to HDFS
<ul>
<li>Spark’s app is written in Spark DSL, and Nemo’s app is written in Beam</li>
</ul>
</li>
</ul>
<h3 id="job-completion-time-jct">Job completion time (JCT)</h3>
<p><img src="https://user-images.githubusercontent.com/6691311/37783061-d7c62970-2e37-11e8-89d5-9ef3da8fd846.png" alt="Figure 1" /></p>
<center>Figure 1</center>
<p>As shown in Figure 1, Nemo outperforms Spark by 2.26X primarily because Nemo’s reduce stage completes faster than Spark’s.</p>
<h3 id="mean-disk-throughput-mbs">Mean disk throughput (MB/s)</h3>
<p><img src="https://user-images.githubusercontent.com/6691311/37783098-f17b55d4-2e37-11e8-9cf3-bf082562c1e6.png" alt="Figure 2" /></p>
<center>Figure 2</center>
<p>To understand the performance difference, we’ve measured the mean throughput of the scratch disks that Nemo and Spark use for handling intermediate data. As depicted in Figure 2, Nemo’s reduce stage enjoys much higher disk read throughput with a smaller number of disk seeks. This explains why Nemo’s reduce stage was able to complete more quickly, and validates the effectiveness of SailfishPolicy.</p>
<p>[1] Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. 2012. Sailfish: a framework for large scale data processing. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC ‘12).</p>
<p>[2] Apache Spark. https://spark.apache.org/.</p>
<p>[3] Wikipedia pageview statistics. https://dumps.wikimedia.org/other/pagecounts-raw/.</p>
</div>
<hr>
<ul class="pager">
<li class="previous">
<a href="/blog/2018/03/23/pado-on-nemo/">
<span aria-hidden="true">&larr;</span> Older
</a>
</li>
<li class="next">
<a href="/blog/2019/03/02/release-note-0.1/">
Newer <span aria-hidden="true">&rarr;</span>
</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
<footer class="footer">
<div class="container">
<p class="text-center">
Nemo 2022 |
Powered by <a href="https://github.com/aksakalli/jekyll-doc-theme">Jekyll Doc Theme</a>
</p>
<!-- <p class="text-muted">Place sticky footer content here.</p> -->
</div>
</footer>
<script>
var baseurl = ''
</script>
<script src="https://code.jquery.com/jquery-1.12.4.min.js"></script>
<script src="/assets/js/bootstrap.min.js "></script>
<script src="/assets/js/typeahead.bundle.min.js "></script>
<script src="/assets/js/main.js "></script>
</body>
</html>