site/docs/2.0.0-preview/hardware-provisioning.html - spark-website - Git at Google


 <!DOCTYPE html>
 <!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
 <!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
 <!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->
 <!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
     <head>
         <meta charset="utf-8">
         <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
         <title>Hardware Provisioning - Spark 2.0.0 Documentation</title>


         <link rel="stylesheet" href="css/bootstrap.min.css">
         <style>
             body {
                 padding-top: 60px;
                 padding-bottom: 40px;
             }
         </style>
         <meta name="viewport" content="width=device-width">
         <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
         <link rel="stylesheet" href="css/main.css">

         <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>

         <link rel="stylesheet" href="css/pygments-default.css">


     </head>
     <body>
         <!--[if lt IE 7]>
             <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
         <![endif]-->

         <!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->

         <div class="navbar navbar-fixed-top" id="topbar">
             <div class="navbar-inner">
                 <div class="container">
                     <div class="brand"><a href="index.html">
                       <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">2.0.0</span>
                     </div>
                     <ul class="nav">
                         <!--TODO(andyk): Add class="active" attribute to li some how.-->
                         <li><a href="index.html">Overview</a></li>

                         <li class="dropdown">
                             <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="quick-start.html">Quick Start</a></li>
                                 <li><a href="programming-guide.html">Spark Programming Guide</a></li>
                                 <li class="divider"></li>
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
                                 <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
                                 <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
                                 <li><a href="sparkr.html">SparkR (R on Spark)</a></li>
                             </ul>
                         </li>

                         <li class="dropdown">
                             <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
                                 <li><a href="api/java/index.html">Java</a></li>
                                 <li><a href="api/python/index.html">Python</a></li>
                                 <li><a href="api/R/index.html">R</a></li>
                             </ul>
                         </li>

                         <li class="dropdown">
                             <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="cluster-overview.html">Overview</a></li>
                                 <li><a href="submitting-applications.html">Submitting Applications</a></li>
                                 <li class="divider"></li>
                                 <li><a href="spark-standalone.html">Spark Standalone</a></li>
                                 <li><a href="running-on-mesos.html">Mesos</a></li>
                                 <li><a href="running-on-yarn.html">YARN</a></li>
                             </ul>
                         </li>

                         <li class="dropdown">
                             <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="configuration.html">Configuration</a></li>
                                 <li><a href="monitoring.html">Monitoring</a></li>
                                 <li><a href="tuning.html">Tuning Guide</a></li>
                                 <li><a href="job-scheduling.html">Job Scheduling</a></li>
                                 <li><a href="security.html">Security</a></li>
                                 <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
                                 <li class="divider"></li>
                                 <li><a href="building-spark.html">Building Spark</a></li>
                                 <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
                                 <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects">Supplemental Projects</a></li>
                             </ul>
                         </li>
                     </ul>
                     <!--<p class="navbar-text pull-right"><span class="version-text">v2.0.0</span></p>-->
                 </div>
             </div>
         </div>

         <div class="container-wrapper">


                 <div class="content" id="content">

                         <h1 class="title">Hardware Provisioning</h1>


                     <p>A common question received by Spark developers is how to configure hardware for it. While the right
 hardware will depend on the situation, we make the following recommendations.</p>

 <h1 id="storage-systems">Storage Systems</h1>

 <p>Because most Spark jobs will likely have to read input data from an external storage system (e.g.
 the Hadoop File System, or HBase), it is important to place it <strong>as close to this system as
 possible</strong>. We recommend the following:</p>

 <ul>
   <li>
     <p>If at all possible, run Spark on the same nodes as HDFS. The simplest way is to set up a Spark
 <a href="spark-standalone.html">standalone mode cluster</a> on the same nodes, and configure Spark and
 Hadoop&#8217;s memory and CPU usage to avoid interference (for Hadoop, the relevant options are
 <code>mapred.child.java.opts</code> for the per-task memory and <code>mapred.tasktracker.map.tasks.maximum</code>
 and <code>mapred.tasktracker.reduce.tasks.maximum</code> for number of tasks). Alternatively, you can run
 Hadoop and Spark on a common cluster manager like <a href="running-on-mesos.html">Mesos</a> or
 <a href="running-on-yarn.html">Hadoop YARN</a>.</p>
   </li>
   <li>
     <p>If this is not possible, run Spark on different nodes in the same local-area network as HDFS.</p>
   </li>
   <li>
     <p>For low-latency data stores like HBase, it may be preferrable to run computing jobs on different
 nodes than the storage system to avoid interference.</p>
   </li>
 </ul>

 <h1 id="local-disks">Local Disks</h1>

 <p>While Spark can perform a lot of its computation in memory, it still uses local disks to store
 data that doesn&#8217;t fit in RAM, as well as to preserve intermediate output between stages. We
 recommend having <strong>4-8 disks</strong> per node, configured <em>without</em> RAID (just as separate mount points).
 In Linux, mount the disks with the <a href="http://www.centos.org/docs/5/html/Global_File_System/s2-manage-mountnoatime.html"><code>noatime</code> option</a>
 to reduce unnecessary writes. In Spark, <a href="configuration.html">configure</a> the <code>spark.local.dir</code>
 variable to be a comma-separated list of the local disks. If you are running HDFS, it&#8217;s fine to
 use the same disks as HDFS.</p>

 <h1 id="memory">Memory</h1>

 <p>In general, Spark can run well with anywhere from <strong>8 GB to hundreds of gigabytes</strong> of memory per
 machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the
 rest for the operating system and buffer cache.</p>

 <p>How much memory you will need will depend on your application. To determine how much your
 application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the
 Storage tab of Spark&#8217;s monitoring UI (<code>http://&lt;driver-node&gt;:4040</code>) to see its size in memory.
 Note that memory usage is greatly affected by storage level and serialization format &#8211; see
 the <a href="tuning.html">tuning guide</a> for tips on how to reduce it.</p>

 <p>Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you
 purchase machines with more RAM than this, you can run <em>multiple worker JVMs per node</em>. In
 Spark&#8217;s <a href="spark-standalone.html">standalone mode</a>, you can set the number of workers per node
 with the <code>SPARK_WORKER_INSTANCES</code> variable in <code>conf/spark-env.sh</code>, and the number of cores
 per worker with <code>SPARK_WORKER_CORES</code>.</p>

 <h1 id="network">Network</h1>

 <p>In our experience, when the data is in memory, a lot of Spark applications are network-bound.
 Using a <strong>10 Gigabit</strong> or higher network is the best way to make these applications faster.
 This is especially true for &#8220;distributed reduce&#8221; applications such as group-bys, reduce-bys, and
 SQL joins. In any given application, you can see how much data Spark shuffles across the network
 from the application&#8217;s monitoring UI (<code>http://&lt;driver-node&gt;:4040</code>).</p>

 <h1 id="cpu-cores">CPU Cores</h1>

 <p>Spark scales well to tens of CPU cores per machine because it performs minimal sharing between
 threads. You should likely provision at least <strong>8-16 cores</strong> per machine. Depending on the CPU
 cost of your workload, you may also need more: once data is in memory, most applications are
 either CPU- or network-bound.</p>


                 </div>

              <!-- /container -->
         </div>

         <script src="js/vendor/jquery-1.8.0.min.js"></script>
         <script src="js/vendor/bootstrap.min.js"></script>
         <script src="js/vendor/anchor.min.js"></script>
         <script src="js/main.js"></script>

         <!-- MathJax Section -->
         <script type="text/x-mathjax-config">
             MathJax.Hub.Config({
                 TeX: { equationNumbers: { autoNumber: "AMS" } }
             });
         </script>
         <script>
             // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
             // We could use "//cdn.mathjax...", but that won't support "file://".
             (function(d, script) {
                 script = d.createElement('script');
                 script.type = 'text/javascript';
                 script.async = true;
                 script.onload = function(){
                     MathJax.Hub.Config({
                         tex2jax: {
                             inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
                             displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
                             processEscapes: true,
                             skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
                         }
                     });
                 };
                 script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
                     'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
                 d.getElementsByTagName('head')[0].appendChild(script);
             }(document));
         </script>
     </body>
 </html>

	<!DOCTYPE html>
	<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
	<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
	<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
	<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
	<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
	<title>Hardware Provisioning - Spark 2.0.0 Documentation</title>




	<link rel="stylesheet" href="css/bootstrap.min.css">
	<style>
	body {
	padding-top: 60px;
	padding-bottom: 40px;
	}
	</style>
	<meta name="viewport" content="width=device-width">
	<link rel="stylesheet" href="css/bootstrap-responsive.min.css">
	<link rel="stylesheet" href="css/main.css">

	<script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>

	<link rel="stylesheet" href="css/pygments-default.css">



	</head>
	<body>
	<!--[if lt IE 7]>
	<p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
	<![endif]-->

	<!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->

	<div class="navbar navbar-fixed-top" id="topbar">
	<div class="navbar-inner">
	<div class="container">
	<div class="brand"><a href="index.html">
	<img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">2.0.0</span>
	</div>
	<ul class="nav">
	<!--TODO(andyk): Add class="active" attribute to li some how.-->
	<li><a href="index.html">Overview</a></li>

	<li class="dropdown">
	<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
	<ul class="dropdown-menu">
	<li><a href="quick-start.html">Quick Start</a></li>
	<li><a href="programming-guide.html">Spark Programming Guide</a></li>
	<li class="divider"></li>
	<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
	<li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
	<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
	<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
	<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
	</ul>
	</li>

	<li class="dropdown">
	<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
	<ul class="dropdown-menu">
	<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
	<li><a href="api/java/index.html">Java</a></li>
	<li><a href="api/python/index.html">Python</a></li>
	<li><a href="api/R/index.html">R</a></li>
	</ul>
	</li>

	<li class="dropdown">
	<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
	<ul class="dropdown-menu">
	<li><a href="cluster-overview.html">Overview</a></li>
	<li><a href="submitting-applications.html">Submitting Applications</a></li>
	<li class="divider"></li>
	<li><a href="spark-standalone.html">Spark Standalone</a></li>
	<li><a href="running-on-mesos.html">Mesos</a></li>
	<li><a href="running-on-yarn.html">YARN</a></li>
	</ul>
	</li>

	<li class="dropdown">
	<a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
	<ul class="dropdown-menu">
	<li><a href="configuration.html">Configuration</a></li>
	<li><a href="monitoring.html">Monitoring</a></li>
	<li><a href="tuning.html">Tuning Guide</a></li>
	<li><a href="job-scheduling.html">Job Scheduling</a></li>
	<li><a href="security.html">Security</a></li>
	<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
	<li class="divider"></li>
	<li><a href="building-spark.html">Building Spark</a></li>
	<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
	<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects">Supplemental Projects</a></li>
	</ul>
	</li>
	</ul>
	<!--<p class="navbar-text pull-right"><span class="version-text">v2.0.0</span></p>-->
	</div>
	</div>
	</div>

	<div class="container-wrapper">


	<div class="content" id="content">

	<h1 class="title">Hardware Provisioning</h1>


	<p>A common question received by Spark developers is how to configure hardware for it. While the right
	hardware will depend on the situation, we make the following recommendations.</p>

	<h1 id="storage-systems">Storage Systems</h1>

	<p>Because most Spark jobs will likely have to read input data from an external storage system (e.g.
	the Hadoop File System, or HBase), it is important to place it <strong>as close to this system as
	possible</strong>. We recommend the following:</p>

	<ul>
	<li>
	<p>If at all possible, run Spark on the same nodes as HDFS. The simplest way is to set up a Spark
	<a href="spark-standalone.html">standalone mode cluster</a> on the same nodes, and configure Spark and
	Hadoop’s memory and CPU usage to avoid interference (for Hadoop, the relevant options are
	<code>mapred.child.java.opts</code> for the per-task memory and <code>mapred.tasktracker.map.tasks.maximum</code>
	and <code>mapred.tasktracker.reduce.tasks.maximum</code> for number of tasks). Alternatively, you can run
	Hadoop and Spark on a common cluster manager like <a href="running-on-mesos.html">Mesos</a> or
	<a href="running-on-yarn.html">Hadoop YARN</a>.</p>
	</li>
	<li>
	<p>If this is not possible, run Spark on different nodes in the same local-area network as HDFS.</p>
	</li>
	<li>
	<p>For low-latency data stores like HBase, it may be preferrable to run computing jobs on different
	nodes than the storage system to avoid interference.</p>
	</li>
	</ul>

	<h1 id="local-disks">Local Disks</h1>

	<p>While Spark can perform a lot of its computation in memory, it still uses local disks to store
	data that doesn’t fit in RAM, as well as to preserve intermediate output between stages. We
	recommend having <strong>4-8 disks</strong> per node, configured <em>without</em> RAID (just as separate mount points).
	In Linux, mount the disks with the <a href="http://www.centos.org/docs/5/html/Global_File_System/s2-manage-mountnoatime.html"><code>noatime</code> option</a>
	to reduce unnecessary writes. In Spark, <a href="configuration.html">configure</a> the <code>spark.local.dir</code>
	variable to be a comma-separated list of the local disks. If you are running HDFS, it’s fine to
	use the same disks as HDFS.</p>

	<h1 id="memory">Memory</h1>

	<p>In general, Spark can run well with anywhere from <strong>8 GB to hundreds of gigabytes</strong> of memory per
	machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the
	rest for the operating system and buffer cache.</p>

	<p>How much memory you will need will depend on your application. To determine how much your
	application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the
	Storage tab of Spark’s monitoring UI (<code>http://<driver-node>:4040</code>) to see its size in memory.
	Note that memory usage is greatly affected by storage level and serialization format – see
	the <a href="tuning.html">tuning guide</a> for tips on how to reduce it.</p>

	<p>Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you
	purchase machines with more RAM than this, you can run <em>multiple worker JVMs per node</em>. In
	Spark’s <a href="spark-standalone.html">standalone mode</a>, you can set the number of workers per node
	with the <code>SPARK_WORKER_INSTANCES</code> variable in <code>conf/spark-env.sh</code>, and the number of cores
	per worker with <code>SPARK_WORKER_CORES</code>.</p>

	<h1 id="network">Network</h1>

	<p>In our experience, when the data is in memory, a lot of Spark applications are network-bound.
	Using a <strong>10 Gigabit</strong> or higher network is the best way to make these applications faster.
	This is especially true for “distributed reduce” applications such as group-bys, reduce-bys, and
	SQL joins. In any given application, you can see how much data Spark shuffles across the network
	from the application’s monitoring UI (<code>http://<driver-node>:4040</code>).</p>

	<h1 id="cpu-cores">CPU Cores</h1>

	<p>Spark scales well to tens of CPU cores per machine because it performs minimal sharing between
	threads. You should likely provision at least <strong>8-16 cores</strong> per machine. Depending on the CPU
	cost of your workload, you may also need more: once data is in memory, most applications are
	either CPU- or network-bound.</p>


	</div>

	<!-- /container -->
	</div>

	<script src="js/vendor/jquery-1.8.0.min.js"></script>
	<script src="js/vendor/bootstrap.min.js"></script>
	<script src="js/vendor/anchor.min.js"></script>
	<script src="js/main.js"></script>

	<!-- MathJax Section -->
	<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
	TeX: { equationNumbers: { autoNumber: "AMS" } }
	});
	</script>
	<script>
	// Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
	// We could use "//cdn.mathjax...", but that won't support "file://".
	(function(d, script) {
	script = d.createElement('script');
	script.type = 'text/javascript';
	script.async = true;
	script.onload = function(){
	MathJax.Hub.Config({
	tex2jax: {
	inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
	displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
	processEscapes: true,
	skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
	}
	});
	};
	script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
	'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
	d.getElementsByTagName('head')[0].appendChild(script);
	}(document));
	</script>
	</body>
	</html>