blob: 5a9f26d733b4510728fbaff14e1d64b47dba0a4e [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>
Useful Developer Tools | Apache Spark
</title>
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/css/bootstrap.min.css" rel="stylesheet"
integrity="sha384-EVSTQN3/azprG1Anm3QDgpJLIm9Nao0Yz1ztcQTwFspd3yD65VohhpuuCOmLASjC" crossorigin="anonymous">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,wght@0,400;0,500;0,700;1,400;1,500;1,700&Courier+Prime:wght@400;700&display=swap" rel="stylesheet">
<link href="/css/custom.css" rel="stylesheet">
<!-- Code highlighter CSS -->
<link href="/css/pygments-default.css" rel="stylesheet">
<link rel="icon" href="/favicon.ico" type="image/x-icon">
</head>
<body class="global">
<nav class="navbar navbar-expand-lg navbar-dark p-0 px-4" style="background: #1D6890;">
<a class="navbar-brand" href="/">
<img src="/images/spark-logo-rev.svg" alt="" width="141" height="72">
</a>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarContent"
aria-controls="navbarContent" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse col-md-12 col-lg-auto pt-4" id="navbarContent">
<ul class="navbar-nav me-auto">
<li class="nav-item">
<a class="nav-link active" aria-current="page" href="/downloads.html">Download</a>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="libraries" role="button" data-bs-toggle="dropdown"
aria-expanded="false">
Libraries
</a>
<ul class="dropdown-menu" aria-labelledby="libraries">
<li><a class="dropdown-item" href="/sql/">SQL and DataFrames</a></li>
<li><a class="dropdown-item" href="/streaming/">Spark Streaming</a></li>
<li><a class="dropdown-item" href="/mllib/">MLlib (machine learning)</a></li>
<li><a class="dropdown-item" href="/graphx/">GraphX (graph)</a></li>
<li>
<hr class="dropdown-divider">
</li>
<li><a class="dropdown-item" href="/third-party-projects.html">Third-Party Projects</a></li>
</ul>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="documentation" role="button" data-bs-toggle="dropdown"
aria-expanded="false">
Documentation
</a>
<ul class="dropdown-menu" aria-labelledby="documentation">
<li><a class="dropdown-item" href="/docs/latest/">Latest Release (Spark 3.3.0)</a></li>
<li><a class="dropdown-item" href="/documentation.html">Older Versions and Other Resources</a></li>
<li><a class="dropdown-item" href="/faq.html">Frequently Asked Questions</a></li>
</ul>
</li>
<li class="nav-item">
<a class="nav-link active" aria-current="page" href="/examples.html">Examples</a>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="community" role="button" data-bs-toggle="dropdown"
aria-expanded="false">
Community
</a>
<ul class="dropdown-menu" aria-labelledby="community">
<li><a class="dropdown-item" href="/community.html">Mailing Lists &amp; Resources</a></li>
<li><a class="dropdown-item" href="/contributing.html">Contributing to Spark</a></li>
<li><a class="dropdown-item" href="/improvement-proposals.html">Improvement Proposals (SPIP)</a>
</li>
<li><a class="dropdown-item" href="https://issues.apache.org/jira/browse/SPARK">Issue Tracker</a>
</li>
<li><a class="dropdown-item" href="/powered-by.html">Powered By</a></li>
<li><a class="dropdown-item" href="/committers.html">Project Committers</a></li>
<li><a class="dropdown-item" href="/history.html">Project History</a></li>
</ul>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="developers" role="button" data-bs-toggle="dropdown"
aria-expanded="false">
Developers
</a>
<ul class="dropdown-menu" aria-labelledby="developers">
<li><a class="dropdown-item" href="/developer-tools.html">Useful Developer Tools</a></li>
<li><a class="dropdown-item" href="/versioning-policy.html">Versioning Policy</a></li>
<li><a class="dropdown-item" href="/release-process.html">Release Process</a></li>
<li><a class="dropdown-item" href="/security.html">Security</a></li>
</ul>
</li>
</ul>
<ul class="navbar-nav ml-auto">
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="apacheFoundation" role="button"
data-bs-toggle="dropdown" aria-expanded="false">
Apache Software Foundation
</a>
<ul class="dropdown-menu" aria-labelledby="apacheFoundation">
<li><a class="dropdown-item" href="https://www.apache.org/">Apache Homepage</a></li>
<li><a class="dropdown-item" href="https://www.apache.org/licenses/">License</a></li>
<li><a class="dropdown-item"
href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
<li><a class="dropdown-item" href="https://www.apache.org/foundation/thanks.html">Thanks</a></li>
<li><a class="dropdown-item" href="https://www.apache.org/security/">Security</a></li>
<li><a class="dropdown-item" href="https://www.apache.org/events/current-event">Event</a></li>
</ul>
</li>
</ul>
</div>
</nav>
<div class="container">
<div class="row mt-4">
<div class="col-12 col-md-9">
<h2>Test coverage</h2>
<p>Apache Spark community uses various resources to maintain the community test coverage.</p>
<h3 id="github_action">GitHub Action</h3>
<p><a href="https://github.com/apache/spark/actions">GitHub Action</a> provides the following on Ubuntu 20.04.</p>
<ul>
<li>Scala 2.12/2.13 SBT build with Java 8</li>
<li>Scala 2.12 Maven build with Java 11/17</li>
<li>Java/Scala/Python/R unit tests with Java 8/Scala 2.12/SBT</li>
<li>TPC-DS benchmark with scale factor 1</li>
<li>JDBC Docker integration tests</li>
<li>Kubernetes integration tests</li>
<li>Daily Java/Scala/Python/R unit tests with Java 11/17 and Scala 2.12/SBT</li>
</ul>
<h3 id="appveyor">AppVeyor</h3>
<p><a href="https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark">AppVeyor</a> provides the following on Windows.</p>
<ul>
<li>R unit tests with Java 8/Scala 2.12/SBT</li>
</ul>
<h3 id="scaleway">Scaleway</h3>
<p><a href="https://www.scaleway.com">Scaleway</a> provides the following on MacOS and Apple Silicon.</p>
<ul>
<li><a href="https://apache-spark.s3.fr-par.scw.cloud/index.html">Java/Scala/Python/R unit tests with Java 17/Scala 2.12/SBT</a></li>
<li>K8s integration tests (TBD)</li>
</ul>
<h2>Useful developer tools</h2>
<h3 id="reducing-build-times">Reducing build times</h3>
<h4>SBT: Avoiding re-creating the assembly JAR</h4>
<p>Spark&#8217;s default build strategy is to assemble a jar including all of its dependencies. This can
be cumbersome when doing iterative development. When developing locally, it is possible to create
an assembly jar including all of Spark&#8217;s dependencies and then re-package only Spark itself
when making changes.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ build/sbt clean package
$ ./bin/spark-shell
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
# ... do some local development ... #
$ build/sbt compile
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ build/sbt ~compile
</code></pre></div></div>
<h3>Building submodules individually</h3>
<p>For instance, you can build the Spark Core module using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ # sbt
$ build/sbt
&gt; project core
&gt; package
$ # or you can build the spark-core module with sbt directly using:
$ build/sbt core/package
$ # Maven
$ build/mvn package -DskipTests -pl core
</code></pre></div></div>
<p><a name="individual-tests"></a></p>
<h3 id="running-individual-tests">Running Individual Tests</h3>
<p>When developing locally, it&#8217;s often convenient to run a single test or a few tests, rather than running the entire test suite.</p>
<h4>Testing with SBT</h4>
<p>The fastest way to run individual tests is to use the <code class="language-plaintext highlighter-rouge">sbt</code> console. It&#8217;s fastest to keep a <code class="language-plaintext highlighter-rouge">sbt</code> console open, and use it to re-run tests as necessary. For example, to run all of the tests in a particular project, e.g., <code class="language-plaintext highlighter-rouge">core</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ build/sbt
&gt; project core
&gt; test
</code></pre></div></div>
<p>You can run a single test suite using the <code class="language-plaintext highlighter-rouge">testOnly</code> command. For example, to run the DAGSchedulerSuite:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; testOnly org.apache.spark.scheduler.DAGSchedulerSuite
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">testOnly</code> command accepts wildcards; e.g., you can also run the <code class="language-plaintext highlighter-rouge">DAGSchedulerSuite</code> with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; testOnly *DAGSchedulerSuite
</code></pre></div></div>
<p>Or you could run all of the tests in the scheduler package:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; testOnly org.apache.spark.scheduler.*
</code></pre></div></div>
<p>If you&#8217;d like to run just a single test in the <code class="language-plaintext highlighter-rouge">DAGSchedulerSuite</code>, e.g., a test that includes &#8220;SPARK-12345&#8221; in the name, you run the following command in the sbt console:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; testOnly *DAGSchedulerSuite -- -z "SPARK-12345"
</code></pre></div></div>
<p>If you&#8217;d prefer, you can run all of these commands on the command line (but this will be slower than running tests using an open console). To do this, you need to surround <code class="language-plaintext highlighter-rouge">testOnly</code> and the following arguments in quotes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ build/sbt "core/testOnly *DAGSchedulerSuite -- -z SPARK-12345"
</code></pre></div></div>
<p>For more about how to run individual tests with sbt, see the <a href="https://www.scala-sbt.org/0.13/docs/Testing.html">sbt documentation</a>.</p>
<h4>Testing with Maven</h4>
<p>With Maven, you can use the <code class="language-plaintext highlighter-rouge">-DwildcardSuites</code> flag to run individual Scala tests:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.DAGSchedulerSuite test
</code></pre></div></div>
<p>You need <code class="language-plaintext highlighter-rouge">-Dtest=none</code> to avoid running the Java tests. For more information about the ScalaTest Maven Plugin, refer to the <a href="http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin">ScalaTest documentation</a>.</p>
<p>To run individual Java tests, you can use the <code class="language-plaintext highlighter-rouge">-Dtest</code> flag:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>build/mvn test -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite test
</code></pre></div></div>
<h4>Testing PySpark</h4>
<p>To run individual PySpark tests, you can use <code class="language-plaintext highlighter-rouge">run-tests</code> script under <code class="language-plaintext highlighter-rouge">python</code> directory. Test cases are located at <code class="language-plaintext highlighter-rouge">tests</code> package under each PySpark packages.
Note that, if you add some changes into Scala or Python side in Apache Spark, you need to manually build Apache Spark again before running PySpark tests in order to apply the changes.
Running PySpark testing script does not automatically build it.</p>
<p>Also, note that there is an ongoing issue to use PySpark on macOS High Serria+. <code class="language-plaintext highlighter-rouge">OBJC_DISABLE_INITIALIZE_FORK_SAFETY</code>
should be set to <code class="language-plaintext highlighter-rouge">YES</code> in order to run some of tests.
See <a href="https://issues.apache.org/jira/browse/SPARK-25473">PySpark issue</a> and <a href="https://bugs.python.org/issue33725">Python issue</a> for more details.</p>
<p>To run test cases in a specific module:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python/run-tests --testnames pyspark.sql.tests.test_arrow
</code></pre></div></div>
<p>To run test cases in a specific class:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python/run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests'
</code></pre></div></div>
<p>To run single test case in a specific class:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python/run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests.test_null_conversion'
</code></pre></div></div>
<p>You can also run doctests in a specific module:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python/run-tests --testnames pyspark.sql.dataframe
</code></pre></div></div>
<p>Lastly, there is another script called <code class="language-plaintext highlighter-rouge">run-tests-with-coverage</code> in the same location, which generates coverage report for PySpark tests. It accepts same arguments with <code class="language-plaintext highlighter-rouge">run-tests</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python
...
Name Stmts Miss Branch BrPart Cover
-------------------------------------------------------------------
pyspark/__init__.py 42 4 8 2 84%
pyspark/_globals.py 16 3 4 2 75%
...
Generating HTML files for PySpark coverage under /.../spark/python/test_coverage/htmlcov
</code></pre></div></div>
<p>You can check the coverage report visually by HTMLs under <code class="language-plaintext highlighter-rouge">/.../spark/python/test_coverage/htmlcov</code>.</p>
<p>Please check other available options via <code class="language-plaintext highlighter-rouge">python/run-tests[-with-coverage] --help</code>.</p>
<h4>Testing K8S</h4>
<p>Although GitHub Action provide both K8s unit test and integration test coverage, you can run it locally. For example, Volcano batch scheduler integration test should be done manually. Please refer the integration test documentation for the detail.</p>
<p>https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md</p>
<h3>Testing with GitHub actions workflow</h3>
<p>Apache Spark leverages GitHub Actions that enables continuous integration and a wide range of automation. Apache Spark repository provides several GitHub Actions workflows for developers to run before creating a pull request.</p>
<p><a name="github-workflow-benchmarks"></a></p>
<h4>Running benchmarks in your forked repository</h4>
<p>Apache Spark repository provides an easy way to run benchmarks in GitHub Actions. When you update the benchmark results in a pull request, it is recommended to use GitHub Actions to run and generate the benchmark results in order to run them on the environment as same as possible.</p>
<ul>
<li>Click the &#8220;Actions&#8221; tab in your forked repository.</li>
<li>Select the &#8220;Run benchmarks&#8221; workflow in the &#8220;All workflows&#8221; list.</li>
<li>Click the &#8220;Run workflow&#8221; button and enter the fields appropriately as below:
<ul>
<li><strong>Benchmark class</strong>: the benchmark class which you wish to run. It allows a glob pattern. For example, <code class="language-plaintext highlighter-rouge">org.apache.spark.sql.*</code>.</li>
<li><strong>JDK version</strong>: Java version you want to run the benchmark with. For example, <code class="language-plaintext highlighter-rouge">11</code>.</li>
<li><strong>Failfast</strong>: indicates if you want to stop the benchmark and workflow when it fails. When <code class="language-plaintext highlighter-rouge">true</code>, it fails right away. When <code class="language-plaintext highlighter-rouge">false</code>, it runs all whether it fails or not.</li>
<li><strong>Number of job splits</strong>: it splits the benchmark jobs into the specified number, and runs them in parallel. It is particularly useful to work around the time limits of workflow and jobs in GitHub Actions.</li>
</ul>
</li>
<li>Once a &#8220;Run benchmarks&#8221; workflow is finished, click the workflow and download benchmarks results at &#8220;Artifacts&#8221;.</li>
<li>Go to your root directory of Apache Spark repository, and unzip/untar the downloaded files which will update the benchmark results with appropriately locating the files to update.</li>
</ul>
<p><img src="/images/running-benchamrks-using-github-actions.png" style="width: 100%; max-width: 800px;" /></p>
<h3>ScalaTest issues</h3>
<p>If the following error occurs when running ScalaTest</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>An internal error occurred during: "Launching XYZSuite.scala".
java.lang.NullPointerException
</code></pre></div></div>
<p>It is due to an incorrect Scala library in the classpath. To fix it:</p>
<ul>
<li>Right click on project</li>
<li>Select <code class="language-plaintext highlighter-rouge">Build Path | Configure Build Path</code></li>
<li><code class="language-plaintext highlighter-rouge">Add Library | Scala Library</code></li>
<li>Remove <code class="language-plaintext highlighter-rouge">scala-library-2.10.4.jar - lib_managed\jars</code></li>
</ul>
<p>In the event of &#8220;Could not find resource path for Web UI: org/apache/spark/ui/static&#8221;,
it&#8217;s due to a classpath issue (some classes were probably not compiled). To fix this, it
sufficient to run a test from the command line:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>build/sbt "testOnly org.apache.spark.rdd.SortingSuite"
</code></pre></div></div>
<h3>Running different test permutations on Jenkins</h3>
<p>When running tests for a pull request on Jenkins, you can add special phrases to the title of
your pull request to change testing behavior. This includes:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">[test-maven]</code> - signals to test the pull request using maven</li>
<li><code class="language-plaintext highlighter-rouge">[test-hadoop2.7]</code> - signals to test using Spark&#8217;s Hadoop 2.7 profile</li>
<li><code class="language-plaintext highlighter-rouge">[test-hadoop3.2]</code> - signals to test using Spark&#8217;s Hadoop 3.2 profile</li>
<li><code class="language-plaintext highlighter-rouge">[test-hadoop3.2][test-java11]</code> - signals to test using Spark&#8217;s Hadoop 3.2 profile with JDK 11</li>
<li><code class="language-plaintext highlighter-rouge">[test-hive1.2]</code> - signals to test using Spark&#8217;s Hive 1.2 profile</li>
<li><code class="language-plaintext highlighter-rouge">[test-hive2.3]</code> - signals to test using Spark&#8217;s Hive 2.3 profile</li>
</ul>
<h3>Binary compatibility</h3>
<p>To ensure binary compatibility, Spark uses <a href="https://github.com/typesafehub/migration-manager">MiMa</a>.</p>
<h4>Ensuring binary compatibility</h4>
<p>When working on an issue, it&#8217;s always a good idea to check that your changes do
not introduce binary incompatibilities before opening a pull request.</p>
<p>You can do so by running the following command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dev/mima
</code></pre></div></div>
<p>A binary incompatibility reported by MiMa might look like the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[error] method this(org.apache.spark.sql.Dataset)Unit in class org.apache.spark.SomeClass does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this")
</code></pre></div></div>
<p>If you open a pull request containing binary incompatibilities anyway, Jenkins
will remind you by failing the test build with the following message:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Test build #xx has finished for PR yy at commit ffffff.
This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.
</code></pre></div></div>
<h4>Solving a binary incompatibility</h4>
<p>If you believe that your binary incompatibilies are justified or that MiMa
reported false positives (e.g. the reported binary incompatibilities are about a
non-user facing API), you can filter them out by adding an exclusion in
<a href="https://github.com/apache/spark/blob/master/project/MimaExcludes.scala">project/MimaExcludes.scala</a>
containing what was suggested by the MiMa report and a comment containing the
JIRA number of the issue you&#8217;re working on as well as its title.</p>
<p>For the problem described above, we might add the following:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="c1">// [SPARK-zz][CORE] Fix an issue</span>
<span class="nv">ProblemFilters</span><span class="o">.</span><span class="py">exclude</span><span class="o">[</span><span class="kt">DirectMissingMethodProblem</span><span class="o">](</span><span class="s">"org.apache.spark.SomeClass.this"</span><span class="o">)</span></code></pre></figure>
<p>Otherwise, you will have to resolve those incompatibilies before opening or
updating your pull request. Usually, the problems reported by MiMa are
self-explanatory and revolve around missing members (methods or fields) that
you will have to add back in order to maintain binary compatibility.</p>
<h3>Checking out pull requests</h3>
<p>Git provides a mechanism for fetching remote pull requests into your own local repository.
This is useful when reviewing code or testing patches locally. If you haven&#8217;t yet cloned the
Spark Git repository, use the following command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/apache/spark.git
$ cd spark
</code></pre></div></div>
<p>To enable this feature you&#8217;ll need to configure the git remote repository to fetch pull request
data. Do this by modifying the <code class="language-plaintext highlighter-rouge">.git/config</code> file inside of your Spark directory. The remote may
not be named &#8220;origin&#8221; if you&#8217;ve named it something else:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[remote "origin"]
url = git@github.com:apache/spark.git
fetch = +refs/heads/*:refs/remotes/origin/*
fetch = +refs/pull/*/head:refs/remotes/origin/pr/* # Add this line
</code></pre></div></div>
<p>Once you&#8217;ve done this you can fetch remote pull requests</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Fetch remote pull requests
$ git fetch origin
# Checkout a remote pull request
$ git checkout origin/pr/112
# Create a local branch from a remote pull request
$ git checkout origin/pr/112 -b new-branch
</code></pre></div></div>
<h3>Generating dependency graphs</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ # sbt
$ build/sbt dependencyTree
$ # Maven
$ build/mvn -DskipTests install
$ build/mvn dependency:tree
</code></pre></div></div>
<h3>Organizing imports</h3>
<p>You can use a <a href="https://plugins.jetbrains.com/plugin/7350">IntelliJ Imports Organizer</a>
from Aaron Davidson to help you organize the imports in
your code. It can be configured to match the import ordering from the style guide.</p>
<h3>Formatting code</h3>
<p>To format Scala code, run the following command prior to submitting a PR:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./dev/scalafmt
</code></pre></div></div>
<p>By default, this script will format files that differ from git master. For more information, see <a href="https://scalameta.org/scalafmt/">scalafmt documentation</a>, but use the existing script not a locally installed version of scalafmt.</p>
<h3>IDE setup</h3>
<h4>IntelliJ</h4>
<p>While many of the Spark developers use SBT or Maven on the command line, the most common IDE we
use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get
free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from <code class="language-plaintext highlighter-rouge">Preferences &gt; Plugins</code>.</p>
<p>To create a Spark project for IntelliJ:</p>
<ul>
<li>Download IntelliJ and install the
<a href="https://confluence.jetbrains.com/display/SCA/Scala+Plugin+for+IntelliJ+IDEA">Scala plug-in for IntelliJ</a>.</li>
<li>Go to <code class="language-plaintext highlighter-rouge">File -&gt; Import Project</code>, locate the spark source directory, and select &#8220;Maven Project&#8221;.</li>
<li>In the Import wizard, it&#8217;s fine to leave settings at their default. However it is usually useful
to enable &#8220;Import Maven projects automatically&#8221;, since changes to the project structure will
automatically update the IntelliJ project.</li>
<li>As documented in <a href="https://spark.apache.org/docs/latest/building-spark.html">Building Spark</a>,
some build configurations require specific profiles to be
enabled. The same profiles that are enabled with <code class="language-plaintext highlighter-rouge">-P[profile name]</code> above may be enabled on the
Profiles screen in the Import wizard. For example, if developing for Hadoop 2.7 with YARN support,
enable profiles <code class="language-plaintext highlighter-rouge">yarn</code> and <code class="language-plaintext highlighter-rouge">hadoop-2.7</code>. These selections can be changed later by accessing the
&#8220;Maven Projects&#8221; tool window from the View menu, and expanding the Profiles section.</li>
</ul>
<p>Other tips:</p>
<ul>
<li>&#8220;Rebuild Project&#8221; can fail the first time the project is compiled, because generate source files
are not automatically generated. Try clicking the &#8220;Generate Sources and Update Folders For All
Projects&#8221; button in the &#8220;Maven Projects&#8221; tool window to manually generate these sources.</li>
<li>The version of Maven bundled with IntelliJ may not be new enough for Spark. If that happens,
the action &#8220;Generate Sources and Update Folders For All Projects&#8221; could fail silently.
Please remember to reset the Maven home directory
(<code class="language-plaintext highlighter-rouge">Preference -&gt; Build, Execution, Deployment -&gt; Maven -&gt; Maven home directory</code>) of your project to
point to a newer installation of Maven. You may also build Spark with the script <code class="language-plaintext highlighter-rouge">build/mvn</code> first.
If the script cannot locate a new enough Maven installation, it will download and install a recent
version of Maven to folder <code class="language-plaintext highlighter-rouge">build/apache-maven-&lt;version&gt;/</code>.</li>
<li>Some of the modules have pluggable source directories based on Maven profiles (i.e. to support
both Scala 2.11 and 2.10 or to allow cross building against different versions of Hive). In some
cases IntelliJ&#8217;s does not correctly detect use of the maven-build-plugin to add source directories.
In these cases, you may need to add source locations explicitly to compile the entire project. If
so, open the &#8220;Project Settings&#8221; and select &#8220;Modules&#8221;. Based on your selected Maven profiles, you
may need to add source folders to the following modules:
<ul>
<li>spark-hive: add v0.13.1/src/main/scala</li>
<li>spark-streaming-flume-sink: add target\scala-2.11\src_managed\main\compiled_avro</li>
<li>spark-catalyst: add target\scala-2.11\src_managed\main</li>
</ul>
</li>
<li>Compilation may fail with an error like &#8220;scalac: bad option:
-P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar&#8221;.
If so, go to Preferences &gt; Build, Execution, Deployment &gt; Scala Compiler and clear the &#8220;Additional
compiler options&#8221; field. It will work then although the option will come back when the project
reimports. If you try to build any of the projects using quasiquotes (eg., sql) then you will
need to make that jar a compiler plugin (just below &#8220;Additional compiler options&#8221;).
Otherwise you will see errors like:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/Users/irashid/github/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Error:(147, 9) value q is not a member of StringContext
Note: implicit class Evaluate2 is not applicable here because it comes after the application point and it lacks an explicit result type
q"""
^
</code></pre></div> </div>
</li>
</ul>
<h4>Debug Spark remotely</h4>
<p>This part will show you how to debug Spark remotely with IntelliJ.</p>
<h5>Set up remote debug configuration</h5>
<p>Follow <i>Run &gt; Edit Configurations &gt; + &gt; Remote</i> to open a default Remote Configuration template:
<img src="/images/intellij_remote_debug_configuration.png" style="width: 75%; max-width: 660px;" /></p>
<p>Normally, the default values should be good enough to use. Make sure that you choose <b>Listen to remote JVM</b>
as <i>Debugger mode</i> and select the right JDK version to generate proper <i>Command line arguments for remote JVM</i>.</p>
<p>Once you finish configuration and save it. You can follow <i>Run &gt; Run &gt; Your_Remote_Debug_Name &gt; Debug</i> to start remote debug
process and wait for SBT console to connect:</p>
<p><img src="/images/intellij_start_remote_debug.png" style="width: 75%; max-width: 660px;" /></p>
<h5>Trigger the remote debugging</h5>
<p>In general, there are 2 steps:</p>
<ol>
<li>Set JVM options using the <i>Command line arguments for remote JVM</i> generated in the last step.</li>
<li>Start the Spark execution (SBT test, pyspark test, spark-shell, etc.)</li>
</ol>
<p>The following is an example of how to trigger the remote debugging using SBT unit tests.</p>
<p>Enter in SBT console</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./build/sbt
</code></pre></div></div>
<p>Switch to project where the target test locates, e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbt &gt; project core
</code></pre></div></div>
<p>Copy pasting the <i>Command line arguments for remote JVM</i></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbt &gt; set javaOptions in Test += "-agentlib:jdwp=transport=dt_socket,server=n,suspend=n,address=localhost:5005"
</code></pre></div></div>
<p>Set breakpoints with IntelliJ and run the test with SBT, e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbt &gt; testOnly *SparkContextSuite -- -t "Only one SparkContext may be active at a time"
</code></pre></div></div>
<p>It should be successfully connected to IntelliJ when you see &#8220;Connected to the target VM,
address: &#8216;localhost:5005&#8217;, transport: &#8216;socket&#8217;&#8221; in IntelliJ console. And then, you can start
debug in IntelliJ as usual.</p>
<p>To exit remote debug mode (so that you don&#8217;t have to keep starting the remote debugger),
type &#8220;session clear&#8221; in SBT console while you&#8217;re in a project.</p>
<h4>Eclipse</h4>
<p>Eclipse can be used to develop and test Spark. The following configuration is known to work:</p>
<ul>
<li>Eclipse Juno</li>
<li><a href="http://scala-ide.org/">Scala IDE 4.0</a></li>
<li>Scala Test</li>
</ul>
<p>The easiest way is to download the Scala IDE bundle from the Scala IDE download page. It comes
pre-installed with ScalaTest. Alternatively, use the Scala IDE update site or Eclipse Marketplace.</p>
<p>SBT can create Eclipse <code class="language-plaintext highlighter-rouge">.project</code> and <code class="language-plaintext highlighter-rouge">.classpath</code> files. To create these files for each Spark sub
project, use this command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbt/sbt eclipse
</code></pre></div></div>
<p>To import a specific project, e.g. spark-core, select <code class="language-plaintext highlighter-rouge">File | Import | Existing Projects</code> into
Workspace. Do not select &#8220;Copy projects into workspace&#8221;.</p>
<p>If you want to develop on Scala 2.10 you need to configure a Scala installation for the
exact Scala version that’s used to compile Spark.
Since Scala IDE bundles the latest versions (2.10.5 and 2.11.8 at this point), you need to add one
in <code class="language-plaintext highlighter-rouge">Eclipse Preferences -&gt; Scala -&gt; Installations</code> by pointing to the <code class="language-plaintext highlighter-rouge">lib/</code> directory of your
Scala 2.10.5 distribution. Once this is done, select all Spark projects and right-click,
choose <code class="language-plaintext highlighter-rouge">Scala -&gt; Set Scala Installation</code> and point to the 2.10.5 installation.
This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now.</p>
<p>ScalaTest can execute unit tests by right clicking a source file and selecting <code class="language-plaintext highlighter-rouge">Run As | Scala Test</code>.</p>
<p>If Java memory errors occur, it might be necessary to increase the settings in <code class="language-plaintext highlighter-rouge">eclipse.ini</code>
in the Eclipse install directory. Increase the following setting as needed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--launcher.XXMaxPermSize
256M
</code></pre></div></div>
<p><a name="nightly-builds"></a></p>
<h3>Nightly builds</h3>
<p>Spark publishes SNAPSHOT releases of its Maven artifacts for both master and maintenance
branches on a nightly basis. To link to a SNAPSHOT you need to add the ASF snapshot
repository to your build. Note that SNAPSHOT artifacts are ephemeral and may change or
be removed. To use these you must add the ASF snapshot repository at
<a href="https://repository.apache.org/snapshots/">https://repository.apache.org/snapshots/</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>groupId: org.apache.spark
artifactId: spark-core_2.12
version: 3.0.0-SNAPSHOT
</code></pre></div></div>
<p><a name="profiling"></a></p>
<h3>Profiling Spark applications using YourKit</h3>
<p>Here are instructions on profiling Spark applications using YourKit Java Profiler.</p>
<h4>On Spark EC2 images</h4>
<ul>
<li>After logging into the master node, download the YourKit Java Profiler for Linux from the
<a href="https://www.yourkit.com/download/index.jsp">YourKit downloads page</a>.
This file is pretty big (~100 MB) and YourKit downloads site is somewhat slow, so you may
consider mirroring this file or including it on a custom AMI.</li>
<li>Unzip this file somewhere (in <code class="language-plaintext highlighter-rouge">/root</code> in our case): <code class="language-plaintext highlighter-rouge">unzip YourKit-JavaProfiler-2017.02-b66.zip</code></li>
<li>Copy the expanded YourKit files to each node using copy-dir: <code class="language-plaintext highlighter-rouge">~/spark-ec2/copy-dir /root/YourKit-JavaProfiler-2017.02</code></li>
<li>Configure the Spark JVMs to use the YourKit profiling agent by editing <code class="language-plaintext highlighter-rouge">~/spark/conf/spark-env.sh</code>
and adding the lines
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SPARK_DAEMON_JAVA_OPTS+=" -agentpath:/root/YourKit-JavaProfiler-2017.02/bin/linux-x86-64/libyjpagent.so=sampling"
export SPARK_DAEMON_JAVA_OPTS
SPARK_EXECUTOR_OPTS+=" -agentpath:/root/YourKit-JavaProfiler-2017.02/bin/linux-x86-64/libyjpagent.so=sampling"
export SPARK_EXECUTOR_OPTS
</code></pre></div> </div>
</li>
<li>Copy the updated configuration to each node: <code class="language-plaintext highlighter-rouge">~/spark-ec2/copy-dir ~/spark/conf/spark-env.sh</code></li>
<li>Restart your Spark cluster: <code class="language-plaintext highlighter-rouge">~/spark/bin/stop-all.sh</code> and <code class="language-plaintext highlighter-rouge">~/spark/bin/start-all.sh</code></li>
<li>By default, the YourKit profiler agents use ports <code class="language-plaintext highlighter-rouge">10001-10010</code>. To connect the YourKit desktop
application to the remote profiler agents, you&#8217;ll have to open these ports in the cluster&#8217;s EC2
security groups. To do this, sign into the AWS Management Console. Go to the EC2 section and
select <code class="language-plaintext highlighter-rouge">Security Groups</code> from the <code class="language-plaintext highlighter-rouge">Network &amp; Security</code> section on the left side of the page.
Find the security groups corresponding to your cluster; if you launched a cluster named <code class="language-plaintext highlighter-rouge">test_cluster</code>,
then you will want to modify the settings for the <code class="language-plaintext highlighter-rouge">test_cluster-slaves</code> and <code class="language-plaintext highlighter-rouge">test_cluster-master</code>
security groups. For each group, select it from the list, click the <code class="language-plaintext highlighter-rouge">Inbound</code> tab, and create a
new <code class="language-plaintext highlighter-rouge">Custom TCP Rule</code> opening the port range <code class="language-plaintext highlighter-rouge">10001-10010</code>. Finally, click <code class="language-plaintext highlighter-rouge">Apply Rule Changes</code>.
Make sure to do this for both security groups.
Note: by default, <code class="language-plaintext highlighter-rouge">spark-ec2</code> re-uses security groups: if you stop this cluster and launch another
cluster with the same name, your security group settings will be re-used.</li>
<li>Launch the YourKit profiler on your desktop.</li>
<li>Select &#8220;Connect to remote application&#8230;&#8221; from the welcome screen and enter the the address of your Spark master or worker machine, e.g. <code class="language-plaintext highlighter-rouge">ec2--.compute-1.amazonaws.com</code></li>
<li>YourKit should now be connected to the remote profiling agent. It may take a few moments for profiling information to appear.</li>
</ul>
<p>Please see the full YourKit documentation for the full list of profiler agent
<a href="https://www.yourkit.com/docs/java/help/startup_options.jsp">startup options</a>.</p>
<h4>In Spark unit tests</h4>
<p>When running Spark tests through SBT, add <code class="language-plaintext highlighter-rouge">javaOptions in Test += "-agentpath:/path/to/yjp"</code>
to <code class="language-plaintext highlighter-rouge">SparkBuild.scala</code> to launch the tests with the YourKit profiler agent enabled.<br />
The platform-specific paths to the profiler agents are listed in the
<a href="http://www.yourkit.com/docs/80/help/agent.jsp">YourKit documentation</a>.</p>
</div>
<div class="col-12 col-md-3">
<div class="news" style="margin-bottom: 20px;">
<h5>Latest News</h5>
<ul class="list-unstyled">
<li><a href="/news/spark-3-2-2-released.html">Spark 3.2.2 released</a>
<span class="small">(Jul 17, 2022)</span></li>
<li><a href="/news/spark-3-3-0-released.html">Spark 3.3.0 released</a>
<span class="small">(Jun 16, 2022)</span></li>
<li><a href="/news/sigmod-system-award.html">SIGMOD Systems Award for Apache Spark</a>
<span class="small">(May 13, 2022)</span></li>
<li><a href="/news/3-1-3-released.html">Spark 3.1.3 released</a>
<span class="small">(Feb 18, 2022)</span></li>
</ul>
<p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p>
</div>
<div style="text-align:center; margin-bottom: 20px;">
<a href="https://www.apache.org/events/current-event.html">
<img src="https://www.apache.org/events/current-event-234x60.png" style="max-width: 100%;"/>
</a>
</div>
<div class="hidden-xs hidden-sm">
<a href="/downloads.html" class="btn btn-cta btn-lg d-grid" style="margin-bottom: 30px;">
Download Spark
</a>
<p style="font-size: 16px; font-weight: 500; color: #555;">
Built-in Libraries:
</p>
<ul class="list-none">
<li><a href="/sql/">SQL and DataFrames</a></li>
<li><a href="/streaming/">Spark Streaming</a></li>
<li><a href="/mllib/">MLlib (machine learning)</a></li>
<li><a href="/graphx/">GraphX (graph)</a></li>
</ul>
<a href="/third-party-projects.html">Third-Party Projects</a>
</div>
</div>
</div>
<footer class="small">
<hr>
Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered
trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
See guidance on use of Apache Spark <a href="/trademarks.html">trademarks</a>.
All other marks mentioned may be trademarks or registered trademarks of their respective owners.
Copyright &copy; 2018 The Apache Software Foundation, Licensed under the
<a href="https://www.apache.org/licenses/">Apache License, Version 2.0</a>.
</footer>
</div>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/js/bootstrap.bundle.min.js"
integrity="sha384-MrcW6ZMFYlzcLA8Nl+NtUVF0sA7MsXsP1UyJoMp4YLEuNSfAP+JcXn/tWtIaxVXM"
crossorigin="anonymous"></script>
<script src="https://code.jquery.com/jquery.js"></script>
<script src="/js/lang-tabs.js"></script>
<script src="/js/downloads.js"></script>
</body>
</html>