blob: fac86fdd8127ce09273afeae98278f502d956766 [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Contributing to PySpark &#8212; PySpark 3.1.1 documentation</title>
<link rel="stylesheet" href="../_static/css/index.73d71520a4ca3b99cfee5594769eaaae.css">
<link rel="stylesheet"
href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">
<link rel="stylesheet"
href="../_static/vendor/open-sans_all/1.44.1/index.css">
<link rel="stylesheet"
href="../_static/vendor/lato_latin-ext/1.44.1/index.css">
<link rel="stylesheet" href="../_static/basic.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" type="text/css" href="../_static/css/pyspark.css" />
<link rel="preload" as="script" href="../_static/js/index.3da636dd464baa7582d2.js">
<script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
<script src="../_static/underscore.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/language_data.js"></script>
<script src="../_static/copybutton.js"></script>
<script crossorigin="anonymous" integrity="sha256-Ae2Vz/4ePdIu6ZyI/5ZGsYnb+m0JlOmKPjt6XZ9JJkA=" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js"></script>
<script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\(", "\\)"]], "processEscapes": true, "ignoreClass": "document", "processClass": "math|output_area"}})</script>
<link rel="canonical" href="https://spark.apache.org/docs/latest/api/python/development/contributing.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Testing PySpark" href="testing.html" />
<link rel="prev" title="Development" href="index.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en" />
<!-- Matomo -->
<script type="text/javascript">
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
_paq.push(["disableCookies"]);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '40']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<!-- End Matomo Code -->
</head>
<body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80">
<nav class="navbar navbar-light navbar-expand-lg bg-light fixed-top bd-navbar" id="navbar-main">
<div class="container-xl">
<a class="navbar-brand" href="../index.html">
<img src="../_static/spark-logo-reverse.png" class="logo" alt="logo" />
</a>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbar-menu" aria-controls="navbar-menu" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar-menu" class="col-lg-9 collapse navbar-collapse">
<ul id="navbar-main-elements" class="navbar-nav mr-auto">
<li class="nav-item ">
<a class="nav-link" href="../getting_started/index.html">Getting Started</a>
</li>
<li class="nav-item ">
<a class="nav-link" href="../user_guide/index.html">User Guide</a>
</li>
<li class="nav-item ">
<a class="nav-link" href="../reference/index.html">API Reference</a>
</li>
<li class="nav-item active">
<a class="nav-link" href="index.html">Development</a>
</li>
<li class="nav-item ">
<a class="nav-link" href="../migration_guide/index.html">Migration Guide</a>
</li>
</ul>
<ul class="navbar-nav">
</ul>
</div>
</div>
</nav>
<div class="container-xl">
<div class="row">
<div class="col-12 col-md-3 bd-sidebar"><form class="bd-search d-flex align-items-center" action="../search.html" method="get">
<i class="icon fas fa-search"></i>
<input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" >
</form>
<nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
<div class="bd-toc-item active">
<ul class="nav bd-sidenav">
<li class="active">
<a href="">Contributing to PySpark</a>
</li>
<li class="">
<a href="testing.html">Testing PySpark</a>
</li>
<li class="">
<a href="debugging.html">Debugging PySpark</a>
</li>
<li class="">
<a href="setting_ide.html">Setting up IDEs</a>
</li>
</ul>
</nav>
</div>
<div class="d-none d-xl-block col-xl-2 bd-toc">
<div class="tocsection onthispage pt-5 pb-3">
<i class="fas fa-list"></i> On this page
</div>
<nav id="bd-toc-nav">
<ul class="nav section-nav flex-column">
<li class="nav-item toc-entry toc-h2">
<a href="#contributing-by-testing-releases" class="nav-link">Contributing by Testing Releases</a>
</li>
<li class="nav-item toc-entry toc-h2">
<a href="#contributing-documentation-changes" class="nav-link">Contributing Documentation Changes</a>
</li>
<li class="nav-item toc-entry toc-h2">
<a href="#preparing-to-contribute-code-changes" class="nav-link">Preparing to Contribute Code Changes</a>
</li>
<li class="nav-item toc-entry toc-h2">
<a href="#contributing-and-maintaining-type-hints" class="nav-link">Contributing and Maintaining Type Hints</a>
</li>
<li class="nav-item toc-entry toc-h2">
<a href="#code-and-docstring-guide" class="nav-link">Code and Docstring Guide</a>
</li>
</ul>
</nav>
</div>
<main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 bd-content" role="main">
<div>
<div class="section" id="contributing-to-pyspark">
<h1>Contributing to PySpark<a class="headerlink" href="#contributing-to-pyspark" title="Permalink to this headline"></a></h1>
<p>There are many types of contribution, for example, helping other users, testing releases, reviewing changes,
documentation contribution, bug reporting, JIRA maintenance, code changes, etc.
These are documented at <a class="reference external" href="http://spark.apache.org/contributing.html">the general guidelines</a>.
This page focuses on PySpark and includes additional details specifically for PySpark.</p>
<div class="section" id="contributing-by-testing-releases">
<h2>Contributing by Testing Releases<a class="headerlink" href="#contributing-by-testing-releases" title="Permalink to this headline"></a></h2>
<p>Before the official release, PySpark release candidates are shared in the <a class="reference external" href="http://apache-spark-developers-list.1001551.n3.nabble.com/">dev&#64;spark.apache.org</a> mailing list to vote on.
This release candidates can be easily installed via pip. For example, in case of Spark 3.0.0 RC1, you can install as below:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip install https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/pyspark-3.0.0.tar.gz
</pre></div>
</div>
<p>The link for release files such as <code class="docutils literal notranslate"><span class="pre">https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin</span></code> can be found in the vote thread.</p>
<p>Testing and verifying users’ existing workloads against release candidates is one of the vital contributions to PySpark.
It prevents breaking users’ existing workloads before the official release.
When there is an issue such as a regression, correctness problem or performance degradation worth enough to drop the release candidate,
usually the release candidate is dropped and the community focuses on fixing it to include in the next release candidate.</p>
</div>
<div class="section" id="contributing-documentation-changes">
<h2>Contributing Documentation Changes<a class="headerlink" href="#contributing-documentation-changes" title="Permalink to this headline"></a></h2>
<p>The release documentation is located under Spark’s <a class="reference external" href="https://github.com/apache/spark/tree/master/docs">docs</a> directory.
<a class="reference external" href="https://github.com/apache/spark/blob/master/docs/README.md">README.md</a> describes the required dependencies and steps
to generate the documentations. Usually, PySpark documentation is tested with the command below
under the <a class="reference external" href="https://github.com/apache/spark/tree/master/docs">docs</a> directory:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">SKIP_SCALADOC</span><span class="o">=</span><span class="m">1</span> <span class="nv">SKIP_RDOC</span><span class="o">=</span><span class="m">1</span> <span class="nv">SKIP_SQLDOC</span><span class="o">=</span><span class="m">1</span> jekyll serve --watch
</pre></div>
</div>
<p>PySpark uses Sphinx to generate its release PySpark documentation. Therefore, if you want to build only PySpark documentation alone,
you can build under <a class="reference external" href="https://github.com/apache/spark/tree/master/python">python/docs</a> directory by:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>make html
</pre></div>
</div>
<p>It generates the corresponding HTMLs under <code class="docutils literal notranslate"><span class="pre">python/docs/build/html</span></code>.</p>
<p>Lastly, please make sure that the new APIs are documented by manually adding methods and/or classes at the corresponding RST files
under <code class="docutils literal notranslate"><span class="pre">python/docs/source/reference</span></code>. Otherwise, they would not be documented in PySpark documentation.</p>
</div>
<div class="section" id="preparing-to-contribute-code-changes">
<h2>Preparing to Contribute Code Changes<a class="headerlink" href="#preparing-to-contribute-code-changes" title="Permalink to this headline"></a></h2>
<p>Before starting to work on codes in PySpark, it is recommended to read <a class="reference external" href="http://spark.apache.org/contributing.html">the general guidelines</a>.
There are a couple of additional notes to keep in mind when contributing to codes in PySpark:</p>
<ul class="simple">
<li><p>Be Pythonic.</p></li>
<li><p>APIs are matched with Scala and Java sides in general.</p></li>
<li><p>PySpark specific APIs can still be considered as long as they are Pythonic and do not conflict with other existent APIs, for example, decorator usage of UDFs.</p></li>
<li><p>If you extend or modify public API, please adjust corresponding type hints. See <a class="reference internal" href="#contributing-and-maintaining-type-hints">Contributing and Maintaining Type Hints</a> for details.</p></li>
</ul>
</div>
<div class="section" id="contributing-and-maintaining-type-hints">
<h2>Contributing and Maintaining Type Hints<a class="headerlink" href="#contributing-and-maintaining-type-hints" title="Permalink to this headline"></a></h2>
<p>PySpark type hints are provided using stub files, placed in the same directory as the annotated module, with exception to <code class="docutils literal notranslate"><span class="pre">#</span> <span class="pre">type:</span> <span class="pre">ignore</span></code> in modules which don’t have their own stubs (tests, examples and non-public API).
As a rule of thumb, only public API is annotated.</p>
<p>Annotations should, when possible:</p>
<ul>
<li><p>Reflect expectations of the underlying JVM API, to help avoid type related failures outside Python interpreter.</p></li>
<li><p>In case of conflict between too broad (<code class="docutils literal notranslate"><span class="pre">Any</span></code>) and too narrow argument annotations, prefer the latter as one, as long as it is covering most of the typical use cases.</p></li>
<li><p>Indicate nonsensical combinations of arguments using <code class="docutils literal notranslate"><span class="pre">&#64;overload</span></code> annotations. For example, to indicate that <code class="docutils literal notranslate"><span class="pre">*Col</span></code> and <code class="docutils literal notranslate"><span class="pre">*Cols</span></code> arguments are mutually exclusive:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nd">@overload</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span>
<span class="bp">self</span><span class="p">,</span>
<span class="o">*</span><span class="p">,</span>
<span class="n">threshold</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="o">...</span><span class="p">,</span>
<span class="n">inputCol</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="o">...</span><span class="p">,</span>
<span class="n">outputCol</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="o">...</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span> <span class="o">...</span>
<span class="nd">@overload</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span>
<span class="bp">self</span><span class="p">,</span>
<span class="o">*</span><span class="p">,</span>
<span class="n">thresholds</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">float</span><span class="p">]]</span> <span class="o">=</span> <span class="o">...</span><span class="p">,</span>
<span class="n">inputCols</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="o">...</span><span class="p">,</span>
<span class="n">outputCols</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="o">...</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span> <span class="o">...</span>
</pre></div>
</div>
</li>
<li><p>Be compatible with the current stable MyPy release.</p></li>
</ul>
<p>Complex supporting type definitions, should be placed in dedicated <code class="docutils literal notranslate"><span class="pre">_typing.pyi</span></code> stubs. See for example <a class="reference external" href="https://github.com/apache/spark/blob/master/python/pyspark/sql/_typing.pyi">pyspark.sql._typing.pyi</a>.</p>
<p>Annotations can be validated using <code class="docutils literal notranslate"><span class="pre">dev/lint-python</span></code> script or by invoking mypy directly:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>mypy --config python/mypy.ini python/pyspark
</pre></div>
</div>
</div>
<div class="section" id="code-and-docstring-guide">
<h2>Code and Docstring Guide<a class="headerlink" href="#code-and-docstring-guide" title="Permalink to this headline"></a></h2>
<p>Please follow the style of the existing codebase as is, which is virtually PEP 8 with one exception: lines can be up
to 100 characters in length, not 79.
For the docstring style, PySpark follows <a class="reference external" href="https://numpydoc.readthedocs.io/en/latest/format.html">NumPy documentation style</a>.</p>
<p>Note that the method and variable names in PySpark are the similar case is <code class="docutils literal notranslate"><span class="pre">threading</span></code> library in Python itself where
the APIs were inspired by Java. PySpark also follows <cite>camelCase</cite> for exposed APIs that match with Scala and Java.
There is an exception <code class="docutils literal notranslate"><span class="pre">functions.py</span></code> that uses <cite>snake_case</cite>. It was in order to make APIs SQL (and Python) friendly.</p>
<p>PySpark leverages linters such as <a class="reference external" href="https://pycodestyle.pycqa.org/en/latest/">pycodestyle</a> and <a class="reference external" href="https://flake8.pycqa.org/en/latest/">flake8</a>, which <code class="docutils literal notranslate"><span class="pre">dev/lint-python</span></code> runs. Therefore, make sure to run that script to double check.</p>
</div>
</div>
</div>
<div class='prev-next-bottom'>
<a class='left-prev' id="prev-link" href="index.html" title="previous page">Development</a>
<a class='right-next' id="next-link" href="testing.html" title="next page">Testing PySpark</a>
</div>
</main>
</div>
</div>
<script src="../_static/js/index.3da636dd464baa7582d2.js"></script>
<footer class="footer mt-5 mt-md-0">
<div class="container">
<p>
&copy; Copyright .<br/>
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 3.0.4.<br/>
</p>
</div>
</footer>
</body>
</html>