blob: 1a2a4467b66f108951e4b44b4f5eb0500025a4a7 [file] [log] [blame]
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Python Package Management &#8212; PySpark 3.2.0 documentation</title>
<link href="../_static/css/theme.css" rel="stylesheet" />
<link href="../_static/css/index.c5995385ac14fb8791e8eb36b4908be2.css" rel="stylesheet" />
<link rel="stylesheet"
href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
<link rel="preload" as="font" type="font/woff2" crossorigin
href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">
<link rel="stylesheet" href="../_static/basic.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" type="text/css" href="../_static/css/pyspark.css" />
<link rel="preload" as="script" href="../_static/js/index.1c5a1a01449ed65a7b51.js">
<script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script>
<script src="../_static/jquery.js"></script>
<script src="../_static/underscore.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/language_data.js"></script>
<script src="../_static/copybutton.js"></script>
<script crossorigin="anonymous" integrity="sha256-Ae2Vz/4ePdIu6ZyI/5ZGsYnb+m0JlOmKPjt6XZ9JJkA=" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js"></script>
<script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/x-mathjax-config">MathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\(", "\\)"]], "processEscapes": true, "ignoreClass": "tex2jax_ignore|mathjax_ignore|document", "processClass": "tex2jax_process|mathjax_process|math|output_area"}})</script>
<link rel="canonical" href="https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Spark SQL" href="sql/index.html" />
<link rel="prev" title="User Guide" href="index.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en" />
</head>
<body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80">
<div class="container-fluid" id="banner"></div>
<nav class="navbar navbar-light navbar-expand-lg bg-light fixed-top bd-navbar" id="navbar-main"><div class="container-xl">
<div id="navbar-start">
<a class="navbar-brand" href="../index.html">
<img src="../_static/spark-logo-reverse.png" class="logo" alt="logo">
</a>
</div>
<button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbar-collapsible" aria-controls="navbar-collapsible" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div id="navbar-collapsible" class="col-lg-9 collapse navbar-collapse">
<div id="navbar-center" class="mr-auto">
<div class="navbar-center-item">
<ul id="navbar-main-elements" class="navbar-nav">
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../getting_started/index.html">
Getting Started
</a>
</li>
<li class="toctree-l1 current active nav-item">
<a class="reference internal nav-link" href="index.html">
User Guide
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../reference/index.html">
API Reference
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../development/index.html">
Development
</a>
</li>
<li class="toctree-l1 nav-item">
<a class="reference internal nav-link" href="../migration_guide/index.html">
Migration Guide
</a>
</li>
</ul>
</div>
</div>
<div id="navbar-end">
<div class="navbar-end-item">
<ul id="navbar-icon-links" class="navbar-nav" aria-label="Icon Links">
</ul>
</div>
</div>
</div>
</div>
</nav>
<div class="container-xl">
<div class="row">
<!-- Only show if we have sidebars configured, else just a small margin -->
<div class="col-12 col-md-3 bd-sidebar"><form class="bd-search d-flex align-items-center" action="../search.html" method="get">
<i class="icon fas fa-search"></i>
<input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" >
</form><nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
<div class="bd-toc-item active">
<ul class="current nav bd-sidenav">
<li class="toctree-l1 current active">
<a class="current reference internal" href="#">
Python Package Management
</a>
</li>
<li class="toctree-l1 has-children">
<a class="reference internal" href="sql/index.html">
Spark SQL
</a>
<input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/>
<label for="toctree-checkbox-1">
<i class="fas fa-chevron-down">
</i>
</label>
<ul>
<li class="toctree-l2">
<a class="reference internal" href="sql/arrow_pandas.html">
Apache Arrow in PySpark
</a>
</li>
</ul>
</li>
<li class="toctree-l1 has-children">
<a class="reference internal" href="pandas_on_spark/index.html">
Pandas API on Spark
</a>
<input class="toctree-checkbox" id="toctree-checkbox-2" name="toctree-checkbox-2" type="checkbox"/>
<label for="toctree-checkbox-2">
<i class="fas fa-chevron-down">
</i>
</label>
<ul>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/options.html">
Options and settings
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/pandas_pyspark.html">
From/to pandas and PySpark DataFrames
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/transform_apply.html">
Transform and apply a function
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/types.html">
Type Support in Pandas API on Spark
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/typehints.html">
Type Hints in Pandas API on Spark
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/from_to_dbms.html">
From/to other DBMSes
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/best_practices.html">
Best Practices
</a>
</li>
<li class="toctree-l2">
<a class="reference internal" href="pandas_on_spark/faq.html">
FAQ
</a>
</li>
</ul>
</li>
</ul>
</div>
</nav>
</div>
<div class="d-none d-xl-block col-xl-2 bd-toc">
<div class="toc-item">
<div class="tocsection onthispage pt-5 pb-3">
<i class="fas fa-list"></i> On this page
</div>
<nav id="bd-toc-nav">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#using-pyspark-native-features">
Using PySpark Native Features
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#using-conda">
Using Conda
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#using-virtualenv">
Using Virtualenv
</a>
</li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#using-pex">
Using PEX
</a>
</li>
</ul>
</nav>
</div>
<div class="toc-item">
</div>
</div>
<main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 bd-content" role="main">
<div>
<div class="section" id="python-package-management">
<h1>Python Package Management<a class="headerlink" href="#python-package-management" title="Permalink to this headline">ΒΆ</a></h1>
<p>When you want to run your PySpark application on a cluster such as YARN, Kubernetes, Mesos, etc., you need to make
sure that your code and all used libraries are available on the executors.</p>
<p>As an example let’s say you may want to run the <a class="reference internal" href="sql/arrow_pandas.html#series-to-scalar"><span class="std std-ref">Pandas UDF’s examples</span></a>.
As it uses pyarrow as an underlying implementation we need to make sure to have pyarrow installed on each executor
on the cluster. Otherwise you may get errors such as <code class="docutils literal notranslate"><span class="pre">ModuleNotFoundError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">'pyarrow'</span></code>.</p>
<p>Here is the script <code class="docutils literal notranslate"><span class="pre">app.py</span></code> from the previous example that will be executed on the cluster:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">pandas_udf</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">spark</span><span class="p">):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span>
<span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">)],</span>
<span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">,</span> <span class="s2">&quot;v&quot;</span><span class="p">))</span>
<span class="nd">@pandas_udf</span><span class="p">(</span><span class="s2">&quot;double&quot;</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">mean_udf</span><span class="p">(</span><span class="n">v</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
<span class="k">return</span> <span class="n">v</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">&quot;id&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">mean_udf</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;v&#39;</span><span class="p">]))</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&quot;__main__&quot;</span><span class="p">:</span>
<span class="n">main</span><span class="p">(</span><span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">())</span>
</pre></div>
</div>
<p>There are multiple ways to manage Python dependencies in the cluster:</p>
<ul class="simple">
<li><p>Using PySpark Native Features</p></li>
<li><p>Using Conda</p></li>
<li><p>Using Virtualenv</p></li>
<li><p>Using PEX</p></li>
</ul>
<div class="section" id="using-pyspark-native-features">
<h2>Using PySpark Native Features<a class="headerlink" href="#using-pyspark-native-features" title="Permalink to this headline">ΒΆ</a></h2>
<p>PySpark allows to upload Python files (<code class="docutils literal notranslate"><span class="pre">.py</span></code>), zipped Python packages (<code class="docutils literal notranslate"><span class="pre">.zip</span></code>), and Egg files (<code class="docutils literal notranslate"><span class="pre">.egg</span></code>)
to the executors by:</p>
<ul class="simple">
<li><p>Setting the configuration setting <code class="docutils literal notranslate"><span class="pre">spark.submit.pyFiles</span></code></p></li>
<li><p>Setting <code class="docutils literal notranslate"><span class="pre">--py-files</span></code> option in Spark scripts</p></li>
<li><p>Directly calling <a class="reference internal" href="../reference/api/pyspark.SparkContext.addPyFile.html#pyspark.SparkContext.addPyFile" title="pyspark.SparkContext.addPyFile"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyspark.SparkContext.addPyFile()</span></code></a> in applications</p></li>
</ul>
<p>This is a straightforward method to ship additional custom Python code to the cluster. You can just add individual files or zip whole
packages and upload them. Using <a class="reference internal" href="../reference/api/pyspark.SparkContext.addPyFile.html#pyspark.SparkContext.addPyFile" title="pyspark.SparkContext.addPyFile"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyspark.SparkContext.addPyFile()</span></code></a> allows to upload code even after having started your job.</p>
<p>However, it does not allow to add packages built as <a class="reference external" href="https://www.python.org/dev/peps/pep-0427/">Wheels</a> and therefore
does not allow to include dependencies with native code.</p>
</div>
<div class="section" id="using-conda">
<h2>Using Conda<a class="headerlink" href="#using-conda" title="Permalink to this headline">ΒΆ</a></h2>
<p><a class="reference external" href="https://docs.conda.io/en/latest/">Conda</a> is one of the most widely-used Python package management systems. PySpark users can directly
use a Conda environment to ship their third-party Python packages by leveraging
<a class="reference external" href="https://conda.github.io/conda-pack/spark.html">conda-pack</a> which is a command line tool creating
relocatable Conda environments.</p>
<p>The example below creates a Conda environment to use on both the driver and executor and packs
it into an archive file. This archive file captures the Conda environment for Python and stores
both Python interpreter and all its relevant dependencies.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
</pre></div>
</div>
<p>After that, you can ship it together with scripts or in the code by using the <code class="docutils literal notranslate"><span class="pre">--archives</span></code> option
or <code class="docutils literal notranslate"><span class="pre">spark.archives</span></code> configuration (<code class="docutils literal notranslate"><span class="pre">spark.yarn.dist.archives</span></code> in YARN). It automatically unpacks the archive on executors.</p>
<p>In the case of a <code class="docutils literal notranslate"><span class="pre">spark-submit</span></code> script, you can use it as follows:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python <span class="c1"># Do not set in cluster modes.</span>
<span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
</pre></div>
</div>
<p>Note that <code class="docutils literal notranslate"><span class="pre">PYSPARK_DRIVER_PYTHON</span></code> above should not be set for cluster modes in YARN or Kubernetes.</p>
<p>If you’re on a regular Python shell or notebook, you can try it as shown below:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="kn">from</span> <span class="nn">app</span> <span class="kn">import</span> <span class="n">main</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;PYSPARK_PYTHON&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;./environment/bin/python&quot;</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span>
<span class="s2">&quot;spark.archives&quot;</span><span class="p">,</span> <span class="c1"># &#39;spark.yarn.dist.archives&#39; in YARN.</span>
<span class="s2">&quot;pyspark_conda_env.tar.gz#environment&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="n">main</span><span class="p">(</span><span class="n">spark</span><span class="p">)</span>
</pre></div>
</div>
<p>For a pyspark shell:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python
<span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python
pyspark --archives pyspark_conda_env.tar.gz#environment
</pre></div>
</div>
</div>
<div class="section" id="using-virtualenv">
<h2>Using Virtualenv<a class="headerlink" href="#using-virtualenv" title="Permalink to this headline">ΒΆ</a></h2>
<p><a class="reference external" href="https://virtualenv.pypa.io/en/latest/">Virtualenv</a> is a Python tool to create isolated Python environments.
Since Python 3.3, a subset of its features has been integrated into Python as a standard library under
the <a class="reference external" href="https://docs.python.org/3/library/venv.html">venv</a> module. PySpark users can use virtualenv to manage
Python dependencies in their clusters by using <a class="reference external" href="https://jcristharif.com/venv-pack/index.html">venv-pack</a>
in a similar way as conda-pack.</p>
<p>A virtual environment to use on both driver and executor can be created as demonstrated below.
It packs the current virtual environment to an archive file, and it contains both Python interpreter and the dependencies.
However, it requires all nodes in a cluster to have the same Python interpreter installed because
<a class="reference external" href="https://github.com/jcrist/venv-pack/issues/5">venv-pack packs Python interpreter as a symbolic link</a>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python -m venv pyspark_venv
<span class="nb">source</span> pyspark_venv/bin/activate
pip install pyarrow pandas venv-pack
venv-pack -o pyspark_venv.tar.gz
</pre></div>
</div>
<p>You can directly pass/unpack the archive file and enable the environment on executors by leveraging
the <code class="docutils literal notranslate"><span class="pre">--archives</span></code> option or <code class="docutils literal notranslate"><span class="pre">spark.archives</span></code> configuration (<code class="docutils literal notranslate"><span class="pre">spark.yarn.dist.archives</span></code> in YARN).</p>
<p>For <code class="docutils literal notranslate"><span class="pre">spark-submit</span></code>, you can use it by running the command as follows. Also, notice that
<code class="docutils literal notranslate"><span class="pre">PYSPARK_DRIVER_PYTHON</span></code> has to be unset in Kubernetes or YARN cluster modes.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python <span class="c1"># Do not set in cluster modes.</span>
<span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py
</pre></div>
</div>
<p>For regular Python shells or notebooks:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>import os
from pyspark.sql import SparkSession
from app import main
os.environ<span class="o">[</span><span class="s1">&#39;PYSPARK_PYTHON&#39;</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;./environment/bin/python&quot;</span>
<span class="nv">spark</span> <span class="o">=</span> SparkSession.builder.config<span class="o">(</span>
<span class="s2">&quot;spark.archives&quot;</span>, <span class="c1"># &#39;spark.yarn.dist.archives&#39; in YARN.</span>
<span class="s2">&quot;pyspark_venv.tar.gz#environment&quot;</span><span class="o">)</span>.getOrCreate<span class="o">()</span>
main<span class="o">(</span>spark<span class="o">)</span>
</pre></div>
</div>
<p>In the case of a pyspark shell:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python
<span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python
pyspark --archives pyspark_venv.tar.gz#environment
</pre></div>
</div>
</div>
<div class="section" id="using-pex">
<h2>Using PEX<a class="headerlink" href="#using-pex" title="Permalink to this headline">ΒΆ</a></h2>
<p>PySpark can also use <a class="reference external" href="https://github.com/pantsbuild/pex">PEX</a> to ship the Python packages
together. PEX is a tool that creates a self-contained Python environment. This is similar
to Conda or virtualenv, but a <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file is executable by itself.</p>
<p>The following example creates a <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file for the driver and executor to use.
The file contains the Python dependencies specified with the <code class="docutils literal notranslate"><span class="pre">pex</span></code> command.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip install pyarrow pandas pex
pex pyspark pyarrow pandas -o pyspark_pex_env.pex
</pre></div>
</div>
<p>This file behaves similarly with a regular Python interpreter.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>./pyspark_pex_env.pex -c <span class="s2">&quot;import pandas; print(pandas.__version__)&quot;</span>
<span class="m">1</span>.1.5
</pre></div>
</div>
<p>However, <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file does not include a Python interpreter itself under the hood so all
nodes in a cluster should have the same Python interpreter installed.</p>
<p>In order to transfer and use the <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file in a cluster, you should ship it via the
<code class="docutils literal notranslate"><span class="pre">spark.files</span></code> configuration (<code class="docutils literal notranslate"><span class="pre">spark.yarn.dist.files</span></code> in YARN) or <code class="docutils literal notranslate"><span class="pre">--files</span></code> option because they are regular files instead
of directories or archive files.</p>
<p>For application submission, you run the commands as shown below.
Note that <code class="docutils literal notranslate"><span class="pre">PYSPARK_DRIVER_PYTHON</span></code> should not be set for cluster modes in YARN or Kubernetes.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python <span class="c1"># Do not set in cluster modes.</span>
<span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./pyspark_pex_env.pex
spark-submit --files pyspark_pex_env.pex app.py
</pre></div>
</div>
<p>For regular Python shells or notebooks:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="kn">from</span> <span class="nn">app</span> <span class="kn">import</span> <span class="n">main</span>
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;PYSPARK_PYTHON&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s2">&quot;./pyspark_pex_env.pex&quot;</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span>
<span class="s2">&quot;spark.files&quot;</span><span class="p">,</span> <span class="c1"># &#39;spark.yarn.dist.files&#39; in YARN.</span>
<span class="s2">&quot;pyspark_pex_env.pex&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="n">main</span><span class="p">(</span><span class="n">spark</span><span class="p">)</span>
</pre></div>
</div>
<p>For the interactive pyspark shell, the commands are almost the same:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python
<span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./pyspark_pex_env.pex
pyspark --files pyspark_pex_env.pex
</pre></div>
</div>
<p>An end-to-end Docker example for deploying a standalone PySpark with <code class="docutils literal notranslate"><span class="pre">SparkSession.builder</span></code> and PEX
can be found <a class="reference external" href="https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md">here</a>
- it uses cluster-pack, a library on top of PEX that automatizes the the intermediate step of having
to create &amp; upload the PEX manually.</p>
</div>
</div>
</div>
<div class='prev-next-bottom'>
<a class='left-prev' id="prev-link" href="index.html" title="previous page">User Guide</a>
<a class='right-next' id="next-link" href="sql/index.html" title="next page">Spark SQL</a>
</div>
</main>
</div>
</div>
<script src="../_static/js/index.1c5a1a01449ed65a7b51.js"></script>
<footer class="footer mt-5 mt-md-0">
<div class="container">
<div class="footer-item">
<p class="copyright">
&copy; Copyright .<br/>
</p>
</div>
<div class="footer-item">
<p class="sphinx-version">
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 3.0.4.<br/>
</p>
</div>
</div>
</footer>
</body>
</html>