| |
| <!DOCTYPE html> |
| |
| <html> |
| <head> |
| <meta charset="utf-8" /> |
| <title>Python Package Management — PySpark 3.2.0 documentation</title> |
| |
| <link href="../_static/css/theme.css" rel="stylesheet" /> |
| <link href="../_static/css/index.c5995385ac14fb8791e8eb36b4908be2.css" rel="stylesheet" /> |
| |
| |
| <link rel="stylesheet" |
| href="../_static/vendor/fontawesome/5.13.0/css/all.min.css"> |
| <link rel="preload" as="font" type="font/woff2" crossorigin |
| href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2"> |
| <link rel="preload" as="font" type="font/woff2" crossorigin |
| href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2"> |
| |
| |
| |
| |
| |
| <link rel="stylesheet" href="../_static/basic.css" type="text/css" /> |
| <link rel="stylesheet" href="../_static/pygments.css" type="text/css" /> |
| <link rel="stylesheet" type="text/css" href="../_static/css/pyspark.css" /> |
| |
| <link rel="preload" as="script" href="../_static/js/index.1c5a1a01449ed65a7b51.js"> |
| |
| <script id="documentation_options" data-url_root="../" src="../_static/documentation_options.js"></script> |
| <script src="../_static/jquery.js"></script> |
| <script src="../_static/underscore.js"></script> |
| <script src="../_static/doctools.js"></script> |
| <script src="../_static/language_data.js"></script> |
| <script src="../_static/copybutton.js"></script> |
| <script crossorigin="anonymous" integrity="sha256-Ae2Vz/4ePdIu6ZyI/5ZGsYnb+m0JlOmKPjt6XZ9JJkA=" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.4/require.min.js"></script> |
| <script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS-MML_HTMLorMML"></script> |
| <script type="text/x-mathjax-config">MathJax.Hub.Config({"tex2jax": {"inlineMath": [["$", "$"], ["\\(", "\\)"]], "processEscapes": true, "ignoreClass": "tex2jax_ignore|mathjax_ignore|document", "processClass": "tex2jax_process|mathjax_process|math|output_area"}})</script> |
| <link rel="canonical" href="https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html" /> |
| <link rel="search" title="Search" href="../search.html" /> |
| <link rel="next" title="Spark SQL" href="sql/index.html" /> |
| <link rel="prev" title="User Guide" href="index.html" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1" /> |
| <meta name="docsearch:language" content="en" /> |
| |
| </head> |
| <body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80"> |
| |
| <div class="container-fluid" id="banner"></div> |
| |
| |
| <nav class="navbar navbar-light navbar-expand-lg bg-light fixed-top bd-navbar" id="navbar-main"><div class="container-xl"> |
| |
| <div id="navbar-start"> |
| |
| |
| |
| <a class="navbar-brand" href="../index.html"> |
| <img src="../_static/spark-logo-reverse.png" class="logo" alt="logo"> |
| </a> |
| |
| |
| |
| </div> |
| |
| <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbar-collapsible" aria-controls="navbar-collapsible" aria-expanded="false" aria-label="Toggle navigation"> |
| <span class="navbar-toggler-icon"></span> |
| </button> |
| |
| |
| <div id="navbar-collapsible" class="col-lg-9 collapse navbar-collapse"> |
| <div id="navbar-center" class="mr-auto"> |
| |
| <div class="navbar-center-item"> |
| <ul id="navbar-main-elements" class="navbar-nav"> |
| <li class="toctree-l1 nav-item"> |
| <a class="reference internal nav-link" href="../getting_started/index.html"> |
| Getting Started |
| </a> |
| </li> |
| |
| <li class="toctree-l1 current active nav-item"> |
| <a class="reference internal nav-link" href="index.html"> |
| User Guide |
| </a> |
| </li> |
| |
| <li class="toctree-l1 nav-item"> |
| <a class="reference internal nav-link" href="../reference/index.html"> |
| API Reference |
| </a> |
| </li> |
| |
| <li class="toctree-l1 nav-item"> |
| <a class="reference internal nav-link" href="../development/index.html"> |
| Development |
| </a> |
| </li> |
| |
| <li class="toctree-l1 nav-item"> |
| <a class="reference internal nav-link" href="../migration_guide/index.html"> |
| Migration Guide |
| </a> |
| </li> |
| |
| |
| </ul> |
| </div> |
| |
| </div> |
| |
| <div id="navbar-end"> |
| |
| <div class="navbar-end-item"> |
| <ul id="navbar-icon-links" class="navbar-nav" aria-label="Icon Links"> |
| </ul> |
| </div> |
| |
| </div> |
| </div> |
| </div> |
| </nav> |
| |
| |
| <div class="container-xl"> |
| <div class="row"> |
| |
| |
| <!-- Only show if we have sidebars configured, else just a small margin --> |
| <div class="col-12 col-md-3 bd-sidebar"><form class="bd-search d-flex align-items-center" action="../search.html" method="get"> |
| <i class="icon fas fa-search"></i> |
| <input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" > |
| </form><nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation"> |
| <div class="bd-toc-item active"> |
| <ul class="current nav bd-sidenav"> |
| <li class="toctree-l1 current active"> |
| <a class="current reference internal" href="#"> |
| Python Package Management |
| </a> |
| </li> |
| <li class="toctree-l1 has-children"> |
| <a class="reference internal" href="sql/index.html"> |
| Spark SQL |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-1" name="toctree-checkbox-1" type="checkbox"/> |
| <label for="toctree-checkbox-1"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="sql/arrow_pandas.html"> |
| Apache Arrow in PySpark |
| </a> |
| </li> |
| </ul> |
| </li> |
| <li class="toctree-l1 has-children"> |
| <a class="reference internal" href="pandas_on_spark/index.html"> |
| Pandas API on Spark |
| </a> |
| <input class="toctree-checkbox" id="toctree-checkbox-2" name="toctree-checkbox-2" type="checkbox"/> |
| <label for="toctree-checkbox-2"> |
| <i class="fas fa-chevron-down"> |
| </i> |
| </label> |
| <ul> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/options.html"> |
| Options and settings |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/pandas_pyspark.html"> |
| From/to pandas and PySpark DataFrames |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/transform_apply.html"> |
| Transform and apply a function |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/types.html"> |
| Type Support in Pandas API on Spark |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/typehints.html"> |
| Type Hints in Pandas API on Spark |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/from_to_dbms.html"> |
| From/to other DBMSes |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/best_practices.html"> |
| Best Practices |
| </a> |
| </li> |
| <li class="toctree-l2"> |
| <a class="reference internal" href="pandas_on_spark/faq.html"> |
| FAQ |
| </a> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| </div> |
| </nav> |
| </div> |
| |
| |
| |
| |
| <div class="d-none d-xl-block col-xl-2 bd-toc"> |
| |
| |
| <div class="toc-item"> |
| |
| <div class="tocsection onthispage pt-5 pb-3"> |
| <i class="fas fa-list"></i> On this page |
| </div> |
| |
| <nav id="bd-toc-nav"> |
| <ul class="visible nav section-nav flex-column"> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#using-pyspark-native-features"> |
| Using PySpark Native Features |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#using-conda"> |
| Using Conda |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#using-virtualenv"> |
| Using Virtualenv |
| </a> |
| </li> |
| <li class="toc-h2 nav-item toc-entry"> |
| <a class="reference internal nav-link" href="#using-pex"> |
| Using PEX |
| </a> |
| </li> |
| </ul> |
| |
| </nav> |
| </div> |
| |
| <div class="toc-item"> |
| |
| </div> |
| |
| |
| </div> |
| |
| |
| |
| |
| |
| |
| <main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 bd-content" role="main"> |
| |
| <div> |
| |
| <div class="section" id="python-package-management"> |
| <h1>Python Package Management<a class="headerlink" href="#python-package-management" title="Permalink to this headline">ΒΆ</a></h1> |
| <p>When you want to run your PySpark application on a cluster such as YARN, Kubernetes, Mesos, etc., you need to make |
| sure that your code and all used libraries are available on the executors.</p> |
| <p>As an example letβs say you may want to run the <a class="reference internal" href="sql/arrow_pandas.html#series-to-scalar"><span class="std std-ref">Pandas UDFβs examples</span></a>. |
| As it uses pyarrow as an underlying implementation we need to make sure to have pyarrow installed on each executor |
| on the cluster. Otherwise you may get errors such as <code class="docutils literal notranslate"><span class="pre">ModuleNotFoundError:</span> <span class="pre">No</span> <span class="pre">module</span> <span class="pre">named</span> <span class="pre">'pyarrow'</span></code>.</p> |
| <p>Here is the script <code class="docutils literal notranslate"><span class="pre">app.py</span></code> from the previous example that will be executed on the cluster:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">pandas_udf</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span> |
| |
| <span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">spark</span><span class="p">):</span> |
| <span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span> |
| <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">5.0</span><span class="p">),</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">)],</span> |
| <span class="p">(</span><span class="s2">"id"</span><span class="p">,</span> <span class="s2">"v"</span><span class="p">))</span> |
| |
| <span class="nd">@pandas_udf</span><span class="p">(</span><span class="s2">"double"</span><span class="p">)</span> |
| <span class="k">def</span> <span class="nf">mean_udf</span><span class="p">(</span><span class="n">v</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-></span> <span class="nb">float</span><span class="p">:</span> |
| <span class="k">return</span> <span class="n">v</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> |
| |
| <span class="nb">print</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">"id"</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">mean_udf</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'v'</span><span class="p">]))</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> |
| |
| |
| <span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span> |
| <span class="n">main</span><span class="p">(</span><span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">())</span> |
| </pre></div> |
| </div> |
| <p>There are multiple ways to manage Python dependencies in the cluster:</p> |
| <ul class="simple"> |
| <li><p>Using PySpark Native Features</p></li> |
| <li><p>Using Conda</p></li> |
| <li><p>Using Virtualenv</p></li> |
| <li><p>Using PEX</p></li> |
| </ul> |
| <div class="section" id="using-pyspark-native-features"> |
| <h2>Using PySpark Native Features<a class="headerlink" href="#using-pyspark-native-features" title="Permalink to this headline">ΒΆ</a></h2> |
| <p>PySpark allows to upload Python files (<code class="docutils literal notranslate"><span class="pre">.py</span></code>), zipped Python packages (<code class="docutils literal notranslate"><span class="pre">.zip</span></code>), and Egg files (<code class="docutils literal notranslate"><span class="pre">.egg</span></code>) |
| to the executors by:</p> |
| <ul class="simple"> |
| <li><p>Setting the configuration setting <code class="docutils literal notranslate"><span class="pre">spark.submit.pyFiles</span></code></p></li> |
| <li><p>Setting <code class="docutils literal notranslate"><span class="pre">--py-files</span></code> option in Spark scripts</p></li> |
| <li><p>Directly calling <a class="reference internal" href="../reference/api/pyspark.SparkContext.addPyFile.html#pyspark.SparkContext.addPyFile" title="pyspark.SparkContext.addPyFile"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyspark.SparkContext.addPyFile()</span></code></a> in applications</p></li> |
| </ul> |
| <p>This is a straightforward method to ship additional custom Python code to the cluster. You can just add individual files or zip whole |
| packages and upload them. Using <a class="reference internal" href="../reference/api/pyspark.SparkContext.addPyFile.html#pyspark.SparkContext.addPyFile" title="pyspark.SparkContext.addPyFile"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pyspark.SparkContext.addPyFile()</span></code></a> allows to upload code even after having started your job.</p> |
| <p>However, it does not allow to add packages built as <a class="reference external" href="https://www.python.org/dev/peps/pep-0427/">Wheels</a> and therefore |
| does not allow to include dependencies with native code.</p> |
| </div> |
| <div class="section" id="using-conda"> |
| <h2>Using Conda<a class="headerlink" href="#using-conda" title="Permalink to this headline">ΒΆ</a></h2> |
| <p><a class="reference external" href="https://docs.conda.io/en/latest/">Conda</a> is one of the most widely-used Python package management systems. PySpark users can directly |
| use a Conda environment to ship their third-party Python packages by leveraging |
| <a class="reference external" href="https://conda.github.io/conda-pack/spark.html">conda-pack</a> which is a command line tool creating |
| relocatable Conda environments.</p> |
| <p>The example below creates a Conda environment to use on both the driver and executor and packs |
| it into an archive file. This archive file captures the Conda environment for Python and stores |
| both Python interpreter and all its relevant dependencies.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack |
| conda activate pyspark_conda_env |
| conda pack -f -o pyspark_conda_env.tar.gz |
| </pre></div> |
| </div> |
| <p>After that, you can ship it together with scripts or in the code by using the <code class="docutils literal notranslate"><span class="pre">--archives</span></code> option |
| or <code class="docutils literal notranslate"><span class="pre">spark.archives</span></code> configuration (<code class="docutils literal notranslate"><span class="pre">spark.yarn.dist.archives</span></code> in YARN). It automatically unpacks the archive on executors.</p> |
| <p>In the case of a <code class="docutils literal notranslate"><span class="pre">spark-submit</span></code> script, you can use it as follows:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python <span class="c1"># Do not set in cluster modes.</span> |
| <span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python |
| spark-submit --archives pyspark_conda_env.tar.gz#environment app.py |
| </pre></div> |
| </div> |
| <p>Note that <code class="docutils literal notranslate"><span class="pre">PYSPARK_DRIVER_PYTHON</span></code> above should not be set for cluster modes in YARN or Kubernetes.</p> |
| <p>If youβre on a regular Python shell or notebook, you can try it as shown below:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span> |
| <span class="kn">from</span> <span class="nn">app</span> <span class="kn">import</span> <span class="n">main</span> |
| |
| <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'PYSPARK_PYTHON'</span><span class="p">]</span> <span class="o">=</span> <span class="s2">"./environment/bin/python"</span> |
| <span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span> |
| <span class="s2">"spark.archives"</span><span class="p">,</span> <span class="c1"># 'spark.yarn.dist.archives' in YARN.</span> |
| <span class="s2">"pyspark_conda_env.tar.gz#environment"</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span> |
| <span class="n">main</span><span class="p">(</span><span class="n">spark</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>For a pyspark shell:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python |
| <span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python |
| pyspark --archives pyspark_conda_env.tar.gz#environment |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="using-virtualenv"> |
| <h2>Using Virtualenv<a class="headerlink" href="#using-virtualenv" title="Permalink to this headline">ΒΆ</a></h2> |
| <p><a class="reference external" href="https://virtualenv.pypa.io/en/latest/">Virtualenv</a> is a Python tool to create isolated Python environments. |
| Since Python 3.3, a subset of its features has been integrated into Python as a standard library under |
| the <a class="reference external" href="https://docs.python.org/3/library/venv.html">venv</a> module. PySpark users can use virtualenv to manage |
| Python dependencies in their clusters by using <a class="reference external" href="https://jcristharif.com/venv-pack/index.html">venv-pack</a> |
| in a similar way as conda-pack.</p> |
| <p>A virtual environment to use on both driver and executor can be created as demonstrated below. |
| It packs the current virtual environment to an archive file, and it contains both Python interpreter and the dependencies. |
| However, it requires all nodes in a cluster to have the same Python interpreter installed because |
| <a class="reference external" href="https://github.com/jcrist/venv-pack/issues/5">venv-pack packs Python interpreter as a symbolic link</a>.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python -m venv pyspark_venv |
| <span class="nb">source</span> pyspark_venv/bin/activate |
| pip install pyarrow pandas venv-pack |
| venv-pack -o pyspark_venv.tar.gz |
| </pre></div> |
| </div> |
| <p>You can directly pass/unpack the archive file and enable the environment on executors by leveraging |
| the <code class="docutils literal notranslate"><span class="pre">--archives</span></code> option or <code class="docutils literal notranslate"><span class="pre">spark.archives</span></code> configuration (<code class="docutils literal notranslate"><span class="pre">spark.yarn.dist.archives</span></code> in YARN).</p> |
| <p>For <code class="docutils literal notranslate"><span class="pre">spark-submit</span></code>, you can use it by running the command as follows. Also, notice that |
| <code class="docutils literal notranslate"><span class="pre">PYSPARK_DRIVER_PYTHON</span></code> has to be unset in Kubernetes or YARN cluster modes.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python <span class="c1"># Do not set in cluster modes.</span> |
| <span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python |
| spark-submit --archives pyspark_venv.tar.gz#environment app.py |
| </pre></div> |
| </div> |
| <p>For regular Python shells or notebooks:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>import os |
| from pyspark.sql import SparkSession |
| from app import main |
| |
| os.environ<span class="o">[</span><span class="s1">'PYSPARK_PYTHON'</span><span class="o">]</span> <span class="o">=</span> <span class="s2">"./environment/bin/python"</span> |
| <span class="nv">spark</span> <span class="o">=</span> SparkSession.builder.config<span class="o">(</span> |
| <span class="s2">"spark.archives"</span>, <span class="c1"># 'spark.yarn.dist.archives' in YARN.</span> |
| <span class="s2">"pyspark_venv.tar.gz#environment"</span><span class="o">)</span>.getOrCreate<span class="o">()</span> |
| main<span class="o">(</span>spark<span class="o">)</span> |
| </pre></div> |
| </div> |
| <p>In the case of a pyspark shell:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python |
| <span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./environment/bin/python |
| pyspark --archives pyspark_venv.tar.gz#environment |
| </pre></div> |
| </div> |
| </div> |
| <div class="section" id="using-pex"> |
| <h2>Using PEX<a class="headerlink" href="#using-pex" title="Permalink to this headline">ΒΆ</a></h2> |
| <p>PySpark can also use <a class="reference external" href="https://github.com/pantsbuild/pex">PEX</a> to ship the Python packages |
| together. PEX is a tool that creates a self-contained Python environment. This is similar |
| to Conda or virtualenv, but a <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file is executable by itself.</p> |
| <p>The following example creates a <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file for the driver and executor to use. |
| The file contains the Python dependencies specified with the <code class="docutils literal notranslate"><span class="pre">pex</span></code> command.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip install pyarrow pandas pex |
| pex pyspark pyarrow pandas -o pyspark_pex_env.pex |
| </pre></div> |
| </div> |
| <p>This file behaves similarly with a regular Python interpreter.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>./pyspark_pex_env.pex -c <span class="s2">"import pandas; print(pandas.__version__)"</span> |
| <span class="m">1</span>.1.5 |
| </pre></div> |
| </div> |
| <p>However, <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file does not include a Python interpreter itself under the hood so all |
| nodes in a cluster should have the same Python interpreter installed.</p> |
| <p>In order to transfer and use the <code class="docutils literal notranslate"><span class="pre">.pex</span></code> file in a cluster, you should ship it via the |
| <code class="docutils literal notranslate"><span class="pre">spark.files</span></code> configuration (<code class="docutils literal notranslate"><span class="pre">spark.yarn.dist.files</span></code> in YARN) or <code class="docutils literal notranslate"><span class="pre">--files</span></code> option because they are regular files instead |
| of directories or archive files.</p> |
| <p>For application submission, you run the commands as shown below. |
| Note that <code class="docutils literal notranslate"><span class="pre">PYSPARK_DRIVER_PYTHON</span></code> should not be set for cluster modes in YARN or Kubernetes.</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python <span class="c1"># Do not set in cluster modes.</span> |
| <span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./pyspark_pex_env.pex |
| spark-submit --files pyspark_pex_env.pex app.py |
| </pre></div> |
| </div> |
| <p>For regular Python shells or notebooks:</p> |
| <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span> |
| <span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span> |
| <span class="kn">from</span> <span class="nn">app</span> <span class="kn">import</span> <span class="n">main</span> |
| |
| <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'PYSPARK_PYTHON'</span><span class="p">]</span> <span class="o">=</span> <span class="s2">"./pyspark_pex_env.pex"</span> |
| <span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">config</span><span class="p">(</span> |
| <span class="s2">"spark.files"</span><span class="p">,</span> <span class="c1"># 'spark.yarn.dist.files' in YARN.</span> |
| <span class="s2">"pyspark_pex_env.pex"</span><span class="p">)</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span> |
| <span class="n">main</span><span class="p">(</span><span class="n">spark</span><span class="p">)</span> |
| </pre></div> |
| </div> |
| <p>For the interactive pyspark shell, the commands are almost the same:</p> |
| <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">PYSPARK_DRIVER_PYTHON</span><span class="o">=</span>python |
| <span class="nb">export</span> <span class="nv">PYSPARK_PYTHON</span><span class="o">=</span>./pyspark_pex_env.pex |
| pyspark --files pyspark_pex_env.pex |
| </pre></div> |
| </div> |
| <p>An end-to-end Docker example for deploying a standalone PySpark with <code class="docutils literal notranslate"><span class="pre">SparkSession.builder</span></code> and PEX |
| can be found <a class="reference external" href="https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md">here</a> |
| - it uses cluster-pack, a library on top of PEX that automatizes the the intermediate step of having |
| to create & upload the PEX manually.</p> |
| </div> |
| </div> |
| |
| |
| </div> |
| |
| |
| <div class='prev-next-bottom'> |
| |
| <a class='left-prev' id="prev-link" href="index.html" title="previous page">User Guide</a> |
| <a class='right-next' id="next-link" href="sql/index.html" title="next page">Spark SQL</a> |
| |
| </div> |
| |
| </main> |
| |
| |
| </div> |
| </div> |
| |
| <script src="../_static/js/index.1c5a1a01449ed65a7b51.js"></script> |
| |
| <footer class="footer mt-5 mt-md-0"> |
| <div class="container"> |
| |
| <div class="footer-item"> |
| <p class="copyright"> |
| © Copyright .<br/> |
| </p> |
| </div> |
| |
| <div class="footer-item"> |
| <p class="sphinx-version"> |
| Created using <a href="http://sphinx-doc.org/">Sphinx</a> 3.0.4.<br/> |
| </p> |
| </div> |
| |
| </div> |
| </footer> |
| </body> |
| </html> |