python/docs/source/getting_started/install.rst - spark - Git at Google

 ..  Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 ..  Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 ============
 Installation
 ============

 PySpark is included in the official releases of Spark available in the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
 For Python users, PySpark also provides ``pip`` installation from PyPI. This is usually for local usage or as
 a client to connect to a cluster instead of setting up a cluster itself.

 This page includes instructions for installing PySpark by using pip, Conda, downloading manually,
 and building from the source.


 Python Versions Supported
 -------------------------

 Python 3.10 and above.


 Using PyPI
 ----------

 PySpark installation using `PyPI (pyspark) <https://pypi.org/project/pyspark/>`_ is as follows:

 .. code-block:: bash

     pip install pyspark

 If you want to install extra dependencies for a specific component, you can install it as below:

 .. code-block:: bash

     # Spark SQL
     pip install pyspark[sql]
     # pandas API on Spark
     pip install pyspark[pandas_on_spark] plotly  # to plot your data, you can install plotly together.
     # Spark Connect
     pip install pyspark[connect]


 See :ref:`optional-dependencies` for more detail about extra dependencies.

 For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

 .. code-block:: bash

     PYSPARK_HADOOP_VERSION=3 pip install pyspark

 The default distribution uses Hadoop 3.3 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
 downloads a different version and uses it in PySpark. Downloading it can take a while depending on
 the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manually choose the mirror for faster downloading.

 .. code-block:: bash

     PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=3 pip install

 It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.

 .. code-block:: bash

     PYSPARK_HADOOP_VERSION=3 pip install pyspark -v

 Supported values in ``PYSPARK_HADOOP_VERSION`` are:

 - ``without``: Spark pre-built with user-provided Apache Hadoop
 - ``3``: Spark pre-built for Apache Hadoop 3.3 and later (default)

 Note that this installation of PySpark with/without a specific Hadoop version is experimental. It can change or be removed between minor releases.


 Making Spark Connect the default
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 If you want to make Spark Connect the default, you can install an additional library via `PyPI (pyspark-connect) <https://pypi.org/project/pyspark-connect/>`_. Execute the following command:

 .. code-block:: bash

     pip install pyspark-connect

 This will automatically install the ``pyspark`` library, as well as dependencies that are necessary for Spark Connect.
 If you want to customize ``pyspark``, you need to install ``pyspark`` with the instructions above in advance.

 This package supports both:
 - ``spark.master`` (``--master``) with a locally running Spark Connect server
 - ``spark.remote`` (``--remote``) including local clusters, e.g., ``local[*]`` as well as connection URIs such as ``sc://localhost``.

 See also `Quickstart: Spark Connect <quickstart_connect.html>`_ for how to use it.


 Python Spark Connect Client
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The Python Spark Connect client is a pure Python library that does not rely on any non-Python dependencies such as jars and JRE in your environment.
 To install the Python Spark Connect client via `PyPI (pyspark-client) <https://pypi.org/project/pyspark-client/>`_, execute the following command:

 .. code-block:: bash

     pip install pyspark-client

 This package only supports `spark.remote` with connection URIs, e.g., ``sc://localhost``. See also `Quickstart: Spark Connect <quickstart_connect.html>`_ for how to use it.


 Using Conda
 -----------

 Conda is an open-source package management and environment management system (developed by
 `Anaconda <https://www.anaconda.com/>`_), which is best installed through
 `Miniconda <https://docs.conda.io/en/latest/miniconda.html>`_ or `Miniforge <https://github.com/conda-forge/miniforge/>`_.
 The tool is both cross-platform and language agnostic, and in practice, conda can replace both
 `pip <https://pip.pypa.io/en/latest/>`_ and `virtualenv <https://virtualenv.pypa.io/en/latest/>`_.

 Conda uses so-called channels to distribute packages, and together with the default channels by
 Anaconda itself, the most important channel is `conda-forge <https://conda-forge.org/>`_, which
 is the community-driven packaging effort that is the most extensive & the most current (and also
 serves as the upstream for the Anaconda channels in most cases).

 To create a new conda environment from your terminal and activate it, proceed as shown below:

 .. code-block:: bash

     conda create -n pyspark_env
     conda activate pyspark_env

 After activating the environment, use the following command to install pyspark,
 a python version of your choice, as well as other packages you want to use in
 the same session as pyspark (you can install in several steps too).

 .. code-block:: bash

     conda install -c conda-forge pyspark  # can also add "python=3.10 some_package [etc.]" here

 Note that `PySpark for conda <https://anaconda.org/conda-forge/pyspark>`_ is maintained
 separately by the community; while new versions generally get packaged quickly, the
 availability through conda(-forge) is not directly in sync with the PySpark release cycle.

 While using pip in a conda environment is technically feasible (with the same command as
 `above <#using-pypi>`_), this approach is `discouraged <https://www.anaconda.com/blog/using-pip-in-a-conda-environment/>`_,
 because pip does not interoperate with conda.

 For a short summary about useful conda commands, see their
 `cheat sheet <https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html>`_.


 Manually Downloading
 --------------------

 PySpark is included in the distributions available at the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
 You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you want
 to install Spark, for example, as below:

 .. code-block:: bash

     tar xzvf spark-\ |release|\-bin-hadoop3.tgz

 Ensure the ``SPARK_HOME`` environment variable points to the directory where the tar file has been extracted.
 Update ``PYTHONPATH`` environment variable such that it can find the PySpark and Py4J under ``SPARK_HOME/python/lib``.
 One example of doing this is shown below:

 .. code-block:: bash

     cd spark-\ |release|\-bin-hadoop3
     export SPARK_HOME=`pwd`
     export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH


 Installing from Source
 ----------------------

 To install PySpark from source, refer to |building_spark|_.


 Dependencies
 ------------

 Required dependencies
 ~~~~~~~~~~~~~~~~~~~~~

 PySpark requires the following dependencies.

 ========================== ========================= =============================
 Package                    Supported version         Note
 ========================== ========================= =============================
 `py4j`                     >=0.10.9.9                Required to interact with JVM
 ========================== ========================= =============================

 Additional libraries that enhance functionality but are not included in the installation packages:

 - **memory-profiler**: Used for PySpark UDF memory profiling, ``spark.profile.show(...)`` and ``spark.sql.pyspark.udf.profiler``.
 - **plotly**: Used for PySpark plotting, ``DataFrame.plot``.

 Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set and refer to |downloading|_.


 .. _optional-dependencies:

 Optional dependencies
 ~~~~~~~~~~~~~~~~~~~~~

 PySpark has several optional dependencies that enhance its functionality for specific modules.
 These dependencies are only required for certain features and are not necessary for the basic functionality of PySpark.
 If these optional dependencies are not installed, PySpark will function correctly for basic operations but will raise an ``ImportError``
 when you try to use features that require these dependencies.

 Spark Connect
 ^^^^^^^^^^^^^

 Installable with ``pip install "pyspark[connect]"``.

 ========================== ================= ==========================
 Package                    Supported version Note
 ========================== ================= ==========================
 `pandas`                   >=2.2.0           Required for Spark Connect
 `pyarrow`                  >=15.0.0          Required for Spark Connect
 `grpcio`                   >=1.76.0          Required for Spark Connect
 `grpcio-status`            >=1.76.0          Required for Spark Connect
 `googleapis-common-protos` >=1.71.0          Required for Spark Connect
 `zstandard`                >=0.25.0          Required for Spark Connect
 `graphviz`                 >=0.20            Optional for Spark Connect
 ========================== ================= ==========================

 Spark SQL
 ^^^^^^^^^

 Installable with ``pip install "pyspark[sql]"``.

 ========= ================= ======================
 Package   Supported version Note
 ========= ================= ======================
 `pandas`  >=2.2.0           Required for Spark SQL
 `pyarrow` >=15.0.0          Required for Spark SQL
 ========= ================= ======================

 Additional libraries that enhance functionality but are not included in the installation packages:

 - **flameprof**: Provide the default renderer for UDF performance profiling.


 Pandas API on Spark
 ^^^^^^^^^^^^^^^^^^^

 Installable with ``pip install "pyspark[pandas_on_spark]"``.

 ========= ================= ================================
 Package   Supported version Note
 ========= ================= ================================
 `pandas`  >=2.2.0           Required for Pandas API on Spark
 `pyarrow` >=15.0.0          Required for Pandas API on Spark
 ========= ================= ================================

 Additional libraries that enhance functionality but are not included in the installation packages:

 - **mlflow**: Required for ``pyspark.pandas.mlflow``.
 - **plotly**: Provide plotting for visualization. It is recommended using **plotly** over **matplotlib**.
 - **matplotlib**: Provide plotting for visualization. The default is **plotly**.


 MLlib DataFrame-based API
 ^^^^^^^^^^^^^^^^^^^^^^^^^

 Installable with ``pip install "pyspark[ml]"``.

 ======= ================= ======================================
 Package Supported version Note
 ======= ================= ======================================
 `numpy` >=1.22            Required for MLlib DataFrame-based API
 ======= ================= ======================================

 Additional libraries that enhance functionality but are not included in the installation packages:

 - **scipy**: Required for SciPy integration.
 - **scikit-learn**: Required for implementing machine learning algorithms.
 - **torch**: Required for machine learning model training.
 - **torchvision**: Required for supporting image and video processing.
 - **torcheval**: Required for facilitating model evaluation metrics.
 - **deepspeed**: Required for providing high-performance model training optimizations. Installable on non-Darwin systems.

 MLlib
 ^^^^^

 Installable with ``pip install "pyspark[mllib]"``.

 ======= ================= ==================
 Package Supported version Note
 ======= ================= ==================
 `numpy` >=1.22            Required for MLlib
 ======= ================= ==================

 Declarative Pipelines
 ^^^^^^^^^^^^^^^^^^^^^

 Installable with ``pip install "pyspark[pipelines]"``. Includes all dependencies for both Spark SQL and Spark Connect, because Declarative Pipelines is built on top of both.

 ========================== ================= ===================================================
 Package                    Supported version Note
 ========================== ================= ===================================================
 `pandas`                   >=2.2.0           Required for Spark Connect and Spark SQL
 `pyarrow`                  >=15.0.0          Required for Spark Connect and Spark SQL
 `grpcio`                   >=1.76.0          Required for Spark Connect
 `grpcio-status`            >=1.76.0          Required for Spark Connect
 `googleapis-common-protos` >=1.71.0          Required for Spark Connect
 `zstandard`                >=0.25.0          Required for Spark Connect
 `pyyaml`                   >=3.11            Required for spark-pipelines command line interface
 `graphviz`                 >=0.20            Optional for Spark Connect
 ========================== ================= ===================================================
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	============
	Installation
	============

	PySpark is included in the official releases of Spark available in the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
	For Python users, PySpark also provides ``pip`` installation from PyPI. This is usually for local usage or as
	a client to connect to a cluster instead of setting up a cluster itself.

	This page includes instructions for installing PySpark by using pip, Conda, downloading manually,
	and building from the source.


	Python Versions Supported
	-------------------------

	Python 3.10 and above.


	Using PyPI
	----------

	PySpark installation using `PyPI (pyspark) <https://pypi.org/project/pyspark/>`_ is as follows:

	.. code-block:: bash

	pip install pyspark

	If you want to install extra dependencies for a specific component, you can install it as below:

	.. code-block:: bash

	# Spark SQL
	pip install pyspark[sql]
	# pandas API on Spark
	pip install pyspark[pandas_on_spark] plotly # to plot your data, you can install plotly together.
	# Spark Connect
	pip install pyspark[connect]


	See :ref:`optional-dependencies` for more detail about extra dependencies.

	For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

	.. code-block:: bash

	PYSPARK_HADOOP_VERSION=3 pip install pyspark

	The default distribution uses Hadoop 3.3 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
	downloads a different version and uses it in PySpark. Downloading it can take a while depending on
	the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manually choose the mirror for faster downloading.

	.. code-block:: bash

	PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=3 pip install

	It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.

	.. code-block:: bash

	PYSPARK_HADOOP_VERSION=3 pip install pyspark -v

	Supported values in ``PYSPARK_HADOOP_VERSION`` are:

	- ``without``: Spark pre-built with user-provided Apache Hadoop
	- ``3``: Spark pre-built for Apache Hadoop 3.3 and later (default)

	Note that this installation of PySpark with/without a specific Hadoop version is experimental. It can change or be removed between minor releases.


	Making Spark Connect the default
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	If you want to make Spark Connect the default, you can install an additional library via `PyPI (pyspark-connect) <https://pypi.org/project/pyspark-connect/>`_. Execute the following command:

	.. code-block:: bash

	pip install pyspark-connect

	This will automatically install the ``pyspark`` library, as well as dependencies that are necessary for Spark Connect.
	If you want to customize ``pyspark``, you need to install ``pyspark`` with the instructions above in advance.

	This package supports both:
	- ``spark.master`` (``--master``) with a locally running Spark Connect server
	- ``spark.remote`` (``--remote``) including local clusters, e.g., ``local[*]`` as well as connection URIs such as ``sc://localhost``.

	See also `Quickstart: Spark Connect <quickstart_connect.html>`_ for how to use it.


	Python Spark Connect Client
	~~~~~~~~~~~~~~~~~~~~~~~~~~~

	The Python Spark Connect client is a pure Python library that does not rely on any non-Python dependencies such as jars and JRE in your environment.
	To install the Python Spark Connect client via `PyPI (pyspark-client) <https://pypi.org/project/pyspark-client/>`_, execute the following command:

	.. code-block:: bash

	pip install pyspark-client

	This package only supports `spark.remote` with connection URIs, e.g., ``sc://localhost``. See also `Quickstart: Spark Connect <quickstart_connect.html>`_ for how to use it.


	Using Conda
	-----------

	Conda is an open-source package management and environment management system (developed by
	`Anaconda <https://www.anaconda.com/>`_), which is best installed through
	`Miniconda <https://docs.conda.io/en/latest/miniconda.html>`_ or `Miniforge <https://github.com/conda-forge/miniforge/>`_.
	The tool is both cross-platform and language agnostic, and in practice, conda can replace both
	`pip <https://pip.pypa.io/en/latest/>`_ and `virtualenv <https://virtualenv.pypa.io/en/latest/>`_.

	Conda uses so-called channels to distribute packages, and together with the default channels by
	Anaconda itself, the most important channel is `conda-forge <https://conda-forge.org/>`_, which
	is the community-driven packaging effort that is the most extensive & the most current (and also
	serves as the upstream for the Anaconda channels in most cases).

	To create a new conda environment from your terminal and activate it, proceed as shown below:

	.. code-block:: bash

	conda create -n pyspark_env
	conda activate pyspark_env

	After activating the environment, use the following command to install pyspark,
	a python version of your choice, as well as other packages you want to use in
	the same session as pyspark (you can install in several steps too).

	.. code-block:: bash

	conda install -c conda-forge pyspark # can also add "python=3.10 some_package [etc.]" here

	Note that `PySpark for conda <https://anaconda.org/conda-forge/pyspark>`_ is maintained
	separately by the community; while new versions generally get packaged quickly, the
	availability through conda(-forge) is not directly in sync with the PySpark release cycle.

	While using pip in a conda environment is technically feasible (with the same command as
	`above <#using-pypi>`_), this approach is `discouraged <https://www.anaconda.com/blog/using-pip-in-a-conda-environment/>`_,
	because pip does not interoperate with conda.

	For a short summary about useful conda commands, see their
	`cheat sheet <https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html>`_.


	Manually Downloading
	--------------------

	PySpark is included in the distributions available at the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
	You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you want
	to install Spark, for example, as below:

	.. code-block:: bash

	tar xzvf spark-\ \|release\|\-bin-hadoop3.tgz

	Ensure the ``SPARK_HOME`` environment variable points to the directory where the tar file has been extracted.
	Update ``PYTHONPATH`` environment variable such that it can find the PySpark and Py4J under ``SPARK_HOME/python/lib``.
	One example of doing this is shown below:

	.. code-block:: bash

	cd spark-\ \|release\|\-bin-hadoop3
	export SPARK_HOME=`pwd`
	export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/.zip); IFS=:; echo "${ZIPS[]}"):$PYTHONPATH


	Installing from Source
	----------------------

	To install PySpark from source, refer to \|building_spark\|_.


	Dependencies
	------------

	Required dependencies
	~~~~~~~~~~~~~~~~~~~~~

	PySpark requires the following dependencies.

	========================== ========================= =============================
	Package Supported version Note
	========================== ========================= =============================
	`py4j` >=0.10.9.9 Required to interact with JVM
	========================== ========================= =============================

	Additional libraries that enhance functionality but are not included in the installation packages:

	- memory-profiler: Used for PySpark UDF memory profiling, ``spark.profile.show(...)`` and ``spark.sql.pyspark.udf.profiler``.
	- plotly: Used for PySpark plotting, ``DataFrame.plot``.

	Note that PySpark requires Java 17 or later with ``JAVA_HOME`` properly set and refer to \|downloading\|_.


	.. _optional-dependencies:

	Optional dependencies
	~~~~~~~~~~~~~~~~~~~~~

	PySpark has several optional dependencies that enhance its functionality for specific modules.
	These dependencies are only required for certain features and are not necessary for the basic functionality of PySpark.
	If these optional dependencies are not installed, PySpark will function correctly for basic operations but will raise an ``ImportError``
	when you try to use features that require these dependencies.

	Spark Connect
	^^^^^^^^^^^^^

	Installable with ``pip install "pyspark[connect]"``.

	========================== ================= ==========================
	Package Supported version Note
	========================== ================= ==========================
	`pandas` >=2.2.0 Required for Spark Connect
	`pyarrow` >=15.0.0 Required for Spark Connect
	`grpcio` >=1.76.0 Required for Spark Connect
	`grpcio-status` >=1.76.0 Required for Spark Connect
	`googleapis-common-protos` >=1.71.0 Required for Spark Connect
	`zstandard` >=0.25.0 Required for Spark Connect
	`graphviz` >=0.20 Optional for Spark Connect
	========================== ================= ==========================

	Spark SQL
	^^^^^^^^^

	Installable with ``pip install "pyspark[sql]"``.

	========= ================= ======================
	Package Supported version Note
	========= ================= ======================
	`pandas` >=2.2.0 Required for Spark SQL
	`pyarrow` >=15.0.0 Required for Spark SQL
	========= ================= ======================

	Additional libraries that enhance functionality but are not included in the installation packages:

	- flameprof: Provide the default renderer for UDF performance profiling.


	Pandas API on Spark
	^^^^^^^^^^^^^^^^^^^

	Installable with ``pip install "pyspark[pandas_on_spark]"``.

	========= ================= ================================
	Package Supported version Note
	========= ================= ================================
	`pandas` >=2.2.0 Required for Pandas API on Spark
	`pyarrow` >=15.0.0 Required for Pandas API on Spark
	========= ================= ================================

	Additional libraries that enhance functionality but are not included in the installation packages:

	- mlflow: Required for ``pyspark.pandas.mlflow``.
	- plotly: Provide plotting for visualization. It is recommended using plotly over matplotlib.
	- matplotlib: Provide plotting for visualization. The default is plotly.


	MLlib DataFrame-based API
	^^^^^^^^^^^^^^^^^^^^^^^^^

	Installable with ``pip install "pyspark[ml]"``.

	======= ================= ======================================
	Package Supported version Note
	======= ================= ======================================
	`numpy` >=1.22 Required for MLlib DataFrame-based API
	======= ================= ======================================

	Additional libraries that enhance functionality but are not included in the installation packages:

	- scipy: Required for SciPy integration.
	- scikit-learn: Required for implementing machine learning algorithms.
	- torch: Required for machine learning model training.
	- torchvision: Required for supporting image and video processing.
	- torcheval: Required for facilitating model evaluation metrics.
	- deepspeed: Required for providing high-performance model training optimizations. Installable on non-Darwin systems.

	MLlib
	^^^^^

	Installable with ``pip install "pyspark[mllib]"``.

	======= ================= ==================
	Package Supported version Note
	======= ================= ==================
	`numpy` >=1.22 Required for MLlib
	======= ================= ==================

	Declarative Pipelines
	^^^^^^^^^^^^^^^^^^^^^

	Installable with ``pip install "pyspark[pipelines]"``. Includes all dependencies for both Spark SQL and Spark Connect, because Declarative Pipelines is built on top of both.

	========================== ================= ===================================================
	Package Supported version Note
	========================== ================= ===================================================
	`pandas` >=2.2.0 Required for Spark Connect and Spark SQL
	`pyarrow` >=15.0.0 Required for Spark Connect and Spark SQL
	`grpcio` >=1.76.0 Required for Spark Connect
	`grpcio-status` >=1.76.0 Required for Spark Connect
	`googleapis-common-protos` >=1.71.0 Required for Spark Connect
	`zstandard` >=0.25.0 Required for Spark Connect
	`pyyaml` >=3.11 Required for spark-pipelines command line interface
	`graphviz` >=0.20 Optional for Spark Connect
	========================== ================= ===================================================