python/docs/source/index.rst - spark - Git at Google

 ..  Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 ..  Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 .. PySpark documentation master file

 =================
 PySpark Overview
 =================

 **Date**: |today| **Version**: |release|

 **Useful links**:
 |binder|_ | `GitHub <https://github.com/apache/spark>`_ | `Issues <https://issues.apache.org/jira/projects/SPARK/issues>`_ | |examples|_ | `Community <https://spark.apache.org/community.html>`_

 PySpark is the Python API for Apache Spark. It enables you to perform real-time,
 large-scale data processing in a distributed environment using Python. It also provides a PySpark
 shell for interactively analyzing your data.

 PySpark combines Python's learnability and ease of use with the power of Apache Spark
 to enable processing and analysis of data at any size for everyone familiar with Python.

 PySpark supports all of Spark's features such as Spark SQL,
 DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.

 .. list-table::
    :widths: 10 20 20 20 20 10
    :header-rows: 0
    :class: borderless spec_table

    * -
      - .. image:: ../../../docs/img/pyspark-spark_sql_and_dataframes.png
           :target: reference/pyspark.sql/index.html
           :width: 100%
           :alt: Spark SQL
      - .. image:: ../../../docs/img/pyspark-pandas_api_on_spark.png
           :target: reference/pyspark.pandas/index.html
           :width: 100%
           :alt: Pandas API on Spark
      - .. image:: ../../../docs/img/pyspark-structured_streaming.png
           :target: reference/pyspark.ss/index.html
           :width: 100%
           :alt: Streaming
      - .. image:: ../../../docs/img/pyspark-machine_learning.png
           :target: reference/pyspark.ml.html
           :width: 100%
           :alt: Machine Learning
      -

 .. list-table::
    :widths: 10 80 10
    :header-rows: 0
    :class: borderless spec_table

    * -
      - .. image:: ../../../docs/img/pyspark-spark_core_and_rdds.png
           :target: reference/pyspark.html
           :width: 100%
           :alt: Spark Core and RDDs
      -

 .. _Index Page - Spark SQL and DataFrames:

 **Spark SQL and DataFrames**

 Spark SQL is Apache Spark's module for working with structured data.
 It allows you to seamlessly mix SQL queries with Spark programs.
 With PySpark DataFrames you can efficiently read, write, transform,
 and analyze data using Python and SQL.
 Whether you use Python or SQL, the same underlying execution
 engine is used so you will always leverage the full power of Spark.

 - :ref:`/getting_started/quickstart_df.ipynb`
 - |binder_df|_
 - :ref:`Spark SQL API Reference</reference/pyspark.sql/index.rst>`

 **Pandas API on Spark**

 Pandas API on Spark allows you to scale your pandas workload to any size
 by running it distributed across multiple nodes. If you are already familiar
 with pandas and want to leverage Spark for big data, pandas API on Spark makes
 you immediately productive and lets you migrate your applications without modifying the code.
 You can have a single codebase that works both with pandas (tests, smaller datasets)
 and with Spark (production, distributed datasets) and you can switch between the
 pandas API and the Pandas API on Spark easily and without overhead.

 Pandas API on Spark aims to make the transition from pandas to Spark easy but
 if you are new to Spark or deciding which API to use, we recommend using PySpark
 (see :ref:`Spark SQL and DataFrames <Index Page - Spark SQL and DataFrames>`).

 - :ref:`/getting_started/quickstart_ps.ipynb`
 - |binder_ps|_
 - :ref:`Pandas API on Spark Reference</reference/pyspark.pandas/index.rst>`

 .. _Index Page - Structured Streaming:

 **Structured Streaming**

 Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
 You can express your streaming computation the same way you would express a batch computation on static data.
 The Spark SQL engine will take care of running it incrementally and continuously and updating the final result
 as streaming data continues to arrive.

 - `Structured Streaming Programming Guide <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`_
 - :ref:`Structured Streaming API Reference</reference/pyspark.ss/index.rst>`

 **Machine Learning (MLlib)**

 Built on top of Spark, MLlib is a scalable machine learning library that provides
 a uniform set of high-level APIs that help users create and tune practical machine
 learning pipelines.

 - `Machine Learning Library (MLlib) Programming Guide <https://spark.apache.org/docs/latest/ml-guide.html>`_
 - :ref:`Machine Learning (MLlib) API Reference</reference/pyspark.ml.rst>`

 **Spark Core and RDDs**

 Spark Core is the underlying general execution engine for the Spark platform that all
 other functionality is built on top of. It provides RDDs (Resilient Distributed Datasets)
 and in-memory computing capabilities.

 Note that the RDD API is a low-level API which can be difficult to use and you do not get
 the benefit of Spark's automatic query optimization capabilities.
 We recommend using DataFrames (see :ref:`Spark SQL and DataFrames <Index Page - Spark SQL and DataFrames>` above)
 instead of RDDs as it allows you to express what you want more easily and lets Spark automatically
 construct the most efficient query for you.

 - :ref:`Spark Core API Reference</reference/pyspark.rst>`

 **Spark Streaming (Legacy)**

 Spark Streaming is an extension of the core Spark API that enables scalable,
 high-throughput, fault-tolerant stream processing of live data streams.

 Note that Spark Streaming is the previous generation of Spark's streaming engine.
 It is a legacy project and it is no longer being updated.
 There is a newer and easier to use streaming engine in Spark called
 :ref:`Structured Streaming <Index Page - Structured Streaming>` which you
 should use for your streaming applications and pipelines.

 - `Spark Streaming Programming Guide (Legacy) <https://spark.apache.org/docs/latest/streaming-programming-guide.html>`_
 - :ref:`Spark Streaming API Reference (Legacy)</reference/pyspark.streaming.rst>`

 .. toctree::
     :maxdepth: 2
     :hidden:

     Overview <self>
     getting_started/index
     user_guide/index
     reference/index
     development/index
     migration_guide/index
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	.. PySpark documentation master file

	=================
	PySpark Overview
	=================

	Date: \|today\| Version: \|release\|

	Useful links:
	\|binder\|_ \| `GitHub <https://github.com/apache/spark>`_ \| `Issues <https://issues.apache.org/jira/projects/SPARK/issues>`_ \| \|examples\|_ \| `Community <https://spark.apache.org/community.html>`_

	PySpark is the Python API for Apache Spark. It enables you to perform real-time,
	large-scale data processing in a distributed environment using Python. It also provides a PySpark
	shell for interactively analyzing your data.

	PySpark combines Python's learnability and ease of use with the power of Apache Spark
	to enable processing and analysis of data at any size for everyone familiar with Python.

	PySpark supports all of Spark's features such as Spark SQL,
	DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.

	.. list-table::
	:widths: 10 20 20 20 20 10
	:header-rows: 0
	:class: borderless spec_table

	* -
	- .. image:: ../../../docs/img/pyspark-spark_sql_and_dataframes.png
	:target: reference/pyspark.sql/index.html
	:width: 100%
	:alt: Spark SQL
	- .. image:: ../../../docs/img/pyspark-pandas_api_on_spark.png
	:target: reference/pyspark.pandas/index.html
	:width: 100%
	:alt: Pandas API on Spark
	- .. image:: ../../../docs/img/pyspark-structured_streaming.png
	:target: reference/pyspark.ss/index.html
	:width: 100%
	:alt: Streaming
	- .. image:: ../../../docs/img/pyspark-machine_learning.png
	:target: reference/pyspark.ml.html
	:width: 100%
	:alt: Machine Learning
	-

	.. list-table::
	:widths: 10 80 10
	:header-rows: 0
	:class: borderless spec_table

	* -
	- .. image:: ../../../docs/img/pyspark-spark_core_and_rdds.png
	:target: reference/pyspark.html
	:width: 100%
	:alt: Spark Core and RDDs
	-

	.. _Index Page - Spark SQL and DataFrames:

	Spark SQL and DataFrames

	Spark SQL is Apache Spark's module for working with structured data.
	It allows you to seamlessly mix SQL queries with Spark programs.
	With PySpark DataFrames you can efficiently read, write, transform,
	and analyze data using Python and SQL.
	Whether you use Python or SQL, the same underlying execution
	engine is used so you will always leverage the full power of Spark.

	- :ref:`/getting_started/quickstart_df.ipynb`
	- \|binder_df\|_
	- :ref:`Spark SQL API Reference</reference/pyspark.sql/index.rst>`

	Pandas API on Spark

	Pandas API on Spark allows you to scale your pandas workload to any size
	by running it distributed across multiple nodes. If you are already familiar
	with pandas and want to leverage Spark for big data, pandas API on Spark makes
	you immediately productive and lets you migrate your applications without modifying the code.
	You can have a single codebase that works both with pandas (tests, smaller datasets)
	and with Spark (production, distributed datasets) and you can switch between the
	pandas API and the Pandas API on Spark easily and without overhead.

	Pandas API on Spark aims to make the transition from pandas to Spark easy but
	if you are new to Spark or deciding which API to use, we recommend using PySpark
	(see :ref:`Spark SQL and DataFrames <Index Page - Spark SQL and DataFrames>`).

	- :ref:`/getting_started/quickstart_ps.ipynb`
	- \|binder_ps\|_
	- :ref:`Pandas API on Spark Reference</reference/pyspark.pandas/index.rst>`

	.. _Index Page - Structured Streaming:

	Structured Streaming

	Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
	You can express your streaming computation the same way you would express a batch computation on static data.
	The Spark SQL engine will take care of running it incrementally and continuously and updating the final result
	as streaming data continues to arrive.

	- `Structured Streaming Programming Guide <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`_
	- :ref:`Structured Streaming API Reference</reference/pyspark.ss/index.rst>`

	Machine Learning (MLlib)

	Built on top of Spark, MLlib is a scalable machine learning library that provides
	a uniform set of high-level APIs that help users create and tune practical machine
	learning pipelines.

	- `Machine Learning Library (MLlib) Programming Guide <https://spark.apache.org/docs/latest/ml-guide.html>`_
	- :ref:`Machine Learning (MLlib) API Reference</reference/pyspark.ml.rst>`

	Spark Core and RDDs

	Spark Core is the underlying general execution engine for the Spark platform that all
	other functionality is built on top of. It provides RDDs (Resilient Distributed Datasets)
	and in-memory computing capabilities.

	Note that the RDD API is a low-level API which can be difficult to use and you do not get
	the benefit of Spark's automatic query optimization capabilities.
	We recommend using DataFrames (see :ref:`Spark SQL and DataFrames <Index Page - Spark SQL and DataFrames>` above)
	instead of RDDs as it allows you to express what you want more easily and lets Spark automatically
	construct the most efficient query for you.

	- :ref:`Spark Core API Reference</reference/pyspark.rst>`

	Spark Streaming (Legacy)

	Spark Streaming is an extension of the core Spark API that enables scalable,
	high-throughput, fault-tolerant stream processing of live data streams.

	Note that Spark Streaming is the previous generation of Spark's streaming engine.
	It is a legacy project and it is no longer being updated.
	There is a newer and easier to use streaming engine in Spark called
	:ref:`Structured Streaming <Index Page - Structured Streaming>` which you
	should use for your streaming applications and pipelines.

	- `Spark Streaming Programming Guide (Legacy) <https://spark.apache.org/docs/latest/streaming-programming-guide.html>`_
	- :ref:`Spark Streaming API Reference (Legacy)</reference/pyspark.streaming.rst>`

	.. toctree::
	:maxdepth: 2
	:hidden:

	Overview <self>
	getting_started/index
	user_guide/index
	reference/index
	development/index
	migration_guide/index