| .. Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| .. PySpark documentation master file |
| |
| ===================== |
| PySpark Documentation |
| ===================== |
| |
| |binder|_ | `GitHub <https://github.com/apache/spark>`_ | `Issues <https://issues.apache.org/jira/projects/SPARK/issues>`_ | |examples|_ | `Community <https://spark.apache.org/community.html>`_ |
| |
| PySpark is an interface for Apache Spark in Python. It not only allows you to write |
| Spark applications using Python APIs, but also provides the PySpark shell for |
| interactively analyzing your data in a distributed environment. PySpark supports most |
| of Spark's features such as Spark SQL, DataFrame, Streaming, MLlib |
| (Machine Learning) and Spark Core. |
| |
| .. image:: ../../../docs/img/pyspark-components.png |
| :alt: PySpark Components |
| |
| **Spark SQL and DataFrame** |
| |
| Spark SQL is a Spark module for structured data processing. It provides |
| a programming abstraction called DataFrame and can also act as distributed |
| SQL query engine. |
| |
| **pandas API on Spark** |
| |
| pandas API on Spark allows you to scale your pandas workload out. |
| With this package, you can: |
| |
| * Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas. |
| * Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets). |
| * Switch to pandas API and PySpark API contexts easily without any overhead. |
| |
| **Streaming** |
| |
| Running on top of Spark, the streaming feature in Apache Spark enables powerful |
| interactive and analytical applications across both streaming and historical data, |
| while inheriting Spark's ease of use and fault tolerance characteristics. |
| |
| **MLlib** |
| |
| Built on top of Spark, MLlib is a scalable machine learning library that provides |
| a uniform set of high-level APIs that help users create and tune practical machine |
| learning pipelines. |
| |
| **Spark Core** |
| |
| Spark Core is the underlying general execution engine for the Spark platform that all |
| other functionality is built on top of. It provides an RDD (Resilient Distributed Dataset) |
| and in-memory computing capabilities. |
| |
| .. toctree:: |
| :maxdepth: 2 |
| :hidden: |
| |
| getting_started/index |
| user_guide/index |
| reference/index |
| development/index |
| migration_guide/index |