| <!-- |
| ▄▄▄ ██▓███ ▄▄▄ ▄████▄ ██░ ██ ▓█████ ██▓ ▄████ ███▄ █ ██▓▄▄▄█████▓▓█████ |
| ▒████▄ ▓██░ ██▒▒████▄ ▒██▀ ▀█ ▓██░ ██▒▓█ ▀ ▓██▒ ██▒ ▀█▒ ██ ▀█ █ ▓██▒▓ ██▒ ▓▒▓█ ▀ |
| ▒██ ▀█▄ ▓██░ ██▓▒▒██ ▀█▄ ▒▓█ ▄ ▒██▀▀██░▒███ ▒██▒▒██░▄▄▄░▓██ ▀█ ██▒▒██▒▒ ▓██░ ▒░▒███ |
| ░██▄▄▄▄██ ▒██▄█▓▒ ▒░██▄▄▄▄██ ▒▓▓▄ ▄██▒░▓█ ░██ ▒▓█ ▄ ░██░░▓█ ██▓▓██▒ ▐▌██▒░██░░ ▓██▓ ░ ▒▓█ ▄ |
| ▓█ ▓██▒▒██▒ ░ ░ ▓█ ▓██▒▒ ▓███▀ ░░▓█▒░██▓░▒████▒ ░██░░▒▓███▀▒▒██░ ▓██░░██░ ▒██▒ ░ ░▒████▒ |
| ▒▒ ▓▒█░▒▓▒░ ░ ░ ▒▒ ▓▒█░░ ░▒ ▒ ░ ▒ ░░▒░▒░░ ▒░ ░ ░▓ ░▒ ▒ ░ ▒░ ▒ ▒ ░▓ ▒ ░░ ░░ ▒░ ░ |
| ▒ ▒▒ ░░▒ ░ ▒ ▒▒ ░ ░ ▒ ▒ ░▒░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░░ ░ ▒░ ▒ ░ ░ ░ ░ ░ |
| ░ ▒ ░░ ░ ▒ ░ ░ ░░ ░ ░ ▒ ░░ ░ ░ ░ ░ ░ ▒ ░ ░ ░ |
| ░ ░ ░ ░░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ |
| --> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <link rel="canonical" href="https://ignite.apache.org/use-cases/spark-acceleration.html"/> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <meta name="description" |
| content="Apache Ignite integrates with Apache Spark to accelerate the performance of Spark applications |
| and APIs by keeping data in a shared in-memory cluster."/> |
| |
| <title>Apache Spark Performance Acceleration</title> |
| |
| <!--#include virtual="/includes/styles.html" --> |
| |
| <!--#include virtual="/includes/sh.html" --> |
| </head> |
| <body> |
| <!--#include virtual="/includes/header.html" --> |
| <article> |
| <header> |
| <div class="container"> |
| <h1>Apache Spark <strong>Performance Acceleration</strong></h1> |
| </div> |
| </header> |
| <div class="container"> |
| <p> |
| The performance of Apache Spark® applications can be accelerated by keeping data in a shared |
| Apache Ignite® in-memory cluster. Spark works with Ignite as a data source similar to how it uses Hadoop or a |
| relational database. You can start an Ignite cluster, set it as a data source for Spark workers, and |
| continue using Spark RDDs or DataFrames APIs. You can gain even more speed by running Ignite SQL or |
| compute APIs directly on the Spark dataset. Ignite can also be used as a distributed in-memory layer by Spark |
| workers that need to share both data and state. |
| </p> |
| <img class="img-fluid diagram-right" alt="Apache Spark Performance Acceleration" src="/images/svg-diagrams/spark_acceleration.svg"/> |
| |
| |
| <p> |
| The performance increase is achievable for several reasons. First, Ignite is designed to store data sets |
| in memory across a cluster of nodes reducing latency of Spark operations that usually need to pull date |
| from disk-based systems. Second, Ignite tries to minimize data shuffling over the network between its |
| store and Spark applications by running certain Spark tasks, produced by RDDs or DataFrames APIs, |
| in-place on Ignite nodes. This optimization helps to reduce the effect of network latency on the |
| performance of Spark calls. Finally, the network impact can be further reduced if the native |
| Ignite APIs, such as SQL, are called from Spark applications directly. By doing so, you can eliminate |
| data shuffling between Spark and Ignite as long as Ignite SQL queries are always executed on |
| Ignite nodes returning a much smaller final result set to the application layer. |
| </p> |
| |
| <h2>Ignite Shared RDDs</h2> |
| <p> |
| Apache Ignite provides an implementation of the Spark RDD, which allows any data and state to be shared |
| in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the data stored |
| in Ignite caches across different Spark jobs, workers, or applications. |
| </p> |
| |
| <p> |
| The Ignite RDD is implemented as a view over a distributed Ignite table (aka. cache). It can be deployed |
| with an Ignite node either within the Spark job executing process, on a Spark worker, or in a separate |
| Ignite cluster. This means that depending on the chosen deployment mode, the shared state may either |
| exist only during the lifespan of a Spark application (embedded mode), or it may out-survive the Spark |
| application (standalone mode). |
| </p> |
| |
| <h2>Ignite DataFrames</h2> |
| <p> |
| The Apache Spark DataFrame API introduced the concept of a schema to describe the data, |
| allowing Spark to manage the schema and organize the data into a tabular format. To put it simply, |
| a DataFrame is a distributed collection of data organized into named columns. It is conceptually |
| equivalent to a table in a relational database and allows Spark to leverage the Catalyst query |
| optimizer to produce much more efficient query execution plans in comparison to RDDs, which are |
| collections of elements partitioned across the nodes of the cluster. |
| </p> |
| <p> |
| Ignite supports DataFrame APIs allowing Spark to write to and read from Ignite through that interface. |
| Furthermore, Ignite analyses execution plans produced by Spark's Catalyst engine and can execute |
| parts of the plan on Ignite nodes directly, which will reduce data shuffling and consequently make your |
| SparkSQL perform better. |
| </p> |
| |
| |
| <div class="jumbotron jumbotron-fluid"> |
| <div class="container"> |
| <div class="title display-6">Learn More</div> |
| <hr class="my-4"> |
| <div class="row"> |
| <div class="col-sm-6"> |
| <ul> |
| <li> |
| <a href="https://apacheignite-fs.readme.io/docs/installation-deployment" target="docs"> |
| Ignite and Spark Installation and Deployment <i |
| class="fa fa-angle-double-right"></i> |
| </a> |
| </li> |
| <li> |
| <a href="https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd" target="docs"> |
| Ignite RDDs in Details <i class="fa fa-angle-double-right"></i> |
| </a> |
| </li> |
| </ul> |
| </div> |
| <div class="col-sm-6"> |
| <ul> |
| <li> |
| <a href="https://apacheignite-fs.readme.io/docs/ignite-data-frame" target="docs"> |
| Ignite DataFrames in Details <i class="fa fa-angle-double-right"></i> |
| </a> |
| </li> |
| <li> |
| |
| <a href="/use-cases/digital-integration-hub.html"> |
| Ignite as a Digital Integration Hub <i class="fa fa-angle-double-right"></i> |
| </a> |
| |
| </li> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| </div> |
| |
| </article> |
| <!--#include virtual="/includes/footer.html" --> |
| <!--#include virtual="/includes/scripts.html" --> |
| </body> |
| </html> |