| <!-- |
| ▄▄▄ ██▓███ ▄▄▄ ▄████▄ ██░ ██ ▓█████ ██▓ ▄████ ███▄ █ ██▓▄▄▄█████▓▓█████ |
| ▒████▄ ▓██░ ██▒▒████▄ ▒██▀ ▀█ ▓██░ ██▒▓█ ▀ ▓██▒ ██▒ ▀█▒ ██ ▀█ █ ▓██▒▓ ██▒ ▓▒▓█ ▀ |
| ▒██ ▀█▄ ▓██░ ██▓▒▒██ ▀█▄ ▒▓█ ▄ ▒██▀▀██░▒███ ▒██▒▒██░▄▄▄░▓██ ▀█ ██▒▒██▒▒ ▓██░ ▒░▒███ |
| ░██▄▄▄▄██ ▒██▄█▓▒ ▒░██▄▄▄▄██ ▒▓▓▄ ▄██▒░▓█ ░██ ▒▓█ ▄ ░██░░▓█ ██▓▓██▒ ▐▌██▒░██░░ ▓██▓ ░ ▒▓█ ▄ |
| ▓█ ▓██▒▒██▒ ░ ░ ▓█ ▓██▒▒ ▓███▀ ░░▓█▒░██▓░▒████▒ ░██░░▒▓███▀▒▒██░ ▓██░░██░ ▒██▒ ░ ░▒████▒ |
| ▒▒ ▓▒█░▒▓▒░ ░ ░ ▒▒ ▓▒█░░ ░▒ ▒ ░ ▒ ░░▒░▒░░ ▒░ ░ ░▓ ░▒ ▒ ░ ▒░ ▒ ▒ ░▓ ▒ ░░ ░░ ▒░ ░ |
| ▒ ▒▒ ░░▒ ░ ▒ ▒▒ ░ ░ ▒ ▒ ░▒░ ░ ░ ░ ░ ▒ ░ ░ ░ ░ ░░ ░ ▒░ ▒ ░ ░ ░ ░ ░ |
| ░ ▒ ░░ ░ ▒ ░ ░ ░░ ░ ░ ▒ ░░ ░ ░ ░ ░ ░ ▒ ░ ░ ░ |
| ░ ░ ░ ░░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ |
| --> |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <!DOCTYPE html> |
| <html lang="en"> |
| <head> |
| <link rel="canonical" href="https://ignite.apache.org/use-cases/spark-acceleration.html"/> |
| <meta charset="utf-8"> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| |
| <meta name="description" |
| content="Apache Ignite integrates with Apache Spark to accelerate the performance of Spark applications |
| and APIs by keeping data in a shared in-memory cluster."/> |
| |
| <title>Apache Spark Performance Acceleration With Apache Ignite</title> |
| |
| <!--#include virtual="/includes/styles.html" --> |
| |
| <!--#include virtual="/includes/sh.html" --> |
| </head> |
| <body> |
| <div id="wrapper"> |
| <!--#include virtual="/includes/header.html" --> |
| |
| <main id="main" role="main" class="container"> |
| <section id="shared-memory-layer" class="page-section"> |
| <h1 class="first">Apache Spark Performance Acceleration With Apache Ignite</h1> |
| <div class="col-sm-12 col-md-12 col-xs-12" style="padding:0 0 10px 0;"> |
| <div class="col-sm-6 col-md-6 col-xs-12" style="padding-left:0; padding-right:0"> |
| <p> |
| Apache Ignite integrates with Apache Spark to accelerate the performance of Spark applications |
| and APIs by keeping data in a shared in-memory cluster. Spark users can use Ignite as a data |
| source in a way similar to Hadoop or a relational database. Just start an Ignite cluster, set |
| it as a data source for Spark workers, and keep using Spark RDDs or DataFrames APIs or gain |
| even more speed by running Ignite SQL or compute APIs directly. |
| </p> |
| |
| <p> |
| In addition to the performance acceleration of Spark applications, Ignite is used as a shared |
| in-memory layer by those Spark workers that need to share both data and state. |
| </p> |
| |
| </div> |
| |
| <div class="col-sm-6 col-md-6 col-xs-12" style="padding-right:0"> |
| <img class="img-responsive" src="/images/spark_integration.png" width="440px" style="float:right;"/> |
| </div> |
| |
| </div> |
| |
| <p> |
| The performance increase is achievable for several reasons. First, Ignite is designed to store data sets |
| in memory across a cluster of nodes reducing latency of Spark operations that usually need to pull date |
| from disk-based systems. Second, Ignite tries to minimize data shuffling over the network between its |
| store and Spark applications by running certain Spark tasks, produced by RDDs or DataFrames APIs, |
| in-place on Ignite nodes. This optimization helps to reduce the effect of the network latency on |
| performance of Spark calls. Finally, the network impact can be minimized even greatly if native |
| Ignite APIs such as SQL are called from Spark applications directly. By doing that, you will completely |
| eliminate data shuffling between Spark and Ignite as long as Ignite SQL queries are always executed on |
| Ignite nodes returning a much smaller final result set to an application layer. |
| </p> |
| |
| <div class="page-heading">Ignite Shared RDDs</div> |
| <p> |
| Apache Ignite provides an implementation of the Spark RDD which allows any data and state to be shared |
| in memory as RDDs across Spark jobs. The Ignite RDD provides a shared, mutable view of the same data |
| in-memory in Ignite across different Spark jobs, workers, or applications. |
| </p> |
| |
| <p> |
| The way an IgniteRDD is implemented is as a view over a distributed Ignite table (aka. cache). |
| It can be deployed with an Ignite node either within the Spark job executing process, on a Spark worker, |
| or in a separate Ignite cluster. It means that depending on the chosen deployment mode the shared |
| state may either exist only during the lifespan of a Spark application (embedded mode), or it may |
| out-survive the Spark application (standalone mode). |
| </p> |
| |
| <div class="page-heading">Ignite DataFrames</div> |
| <p> |
| The Apache Spark DataFrame API introduced the concept of a schema to describe the data, |
| allowing Spark to manage the schema and organize the data into a tabular format. To put it simply, |
| a DataFrame is a distributed collection of data organized into named columns. It is conceptually |
| equivalent to a table in a relational database and allows Spark to leverage the Catalyst query |
| optimizer to produce much more efficient query execution plans in comparison to RDDs, which are |
| collections of elements partitioned across the nodes of the cluster. |
| </p> |
| <p> |
| Ignite supports DataFrame APIs letting Spark to write to and read from Ignite through that interface. |
| Even more, Ignite analyses execution plans produced by Spark's Catalyst engine and can execute |
| parts of the plan on Ignite nodes directly, reducing data shuffling. All that will make your SparkSQL |
| more performant. |
| </p> |
| |
| <div class="page-heading">Learn More</div> |
| <p> |
| <a href="https://apacheignite-fs.readme.io/docs/installation-deployment" target="docs"> |
| <b>Ignite and Spark Installation and Deployment <i class="fa fa-angle-double-right"></i></b> |
| </a> |
| </p> |
| <p> |
| <a href="https://apacheignite-fs.readme.io/docs/ignitecontext-igniterdd" target="docs"> |
| <b>Ignite RDDs in Details <i class="fa fa-angle-double-right"></i></b> |
| </a> |
| </p> |
| <p> |
| <a href="https://apacheignite-fs.readme.io/docs/ignite-data-frame" target="docs"> |
| <b>Ignite DataFrames in Details <i class="fa fa-angle-double-right"></i></b> |
| </a> |
| </p> |
| |
| </section> |
| </main> |
| |
| <!--#include virtual="/includes/footer.html" --> |
| </div> |
| <!--#include virtual="/includes/scripts.html" --> |
| </body> |
| </html> |