blob: 47f70af06c4556750d834e156a0fb8220452d47e [file] [log] [blame]
---
title: Apache DataFu Spark - Getting Started
version: 1.7.0
section_name: Getting Started
license: >
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
# DataFu Spark
Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in [Apache Spark](http://spark.apache.org/).
A list of some of the things you can do with DataFu Spark is given below:
* ["Dedup" a table](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L139) - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).
* [Join a table with a numeric field with a table with a range](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L361)
* [Do a skewed join between tables](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L274) (where the small table is still too big to fit in memory)
* [Count distinct up to](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala#L224) - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table
* Call Python code from Spark Scala, or Scala code from PySpark
If you'd like to read more details about these functions, check out the [Guide](/docs/spark/guide.html). Otherwise if you are
ready to get started using DataFu Spark, keep reading.
The rest of this page assumes you already have a built JAR available. If this is not the case, please see the [Download](/docs/download.html) page.
This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, _DataFrameOps_.
## Basic Example: Finding the most recent update of a given record
A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Lets consider the following simplified example.
<br>
<script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
<center>Raw customers data, with more than one row per customer</center>
<br>
We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's _dedupWithOrder_ method.
```scala
import datafu.spark.DataFrameOps._
val customers = spark.read.format("csv").option("header", "true").load("customers.csv")
csv.dedupWithOrder($"id", $"date_updated".desc).show
```
Our result will be as expected — each customer only appears once, as you can see below:
<br>
<script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
<center>“Deduplicated” data, with only the most recent record for each customer (though not in order)</center>
<br>
There are two additional variants of _dedupWithOrder_ in datafu-spark. The _dedupWithCombiner_ method has similar functionality to _dedupWithOrder_, but uses a UDAF to utilize map side aggregation. _dedupTopN_ allows retaining more than one record for each key.
## Next Steps
Check out the [Guide](/docs/spark/guide.html) for more information on what you can do with DataFu Spark.