| --- |
| title: Apache DataFu Spark - Getting Started |
| version: 1.7.0 |
| section_name: Getting Started |
| license: > |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --- |
| |
| # DataFu Spark |
| |
| Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in [Apache Spark](http://spark.apache.org/). |
| |
| A list of some of the things you can do with DataFu Spark is given below: |
| |
| * ["Dedup" a table](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L139) - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record). |
| |
| * [Join a table with a numeric field with a table with a range](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L361) |
| |
| * [Do a skewed join between tables](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L274) (where the small table is still too big to fit in memory) |
| |
| * [Count distinct up to](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala#L224) - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table |
| |
| * Call Python code from Spark Scala, or Scala code from PySpark |
| |
| If you'd like to read more details about these functions, check out the [Guide](/docs/spark/guide.html). Otherwise if you are |
| ready to get started using DataFu Spark, keep reading. |
| |
| The rest of this page assumes you already have a built JAR available. If this is not the case, please see the [Download](/docs/download.html) page. |
| |
| This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, _DataFrameOps_. |
| |
| ## Basic Example: Finding the most recent update of a given record |
| |
| A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example. |
| |
| <br> |
| <script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script> |
| <center>Raw customers’ data, with more than one row per customer</center> |
| <br> |
| |
| We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's _dedupWithOrder_ method. |
| |
| ```scala |
| import datafu.spark.DataFrameOps._ |
| |
| val customers = spark.read.format("csv").option("header", "true").load("customers.csv") |
| |
| csv.dedupWithOrder($"id", $"date_updated".desc).show |
| ``` |
| |
| Our result will be as expected — each customer only appears once, as you can see below: |
| |
| <br> |
| <script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script> |
| <center>“Deduplicated” data, with only the most recent record for each customer (though not in order)</center> |
| <br> |
| |
| There are two additional variants of _dedupWithOrder_ in datafu-spark. The _dedupWithCombiner_ method has similar functionality to _dedupWithOrder_, but uses a UDAF to utilize map side aggregation. _dedupTopN_ allows retaining more than one record for each key. |
| |
| ## Next Steps |
| |
| Check out the [Guide](/docs/spark/guide.html) for more information on what you can do with DataFu Spark. |