site/source/docs/spark/getting-started.html.markdown.erb - datafu - Git at Google

 ---
 title: Apache DataFu Spark - Getting Started
 version: 1.7.0
 section_name: Getting Started
 license: >
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

        http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 ---

 # DataFu Spark

 Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in [Apache Spark](http://spark.apache.org/).

 A list of some of the things you can do with DataFu Spark is given below:

 * ["Dedup" a table](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L139) - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).

 * [Join a table with a numeric field with a table with a range](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L361)

 * [Do a skewed join between tables](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L274) (where the small table is still too big to fit in memory)

 * [Count distinct up to](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala#L224) - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table

 * Call Python code from Spark Scala, or Scala code from PySpark

 If you'd like to read more details about these functions, check out the [Guide](/docs/spark/guide.html).  Otherwise if you are
 ready to get started using DataFu Spark, keep reading.

 The rest of this page assumes you already have a built JAR available.  If this is not the case, please see the [Download](/docs/download.html) page.

 This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, _DataFrameOps_.

 ## Basic Example: Finding the most recent update of a given record

 A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.

 <br>
 <script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
 <center>Raw customers’ data, with more than one row per customer</center>
 <br>

 We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's _dedupWithOrder_ method.

 ```scala
 import datafu.spark.DataFrameOps._

 val customers = spark.read.format("csv").option("header", "true").load("customers.csv")

 csv.dedupWithOrder($"id", $"date_updated".desc).show
 ```

 Our result will be as expected — each customer only appears once, as you can see below:

 <br>
 <script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
 <center>“Deduplicated” data, with only the most recent record for each customer (though not in order)</center>
 <br>

 There are two additional variants of _dedupWithOrder_ in datafu-spark. The _dedupWithCombiner_ method has similar functionality to _dedupWithOrder_, but uses a UDAF to utilize map side aggregation. _dedupTopN_ allows retaining more than one record for each key.

 ## Next Steps

 Check out the [Guide](/docs/spark/guide.html) for more information on what you can do with DataFu Spark.
	---
	title: Apache DataFu Spark - Getting Started
	version: 1.7.0
	section_name: Getting Started
	license: >
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	---

	# DataFu Spark

	Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in [Apache Spark](http://spark.apache.org/).

	A list of some of the things you can do with DataFu Spark is given below:

	* ["Dedup" a table](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L139) - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).

	* [Join a table with a numeric field with a table with a range](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L361)

	* [Do a skewed join between tables](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkDFUtils.scala#L274) (where the small table is still too big to fit in memory)

	* [Count distinct up to](https://github.com/apache/datafu/blob/master/datafu-spark/src/main/scala/datafu/spark/SparkUDAFs.scala#L224) - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table

	* Call Python code from Spark Scala, or Scala code from PySpark

	If you'd like to read more details about these functions, check out the [Guide](/docs/spark/guide.html). Otherwise if you are
	ready to get started using DataFu Spark, keep reading.

	The rest of this page assumes you already have a built JAR available. If this is not the case, please see the [Download](/docs/download.html) page.

	This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, _DataFrameOps_.

	## Basic Example: Finding the most recent update of a given record

	A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.

	<br>
	<script src="https://gist.github.com/eyala/65b6750b2539db5895738a49be3d8c98.js"></script>
	<center>Raw customers’ data, with more than one row per customer</center>
	<br>

	We can see that though most of the customers only appear once, _julia_ and _quentin_ have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's _dedupWithOrder_ method.

	```scala
	import datafu.spark.DataFrameOps._

	val customers = spark.read.format("csv").option("header", "true").load("customers.csv")

	csv.dedupWithOrder($"id", $"date_updated".desc).show
	```

	Our result will be as expected — each customer only appears once, as you can see below:

	<br>
	<script src="https://gist.github.com/eyala/1dddebc39e9a3fe4501638a95f577752.js"></script>
	<center>“Deduplicated” data, with only the most recent record for each customer (though not in order)</center>
	<br>

	There are two additional variants of _dedupWithOrder_ in datafu-spark. The _dedupWithCombiner_ method has similar functionality to _dedupWithOrder_, but uses a UDAF to utilize map side aggregation. _dedupTopN_ allows retaining more than one record for each key.

	## Next Steps

	Check out the [Guide](/docs/spark/guide.html) for more information on what you can do with DataFu Spark.