In order to call the datafu-spark API's from Pyspark, you can do the following (tested on a Hortonworks vm)
First, call pyspark with the following parameters
export PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar pyspark --jars datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar --conf spark.executorEnv.PYTHONPATH=datafu-spark_2.11_2.3.0-1.5.0-SNAPSHOT.jar
The following is an example of calling the Spark version of the datafu dedup method
from pyspark_utils.df_utils import PySparkDFUtils df_utils = PySparkDFUtils() df_people = sqlContext.createDataFrame([ ("a", "Alice", 34), ("a", "Sara", 33), ("b", "Bob", 36), ("b", "Charlie", 30), ("c", "David", 29), ("c", "Esther", 32), ("c", "Fanny", 36), ("c", "Zoey", 36)], ["id", "name", "age"]) func_dedup_res = df_utils.dedup_with_order(dataFrame=df_people, groupCol=df_people.id, orderCols=[df_people.age.desc(), df_people.name.desc()]) func_dedup_res.registerTempTable("dedup") func_dedup_res.show()
This should produce the following output
Building and testing datafu-spark can be done as described in the the main DataFu README.
If you wish to build for a specific Scala/Spark version, there are two options. One is to change the scalaVersion and sparkVersion in the main gradle.properties file.
The other is to pass these parameters in the command line. For example, to build and test for Scala 2.12 and Spark 2.4.0, you would use
./gradlew :datafu-spark:test -PscalaVersion=2.12 -PsparkVersion=2.4.0
There is a script for building and testing datafu-spark across the multiple Scala/Spark combinations.
To see the available options run it like this:
./build_and_test_spark.sh -h