Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
It consists of three libraries:
For more information please visit the website:
If you'd like to jump in and get started, check out the corresponding guides for each library:
If you are starting from a source release, then you'll want to verify the release is valid and bootstrap the build environment.
To verify that the archive has the correct MD5 checksum, the following two commands can be run. These should produce the same output.
openssl md5 < apache-datafu-sources-x.y.z-incubating.tgz cat apache-datafu-sources-x.y.z-incubating.tgz.MD5
To verify the archive against its signature, you can run:
gpg2 --verify apache-datafu-sources-x.y.z-incubating.tgz.asc
The command above will assume you are verifying
apache-datafu-sources-x.y.z-incubating.tgz and produce “Good signature” if the archive is valid.
To build DataFu from a source release, it is first necessary to download a gradle wrapper script. This bootstrapping process requires Gradle to be installed on the source machine. Gradle is available through most package managers or directly from its website. Once you have installed Gradle and have ensured that the
gradle is available in your path, you can bootstrap the wrapper with:
gradle -b bootstrap.gradle
After the bootstrap script has completed, you should find a
gradlew script in the root of the project. The regular gradlew instructions below should then be available.
When building from a source release, the version for all generated artifacts will be of the form
x.y.z. If you were to clone the git repo and build you would find
-SNAPSHOT appended to the version. This helps to distinguish official releases from those generated from the code repository for testing purposes.
To build DataFu from a git checkout or binary release, run:
./gradlew clean assemble
Each project's jars can be found under the corresponding sub directory. For example, the datafu-pig JAR can be found under
datafu-pig/build/libs. The artifact name will be of the form
datafu-pig-x.y.z.jar if this is a source release and
datafu-pig-x.y.z-SNAPSHOT.jar if this is being built from the code repository.
This command generates the eclipse project and classpath files:
To load the projects in Eclipse:
To clean up the eclipse files:
To run all the tests:
To run only one module's tests - for example, only the DataFu Pig tests:
To run tests for a single class, use the
test.single property. For example, to run only the QuantileTests:
./gradlew :datafu-pig:test -Dtest.single=QuantileTests
The tests can also be run from within Eclipse. You'll need to install the TestNG plugin for Eclipse for DataFu Pig and Hourglass. See: http://testng.org/doc/download.html.
Potential issues and workaround: