tree: 9ba89e2663e3be36980eb36c351fed51bddf3134 [path history] [tgz]
  1. code/
  2. src/
  3. wayang-benchmark_2.11/
  4. wayang-benchmark_2.12/
  5. pom.xml
  6. README.md
wayang-benchmark/README.md

Apache Wayang (incubating) Benchmarks

This repository provides example applications and further benchmarking tools to evaluate and get started with Apache Wayang (incubating).

Below we provide detailed information on our various benchmark components, including running instructions. For the configuration of Apache Wayang (incubating) itself, please consult the Apache Wayang (incubating) repository or feel free to reach out on dev@wayang.apache.org.

Apache Wayang (incubating) Example Applications

WordCount

Description. This app takes a text input file and counts the number occurrences of each word in the text. This simple app has become some sort of “Hello World” program for data processing systems.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.wordcount.WordCountScala

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Apache Wayang (incubating) in combination with this app:

Word2NVec

Description. Akin to Google's Word2Vec, this app creates vector representations of words from a corpus based on its neighbors. This app is a bit simpler in the sense that it calculates the average neighborhood of each word rather than determining a lower-dimensional representation. The resulting vectors can be used, e.g., to cluster words and find related terms.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.simwords.Word2NVec

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Apache Wayang (incubating) in combination with this app:

TPC-H Query 3

Description. This app executes a query from the established TPC-H benchmark. We provide several variants that work either on data in databases, in files, or in a mixture of both. Thus, this app requires cross-platform execution.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.tpch.TpcH

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters. Note that you will have to configure Apache Wayang (incubating), such that can access the database. Furthermore, this app depends on the following configuration keys:

  • wayang.apps.tpch.csv.customer: URL to the CUSTOMER file
  • wayang.apps.tpch.csv.orders: URL to the ORDERS file
  • wayang.apps.tpch.csv.lineitem: URL to the LINEITEM file

Datasets. The datasets for this app can be generated with the TPC-H tools. The generated datasets can then be either put into a database and/or a filesystem.

SINDY

Description. This app provides the data profiling algorithm SINDY that discovers inclusion dependencies in a relational database.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.sindy.Sindy

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Apache Wayang (incubating) in combination with this app:

SGD

Description. This app implements the stochastic gradient descent algorithm. SGD is an optimization algorithm that minimizes a loss function and can be used in many tasks of supervised machine learning. The current implementation uses the logistic loss and can thus, be used for classification. As many other machine learning techniques, SGD is a highly iterative algorithm.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.sgd.SGD

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Apache Wayang (incubating) in combination with this app:

k-means

Description. Being a well-known method to cluster data points in a Euclidian space. As many other machine learning techniques, k-means is an iterative algorithm.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.kmeans.Kmeans

or

org.apache.wayang.apps.kmeans.postgres.Kmeans

The latter assumes data to reside in a filesystem, while the other assumes data to reside in PostgreSQL. For the latter case, you will have to configure Apache Wayang (incubating), such that it can access the database. Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. We provide a data generator to generate files that can be clustered. You can further load these files into the database assuming the following schema:

CREATE TABLE "<table_name_of_your_choice>" (x float8, y float8);

CrocoPR

Description. This app implements the cross-community PageRank: It takes as input two graphs, merges them, and runs a standard PageRank on the resulting graph. The preprocessing and PageRank steps typically lend themselves to be executed on different platforms.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.crocopr.CrocoPR

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. This app works on RDF files, more specifically the Wikipedia pagelinks via DBpedia. Note that this app requires two input files. For the purpose of benchmarking, it is fine to use the same input file twice.

Optimizer experiments

Optimizer scalability

Description. This app generates Apache Wayang (incubating) plans with specific predefined topologies but of arbitrary size. This allows to experimentally determine the scalability of Apache Wayang (incubating)'s optimizer to large plans.

Running the app. To run the app, launch the main class:

org.apache.wayang.apps.benchmark.OptimizerScalabilityTest

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters. Furthermore, the following configuration can be interesting:

  • wayang.core.optimizer.pruning.strategies: controls the pruning strategy to be used when enumerating alternative plans
    • admissible values: empty or comma-separated list of org.apache.wayang.core.optimizer.enumeration.LatentOperatorPruningStrategy (default), org.apache.wayang.core.optimizer.enumeration.TopKPruningStrategy, org.apache.wayang.core.optimizer.enumeration.RandomPruningStrategy, and org.apache.wayang.core.optimizer.enumeration.SinglePlatformPruningStrategy (order-sensitive)
  • wayang.core.optimizer.pruning.topk: controls the k for the top-k pruning
  • wayang.core.optimizer.enumeration.concatenationprio: controls the order of the enumeration
    • admissible values: slots, plans, plans2, none, random
  • wayang.core.optimizer.enumeration.invertconcatenations invert the above mentioned enumeration order
    • admissible value: false (default), true