title: “Quick Start: Run K-Means Example”

Top navigation

top-nav-group: quickstart top-nav-pos: 2 top-nav-title: Run Example

  • This will be replaced by the TOC {:toc}

This guide walks you through the steps of executing an example program (K-Means clustering) on Flink. On the way, you will see the a visualization of the program, the optimized execution plan, and track the progress of its execution.

Setup Flink

Follow the instructions to setup Flink and enter the root directory of your Flink setup.

Generate Input Data

Flink contains a data generator for K-Means.

# Assuming you are in the root directory of your Flink setup
mkdir kmeans
cd kmeans
# Run data generator
java -cp ../examples/batch/KMeans.jar:../lib/flink-dist-{{ site.version }}.jar \
  org.apache.flink.examples.java.clustering.util.KMeansDataGenerator \
  -points 500 -k 10 -stddev 0.08 -output `pwd`

The generator has the following arguments (arguments in [] are optional):

-points <num> -k <num clusters> [-output <output-path>] [-stddev <relative stddev>] [-range <centroid range>] [-seed <seed>]

The relative standard deviation is an interesting tuning parameter. It determines the closeness of the points to randomly generated centers.

The kmeans/ directory should now contain two files: centers and points. The points file contains the points to cluster and the centers file contains initial cluster centers.

Inspect the Input Data

Use the plotPoints.py tool to review the generated data points. Download Python Script

python plotPoints.py points ./points input

Note: You might have to install matplotlib (python-matplotlib package on Ubuntu) to use the Python script.

You can review the input data stored in the input-plot.pdf, for example with Evince (evince input-plot.pdf).

The following overview presents the impact of the different standard deviations on the input data.

relative stddev = 0.03relative stddev = 0.08relative stddev = 0.15

Start Flink

Start Flink and the web job submission client on your local machine.

# return to the Flink root directory
cd ..
# start Flink
./bin/start-local.sh

Inspect and Run the K-Means Example Program

The Flink web interface allows to submit Flink programs using a graphical user interface.

<div class="col-md-6">
	Watch the job executing.
</div>

Shutdown Flink

Stop Flink when you are done.

# stop Flink
./bin/stop-local.sh

Analyze the Result

Use the Python Script again to visualize the result.

cd kmeans
python plotPoints.py result ./result clusters

The following three pictures show the results for the sample input above. Play around with the parameters (number of iterations, number of clusters) to see how they affect the result.

relative stddev = 0.03relative stddev = 0.08relative stddev = 0.15