title: “Quick Start: Run K-Means Example”

This guide will demonstrate Flink‘s features by example. You will see how you can leverage Flink’s Iteration-feature to find clusters in a dataset using K-Means clustering. On the way, you will see the compiler, the status interface and the result of the algorithm.

Generate Input Data

Flink contains a data generator for K-Means.

 # pick a mirror from: {{ site.FLINK_DOWNLOAD_URL_HADOOP_1_STABLE }}
 # for example
wget http://mirror.lwnetwork.org.uk/APACHE/incubator/flink/flink-0.6-incubating-bin-hadoop1.tgz
tar xzf flink-*.tgz 
cd flink-*
mkdir kmeans
cd kmeans
 # Run data generator
java -cp  ../examples/flink-java-examples-{{ site.FLINK_VERSION_STABLE }}-KMeans.jar org.apache.flink.example.java.clustering.util.KMeansDataGenerator 500 10 0.08
cp /tmp/points .
cp /tmp/centers .

The generator has the following arguments:

KMeansDataGenerator <numberOfDataPoints> <numberOfClusterCenters> [<relative stddev>] [<centroid range>] [<seed>]

The relative standard deviation is an interesting tuning parameter: it determines the closeness of the points to the centers.

The kmeans/ directory should now contain two files: centers and points.

Review Input Data

Use the plotPoints.py tool to review the result of the data generator. Download Python Script

python plotPoints.py points points input

Note: You might have to install matplotlib (python-matplotlib package on Ubuntu) to use the Python script.

You can review the input data stored in the input-plot.pdf, for example with Evince (evince input-plot.pdf).

The following overview presents the impact of the different standard deviations on the input data.

relative stddev = 0.03relative stddev = 0.08relative stddev = 0.15

Run Clustering

We are using the generated input data to run the clustering using a Flink job.

 # go to the Flink-root directory
cd flink
 # start Flink (use ./bin/start-cluster.sh if you're on a cluster)
./bin/start-local.sh
 # Start Flink web client
./bin/start-webclient.sh

Review Flink Compiler

The Flink webclient allows to submit Flink programs using a graphical user interface.

<div class="col-md-6">
	1. Press the <b>RunJob</b> to see the optimzer plan. <br>
	2. Inspect the operators and see the properties (input sizes, cost estimation) determined by the optimizer.
</div>

Analyze the Result

Use the Python Script again to visualize the result

python plotPoints.py result result result-pdf

The following three pictures show the results for the sample input above. Play around with the parameters (number of iterations, number of clusters) to see how they affect the result.

relative stddev = 0.03relative stddev = 0.08relative stddev = 0.15