title: “Quick Start: Run K-Means Example”
This guide walks you through the steps of executing an example program (K-Means clustering) on Flink. On the way, you will see the a visualization of the program, the optimized execution plan, and track the progress of its execution.
Follow the instructions to setup Flink and enter the root directory of your Flink setup.
Flink contains a data generator for K-Means.
# Assuming you are in the root directory of your Flink setup mkdir kmeans cd kmeans # Run data generator java -cp ../examples/batch/KMeans.jar:../lib/flink-dist-{{ site.version }}.jar \ org.apache.flink.examples.java.clustering.util.KMeansDataGenerator \ -points 500 -k 10 -stddev 0.08 -output `pwd`
The generator has the following arguments (arguments in []
are optional):
-points <num> -k <num clusters> [-output <output-path>] [-stddev <relative stddev>] [-range <centroid range>] [-seed <seed>]
The relative standard deviation is an interesting tuning parameter. It determines the closeness of the points to randomly generated centers.
The kmeans/
directory should now contain two files: centers
and points
. The points
file contains the points to cluster and the centers
file contains initial cluster centers.
Use the plotPoints.py
tool to review the generated data points. Download Python Script
python plotPoints.py points ./points input
Note: You might have to install matplotlib (python-matplotlib
package on Ubuntu) to use the Python script.
You can review the input data stored in the input-plot.pdf
, for example with Evince (evince input-plot.pdf
).
The following overview presents the impact of the different standard deviations on the input data.
relative stddev = 0.03 | relative stddev = 0.08 | relative stddev = 0.15 |
---|---|---|
Start Flink and the web job submission client on your local machine.
# return to the Flink root directory cd .. # start Flink ./bin/start-local.sh
The Flink web interface allows to submit Flink programs using a graphical user interface.
<div class="col-md-6"> Watch the job executing. </div>
Stop Flink when you are done.
# stop Flink ./bin/stop-local.sh
Use the Python Script again to visualize the result.
cd kmeans python plotPoints.py result ./result clusters
The following three pictures show the results for the sample input above. Play around with the parameters (number of iterations, number of clusters) to see how they affect the result.
relative stddev = 0.03 | relative stddev = 0.08 | relative stddev = 0.15 |
---|---|---|