The k-means algorithm can be used for unsupervised clustering of multivariate data. In this example we use a data set to predict forest cover type. There is also a paper published for the author of this work (PDF)
Here are the steps I used to prepare the data files. They are located under src/main/resources/
# Download data set and uncoompress. wget http://kdd.ics.uci.edu/databases/covertype/covtype.data.gz gunzip covtype.data.gz # Remove some columns and put the class label in the first column. gawk -F "," '{print $55, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' covtype.data > covtype-modified.data # Randomize data set. sort -R covtype-modified.data > covtype-random.data # Check number of data points. wc -l covtype-* # 581012 covtype-modified.data # 581012 covtype-random.data # Create a train and a test set. tail -100000 covtype-random.data > covtype-test.data head -481012 covtype-random.data > covtype-train.data wc -l covtype-{train,test}.data # 481012 covtype-train.data # 100000 covtype-test.data # 581012 total
To estimate the centroids of the clusters using k-means, we need to follow these steps:
Clearly, estimating the k-means centroids requires batch processing. That is we pass the same data through the application several times and we know exactly how many observation vectors we have available. With this knowledge we design the following application graph:
We choose to use the same EventType to communicate across Processing Elements:
The event is immutable and can only be created using the constructor. The fields are: