docs/latest/tutorials/map-reduce/clustering/clusteringyourdata.md - mahout - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 ---
 layout: doc-page
 title: (Deprecated)  ClusteringYourData


 ---

 # Clustering your data

 After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own
 data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.

 The following pieces *may* be useful for in getting started:

 <a name="ClusteringYourData-Input"></a>
 # Input

 For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html).
 In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html).


 <a name="ClusteringYourData-RunningtheProcess"></a>
 # Running the Process

 * [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html).

 * [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and
 [fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).

 * [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html).

 * [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html).

 * [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html).

 * TODO: kmeans++/ streaming kMeans documentation


 <a name="ClusteringYourData-RetrievingtheOutput"></a>
 # Retrieving the Output

 Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.

     ./bin/mahout clusterdump <OPTIONS>


 <a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
 ## The cluster dumper options are:

       --help (-h)				   Print out help

       --input (-i) input			   The directory containing Sequence
     					   Files for the Clusters

       --output (-o) output			   The output file.  If not specified,
     					   dumps to the console.

       --outputFormat (-of) outputFormat	   The optional output format to write
     					   the results as. Options: TEXT, CSV, or GRAPH_ML

       --substring (-b) substring		   The number of chars of the
     					   asFormatString() to print

       --pointsDir (-p) pointsDir		   The directory containing points
  					   sequence files mapping input vectors     					   to their cluster.  If specified,
     					   then the program will output the
     					   points associated with a cluster

       --dictionary (-d) dictionary		   The dictionary file.

       --dictionaryType (-dt) dictionaryType    The dictionary file type
     					   (text|sequencefile)

       --distanceMeasure (-dm) distanceMeasure  The classname of the DistanceMeasure.
     					   Default is SquaredEuclidean.

       --numWords (-n) numWords		   The number of top terms to print

       --tempDir tempDir			   Intermediate output directory

       --startPhase startPhase		   First phase to run

       --endPhase endPhase			   Last phase to run

       --evaluate (-e)			   Run ClusterEvaluator and CDbwEvaluator over the
     					   input. The output will be appended to the rest of
     					   the output at the end.


 More information on using clusterdump utility can be found [here](cluster-dumper.html)

 <a name="ClusteringYourData-ValidatingtheOutput"></a>
 # Validating the Output

 {quote}
 Ted Dunning: A principled approach to cluster evaluation is to measure how well the
 cluster membership captures the structure of unseen data.  A natural
 measure for this is to measure how much of the entropy of the data is
 captured by cluster membership.  For k-means and its natural L_2 metric,
 the natural cluster quality metric is the squared distance from the nearest
 centroid adjusted by the log_2 of the number of clusters.  This can be
 compared to the squared magnitude of the original data or the squared
 deviation from the centroid for all of the data.  The idea is that you are
 changing the representation of the data by allocating some of the bits in
 your original representation to represent which cluster each point is in.
 If those bits aren't made up by the residue being small then your
 clustering is making a bad trade-off.

 In the past, I have used other more heuristic measures as well.  One of the
 key characteristics that I would like to see out of a clustering is a
 degree of stability.  Thus, I look at the fractions of points that are
 assigned to each cluster or the distribution of distances from the cluster
 centroid. These values should be relatively stable when applied to held-out
 data.

 For text, you can actually compute perplexity which measures how well
 cluster membership predicts what words are used.  This is nice because you
 don't have to worry about the entropy of real valued numbers.

 Manual inspection and the so-called laugh test is also important.  The idea
 is that the results should not be so ludicrous as to make you laugh.
 Unfortunately, it is pretty easy to kid yourself into thinking your system
 is working using this kind of inspection.  The problem is that we are too
 good at seeing (making up) patterns.
 {quote}
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	---
	layout: doc-page
	title: (Deprecated) ClusteringYourData


	---

	# Clustering your data

	After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own
	data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.

	The following pieces may be useful for in getting started:

	<a name="ClusteringYourData-Input"></a>
	# Input

	For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html).
	In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html).


	<a name="ClusteringYourData-RunningtheProcess"></a>
	# Running the Process

	* [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html).

	* [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and
	[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).

	* [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html).

	* [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html).

	* [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html).

	* TODO: kmeans++/ streaming kMeans documentation


	<a name="ClusteringYourData-RetrievingtheOutput"></a>
	# Retrieving the Output

	Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.

	./bin/mahout clusterdump <OPTIONS>


	<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
	## The cluster dumper options are:

	--help (-h) Print out help

	--input (-i) input The directory containing Sequence
	Files for the Clusters

	--output (-o) output The output file. If not specified,
	dumps to the console.

	--outputFormat (-of) outputFormat The optional output format to write
	the results as. Options: TEXT, CSV, or GRAPH_ML

	--substring (-b) substring The number of chars of the
	asFormatString() to print

	--pointsDir (-p) pointsDir The directory containing points
	sequence files mapping input vectors to their cluster. If specified,
	then the program will output the
	points associated with a cluster

	--dictionary (-d) dictionary The dictionary file.

	--dictionaryType (-dt) dictionaryType The dictionary file type
	(text\|sequencefile)

	--distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure.
	Default is SquaredEuclidean.

	--numWords (-n) numWords The number of top terms to print

	--tempDir tempDir Intermediate output directory

	--startPhase startPhase First phase to run

	--endPhase endPhase Last phase to run

	--evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the
	input. The output will be appended to the rest of
	the output at the end.


	More information on using clusterdump utility can be found [here](cluster-dumper.html)

	<a name="ClusteringYourData-ValidatingtheOutput"></a>
	# Validating the Output

	{quote}
	Ted Dunning: A principled approach to cluster evaluation is to measure how well the
	cluster membership captures the structure of unseen data. A natural
	measure for this is to measure how much of the entropy of the data is
	captured by cluster membership. For k-means and its natural L_2 metric,
	the natural cluster quality metric is the squared distance from the nearest
	centroid adjusted by the log_2 of the number of clusters. This can be
	compared to the squared magnitude of the original data or the squared
	deviation from the centroid for all of the data. The idea is that you are
	changing the representation of the data by allocating some of the bits in
	your original representation to represent which cluster each point is in.
	If those bits aren't made up by the residue being small then your
	clustering is making a bad trade-off.

	In the past, I have used other more heuristic measures as well. One of the
	key characteristics that I would like to see out of a clustering is a
	degree of stability. Thus, I look at the fractions of points that are
	assigned to each cluster or the distribution of distances from the cluster
	centroid. These values should be relatively stable when applied to held-out
	data.

	For text, you can actually compute perplexity which measures how well
	cluster membership predicts what words are used. This is nice because you
	don't have to worry about the entropy of real valued numbers.

	Manual inspection and the so-called laugh test is also important. The idea
	is that the results should not be so ludicrous as to make you laugh.
	Unfortunately, it is pretty easy to kid yourself into thinking your system
	is working using this kind of inspection. The problem is that we are too
	good at seeing (making up) patterns.
	{quote}