docs/latest/tutorials/map-reduce/clustering/lda-commandline.md - mahout - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 ---
 layout: doc-page
 title: (Deprecated)  lda-commandline


 ---

 <a name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
 # Running Latent Dirichlet Allocation (algorithm) from the Command Line
 [Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
  lda has been implemented as Collapsed Variable Bayes (cvb).

 Mahout's LDA can be launched from the same command line invocation whether
 you are running on a single machine in stand-alone mode or on a larger
 Hadoop cluster. The difference is determined by the $HADOOP_HOME and
 $HADOOP_CONF_DIR environment variables. If both are set to an operating
 Hadoop cluster on the target machine then the invocation will run the LDA
 algorithm on that cluster. If either of the environment variables are
 missing then the stand-alone Hadoop configuration will be invoked instead.


     ./bin/mahout cvb <OPTIONS>


 * In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
 will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
 the Mahout version number. For example, when using Mahout 0.3 release, the
 job will be mahout-core-0.3.job


 <a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
 ## Testing it on one single machine w/o cluster

 * Put the data: cp <PATH TO DATA> testdata
 * Run the Job:

     ./bin/mahout cvb -i testdata <OTHER OPTIONS>


 <a name="lda-commandline-Runningitonthecluster"></a>
 ## Running it on the cluster

 * (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
 * Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
 * Run the Job:

     export HADOOP_HOME=<Hadoop Home Directory>
     export HADOOP_CONF_DIR=$HADOOP_HOME/conf
     ./bin/mahout cvb -i testdata <OTHER OPTIONS>

 * Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
 to view all outputs.

 <a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
 # Command line options from Mahout cvb version 0.8

     mahout cvb -h
       --input (-i) input					  Path to job input directory.
       --output (-o) output					  The directory pathname for output.
       --maxIter (-x) maxIter				  The maximum number of iterations.
       --convergenceDelta (-cd) convergenceDelta		  The convergence delta value
       --overwrite (-ow)					  If present, overwrite the output directory before running job
       --num_topics (-k) num_topics				  Number of topics to learn
       --num_terms (-nt) num_terms				  Vocabulary size
       --doc_topic_smoothing (-a) doc_topic_smoothing	  Smoothing for document/topic distribution
       --term_topic_smoothing (-e) term_topic_smoothing	  Smoothing for topic/term distribution
       --dictionary (-dict) dictionary			  Path to term-dictionary file(s) (glob expression supported)
       --doc_topic_output (-dt) doc_topic_output		  Output path for the training doc/topic distribution
       --topic_model_temp_dir (-mt) topic_model_temp_dir	  Path to intermediate model path (useful for restarting)
       --iteration_block_size (-block) iteration_block_size	  Number of iterations per perplexity check
       --random_seed (-seed) random_seed			  Random seed
       --test_set_fraction (-tf) test_set_fraction		  Fraction of data to hold out for testing
       --num_train_threads (-ntt) num_train_threads		  number of threads per mapper to train with
       --num_update_threads (-nut) num_update_threads	  number of threads per mapper to update the model with
       --max_doc_topic_iters (-mipd) max_doc_topic_iters	  max number of iterations per doc for p(topic|doc) learning
       --num_reduce_tasks num_reduce_tasks			  number of reducers to use during model estimation
       --backfill_perplexity 				  enable backfilling of missing perplexity values
       --help (-h)						  Print out help
       --tempDir tempDir					  Intermediate output directory
       --startPhase startPhase				  First phase to run
       --endPhase endPhase					  Last phase to run
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	---
	layout: doc-page
	title: (Deprecated) lda-commandline


	---

	<a name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
	# Running Latent Dirichlet Allocation (algorithm) from the Command Line
	[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
	lda has been implemented as Collapsed Variable Bayes (cvb).

	Mahout's LDA can be launched from the same command line invocation whether
	you are running on a single machine in stand-alone mode or on a larger
	Hadoop cluster. The difference is determined by the $HADOOP_HOME and
	$HADOOP_CONF_DIR environment variables. If both are set to an operating
	Hadoop cluster on the target machine then the invocation will run the LDA
	algorithm on that cluster. If either of the environment variables are
	missing then the stand-alone Hadoop configuration will be invoked instead.



	./bin/mahout cvb <OPTIONS>


	* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
	will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
	the Mahout version number. For example, when using Mahout 0.3 release, the
	job will be mahout-core-0.3.job


	<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
	## Testing it on one single machine w/o cluster

	* Put the data: cp <PATH TO DATA> testdata
	* Run the Job:

	./bin/mahout cvb -i testdata <OTHER OPTIONS>


	<a name="lda-commandline-Runningitonthecluster"></a>
	## Running it on the cluster

	* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
	* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
	* Run the Job:

	export HADOOP_HOME=<Hadoop Home Directory>
	export HADOOP_CONF_DIR=$HADOOP_HOME/conf
	./bin/mahout cvb -i testdata <OTHER OPTIONS>

	* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
	to view all outputs.

	<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
	# Command line options from Mahout cvb version 0.8

	mahout cvb -h
	--input (-i) input Path to job input directory.
	--output (-o) output The directory pathname for output.
	--maxIter (-x) maxIter The maximum number of iterations.
	--convergenceDelta (-cd) convergenceDelta The convergence delta value
	--overwrite (-ow) If present, overwrite the output directory before running job
	--num_topics (-k) num_topics Number of topics to learn
	--num_terms (-nt) num_terms Vocabulary size
	--doc_topic_smoothing (-a) doc_topic_smoothing Smoothing for document/topic distribution
	--term_topic_smoothing (-e) term_topic_smoothing Smoothing for topic/term distribution
	--dictionary (-dict) dictionary Path to term-dictionary file(s) (glob expression supported)
	--doc_topic_output (-dt) doc_topic_output Output path for the training doc/topic distribution
	--topic_model_temp_dir (-mt) topic_model_temp_dir Path to intermediate model path (useful for restarting)
	--iteration_block_size (-block) iteration_block_size Number of iterations per perplexity check
	--random_seed (-seed) random_seed Random seed
	--test_set_fraction (-tf) test_set_fraction Fraction of data to hold out for testing
	--num_train_threads (-ntt) num_train_threads number of threads per mapper to train with
	--num_update_threads (-nut) num_update_threads number of threads per mapper to update the model with
	--max_doc_topic_iters (-mipd) max_doc_topic_iters max number of iterations per doc for p(topic\|doc) learning
	--num_reduce_tasks num_reduce_tasks number of reducers to use during model estimation
	--backfill_perplexity enable backfilling of missing perplexity values
	--help (-h) Print out help
	--tempDir tempDir Intermediate output directory
	--startPhase startPhase First phase to run
	--endPhase endPhase Last phase to run