users/dim-reduction/dimensional-reduction.md - mahout - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 ---
 layout: default
 title: Dimensional Reduction


 ---

 # Support for dimensional reduction

 Matrix algebra underpins the way many Big Data algorithms and data
 structures are composed: full-text search can be viewed as doing matrix
 multiplication of the term-document matrix by the query vector (giving a
 vector over documents where the components are the relevance score),
 computing co-occurrences in a collaborative filtering context (people who
 viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest)
 is taking the squaring the user-item interaction matrix, calculating users
 who are k-degrees separated from each other in a social network or
 web-graph can be found by looking at the k-fold product of the graph
 adjacency matrix, and the list goes on (and these are all cases where the
 linear structure of the matrix is preserved!)

 Each of these examples deal with cases of matrices which tend to be
 tremendously large (often millions to tens of millions to hundreds of
 millions of rows or more, by sometimes a comparable number of columns), but
 also rather sparse. Sparse matrices are nice in some respects: dense
 matrices which are 10^7 on a side would have 100 trillion non-zero entries!
 But the sparsity is often problematic, because any given two rows (or
 columns) of the matrix may have zero overlap. Additionally, any
 machine-learning work done on the data which comprises the rows has to deal
 with what is known as "the curse of dimensionality", and for example, there
 are too many columns to train most regression or classification problems on
 them independently.

 One of the more useful approaches to dealing with such huge sparse data
 sets is the concept of dimensionality reduction, where a lower dimensional
 space of the original column (feature) space of your data is found /
 constructed, and your rows are mapped into that subspace (or sub-manifold).
  In this reduced dimensional space, "important" components to distance
 between points are exaggerated, and unimportant ones washed away, and
 additionally, sparsity of your rows is traded for drastically reduced
 dimensional, but dense "signatures". While this loss of sparsity can lead
 to its own complications, a proper dimensionality reduction can help reveal
 the most important features of your data, expose correlations among your
 supposedly independent original variables, and smooth over the zeroes in
 your correlation matrix.

 One of the most straightforward techniques for dimensionality reduction is
 the matrix decomposition: singular value decomposition, eigen
 decomposition, non-negative matrix factorization, etc. In their truncated
 form these decompositions are an excellent first approach toward linearity
 preserving unsupervised feature selection and dimensional reduction. Of
 course, sparse matrices which don't fit in RAM need special treatment as
 far as decomposition is concerned. Parallelizable and/or stream-oriented
 algorithms are needed.

 <a name="DimensionalReduction-SingularValueDecomposition"></a>
 # Singular Value Decomposition

 Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 applied), are two scalable implementations of SVD, a stream-oriented implementation using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper ([Gorrell and Webb 2005](-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html)
 ); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
  implementation, both single-threaded, and in the
 o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
 (series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
 Coming soon: stochastic decomposition.

 See also: [https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition](Wikipedia - SVD)

 <a name="DimensionalReduction-Lanczos"></a>
 ## Lanczos

 The Lanczos algorithm is designed for eigen-decomposition, but like any
 such algorithm, getting singular vectors out of it is immediate (singular
 vectors of matrix A are just the eigenvectors of A^t * A or A * A^t).
 Lanczos works by taking a starting seed vector *v* (with cardinality equal
 to the number of columns of the matrix A), and repeatedly multiplying A by
 the result: *v'* = A.times(*v*) (and then subtracting off what is
 proportional to previous *v'*'s, and building up an auxiliary matrix of
 projections).  In the case where A is not square (in general: not
 symmetric), then you actually want to repeatedly multiply A*A^t by *v*:
 *v'* = (A * A^t).times(*v*), or equivalently, in Mahout,
 A.timesSquared(*v*) (timesSquared is merely an optimization: by changing
 the order of summation in A*A^t.times(*v*), you can do the same computation
 as one pass over the rows of A instead of two).

 After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*- by -*k*
 tridiagonal matrix has been created (the auxiliary matrix mentioned above),
 out of which a good (often extremely good) approximation to *k* of the
 singular values (and with the basis spanned by the *v_i*, the *k* singular
 *vectors* may also be extracted) of A may be efficiently extracted.  Which
 *k*?  It's actually a spread across the entire spectrum: the first few will
 most certainly be the largest singular values, and the bottom few will be
 the smallest, but you have no guarantee that just because you have the n'th
 largest singular value of A, that you also have the (n-1)'st as well.  A
 good rule of thumb is to try and extract out the top 3k singular vectors
 via Lanczos, and then discard the bottom two thirds, if you want primarily
 the largest singular values (which is the case for using Lanczos for
 dimensional reduction).

 <a name="DimensionalReduction-ParallelizationStragegy"></a>
 ### Parallelization Stragegy

 Lanczos is "embarassingly parallelizable": matrix multiplication of a
 matrix by a vector may be carried out row-at-a-time without communication
 until at the end, the results of the intermediate matrix-by-vector outputs
 are accumulated on one final vector.  When it's truly A.times(*v*), the
 final accumulation doesn't even have collision / synchronization issues
 (the outputs are individual separate entries on a single vector), and
 multicore approaches can be very fast, and there should also be tricks to
 speed things up on Hadoop.  In the asymmetric case, where the operation is
 A.timesSquared(*v*), the accumulation does require synchronization (the
 vectors to be summed have nonzero elements all across their range), but
 delaying writing to disk until Mapper close(), and remembering that having
 a Combiner be the same as the Reducer, the bottleneck in accumulation is
 nowhere near a single point.

 <a name="DimensionalReduction-Mahoutusage"></a>
 ### Mahout usage

 The Mahout DistributedLanzcosSolver is invoked by the
 <MAHOUT_HOME>/bin/mahout svd command. This command takes the following
 arguments (which can be reproduced by just entering the command with no
 arguments):


     Job-Specific Options:
       --input (-i) input			  Path to job input directory.
       --output (-o) output			  The directory pathname for output.
       --numRows (-nr) numRows		  Number of rows of the input matrix
       --numCols (-nc) numCols		  Number of columns of the input matrix
       --rank (-r) rank			  Desired decomposition rank (note:
     					  only roughly 1/4 to 1/3 of these will
     					  have the top portion of the spectrum)
       --symmetric (-sym) symmetric		  Is the input matrix square and
     					  symmetric?
       --cleansvd (-cl) cleansvd		  Run the EigenVerificationJob to clean
     					  the eigenvectors after SVD
       --maxError (-err) maxError		  Maximum acceptable error
       --minEigenvalue (-mev) minEigenvalue	  Minimum eigenvalue to keep the vector for
       --inMemory (-mem) inMemory		  Buffer eigen matrix into memory (if you have enough!)
       --help (-h)				  Print out help
       --tempDir tempDir			  Intermediate output directory
       --startPhase startPhase		  First phase to run
       --endPhase endPhase			  Last phase to run


 The short form invocation may be used to perform the SVD on the input data:

       <MAHOUT_HOME>/bin/mahout svd \
       --input (-i) <Path to input matrix> \
       --output (-o) <The directory pathname for output> \
       --numRows (-nr) <Number of rows of the input matrix> \
       --numCols (-nc) <Number of columns of the input matrix> \
       --rank (-r) <Desired decomposition rank> \
       --symmetric (-sym) <Is the input matrix square and symmetric>


 The --input argument is the location on HDFS where a
 SequenceFile<Writable,VectorWritable> (preferably
 SequentialAccessSparseVectors instances) lies which you wish to decompose.
 Each vector of which has --numcols entries.  --numRows is the number of
 input rows and is used to properly size the matrix data structures.

 After execution, the --output directory will have a file named
 "rawEigenvectors" containing the raw eigenvectors. As the
 DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
 eigenvalues aren't valid, and also scales all of the eigenvalues down by
 the max eignenvalue (to avoid floating point overflow), there is an
 additional step which spits out the nice correctly scaled (and
 non-spurious) eigenvector/value pairs. This is done by the "cleansvd" shell
 script step (c.f. EigenVerificationJob).

 If you have run he short form svd invocation above and require this
 "cleaning" of the eigen/singular output you can run "cleansvd" as a
 separate command:

       <MAHOUT_HOME>/bin/mahout cleansvd \
       --eigenInput <path to raw eigenvectors> \
       --corpusInput <path to corpus> \
       --output <path to output directory> \
       --maxError <maximum allowed error. Default is 0.5> \
       --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
       --inMemory <true if the eigenvectors can all fit into memory. Default false>


 The --corpusInput is the input path from the previous step, --eigenInput is
 the output from the previous step (<output>/rawEigenvectors), and --output
 is the desired output path (same as svd argument). The two "cleaning"
 params are --maxError - the maximum allowed 1-cosAngle(v,
 A.timesSquared(v)), and --minEigenvalue.  Eigenvectors which have too large
 error, or too small eigenvalue are discarded.  Optional argument:
 --inMemory, if you have enough memory on your local machine (not on the
 hadoop cluster nodes!) to load all eigenvectors into memory at once (at
 least 8 bytes/double * rank * numCols), then you will see some speedups on
 this cleaning process.

 After execution, the --output directory will have a file named
 "cleanEigenvectors" containing the clean eigenvectors.

 These two steps can also be invoked together by the svd command by using
 the long form svd invocation:

       <MAHOUT_HOME>/bin/mahout svd \
       --input (-i) <Path to input matrix> \
       --output (-o) <The directory pathname for output> \
       --numRows (-nr) <Number of rows of the input matrix> \
       --numCols (-nc) <Number of columns of the input matrix> \
       --rank (-r) <Desired decomposition rank> \
       --symmetric (-sym) <Is the input matrix square and symmetric> \
       --cleansvd "true"   \
       --maxError <maximum allowed error. Default is 0.5> \
       --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
       --inMemory <true if the eigenvectors can all fit into memory. Default false>


 After execution, the --output directory will contain two files: the
 "rawEigenvectors" and the "cleanEigenvectors".

 TODO: also allow exclusion based on improper orthogonality (currently
 computed, but not checked against constraints).

 <a name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a>
 #### Example: SVD of ASF Mail Archives on Amazon Elastic MapReduce

 This section walks you through a complete example of running the Mahout SVD
 job on Amazon Elastic MapReduce cluster and then preparing the output to be
 used for clustering. This example was developed as part of the effort to
 benchmark Mahout's clustering algorithms using a large document set (see [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
 ). Specifically, we use the ASF mail archives located at
 http://aws.amazon.com/datasets/7791434387204566.  You will need to likely
 run seq2sparse on these first.	See
 $MAHOUT_HOME/examples/bin/build-asf-email.sh (on trunk) for examples of
 processing this data.

 At a high level, the steps we're going to perform are:

 bin/mahout svd (original -> svdOut)
 bin/mahout cleansvd ...
 bin/mahout transpose svdOut -> svdT
 bin/mahout transpose original -> originalT
 bin/mahout matrixmult originalT svdT -> newMatrix
 bin/mahout kmeans newMatrix

 The bulk of the content for this section was extracted from the Mahout user
 mailing list, see: [Using SVD with Canopy/KMeans](http://search.lucidimagination.com/search/document/6e5889ee6f0f253b/using_svd_with_canopy_kmeans#66a50fe017cebbe8)
  and [Need a little help with using SVD](http://search.lucidimagination.com/search/document/748181681ae5238b/need_a_little_help_with_using_svd#134fb2771fd52928)

 Note: Some of this work is due in part to credits donated by the Amazon
 Elastic MapReduce team.

 <a name="DimensionalReduction-1.LaunchEMRCluster"></a>
 ##### 1. Launch EMR Cluster

 For a detailed explanation of the steps involved in launching an Amazon
 Elastic MapReduce cluster for running Mahout jobs, please read the
 "Building Vectors for Large Document Sets" section of [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
 .

 In the remaining steps below, remember to replace JOB_ID with the Job ID of
 your EMR cluster.

 <a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a>
 ##### 2. Load Mahout 0.5+ JAR into S3

 These steps were created with the mahout-0.5-SNAPSHOT because they rely on
 the patch for [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)

 <a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a>
 ##### 3. Copy TFIDF Vectors into HDFS

 Before running your SVD job on the vectors, you need to copy them from S3
 to your EMR cluster's HDFS.


     elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
       --arg s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors\
       --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors \
       -j JOB_ID


 <a name="DimensionalReduction-4.RuntheSVDJob"></a>
 ##### 4. Run the SVD Job

 Now you're ready to run the SVD job on the vectors stored in HDFS:


     elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
       --main-class org.apache.mahout.driver.MahoutDriver \
       --arg svd \
       --arg -i --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors\
       --arg -o --arg /asf-mail-archives/mahout/svd \
       --arg --rank --arg 100 \
       --arg --numCols --arg 20444 \
       --arg --numRows --arg 6076937 \
       --arg --cleansvd --arg "true" \
       -j JOB_ID


 This will run 100 iterations of the LanczosSolver SVD job to produce 87
 eigenvectors in:


     /asf-mail-archives/mahout/svd/cleanEigenvectors


 Only 87 eigenvectors were produced because of the cleanup step, which
 removes any duplicate eigenvectors caused by convergence issues and numeric
 overflow and any that don't appear to be "eigen" enough (ie, they don't
 satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix

 <a name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a>
 ##### 5. Transform your TFIDF Vectors into Mahout Matrix

 The tfidf vectors created by the seq2sparse job are
 SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
 vectors into a matrix form that is a
 SequenceFile<IntWritable,VectorWritable> and a
 SequenceFile<IntWritable,Text> (where the original one is the join of these
 new ones, on the new int key).


     elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
       --main-class org.apache.mahout.driver.MahoutDriver \
       --arg rowid \
       --arg
 -Dmapred.input.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
 \
       --arg
 -Dmapred.output.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix
 \
       -j JOB_ID


 This is not a distributed job and will only run on the master server in
 your EMR cluster. The job produces the following output:


     /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/docIndex
     /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix


 where docIndex is the SequenceFile<IntWritable,Text> and matrix is
 SequenceFile<IntWritable,VectorWritable>.

 <a name="DimensionalReduction-6.TransposetheMatrix"></a>
 ##### 6. Transpose the Matrix

 Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
 eigenvectors. For the mathematically inclined, from the rowid job, we now
 have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
 is p x n (p=87, n=20444). So to multiply these two matrices, I need to
 transpose E so that the number of columns in T equals the number of rows in
 E (i.e. E^T is n x p) the result of the matrixmult would give me an m x p
 matrix (m=6076937, p=87).

 However, in practice, computing the matrix product of two matrices as a
 map-reduce job is efficiently done as a map-side join on two row-based
 matrices with the same number of rows, and the columns are the ones which
 are different.	In particular, if you take a matrix X which is represented
 as a set of numRowsX rows, each of which has numColsX, and another matrix
 with numRowsY == numRowsX, each of which has numColsY (!= numColsX), then
 by summing the outer-products of each of the numRowsX pairs of vectors, you
 get a matrix of with numRowsZ == numColsX, and numColsZ == numColsY (if you
 instead take the reverse outer product of the vector pairs, you can end up
 with the transpose of this final result, with numRowsZ == numColsY, and
 numColsZ == numColsX). - Jake Mannix

 Thus, we need to transpose the matrix using Mahout's Transpose Job:


     elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
       --main-class org.apache.mahout.driver.MahoutDriver \
       --arg transpose \
       --arg -i --arg
 /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix \
       --arg --numRows --arg 6076937 \
       --arg --numCols --arg 20444 \
       --arg --tempDir --arg
 /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
       -j JOB_ID


 This job requires the patch to [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)

 The job creates the following output:


     /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose


 <a name="DimensionalReduction-7.TransposeEigenvectors"></a>
 ##### 7. Transpose Eigenvectors

 If you followed Jake's explanation in step 6 above, then you know that we
 also need to transpose the eigenvectors:


     elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
       --main-class org.apache.mahout.driver.MahoutDriver \
       --arg transpose \
       --arg -i --arg /asf-mail-archives/mahout/svd/cleanEigenvectors \
       --arg --numRows --arg 87 \
       --arg --numCols --arg 20444 \
       --arg --tempDir --arg /asf-mail-archives/mahout/svd/transpose \
       -j JOB_ID


 Note: You need to use the same number of reducers that was used for
 transposing the matrix you are multiplying the vectors with.

 The job creates the following output:


     /asf-mail-archives/mahout/svd/transpose


 <a name="DimensionalReduction-8.MatrixMultiplication"></a>
 ##### 8. Matrix Multiplication

 Lastly, we need to multiply the transposed vectors using Mahout's
 matrixmult job:


     elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
       --main-class org.apache.mahout.driver.MahoutDriver \
       --arg matrixmult \
       --arg --numRowsA --arg 20444 \
       --arg --numColsA --arg 6076937 \
       --arg --numRowsB --arg 20444 \
       --arg --numColsB --arg 87 \
       --arg --inputPathA --arg
 /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
       --arg --inputPathB --arg /asf-mail-archives/mahout/svd/transpose \
       -j JOB_ID


 This job produces output such as:


     /user/hadoop/productWith-189


 <a name="DimensionalReduction-Resources"></a>
 # Resources

 * [LSA tutorial](http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm)
 * [SVD tutorial](http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html)