users/basics/matrix-and-vector-needs.md - mahout - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 ---
 layout: default
 title: Matrix and Vector Needs


 ---

 <a name="MatrixandVectorNeeds-Intro"></a>
 # Intro

 Most ML algorithms require the ability to represent multidimensional data
 concisely and to be able to easily perform common operations on that data.
 MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
 along with a set of common operations on their instances. Vectors and
 matrices are provided with sparse and dense implementations that are memory
 resident and are suitable for manipulating intermediate results within
 mapper, combiner and reducer implementations. They are not intended for
 applications requiring vectors or matrices that exceed the size of a single
 JVM, though such applications might be able to utilize them within a larger
 organizing framework.

 <a name="MatrixandVectorNeeds-Background"></a>
 ## Background

 See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser)

 <a name="MatrixandVectorNeeds-Vectors"></a>
 ## Vectors

 Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[](.html)
  that is storage and access efficient. The class SparseVector implements
 vectors as a HashMap<Integer, Double> that is surprisingly fast and
 efficient. For sparse vectors, the size() method returns the current number
 of elements whereas the cardinality() method returns the number of
 dimensions it holds. An additional VectorView class allows views of an
 underlying vector to be specified by the viewPart() method. See the
 JavaDocs for more complete definitions.

 <a name="MatrixandVectorNeeds-Matrices"></a>
 ## Matrices

 Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double[](.html)
 [] that is storage and access efficient. The class SparseRowMatrix
 implements matrices as a Vector[] holding the rows of the matrix in a
 SparseVector, and the symmetric class SparseColumnMatrix implements
 matrices as a Vector[] holding the columns in a SparseVector. Each of these
 classes can quickly produce a given row or column, respectively. A fourth
 class SparseMatrix, uses a HashMap<Integer, Vector> which is also a
 SparseVector. For sparse matrices, the size() method returns an int\[2\]
 containing the actual row and column sizes whereas the cardinality() method
 returns an int\[2\] with the number of dimensions of each. An additional
 MatrixView class allows views of an underlying matrix to be specified by
 the viewPart() method. See the JavaDocs for more complete definitions.

 The Matrix interface does not currently provide invert or determinant
 methods, though these are desirable. It is arguable that the
 implementations of SparseRowMatrix and SparseColumnMatrix ought to use the
 HashMap<Integer, Vector> implementations and that SparseMatrix should
 instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of
 sparse matrices can also be envisioned that support different storage and
 access characteristics. Because the arguments of assignColumn and assignRow
 operations accept all forms of Vector, it is possible to construct
 instances of sparse matrices containing dense rows or columns. See the
 JavaDocs for more complete definitions.

 For applications like PageRank/TextRank, iterative approaches to calculate
 eigenvectors would also be useful. Batching of row/column operations would
 also be useful, such as perhaps assignRow or assighColumn accepting
 UnaryFunction and BinaryFunction arguments.


 <a name="MatrixandVectorNeeds-Ideas"></a>
 ## Ideas

 As Vector and Matrix implementations are currently memory-resident, very
 large instances greater than available memory are not supported. An
 extended set of implementations that use HBase (BigTable) in Hadoop to
 represent their instances would facilitate applications requiring such
 large collections.
 See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6)
 See [Hama](http://wiki.apache.org/hadoop/Hama)


 <a name="MatrixandVectorNeeds-References"></a>
 ## References

 Have a look at the old parallel computing libraries like [ScalaPACK](http://www.netlib.org/scalapack/)
 , others
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	---
	layout: default
	title: Matrix and Vector Needs


	---

	<a name="MatrixandVectorNeeds-Intro"></a>
	# Intro

	Most ML algorithms require the ability to represent multidimensional data
	concisely and to be able to easily perform common operations on that data.
	MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
	along with a set of common operations on their instances. Vectors and
	matrices are provided with sparse and dense implementations that are memory
	resident and are suitable for manipulating intermediate results within
	mapper, combiner and reducer implementations. They are not intended for
	applications requiring vectors or matrices that exceed the size of a single
	JVM, though such applications might be able to utilize them within a larger
	organizing framework.

	<a name="MatrixandVectorNeeds-Background"></a>
	## Background

	See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser)

	<a name="MatrixandVectorNeeds-Vectors"></a>
	## Vectors

	Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[](.html)
	that is storage and access efficient. The class SparseVector implements
	vectors as a HashMap<Integer, Double> that is surprisingly fast and
	efficient. For sparse vectors, the size() method returns the current number
	of elements whereas the cardinality() method returns the number of
	dimensions it holds. An additional VectorView class allows views of an
	underlying vector to be specified by the viewPart() method. See the
	JavaDocs for more complete definitions.

	<a name="MatrixandVectorNeeds-Matrices"></a>
	## Matrices

	Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double[](.html)
	[] that is storage and access efficient. The class SparseRowMatrix
	implements matrices as a Vector[] holding the rows of the matrix in a
	SparseVector, and the symmetric class SparseColumnMatrix implements
	matrices as a Vector[] holding the columns in a SparseVector. Each of these
	classes can quickly produce a given row or column, respectively. A fourth
	class SparseMatrix, uses a HashMap<Integer, Vector> which is also a
	SparseVector. For sparse matrices, the size() method returns an int\[2\]
	containing the actual row and column sizes whereas the cardinality() method
	returns an int\[2\] with the number of dimensions of each. An additional
	MatrixView class allows views of an underlying matrix to be specified by
	the viewPart() method. See the JavaDocs for more complete definitions.

	The Matrix interface does not currently provide invert or determinant
	methods, though these are desirable. It is arguable that the
	implementations of SparseRowMatrix and SparseColumnMatrix ought to use the
	HashMap<Integer, Vector> implementations and that SparseMatrix should
	instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of
	sparse matrices can also be envisioned that support different storage and
	access characteristics. Because the arguments of assignColumn and assignRow
	operations accept all forms of Vector, it is possible to construct
	instances of sparse matrices containing dense rows or columns. See the
	JavaDocs for more complete definitions.

	For applications like PageRank/TextRank, iterative approaches to calculate
	eigenvectors would also be useful. Batching of row/column operations would
	also be useful, such as perhaps assignRow or assighColumn accepting
	UnaryFunction and BinaryFunction arguments.


	<a name="MatrixandVectorNeeds-Ideas"></a>
	## Ideas

	As Vector and Matrix implementations are currently memory-resident, very
	large instances greater than available memory are not supported. An
	extended set of implementations that use HBase (BigTable) in Hadoop to
	represent their instances would facilitate applications requiring such
	large collections.
	See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6)
	See [Hama](http://wiki.apache.org/hadoop/Hama)


	<a name="MatrixandVectorNeeds-References"></a>
	## References

	Have a look at the old parallel computing libraries like [ScalaPACK](http://www.netlib.org/scalapack/)
	, others