docs/algorithms-matrix-factorization.md - systemds - Git at Google

 ---
 layout: global
 title: SystemML Algorithms Reference - Matrix Factorization
 displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a>
 ---
 <!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to you under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 {% endcomment %}
 -->

 # 5 Matrix Factorization


 ## 5.1 Principle Component Analysis

 ### Description

 Principle Component Analysis (PCA) is a simple, non-parametric method to
 transform the given data set with possibly correlated columns into a set
 of linearly uncorrelated or orthogonal columns, called *principle
 components*. The principle components are ordered in such a way
 that the first component accounts for the largest possible variance,
 followed by remaining principle components in the decreasing order of
 the amount of variance captured from the data. PCA is often used as a
 dimensionality reduction technique, where the original data is projected
 or rotated onto a low-dimensional space with basis vectors defined by
 top-$K$ (for a given value of $K$) principle components.


 ### Usage

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f PCA.dml
                             -nvargs INPUT=<file>
                                     K=<int>
                                     CENTER=[int]
                                     SCALE=[int]
                                     PROJDATA=<int>
                                     OFMT=[format]
                                     MODEL=<file>
                                     OUTPUT=<file>
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f PCA.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs INPUT=<file>
                                          K=<int>
                                          CENTER=[int]
                                          SCALE=[int]
                                          PROJDATA=<int>
                                          OFMT=[format]
                                          MODEL=<file>
                                          OUTPUT=<file>
 </div>
 </div>


 #### Arguments

 **INPUT**: Location (on HDFS) to read the input matrix.

 **K**: Indicates dimension of the new vector space constructed from $K$
     principle components. It must be a value between `1` and the number
     of columns in the input data.

 **CENTER**: (default: `0`) `0` or `1`. Indicates whether or not to
     *center* input data prior to the computation of
     principle components.

 **SCALE**: (default: `0`) `0` or `1`. Indicates whether or not to
     *scale* input data prior to the computation of
     principle components.

 **PROJDATA**: `0` or `1`. Indicates whether or not the input data must be projected
     on to new vector space defined over principle components.

 **OFMT**: (default: `"csv"`) Matrix file output format, such as `text`,
 `mm`, or `csv`; see read/write functions in
 SystemML Language Reference for details.

 **MODEL**: Either the location (on HDFS) where the computed model is
     stored; or the location of an existing model.

 **OUTPUT**: Location (on HDFS) to store the data rotated on to the new
     vector space.


 #### Examples

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f PCA.dml
                             -nvargs INPUT=/user/ml/input.mtx
                                     K=10
                                     CENTER=1
                                     SCALE=1O
                                     FMT=csv
                                     PROJDATA=1
                                     OUTPUT=/user/ml/pca_output/
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f PCA.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs INPUT=/user/ml/input.mtx
                                          K=10
                                          CENTER=1
                                          SCALE=1O
                                          FMT=csv
                                          PROJDATA=1
                                          OUTPUT=/user/ml/pca_output/
 </div>
 </div>

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f PCA.dml
                             -nvargs INPUT=/user/ml/test_input.mtx
                                     K=10
                                     CENTER=1
                                     SCALE=1O
                                     FMT=csv
                                     PROJDATA=1
                                     MODEL=/user/ml/pca_output/
                                     OUTPUT=/user/ml/test_output.mtx
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f PCA.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs INPUT=/user/ml/test_input.mtx
                                          K=10
                                          CENTER=1
                                          SCALE=1O
                                          FMT=csv
                                          PROJDATA=1
                                          MODEL=/user/ml/pca_output/
                                          OUTPUT=/user/ml/test_output.mtx
 </div>
 </div>


 #### Details

 Principle Component Analysis (PCA) is a non-parametric procedure for
 orthogonal linear transformation of the input data to a new coordinate
 system, such that the greatest variance by some projection of the data
 comes to lie on the first coordinate (called the first principal
 component), the second greatest variance on the second coordinate, and
 so on. In other words, PCA first selects a normalized direction in
 $m$-dimensional space ($m$ is the number of columns in the input data)
 along which the variance in input data is maximized – this is referred
 to as the first principle component. It then repeatedly finds other
 directions (principle components) in which the variance is maximized. At
 every step, PCA restricts the search for only those directions that are
 perpendicular to all previously selected directions. By doing so, PCA
 aims to reduce the redundancy among input variables. To understand the
 notion of redundancy, consider an extreme scenario with a data set
 comprising of two variables, where the first one denotes some quantity
 expressed in meters, and the other variable represents the same quantity
 but in inches. Both these variables evidently capture redundant
 information, and hence one of them can be removed. In a general
 scenario, keeping solely the linear combination of input variables would
 both express the data more concisely and reduce the number of variables.
 This is why PCA is often used as a dimensionality reduction technique.

 The specific method to compute such a new coordinate system is as
 follows – compute a covariance matrix $C$ that measures the strength of
 correlation among all pairs of variables in the input data; factorize
 $C$ according to eigen decomposition to calculate its eigenvalues and
 eigenvectors; and finally, order eigenvectors in the decreasing order of
 their corresponding eigenvalue. The computed eigenvectors (also known as
 *loadings*) define the new coordinate system and the square
 root of eigen values provide the amount of variance in the input data
 explained by each coordinate or eigenvector.


 #### Returns

 When MODEL is not provided, PCA procedure is
 applied on INPUT data to generate MODEL as well as the rotated data
 OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. The
 produced model consists of basis vectors MODEL$/dominant.eigen.vectors$
 for the new coordinate system; eigen values
 MODEL$/dominant.eigen.values$; and the standard deviation
 MODEL$/dominant.eigen.standard.deviations$ of principle components. When
 MODEL is provided, INPUT data is rotated according to the coordinate
 system defined by MODEL$/dominant.eigen.vectors$. The resulting data is
 stored at location OUTPUT.


 * * *

 ## 5.2 Matrix Completion via Alternating Minimizations

 ### Description

 Low-rank matrix completion is an effective technique for statistical
 data analysis widely used in the data mining and machine learning
 applications. Matrix completion is a variant of low-rank matrix
 factorization with the goal of recovering a partially observed and
 potentially noisy matrix from a subset of its revealed entries. Perhaps
 the most popular applications in which matrix completion has been
 successfully applied is in the context of collaborative filtering in
 recommender systems. In this setting, the rows in the data matrix
 correspond to users, the columns to items such as movies, and entries to
 feedback provided by users for items. The goal is to predict missing
 entries of the rating matrix. This implementation uses the alternating
 least-squares (ALS) technique for solving large-scale matrix completion
 problems.


 ### Usage

 **ALS**:

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f ALS.dml
                             -nvargs V=<file>
                                     L=<file>
                                     R=<file>
                                     rank=[int]
                                     reg=[L2|wL2]
                                     lambda=[double]
                                     maxi=[int]
                                     check=[boolean]
                                     thr=[double]
                                     fmt=[format]
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f ALS.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs V=<file>
                                          L=<file>
                                          R=<file>
                                          rank=[int]
                                          reg=[L2|wL2]
                                          lambda=[double]
                                          maxi=[int]
                                          check=[boolean]
                                          thr=[double]
                                          fmt=[format]
 </div>
 </div>

 **ALS Prediction**:

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f ALS_predict.dml
                             -nvargs X=<file>
                                     Y=<file>
                                     L=<file>
                                     R=<file>
                                     Vrows=<int>
                                     Vcols=<int>
                                     fmt=[format]
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f ALS_predict.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs X=<file>
                                          Y=<file>
                                          L=<file>
                                          R=<file>
                                          Vrows=<int>
                                          Vcols=<int>
                                          fmt=[format]
 </div>
 </div>

 **ALS Top-K Prediction**:

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f ALS_topk_predict.dml
                             -nvargs X=<file>
                                     Y=<file>
                                     L=<file>
                                     R=<file>
                                     V=<file>
                                     K=[int]
                                     fmt=[format]
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f ALS_topk_predict.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs X=<file>
                                          Y=<file>
                                          L=<file>
                                          R=<file>
                                          V=<file>
                                          K=[int]
                                          fmt=[format]
 </div>
 </div>


 ### Arguments - ALS

 **V**: Location (on HDFS) to read the input (user-item) matrix $V$ to be
 factorized

 **L**: Location (on HDFS) to write the left (user) factor matrix $L$

 **R**: Location (on HDFS) to write the right (item) factor matrix $R$

 **rank**: (default: `10`) Rank of the factorization

 **reg**: (default: `L2`) Regularization:

   * `L2` = `L2` regularization
   * `wL2` = weighted `L2` regularization

 **lambda**: (default: `0.000001`) Regularization parameter

 **maxi**: (default: `50`) Maximum number of iterations

 **check**: (default: `FALSE`) Check for convergence after every iteration, i.e., updating
 $L$ and $R$ once

 **thr**: (default: `0.0001`) Assuming `check=TRUE`, the algorithm stops and
 convergence is declared if the decrease in loss in any two consecutive
 iterations falls below threshold `thr`; if
 `check=FALSE`, parameter `thr` is ignored.

 **fmt**: (default: `"text"`) Matrix file output format, such as `text`,
 `mm`, or `csv`; see read/write functions in
 SystemML Language Reference for details.


 ### Arguments - ALS Prediction/Top-K Prediction

 **X**: Location (on HDFS) to read the input matrix $X$ with following format:

   * for `ALS_predict.dml`: a 2-column matrix that contains
     the user-ids (first column) and the item-ids (second column)
   * for `ALS_topk_predict.dml`: a 1-column matrix that
     contains the user-ids

 **Y**: Location (on HDFS) to write the output of prediction with the following
 format:

   * for `ALS_predict.dml`: a 3-column matrix that contains
     the user-ids (first column), the item-ids (second column) and the
     predicted ratings (third column)
   * for `ALS_topk_predict.dml`: a (`K+1`)-column matrix
     that contains the user-ids in the first column and the top-K
     item-ids in the remaining `K` columns will be stored at
     `Y`. Additionally, a matrix with the same dimensions that
     contains the corresponding actual top-K ratings will be stored at
     `Y.ratings`; see below for details

 **L**: Location (on HDFS) to read the left (user) factor matrix $L$

 **R**: Location (on HDFS) to write the right (item) factor matrix $R$

 **V**: Location (on HDFS) to read the user-item matrix $V$

 **Vrows**: Number of rows of $V$ (i.e., number of users)

 **Vcols**: Number of columns of $V$ (i.e., number of items)

 **K**: (default: `5`) Number of top-K items for top-K prediction

 **fmt**: (default: `"text"`) Matrix file output format, such as `text`,
 `mm`, or `csv`; see read/write functions in
 SystemML Language Reference for details.


 ### Examples

 **ALS**:

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f ALS.dml
                             -nvargs V=/user/ml/V
                                     L=/user/ml/L
                                     R=/user/ml/R
                                     rank=10
                                     reg="wL"
                                     lambda=0.0001
                                     maxi=50
                                     check=TRUE
                                     thr=0.001
                                     fmt=csv
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f ALS.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs V=/user/ml/V
                                          L=/user/ml/L
                                          R=/user/ml/R
                                          rank=10
                                          reg="wL"
                                          lambda=0.0001
                                          maxi=50
                                          check=TRUE
                                          thr=0.001
                                          fmt=csv
 </div>
 </div>

 **ALS Prediction**:

 To compute predicted ratings for a given list of users and items:

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f ALS_predict.dml
                             -nvargs X=/user/ml/X
                                     Y=/user/ml/Y
                                     L=/user/ml/L
                                     R=/user/ml/R
                                     Vrows=100000
                                     Vcols=10000
                                     fmt=csv
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f ALS_predict.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs X=/user/ml/X
                                          Y=/user/ml/Y
                                          L=/user/ml/L
                                          R=/user/ml/R
                                          Vrows=100000
                                          Vcols=10000
                                          fmt=csv
 </div>
 </div>

 **ALS Top-K Prediction**:

 To compute top-K items with highest predicted ratings together with the
 predicted ratings for a given list of users:

 <div class="codetabs">
 <div data-lang="Hadoop" markdown="1">
     hadoop jar SystemML.jar -f ALS_topk_predict.dml
                             -nvargs X=/user/ml/X
                                     Y=/user/ml/Y
                                     L=/user/ml/L
                                     R=/user/ml/R
                                     V=/user/ml/V
                                     K=10
                                     fmt=csv
 </div>
 <div data-lang="Spark" markdown="1">
     $SPARK_HOME/bin/spark-submit --master yarn-cluster
                                  --conf spark.driver.maxResultSize=0
                                  --conf spark.akka.frameSize=128
                                  SystemML.jar
                                  -f ALS_topk_predict.dml
                                  -config=SystemML-config.xml
                                  -exec hybrid_spark
                                  -nvargs X=/user/ml/X
                                          Y=/user/ml/Y
                                          L=/user/ml/L
                                          R=/user/ml/R
                                          V=/user/ml/V
                                          K=10
                                          fmt=csv
 </div>
 </div>


 ### Details

 Given an $m \times n$ input matrix $V$ and a rank parameter
 $r \ll \min{(m,n)}$, low-rank matrix factorization seeks to find an
 $m \times r$ matrix $L$ and an $r \times n$ matrix $R$ such that
 $V \approx LR$, i.e., we aim to approximate $V$ by the low-rank matrix
 $LR$. The quality of the approximation is determined by an
 application-dependent loss function $\mathcal{L}$. We aim at finding the
 loss-minimizing factor matrices, i.e.,

 $$
 \begin{equation}
 (L^*, R^*) = \textrm{argmin}_{L,R}{\mathcal{L}(V,L,R)}
 \end{equation}
 $$

 In the
 context of collaborative filtering in the recommender systems it is
 often the case that the input matrix $V$ contains several missing
 entries. Such entries are coded with the 0 value and the loss function
 is computed only based on the nonzero entries in $V$, i.e.,

 $$%\label{eq:loss}
  \mathcal{L}=\sum_{(i,j)\in\Omega}l(V_{ij},L_{i*},R_{*j})$$

 where
 $$L_{i*}$$ and $$R_{*j}$$, respectively, denote the $i$th row of $L$ and the
 $j$th column of $R$, $$\Omega=\{\omega_1,\dots,\omega_N\}$$ denotes the
 training set containing the observed (nonzero) entries in $V$, and $l$
 is some local loss function.

 ALS is an optimization technique that can
 be used to solve quadratic problems. For matrix completion, the
 algorithm repeatedly keeps one of the unknown matrices ($L$ or $R$)
 fixed and optimizes the other one. In particular, ALS alternates between
 recomputing the rows of $L$ in one step and the columns of $R$ in the
 subsequent step. Our implementation of the ALS algorithm supports the
 loss functions summarized in [**Table 16**](algorithms-matrix-factorization.html#table16)
 commonly used for matrix completion [[ZhouWSP08]](algorithms-bibliography.html).

 * * *

 <a name="table16" />
 **Table 16**: Popular loss functions supported by our ALS implementation; $$N_{i*}$$
   and $$N_{*j}$$, respectively, denote the number of nonzero entries in
   row $i$ and column $j$ of $V$.

 | Loss                  | Definition |
 | --------------------- | ---------- |
 | $$\mathcal{L}_\text{Nzsl}$$     | $$\sum_{i,j:V_{ij}\neq 0} (V_{ij} - [LR]_{ij})^2$$
 | $$\mathcal{L}_\text{Nzsl+L2}$$  | $$\mathcal{L}_\text{Nzsl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$$
 | $$\mathcal{L}_\text{Nzsl+wL2}$$ | $$\mathcal{L}_\text{Nzsl} + \lambda \Bigl(\sum_{ik}N_{i*} L_{ik}^2 + \sum_{kj}N_{*j} R_{kj}^2 \Bigr)$$


 * * *


 Note that the matrix completion problem as defined in (1)
 is a non-convex problem for all loss functions from
 [**Table 16**](algorithms-matrix-factorization.html#table16). However, when fixing one of the matrices
 $L$ or $R$, we get a least-squares problem with a globally optimal
 solution. For example, for the case of $$\mathcal{L}_\text{Nzsl+wL2}$$ we
 have the following closed form solutions

 $$\begin{aligned}
   L^\top_{n+1,i*} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i*}, \\
   R_{n+1,*j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{*j},
   \end{aligned}
 $$

 where $$L_{n+1,i*}$$ (resp. $$R_{n+1,*j}$$) denotes the
 $i$th row of $$L_{n+1}$$ (resp. $j$th column of $$R_{n+1}$$), $\lambda$
 denotes the regularization parameter, $I$ is the identity matrix of
 appropriate dimensionality, $$V_{i*}$$ (resp. $$V_{*j}$$) denotes the
 revealed entries in row $i$ (column $j$), $$R^{(i)}_n$$ (resp.
 $$L^{(j)}_{n+1}$$) refers to the corresponding columns of $R_n$ (rows of
 $$L_{n+1}$$), and $N_1$ (resp. $N_2$) denotes a diagonal matrix that
 contains the number of nonzero entries in row $i$ (column $j$) of $V$.


 **Prediction.** Based on the factor matrices computed by ALS we provide
 two prediction scripts:

   1. `ALS_predict.dml` computes the predicted ratings for a given
 list of users and items.
   2. `ALS_topk_predict.dml` computes top-K item (where $K$ is
 given as input) with highest predicted ratings together with their
 corresponding ratings for a given list of users.


 ### Returns

 We output the factor matrices $L$ and $R$ after the algorithm has
 converged. The algorithm is declared as converged if one of the two
 criteria is meet: (1) the decrease in the value of loss function falls
 below `thr` given as an input parameter (if parameter
 `check=TRUE`), or (2) the maximum number of iterations
 (defined as parameter `maxi`) is reached. Note that for a
 given user $i$ prediction is possible only if user $i$ has rated at
 least one item, i.e., row $i$ in matrix $V$ has at least one nonzero
 entry. In case, some users have not rated any items the corresponding
 factor in $L$ will be all 0s. Similarly if some items have not been
 rated at all the corresponding factors in $R$ will contain only 0s. Our
 prediction scripts output the predicted ratings for a given list of
 users and items as well as the top-K items with highest predicted
 ratings together with the predicted ratings for a given list of users.
 Note that the predictions will only be provided for the users who have
 rated at least one item, i.e., the corresponding rows contain at least
 one nonzero entry.
	---
	layout: global
	title: SystemML Algorithms Reference - Matrix Factorization
	displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a>
	---
	<!--
	{% comment %}
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to you under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	{% endcomment %}
	-->

	# 5 Matrix Factorization


	## 5.1 Principle Component Analysis

	### Description

	Principle Component Analysis (PCA) is a simple, non-parametric method to
	transform the given data set with possibly correlated columns into a set
	of linearly uncorrelated or orthogonal columns, called *principle
	components*. The principle components are ordered in such a way
	that the first component accounts for the largest possible variance,
	followed by remaining principle components in the decreasing order of
	the amount of variance captured from the data. PCA is often used as a
	dimensionality reduction technique, where the original data is projected
	or rotated onto a low-dimensional space with basis vectors defined by
	top-$K$ (for a given value of $K$) principle components.


	### Usage

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f PCA.dml
	-nvargs INPUT=<file>
	K=<int>
	CENTER=[int]
	SCALE=[int]
	PROJDATA=<int>
	OFMT=[format]
	MODEL=<file>
	OUTPUT=<file>
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f PCA.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs INPUT=<file>
	K=<int>
	CENTER=[int]
	SCALE=[int]
	PROJDATA=<int>
	OFMT=[format]
	MODEL=<file>
	OUTPUT=<file>
	</div>
	</div>


	#### Arguments

	INPUT: Location (on HDFS) to read the input matrix.

	K: Indicates dimension of the new vector space constructed from $K$
	principle components. It must be a value between `1` and the number
	of columns in the input data.

	CENTER: (default: `0`) `0` or `1`. Indicates whether or not to
	center input data prior to the computation of
	principle components.

	SCALE: (default: `0`) `0` or `1`. Indicates whether or not to
	scale input data prior to the computation of
	principle components.

	PROJDATA: `0` or `1`. Indicates whether or not the input data must be projected
	on to new vector space defined over principle components.

	OFMT: (default: `"csv"`) Matrix file output format, such as `text`,
	`mm`, or `csv`; see read/write functions in
	SystemML Language Reference for details.

	MODEL: Either the location (on HDFS) where the computed model is
	stored; or the location of an existing model.

	OUTPUT: Location (on HDFS) to store the data rotated on to the new
	vector space.


	#### Examples

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f PCA.dml
	-nvargs INPUT=/user/ml/input.mtx
	K=10
	CENTER=1
	SCALE=1O
	FMT=csv
	PROJDATA=1
	OUTPUT=/user/ml/pca_output/
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f PCA.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs INPUT=/user/ml/input.mtx
	K=10
	CENTER=1
	SCALE=1O
	FMT=csv
	PROJDATA=1
	OUTPUT=/user/ml/pca_output/
	</div>
	</div>

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f PCA.dml
	-nvargs INPUT=/user/ml/test_input.mtx
	K=10
	CENTER=1
	SCALE=1O
	FMT=csv
	PROJDATA=1
	MODEL=/user/ml/pca_output/
	OUTPUT=/user/ml/test_output.mtx
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f PCA.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs INPUT=/user/ml/test_input.mtx
	K=10
	CENTER=1
	SCALE=1O
	FMT=csv
	PROJDATA=1
	MODEL=/user/ml/pca_output/
	OUTPUT=/user/ml/test_output.mtx
	</div>
	</div>


	#### Details

	Principle Component Analysis (PCA) is a non-parametric procedure for
	orthogonal linear transformation of the input data to a new coordinate
	system, such that the greatest variance by some projection of the data
	comes to lie on the first coordinate (called the first principal
	component), the second greatest variance on the second coordinate, and
	so on. In other words, PCA first selects a normalized direction in
	$m$-dimensional space ($m$ is the number of columns in the input data)
	along which the variance in input data is maximized – this is referred
	to as the first principle component. It then repeatedly finds other
	directions (principle components) in which the variance is maximized. At
	every step, PCA restricts the search for only those directions that are
	perpendicular to all previously selected directions. By doing so, PCA
	aims to reduce the redundancy among input variables. To understand the
	notion of redundancy, consider an extreme scenario with a data set
	comprising of two variables, where the first one denotes some quantity
	expressed in meters, and the other variable represents the same quantity
	but in inches. Both these variables evidently capture redundant
	information, and hence one of them can be removed. In a general
	scenario, keeping solely the linear combination of input variables would
	both express the data more concisely and reduce the number of variables.
	This is why PCA is often used as a dimensionality reduction technique.

	The specific method to compute such a new coordinate system is as
	follows – compute a covariance matrix $C$ that measures the strength of
	correlation among all pairs of variables in the input data; factorize
	$C$ according to eigen decomposition to calculate its eigenvalues and
	eigenvectors; and finally, order eigenvectors in the decreasing order of
	their corresponding eigenvalue. The computed eigenvectors (also known as
	loadings) define the new coordinate system and the square
	root of eigen values provide the amount of variance in the input data
	explained by each coordinate or eigenvector.


	#### Returns

	When MODEL is not provided, PCA procedure is
	applied on INPUT data to generate MODEL as well as the rotated data
	OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. The
	produced model consists of basis vectors MODEL$/dominant.eigen.vectors$
	for the new coordinate system; eigen values
	MODEL$/dominant.eigen.values$; and the standard deviation
	MODEL$/dominant.eigen.standard.deviations$ of principle components. When
	MODEL is provided, INPUT data is rotated according to the coordinate
	system defined by MODEL$/dominant.eigen.vectors$. The resulting data is
	stored at location OUTPUT.


	* * *

	## 5.2 Matrix Completion via Alternating Minimizations

	### Description

	Low-rank matrix completion is an effective technique for statistical
	data analysis widely used in the data mining and machine learning
	applications. Matrix completion is a variant of low-rank matrix
	factorization with the goal of recovering a partially observed and
	potentially noisy matrix from a subset of its revealed entries. Perhaps
	the most popular applications in which matrix completion has been
	successfully applied is in the context of collaborative filtering in
	recommender systems. In this setting, the rows in the data matrix
	correspond to users, the columns to items such as movies, and entries to
	feedback provided by users for items. The goal is to predict missing
	entries of the rating matrix. This implementation uses the alternating
	least-squares (ALS) technique for solving large-scale matrix completion
	problems.


	### Usage

	ALS:

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f ALS.dml
	-nvargs V=<file>
	L=<file>
	R=<file>
	rank=[int]
	reg=[L2\|wL2]
	lambda=[double]
	maxi=[int]
	check=[boolean]
	thr=[double]
	fmt=[format]
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f ALS.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs V=<file>
	L=<file>
	R=<file>
	rank=[int]
	reg=[L2\|wL2]
	lambda=[double]
	maxi=[int]
	check=[boolean]
	thr=[double]
	fmt=[format]
	</div>
	</div>

	ALS Prediction:

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f ALS_predict.dml
	-nvargs X=<file>
	Y=<file>
	L=<file>
	R=<file>
	Vrows=<int>
	Vcols=<int>
	fmt=[format]
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f ALS_predict.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs X=<file>
	Y=<file>
	L=<file>
	R=<file>
	Vrows=<int>
	Vcols=<int>
	fmt=[format]
	</div>
	</div>

	ALS Top-K Prediction:

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f ALS_topk_predict.dml
	-nvargs X=<file>
	Y=<file>
	L=<file>
	R=<file>
	V=<file>
	K=[int]
	fmt=[format]
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f ALS_topk_predict.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs X=<file>
	Y=<file>
	L=<file>
	R=<file>
	V=<file>
	K=[int]
	fmt=[format]
	</div>
	</div>


	### Arguments - ALS

	V: Location (on HDFS) to read the input (user-item) matrix $V$ to be
	factorized

	L: Location (on HDFS) to write the left (user) factor matrix $L$

	R: Location (on HDFS) to write the right (item) factor matrix $R$

	rank: (default: `10`) Rank of the factorization

	reg: (default: `L2`) Regularization:

	* `L2` = `L2` regularization
	* `wL2` = weighted `L2` regularization

	lambda: (default: `0.000001`) Regularization parameter

	maxi: (default: `50`) Maximum number of iterations

	check: (default: `FALSE`) Check for convergence after every iteration, i.e., updating
	$L$ and $R$ once

	thr: (default: `0.0001`) Assuming `check=TRUE`, the algorithm stops and
	convergence is declared if the decrease in loss in any two consecutive
	iterations falls below threshold `thr`; if
	`check=FALSE`, parameter `thr` is ignored.

	fmt: (default: `"text"`) Matrix file output format, such as `text`,
	`mm`, or `csv`; see read/write functions in
	SystemML Language Reference for details.


	### Arguments - ALS Prediction/Top-K Prediction

	X: Location (on HDFS) to read the input matrix $X$ with following format:

	* for `ALS_predict.dml`: a 2-column matrix that contains
	the user-ids (first column) and the item-ids (second column)
	* for `ALS_topk_predict.dml`: a 1-column matrix that
	contains the user-ids

	Y: Location (on HDFS) to write the output of prediction with the following
	format:

	* for `ALS_predict.dml`: a 3-column matrix that contains
	the user-ids (first column), the item-ids (second column) and the
	predicted ratings (third column)
	* for `ALS_topk_predict.dml`: a (`K+1`)-column matrix
	that contains the user-ids in the first column and the top-K
	item-ids in the remaining `K` columns will be stored at
	`Y`. Additionally, a matrix with the same dimensions that
	contains the corresponding actual top-K ratings will be stored at
	`Y.ratings`; see below for details

	L: Location (on HDFS) to read the left (user) factor matrix $L$

	R: Location (on HDFS) to write the right (item) factor matrix $R$

	V: Location (on HDFS) to read the user-item matrix $V$

	Vrows: Number of rows of $V$ (i.e., number of users)

	Vcols: Number of columns of $V$ (i.e., number of items)

	K: (default: `5`) Number of top-K items for top-K prediction

	fmt: (default: `"text"`) Matrix file output format, such as `text`,
	`mm`, or `csv`; see read/write functions in
	SystemML Language Reference for details.


	### Examples

	ALS:

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f ALS.dml
	-nvargs V=/user/ml/V
	L=/user/ml/L
	R=/user/ml/R
	rank=10
	reg="wL"
	lambda=0.0001
	maxi=50
	check=TRUE
	thr=0.001
	fmt=csv
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f ALS.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs V=/user/ml/V
	L=/user/ml/L
	R=/user/ml/R
	rank=10
	reg="wL"
	lambda=0.0001
	maxi=50
	check=TRUE
	thr=0.001
	fmt=csv
	</div>
	</div>

	ALS Prediction:

	To compute predicted ratings for a given list of users and items:

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f ALS_predict.dml
	-nvargs X=/user/ml/X
	Y=/user/ml/Y
	L=/user/ml/L
	R=/user/ml/R
	Vrows=100000
	Vcols=10000
	fmt=csv
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f ALS_predict.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs X=/user/ml/X
	Y=/user/ml/Y
	L=/user/ml/L
	R=/user/ml/R
	Vrows=100000
	Vcols=10000
	fmt=csv
	</div>
	</div>

	ALS Top-K Prediction:

	To compute top-K items with highest predicted ratings together with the
	predicted ratings for a given list of users:

	<div class="codetabs">
	<div data-lang="Hadoop" markdown="1">
	hadoop jar SystemML.jar -f ALS_topk_predict.dml
	-nvargs X=/user/ml/X
	Y=/user/ml/Y
	L=/user/ml/L
	R=/user/ml/R
	V=/user/ml/V
	K=10
	fmt=csv
	</div>
	<div data-lang="Spark" markdown="1">
	$SPARK_HOME/bin/spark-submit --master yarn-cluster
	--conf spark.driver.maxResultSize=0
	--conf spark.akka.frameSize=128
	SystemML.jar
	-f ALS_topk_predict.dml
	-config=SystemML-config.xml
	-exec hybrid_spark
	-nvargs X=/user/ml/X
	Y=/user/ml/Y
	L=/user/ml/L
	R=/user/ml/R
	V=/user/ml/V
	K=10
	fmt=csv
	</div>
	</div>


	### Details

	Given an $m \times n$ input matrix $V$ and a rank parameter
	$r \ll \min{(m,n)}$, low-rank matrix factorization seeks to find an
	$m \times r$ matrix $L$ and an $r \times n$ matrix $R$ such that
	$V \approx LR$, i.e., we aim to approximate $V$ by the low-rank matrix
	$LR$. The quality of the approximation is determined by an
	application-dependent loss function $\mathcal{L}$. We aim at finding the
	loss-minimizing factor matrices, i.e.,

	$$
	\begin{equation}
	(L^, R^) = \textrm{argmin}_{L,R}{\mathcal{L}(V,L,R)}
	\end{equation}
	$$

	In the
	context of collaborative filtering in the recommender systems it is
	often the case that the input matrix $V$ contains several missing
	entries. Such entries are coded with the 0 value and the loss function
	is computed only based on the nonzero entries in $V$, i.e.,

	$$%\label{eq:loss}
	\mathcal{L}=\sum_{(i,j)\in\Omega}l(V_{ij},L_{i},R_{j})$$

	where
	$$L_{i}$$ and $$R_{j}$$, respectively, denote the $i$th row of $L$ and the
	$j$th column of $R$, $$\Omega=\{\omega_1,\dots,\omega_N\}$$ denotes the
	training set containing the observed (nonzero) entries in $V$, and $l$
	is some local loss function.

	ALS is an optimization technique that can
	be used to solve quadratic problems. For matrix completion, the
	algorithm repeatedly keeps one of the unknown matrices ($L$ or $R$)
	fixed and optimizes the other one. In particular, ALS alternates between
	recomputing the rows of $L$ in one step and the columns of $R$ in the
	subsequent step. Our implementation of the ALS algorithm supports the
	loss functions summarized in [Table 16](algorithms-matrix-factorization.html#table16)
	commonly used for matrix completion [[ZhouWSP08]](algorithms-bibliography.html).

	* * *

	<a name="table16" />
	Table 16: Popular loss functions supported by our ALS implementation; $$N_{i*}$$
	and $$N_{*j}$$, respectively, denote the number of nonzero entries in
	row $i$ and column $j$ of $V$.

	\| Loss \| Definition \|
	\| --------------------- \| ---------- \|
	\| $$\mathcal{L}_\text{Nzsl}$$ \| $$\sum_{i,j:V_{ij}\neq 0} (V_{ij} - [LR]_{ij})^2$$
	\| $$\mathcal{L}_\text{Nzsl+L2}$$ \| $$\mathcal{L}_\text{Nzsl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$$
	\| $$\mathcal{L}_\text{Nzsl+wL2}$$ \| $$\mathcal{L}_\text{Nzsl} + \lambda \Bigl(\sum_{ik}N_{i} L_{ik}^2 + \sum_{kj}N_{j} R_{kj}^2 \Bigr)$$


	* * *


	Note that the matrix completion problem as defined in (1)
	is a non-convex problem for all loss functions from
	[Table 16](algorithms-matrix-factorization.html#table16). However, when fixing one of the matrices
	$L$ or $R$, we get a least-squares problem with a globally optimal
	solution. For example, for the case of $$\mathcal{L}_\text{Nzsl+wL2}$$ we
	have the following closed form solutions

	$$\begin{aligned}
	L^\top_{n+1,i} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i}, \\
	R_{n+1,j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{j},
	\end{aligned}
	$$

	where $$L_{n+1,i}$$ (resp. $$R_{n+1,j}$$) denotes the
	$i$th row of $$L_{n+1}$$ (resp. $j$th column of $$R_{n+1}$$), $\lambda$
	denotes the regularization parameter, $I$ is the identity matrix of
	appropriate dimensionality, $$V_{i}$$ (resp. $$V_{j}$$) denotes the
	revealed entries in row $i$ (column $j$), $$R^{(i)}_n$$ (resp.
	$$L^{(j)}_{n+1}$$) refers to the corresponding columns of $R_n$ (rows of
	$$L_{n+1}$$), and $N_1$ (resp. $N_2$) denotes a diagonal matrix that
	contains the number of nonzero entries in row $i$ (column $j$) of $V$.


	Prediction. Based on the factor matrices computed by ALS we provide
	two prediction scripts:

	1. `ALS_predict.dml` computes the predicted ratings for a given
	list of users and items.
	2. `ALS_topk_predict.dml` computes top-K item (where $K$ is
	given as input) with highest predicted ratings together with their
	corresponding ratings for a given list of users.


	### Returns

	We output the factor matrices $L$ and $R$ after the algorithm has
	converged. The algorithm is declared as converged if one of the two
	criteria is meet: (1) the decrease in the value of loss function falls
	below `thr` given as an input parameter (if parameter
	`check=TRUE`), or (2) the maximum number of iterations
	(defined as parameter `maxi`) is reached. Note that for a
	given user $i$ prediction is possible only if user $i$ has rated at
	least one item, i.e., row $i$ in matrix $V$ has at least one nonzero
	entry. In case, some users have not rated any items the corresponding
	factor in $L$ will be all 0s. Similarly if some items have not been
	rated at all the corresponding factors in $R$ will contain only 0s. Our
	prediction scripts output the predicted ratings for a given list of
	users and items as well as the top-K items with highest predicted
	ratings together with the predicted ratings for a given list of users.
	Note that the predictions will only be provided for the users who have
	rated at least one item, i.e., the corresponding rows contain at least
	one nonzero entry.