doc/design/modules/linear-systems.tex - madlib - Git at Google

 \chapter[Linear Systems]{Linear Systems}
 \begin{moduleinfo}
 \item[Authors] {Srikrishna Sridhar}
 \item[History]
 	\begin{modulehistory}
 		\item[v1.0] Initial version
 	\end{modulehistory}
 \end{moduleinfo}

 \section{Introduction}\label{sec:intro}
 In this document, we describe solution methods for systems of a consistent
 linear equations.
 \begin{equation}
   \label{eq:linear_system}
   Ax = b
 \end{equation}
 where $x \in \R^{n}$, $A \in \R^{m \times n}$ and $b \in \R^{m}$. We assume
 that all rows of $A$ are non-zero. We denote the rows of $A$ and $b$
 by $a^T_i$ and $b_i$, respectively. This can be written
 as
 \begin{alignat}{2}
 A &= \begin{bmatrix} a^T_1 \  a^T_2  \  \ldots \  a^T_m   \end{bmatrix} \ \ \ \
 b &&= \begin{bmatrix} b_1  \  b_2   \ \ldots \  b_m   \end{bmatrix}
 \end{alignat}

 The algorithms discussed in this document are suitable for large sparse
 linear systems which are expensive for ordinary elimination. Amongst the many methods for
 iteratively solving linear systems, algorithms like the Jacobi and over-relaxation
 methods not as effective as methods like conjugate gradient. The preconditioned
 conjugate gradient (CG) method is one of the most commonly used algorithms for
 solving large sparse systems for symmetric $A$. The textbook CG algorithm has been modified
 to be applicable even when $A$ is not symmetric. The disadvantage of CG is that
 in each iteration, it must perform a matrix-vector product. To avoid this computation,
 some applications implement a new algorithm called the randomized Kaczmarz (RK)
 algorithm. It is a popular choice for extremely large scale applications.
 The algorithm is known to have a linear convergence rate and each iteration
 requires an $O(n)$ operation. In some applications, it outperforms CG. In
 general, it is be difficult to predict which one of CG or RK is preferable
 for a given linear system.

 We now discuss three different approaches to solve linear systems. Direct method,
 Conjugate Gradient and Randomized Kaczmarz. Each method has its own advantages
 and disadvantages which will be highlighted.

 \section{Direct Methods}
 Direct methods are suitable for solving small linear systems that fit completely
 in memory. The workhorse of several commercial codes is the LU decomposition.
 The LU decomposition factors a matrix as the product of a lower triangular matrix
 ($L$) and an upper triangular matrix ($U$) such that
 $$
 PA = LU
 $$
 where $P$ is a permutation matrix which re-orders the rows of $A$. Such an
 LU-decomposition can be used to solve \eqref{eq:linear_system} using
 $$
   Ax = b \Rightarrow LUx = Pb
 $$

 Now, we can solve for $x$ using the following steps
 \begin{enumerate}
   \item Solve for $y$ in the equation $Ly = Pb$
   \item Solve for $x$ in the equation $Ux = b$
 \end{enumerate}
 Since both $L$ and $U$ are triangular matrices, we can efficiently solve both
 the equations directly using forward and backward substitutions.

 The main advantage of solving linear systems with direct methods is that direct
 methods are independent of the conditioning of the system. Solve time depends
 purely on the sparsity and size of the matrices. The major disadvantage is
 that the LU-decomposition has large memory requirements. Even when the matrix
 $A$ is sparse, the $L$ and $U$ factors might be dense. The large memory
 requirements make it unsuitable for solving large or very sparse linear
 systems.

 \section{Iterative Methods}
 In solving \eqref{eq:linear_system} a convergent iterative method starts
 with an initial estimate $x_0$ of the solution and generates a sequence of iterates
 $x_k$ that are successively closer to the solution $x^*$. Iterative methods
 are often useful even for linear problems involving a large number of variables.
 Amongst the many iterative methods, we will review the two most popular methods;
 the conjugate gradient method (CG) and the randomized Kaczmarz (RK) method.

 \subsection{Conjugate Gradient (CG)}
 The linear conjugate gradient method (not to be confused with the non-linear
 conjugate gradient method) to solve large sparse linear systems with a
 \textit{symmetric positive definite} $A$ matrix. Such a system can be stated as:
 \begin{equation}
   \label{eq:sym_linear_system}
   Ax = b
 \end{equation}
 where $x \in \R^{n}$, $A \in \R^{n \times n}$ is symmetric and positive
 definite and $b \in \R^{n}$.

 Unlike direct methods, the time taken to solve a linear system using CG depends on the
 distribution of the eigenvalues of the $A$ matrix. In some applications,
 the $A$ matrix is appropriately scaled by a process called pre-conditioning to
 generate an equivalent system with a more favorable distribution of eigenvalues.


 The system \eqref{eq:sym_linear_system} can be restated as the quadratic minimization
 $$
   \min \phi(x) := \frac{1}{2} x^T A^T A x - b^T x
 $$
 which allows us to interpret CG as an algorithm to minimize convex quadratic
 functions. In the rest of this section, we will refer to the gradient $\nabla \phi(x)$
 as the residual of the linear system:
 $$
   \nabla \phi(x) := r(x) := Ax - b
 $$
 The linear conjugate gradient method generates a sequence of directions ${p_0, p_1 \ldots p_l}$
 that satisfy an important property called conjugacy which implies that the method
 can minimize the function $\phi(x)$ in exactly $n$ steps. We refer the reader
 to the textbook by Nocedal and Wright \cite{nocedal2006numerical} for details on the
 theoretical and algorithmic aspects of CG.

 We now discuss an efficient implementation of the linear CG algorithm. In each,
 iteration ($k$), we keep track of the direction vector $p_k$, the residual
 vector $r_k$ and the solution vector $x_k$. The computational bottleneck
 is in a matrix-vector multiplication between $A$ and $p$.

 \begin{algorithm}%
 % [Conjugate gradient for symmetric positive definite linear systems]
 % \caption{Conjugate gradient for symmetric positive definite linear systems.}\label{alg:cg}
 \alginput{Symmetric matrix $A \in \R{m \times n}$, $b \in \R{m}$}
 \algoutput{Solution to $Ax=b$}
 \begin{algorithmic}[1]
   \State Choose $x_0 \in \R^n$, $r_0 \leftarrow Ax_0$, $p_0 \leftarrow -r_0$, $k \leftarrow 0$
   \While{$\norm{r_k} \leq \epsilon$}
     \State$z_k \leftarrow Ap_k$
     \State$\alpha_k \leftarrow \frac{r_k^T r_k}{p_k^T z_k}$
     \State$x_{k+1} \leftarrow x_k + \alpha_k p_k$
     \State$r_{k+1} \leftarrow r_k + \alpha_k z_k$
     \State$\beta_k \leftarrow \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$
     \State$p_{k+1} \leftarrow -r_{k+1} + \beta_{k+1} p_k$
     \State$k = k + 1$
   \EndWhile
 \end{algorithmic}
 \label{alg:cg}
 \end{algorithm}

 The conjugate gradient method is suitable for large sparse linear systems where
 direct methods can often run into memory bottlenecks. This is mainly because,
 the only memory requirements of the CG method is to store the latest copies
 of the vectors $p_k$, $r_k$ and $x_k$. The majority of the computational efforts
 are spent in the step $z_k \leftarrow A p_k$. Hence, CG tends to perform better
 in sparse linear systems.

 \subsubsection*{Conjugate gradient Least Squares (CGLS)}
 In this section, we will extend the CG algorithm to be numerically suited to any linear system of the form
 \eqref{eq:linear_system}. The naive extension of CG to \eqref{eq:sym_linear_system}
 solves $A^TAx = A^Tb$. In addition to requiring an expensive matrix-matrix
 multiplication algorithm, it has it use of vectors of the form $A^TAp$. An
 algorithm with better numerical properties was developed by Hestenes et. al \cite{hestenes1952methods}.


 \begin{algorithm}
 % \caption{Conjugate Gradient (least squares) for general linear systems.} \label{alg:cgls}
 \alginput{Matrix $A \in \R{m \times n}$, $b \in \R{m}$}
 \algoutput{Solution to $Ax=b$}
 \begin{algorithmic}[1]
 \State Choose $x_0 \in \R^n$, $r_0 \leftarrow b$,\ $s_0 \leftarrow A^Tb$, $p_0 \leftarrow s_0$, $\gamma_0=\norm{s_0}^2$, $k \leftarrow 0$
 \While{$\norm{r_k} \leq \epsilon$}
   \State$z_k \leftarrow Ap_k$
   \State$\alpha_k \leftarrow \frac{\gamma_{k}}{z_k^Tz_k}$
   \State$x_{k+1} \leftarrow x_k + \alpha_k p_k$
   \State$r_{k+1} \leftarrow r_k - \alpha_k z_k$
   \State$s_{k+1} \leftarrow A^T r_{k+1} $
   \State$\gamma_{k+1} \leftarrow s_{k+1}^T s_{k+1}$
   \State$\beta_{k+1} \leftarrow \frac{\gamma_{k+1}}{\gamma_k}$
   \State$p_{k+1} \leftarrow s_{k+1} + \beta_{k+1} p_k$
   \State$k = k + 1$
 \EndWhile
 \end{algorithmic}
 \label{alg:cgls}
 \end{algorithm}

 Paige et. al \cite{paige1982lsqr} developed an algorithm called LSQR which has similar performance
 to CGLS. We might consider implementing LSQR in case CG performs poorly on
 linear systems.

 \subsection{Randomized Kaczmarz (RK)}
 As discussed earlier, the randomized Kaczmarz (RK) algorithm, is a popular
 algorithm for solving \eqref{eq:linear_system}. Each iteration requires an
 $O(n)$ storage and computational effort. During each iteration, RK picks a
 row $a_i$ of the matrix $A$, and does an orthogonal projection of the current
 solution vector $x_k$ to the hyperplane $a_i^Tx = b$. The update step is given by
 $$
 x_{k+1} = x_k - \frac{(a_i^T x_k - b_i)}{\norm{a_i}} a_i
 $$
 An alternate interpretation of RK is that the algorithm is identical to the
 stochastic gradient  descent algorithm on the problem
 $$
   \min \phi(x) := \frac{1}{2} x^T A^T A x - b^T x
 $$
 The algorithm performs best when a row $i$ is chosen randomly but proportional
 to $\norm{a_i}$. Since sequential scans are preferred for in-database algorithms,
 a common pre-processing procedure for RK is to rescale the system so that each
 equation $a_i^Tx = b$ has the same norm.

 \begin{algorithm}%[Randomized Kaczmarz for general linear systems]
 % \caption{Randomized Kaczmarz for general linear systems.}\label{alg:rk}
 \alginput{Matrix $A \in \R{m \times n}$, $b \in \R{m}$}
 \algoutput{Solution to $Ax=b$}
 \begin{algorithmic}[1]
 \State Choose $x_0 \in \R^n$, $k \leftarrow 0$
 \While{$\norm{Ax - b} \leq \epsilon$}
 \State$x_{k+1} \leftarrow x_k - \frac{(a_i^T x_k - b_i)}{\norm{a_i}}$
   \State$k = k + 1$
 \EndWhile
 \end{algorithmic}
 \label{alg:rk}
 \end{algorithm}

 The termination criterion of the algorithm is implemented by computing the
 residual $\norm{Ax-b}$ extremely infrequently. Typically, this computation
 is performed every $K$ epochs where an epoch is defined as one whole pass of
 the data which in the case of RK is $m$ iterations.
	\chapter[Linear Systems]{Linear Systems}
	\begin{moduleinfo}
	\item[Authors] {Srikrishna Sridhar}
	\item[History]
	\begin{modulehistory}
	\item[v1.0] Initial version
	\end{modulehistory}
	\end{moduleinfo}

	\section{Introduction}\label{sec:intro}
	In this document, we describe solution methods for systems of a consistent
	linear equations.
	\begin{equation}
	\label{eq:linear_system}
	Ax = b
	\end{equation}
	where $x \in \R^{n}$, $A \in \R^{m \times n}$ and $b \in \R^{m}$. We assume
	that all rows of $A$ are non-zero. We denote the rows of $A$ and $b$
	by $a^T_i$ and $b_i$, respectively. This can be written
	as
	\begin{alignat}{2}
	A &= \begin{bmatrix} a^T_1 \ a^T_2 \ \ldots \ a^T_m \end{bmatrix} \ \ \ \
	b &&= \begin{bmatrix} b_1 \ b_2 \ \ldots \ b_m \end{bmatrix}
	\end{alignat}

	The algorithms discussed in this document are suitable for large sparse
	linear systems which are expensive for ordinary elimination. Amongst the many methods for
	iteratively solving linear systems, algorithms like the Jacobi and over-relaxation
	methods not as effective as methods like conjugate gradient. The preconditioned
	conjugate gradient (CG) method is one of the most commonly used algorithms for
	solving large sparse systems for symmetric $A$. The textbook CG algorithm has been modified
	to be applicable even when $A$ is not symmetric. The disadvantage of CG is that
	in each iteration, it must perform a matrix-vector product. To avoid this computation,
	some applications implement a new algorithm called the randomized Kaczmarz (RK)
	algorithm. It is a popular choice for extremely large scale applications.
	The algorithm is known to have a linear convergence rate and each iteration
	requires an $O(n)$ operation. In some applications, it outperforms CG. In
	general, it is be difficult to predict which one of CG or RK is preferable
	for a given linear system.

	We now discuss three different approaches to solve linear systems. Direct method,
	Conjugate Gradient and Randomized Kaczmarz. Each method has its own advantages
	and disadvantages which will be highlighted.

	\section{Direct Methods}
	Direct methods are suitable for solving small linear systems that fit completely
	in memory. The workhorse of several commercial codes is the LU decomposition.
	The LU decomposition factors a matrix as the product of a lower triangular matrix
	($L$) and an upper triangular matrix ($U$) such that
	$$
	PA = LU
	$$
	where $P$ is a permutation matrix which re-orders the rows of $A$. Such an
	LU-decomposition can be used to solve \eqref{eq:linear_system} using
	$$
	Ax = b \Rightarrow LUx = Pb
	$$

	Now, we can solve for $x$ using the following steps
	\begin{enumerate}
	\item Solve for $y$ in the equation $Ly = Pb$
	\item Solve for $x$ in the equation $Ux = b$
	\end{enumerate}
	Since both $L$ and $U$ are triangular matrices, we can efficiently solve both
	the equations directly using forward and backward substitutions.

	The main advantage of solving linear systems with direct methods is that direct
	methods are independent of the conditioning of the system. Solve time depends
	purely on the sparsity and size of the matrices. The major disadvantage is
	that the LU-decomposition has large memory requirements. Even when the matrix
	$A$ is sparse, the $L$ and $U$ factors might be dense. The large memory
	requirements make it unsuitable for solving large or very sparse linear
	systems.

	\section{Iterative Methods}
	In solving \eqref{eq:linear_system} a convergent iterative method starts
	with an initial estimate $x_0$ of the solution and generates a sequence of iterates
	$x_k$ that are successively closer to the solution $x^*$. Iterative methods
	are often useful even for linear problems involving a large number of variables.
	Amongst the many iterative methods, we will review the two most popular methods;
	the conjugate gradient method (CG) and the randomized Kaczmarz (RK) method.

	\subsection{Conjugate Gradient (CG)}
	The linear conjugate gradient method (not to be confused with the non-linear
	conjugate gradient method) to solve large sparse linear systems with a
	\textit{symmetric positive definite} $A$ matrix. Such a system can be stated as:
	\begin{equation}
	\label{eq:sym_linear_system}
	Ax = b
	\end{equation}
	where $x \in \R^{n}$, $A \in \R^{n \times n}$ is symmetric and positive
	definite and $b \in \R^{n}$.

	Unlike direct methods, the time taken to solve a linear system using CG depends on the
	distribution of the eigenvalues of the $A$ matrix. In some applications,
	the $A$ matrix is appropriately scaled by a process called pre-conditioning to
	generate an equivalent system with a more favorable distribution of eigenvalues.


	The system \eqref{eq:sym_linear_system} can be restated as the quadratic minimization
	$$
	\min \phi(x) := \frac{1}{2} x^T A^T A x - b^T x
	$$
	which allows us to interpret CG as an algorithm to minimize convex quadratic
	functions. In the rest of this section, we will refer to the gradient $\nabla \phi(x)$
	as the residual of the linear system:
	$$
	\nabla \phi(x) := r(x) := Ax - b
	$$
	The linear conjugate gradient method generates a sequence of directions ${p_0, p_1 \ldots p_l}$
	that satisfy an important property called conjugacy which implies that the method
	can minimize the function $\phi(x)$ in exactly $n$ steps. We refer the reader
	to the textbook by Nocedal and Wright \cite{nocedal2006numerical} for details on the
	theoretical and algorithmic aspects of CG.

	We now discuss an efficient implementation of the linear CG algorithm. In each,
	iteration ($k$), we keep track of the direction vector $p_k$, the residual
	vector $r_k$ and the solution vector $x_k$. The computational bottleneck
	is in a matrix-vector multiplication between $A$ and $p$.

	\begin{algorithm}%
	% [Conjugate gradient for symmetric positive definite linear systems]
	% \caption{Conjugate gradient for symmetric positive definite linear systems.}\label{alg:cg}
	\alginput{Symmetric matrix $A \in \R{m \times n}$, $b \in \R{m}$}
	\algoutput{Solution to $Ax=b$}
	\begin{algorithmic}[1]
	\State Choose $x_0 \in \R^n$, $r_0 \leftarrow Ax_0$, $p_0 \leftarrow -r_0$, $k \leftarrow 0$
	\While{$\norm{r_k} \leq \epsilon$}
	\State$z_k \leftarrow Ap_k$
	\State$\alpha_k \leftarrow \frac{r_k^T r_k}{p_k^T z_k}$
	\State$x_{k+1} \leftarrow x_k + \alpha_k p_k$
	\State$r_{k+1} \leftarrow r_k + \alpha_k z_k$
	\State$\beta_k \leftarrow \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$
	\State$p_{k+1} \leftarrow -r_{k+1} + \beta_{k+1} p_k$
	\State$k = k + 1$
	\EndWhile
	\end{algorithmic}
	\label{alg:cg}
	\end{algorithm}

	The conjugate gradient method is suitable for large sparse linear systems where
	direct methods can often run into memory bottlenecks. This is mainly because,
	the only memory requirements of the CG method is to store the latest copies
	of the vectors $p_k$, $r_k$ and $x_k$. The majority of the computational efforts
	are spent in the step $z_k \leftarrow A p_k$. Hence, CG tends to perform better
	in sparse linear systems.

	\subsubsection*{Conjugate gradient Least Squares (CGLS)}
	In this section, we will extend the CG algorithm to be numerically suited to any linear system of the form
	\eqref{eq:linear_system}. The naive extension of CG to \eqref{eq:sym_linear_system}
	solves $A^TAx = A^Tb$. In addition to requiring an expensive matrix-matrix
	multiplication algorithm, it has it use of vectors of the form $A^TAp$. An
	algorithm with better numerical properties was developed by Hestenes et. al \cite{hestenes1952methods}.


	\begin{algorithm}
	% \caption{Conjugate Gradient (least squares) for general linear systems.} \label{alg:cgls}
	\alginput{Matrix $A \in \R{m \times n}$, $b \in \R{m}$}
	\algoutput{Solution to $Ax=b$}
	\begin{algorithmic}[1]
	\State Choose $x_0 \in \R^n$, $r_0 \leftarrow b$,\ $s_0 \leftarrow A^Tb$, $p_0 \leftarrow s_0$, $\gamma_0=\norm{s_0}^2$, $k \leftarrow 0$
	\While{$\norm{r_k} \leq \epsilon$}
	\State$z_k \leftarrow Ap_k$
	\State$\alpha_k \leftarrow \frac{\gamma_{k}}{z_k^Tz_k}$
	\State$x_{k+1} \leftarrow x_k + \alpha_k p_k$
	\State$r_{k+1} \leftarrow r_k - \alpha_k z_k$
	\State$s_{k+1} \leftarrow A^T r_{k+1} $
	\State$\gamma_{k+1} \leftarrow s_{k+1}^T s_{k+1}$
	\State$\beta_{k+1} \leftarrow \frac{\gamma_{k+1}}{\gamma_k}$
	\State$p_{k+1} \leftarrow s_{k+1} + \beta_{k+1} p_k$
	\State$k = k + 1$
	\EndWhile
	\end{algorithmic}
	\label{alg:cgls}
	\end{algorithm}

	Paige et. al \cite{paige1982lsqr} developed an algorithm called LSQR which has similar performance
	to CGLS. We might consider implementing LSQR in case CG performs poorly on
	linear systems.

	\subsection{Randomized Kaczmarz (RK)}
	As discussed earlier, the randomized Kaczmarz (RK) algorithm, is a popular
	algorithm for solving \eqref{eq:linear_system}. Each iteration requires an
	$O(n)$ storage and computational effort. During each iteration, RK picks a
	row $a_i$ of the matrix $A$, and does an orthogonal projection of the current
	solution vector $x_k$ to the hyperplane $a_i^Tx = b$. The update step is given by
	$$
	x_{k+1} = x_k - \frac{(a_i^T x_k - b_i)}{\norm{a_i}} a_i
	$$
	An alternate interpretation of RK is that the algorithm is identical to the
	stochastic gradient descent algorithm on the problem
	$$
	\min \phi(x) := \frac{1}{2} x^T A^T A x - b^T x
	$$
	The algorithm performs best when a row $i$ is chosen randomly but proportional
	to $\norm{a_i}$. Since sequential scans are preferred for in-database algorithms,
	a common pre-processing procedure for RK is to rescale the system so that each
	equation $a_i^Tx = b$ has the same norm.

	\begin{algorithm}%[Randomized Kaczmarz for general linear systems]
	% \caption{Randomized Kaczmarz for general linear systems.}\label{alg:rk}
	\alginput{Matrix $A \in \R{m \times n}$, $b \in \R{m}$}
	\algoutput{Solution to $Ax=b$}
	\begin{algorithmic}[1]
	\State Choose $x_0 \in \R^n$, $k \leftarrow 0$
	\While{$\norm{Ax - b} \leq \epsilon$}
	\State$x_{k+1} \leftarrow x_k - \frac{(a_i^T x_k - b_i)}{\norm{a_i}}$
	\State$k = k + 1$
	\EndWhile
	\end{algorithmic}
	\label{alg:rk}
	\end{algorithm}

	The termination criterion of the algorithm is implemented by computing the
	residual $\norm{Ax-b}$ extremely infrequently. Typically, this computation
	is performed every $K$ epochs where an epoch is defined as one whole pass of
	the data which in the case of RK is $m$ iterations.