blob: b64f2af390c8862b8fc2082f7569344e09b92c67 [file] [log] [blame]
\chapter[Linear Systems]{Linear Systems}
\begin{moduleinfo}
\item[Authors] {Srikrishna Sridhar}
\item[History]
\begin{modulehistory}
\item[v1.0] Initial version
\end{modulehistory}
\end{moduleinfo}
\section{Introduction}\label{sec:intro}
In this document, we describe solution methods for systems of a consistent
linear equations.
\begin{equation}
\label{eq:linear_system}
Ax = b
\end{equation}
where $x \in \R^{n}$, $A \in \R^{m \times n}$ and $b \in \R^{m}$. We assume
that all rows of $A$ are non-zero. We denote the rows of $A$ and $b$
by $a^T_i$ and $b_i$, respectively. This can be written
as
\begin{alignat}{2}
A &= \begin{bmatrix} a^T_1 \ a^T_2 \ \ldots \ a^T_m \end{bmatrix} \ \ \ \
b &&= \begin{bmatrix} b_1 \ b_2 \ \ldots \ b_m \end{bmatrix}
\end{alignat}
The algorithms discussed in this document are suitable for large sparse
linear systems which are expensive for ordinary elimination. Amongst the many methods for
iteratively solving linear systems, algorithms like the Jacobi and over-relaxation
methods not as effective as methods like conjugate gradient. The preconditioned
conjugate gradient (CG) method is one of the most commonly used algorithms for
solving large sparse systems for symmetric $A$. The textbook CG algorithm has been modified
to be applicable even when $A$ is not symmetric. The disadvantage of CG is that
in each iteration, it must perform a matrix-vector product. To avoid this computation,
some applications implement a new algorithm called the randomized Kaczmarz (RK)
algorithm. It is a popular choice for extremely large scale applications.
The algorithm is known to have a linear convergence rate and each iteration
requires an $O(n)$ operation. In some applications, it outperforms CG. In
general, it is be difficult to predict which one of CG or RK is preferable
for a given linear system.
We now discuss three different approaches to solve linear systems. Direct method,
Conjugate Gradient and Randomized Kaczmarz. Each method has its own advantages
and disadvantages which will be highlighted.
\section{Direct Methods}
Direct methods are suitable for solving small linear systems that fit completely
in memory. The workhorse of several commercial codes is the LU decomposition.
The LU decomposition factors a matrix as the product of a lower triangular matrix
($L$) and an upper triangular matrix ($U$) such that
$$
PA = LU
$$
where $P$ is a permutation matrix which re-orders the rows of $A$. Such an
LU-decomposition can be used to solve \eqref{eq:linear_system} using
$$
Ax = b \Rightarrow LUx = Pb
$$
Now, we can solve for $x$ using the following steps
\begin{enumerate}
\item Solve for $y$ in the equation $Ly = Pb$
\item Solve for $x$ in the equation $Ux = b$
\end{enumerate}
Since both $L$ and $U$ are triangular matrices, we can efficiently solve both
the equations directly using forward and backward substitutions.
The main advantage of solving linear systems with direct methods is that direct
methods are independent of the conditioning of the system. Solve time depends
purely on the sparsity and size of the matrices. The major disadvantage is
that the LU-decomposition has large memory requirements. Even when the matrix
$A$ is sparse, the $L$ and $U$ factors might be dense. The large memory
requirements make it unsuitable for solving large or very sparse linear
systems.
\section{Iterative Methods}
In solving \eqref{eq:linear_system} a convergent iterative method starts
with an initial estimate $x_0$ of the solution and generates a sequence of iterates
$x_k$ that are successively closer to the solution $x^*$. Iterative methods
are often useful even for linear problems involving a large number of variables.
Amongst the many iterative methods, we will review the two most popular methods;
the conjugate gradient method (CG) and the randomized Kaczmarz (RK) method.
\subsection{Conjugate Gradient (CG)}
The linear conjugate gradient method (not to be confused with the non-linear
conjugate gradient method) to solve large sparse linear systems with a
\textit{symmetric positive definite} $A$ matrix. Such a system can be stated as:
\begin{equation}
\label{eq:sym_linear_system}
Ax = b
\end{equation}
where $x \in \R^{n}$, $A \in \R^{n \times n}$ is symmetric and positive
definite and $b \in \R^{n}$.
Unlike direct methods, the time taken to solve a linear system using CG depends on the
distribution of the eigenvalues of the $A$ matrix. In some applications,
the $A$ matrix is appropriately scaled by a process called pre-conditioning to
generate an equivalent system with a more favorable distribution of eigenvalues.
The system \eqref{eq:sym_linear_system} can be restated as the quadratic minimization
$$
\min \phi(x) := \frac{1}{2} x^T A^T A x - b^T x
$$
which allows us to interpret CG as an algorithm to minimize convex quadratic
functions. In the rest of this section, we will refer to the gradient $\nabla \phi(x)$
as the residual of the linear system:
$$
\nabla \phi(x) := r(x) := Ax - b
$$
The linear conjugate gradient method generates a sequence of directions ${p_0, p_1 \ldots p_l}$
that satisfy an important property called conjugacy which implies that the method
can minimize the function $\phi(x)$ in exactly $n$ steps. We refer the reader
to the textbook by Nocedal and Wright \cite{nocedal2006numerical} for details on the
theoretical and algorithmic aspects of CG.
We now discuss an efficient implementation of the linear CG algorithm. In each,
iteration ($k$), we keep track of the direction vector $p_k$, the residual
vector $r_k$ and the solution vector $x_k$. The computational bottleneck
is in a matrix-vector multiplication between $A$ and $p$.
\begin{algorithm}%
% [Conjugate gradient for symmetric positive definite linear systems]
% \caption{Conjugate gradient for symmetric positive definite linear systems.}\label{alg:cg}
\alginput{Symmetric matrix $A \in \R{m \times n}$, $b \in \R{m}$}
\algoutput{Solution to $Ax=b$}
\begin{algorithmic}[1]
\State Choose $x_0 \in \R^n$, $r_0 \leftarrow Ax_0$, $p_0 \leftarrow -r_0$, $k \leftarrow 0$
\While{$\norm{r_k} \leq \epsilon$}
\State$z_k \leftarrow Ap_k$
\State$\alpha_k \leftarrow \frac{r_k^T r_k}{p_k^T z_k}$
\State$x_{k+1} \leftarrow x_k + \alpha_k p_k$
\State$r_{k+1} \leftarrow r_k + \alpha_k z_k$
\State$\beta_k \leftarrow \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$
\State$p_{k+1} \leftarrow -r_{k+1} + \beta_{k+1} p_k$
\State$k = k + 1$
\EndWhile
\end{algorithmic}
\label{alg:cg}
\end{algorithm}
The conjugate gradient method is suitable for large sparse linear systems where
direct methods can often run into memory bottlenecks. This is mainly because,
the only memory requirements of the CG method is to store the latest copies
of the vectors $p_k$, $r_k$ and $x_k$. The majority of the computational efforts
are spent in the step $z_k \leftarrow A p_k$. Hence, CG tends to perform better
in sparse linear systems.
\subsubsection*{Conjugate gradient Least Squares (CGLS)}
In this section, we will extend the CG algorithm to be numerically suited to any linear system of the form
\eqref{eq:linear_system}. The naive extension of CG to \eqref{eq:sym_linear_system}
solves $A^TAx = A^Tb$. In addition to requiring an expensive matrix-matrix
multiplication algorithm, it has it use of vectors of the form $A^TAp$. An
algorithm with better numerical properties was developed by Hestenes et. al \cite{hestenes1952methods}.
\begin{algorithm}
% \caption{Conjugate Gradient (least squares) for general linear systems.} \label{alg:cgls}
\alginput{Matrix $A \in \R{m \times n}$, $b \in \R{m}$}
\algoutput{Solution to $Ax=b$}
\begin{algorithmic}[1]
\State Choose $x_0 \in \R^n$, $r_0 \leftarrow b$,\ $s_0 \leftarrow A^Tb$, $p_0 \leftarrow s_0$, $\gamma_0=\norm{s_0}^2$, $k \leftarrow 0$
\While{$\norm{r_k} \leq \epsilon$}
\State$z_k \leftarrow Ap_k$
\State$\alpha_k \leftarrow \frac{\gamma_{k}}{z_k^Tz_k}$
\State$x_{k+1} \leftarrow x_k + \alpha_k p_k$
\State$r_{k+1} \leftarrow r_k - \alpha_k z_k$
\State$s_{k+1} \leftarrow A^T r_{k+1} $
\State$\gamma_{k+1} \leftarrow s_{k+1}^T s_{k+1}$
\State$\beta_{k+1} \leftarrow \frac{\gamma_{k+1}}{\gamma_k}$
\State$p_{k+1} \leftarrow s_{k+1} + \beta_{k+1} p_k$
\State$k = k + 1$
\EndWhile
\end{algorithmic}
\label{alg:cgls}
\end{algorithm}
Paige et. al \cite{paige1982lsqr} developed an algorithm called LSQR which has similar performance
to CGLS. We might consider implementing LSQR in case CG performs poorly on
linear systems.
\subsection{Randomized Kaczmarz (RK)}
As discussed earlier, the randomized Kaczmarz (RK) algorithm, is a popular
algorithm for solving \eqref{eq:linear_system}. Each iteration requires an
$O(n)$ storage and computational effort. During each iteration, RK picks a
row $a_i$ of the matrix $A$, and does an orthogonal projection of the current
solution vector $x_k$ to the hyperplane $a_i^Tx = b$. The update step is given by
$$
x_{k+1} = x_k - \frac{(a_i^T x_k - b_i)}{\norm{a_i}} a_i
$$
An alternate interpretation of RK is that the algorithm is identical to the
stochastic gradient descent algorithm on the problem
$$
\min \phi(x) := \frac{1}{2} x^T A^T A x - b^T x
$$
The algorithm performs best when a row $i$ is chosen randomly but proportional
to $\norm{a_i}$. Since sequential scans are preferred for in-database algorithms,
a common pre-processing procedure for RK is to rescale the system so that each
equation $a_i^Tx = b$ has the same norm.
\begin{algorithm}%[Randomized Kaczmarz for general linear systems]
% \caption{Randomized Kaczmarz for general linear systems.}\label{alg:rk}
\alginput{Matrix $A \in \R{m \times n}$, $b \in \R{m}$}
\algoutput{Solution to $Ax=b$}
\begin{algorithmic}[1]
\State Choose $x_0 \in \R^n$, $k \leftarrow 0$
\While{$\norm{Ax - b} \leq \epsilon$}
\State$x_{k+1} \leftarrow x_k - \frac{(a_i^T x_k - b_i)}{\norm{a_i}}$
\State$k = k + 1$
\EndWhile
\end{algorithmic}
\label{alg:rk}
\end{algorithm}
The termination criterion of the algorithm is implemented by computing the
residual $\norm{Ax-b}$ extremely infrequently. Typically, this computation
is performed every $K$ epochs where an epoch is defined as one whole pass of
the data which in the case of RK is $m$ iterations.