Conjugate-gradient methods

Prince, E.; Boggs, P. T.

doi:10.1107/97809553602060000609

International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 686-687

Section 8.1.5.2. Conjugate-gradient methods

E. Prince^a and P. T. Boggs^b

^a NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and ^bScientific Computing Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

8.1.5.2. Conjugate-gradient methods

| top | pdf |

A numerical procedure that is applicable to large-scale problems that may not be sparse is called the conjugate-gradient method. Conjugate-gradient methods were originally designed to solve the quadratic minimization problem, find the minimum of $[S({\bf x})=(1/2){\bf x}^T{\bi H}{\bf x}-{\bf b}^T{\bf x}, \eqno (8.1.5.1)]$ where H is a symmetric, positive-definite matrix. The gradient of S is $[{\bf g}({\bf x})={\bi H}{\bf x}-{\bf b}, \eqno (8.1.5.2)]$ and its Hessian matrix is H. Given an initial estimate, $[{\bf x}_0]$ , the conjugate-gradient algorithm is

(1) define $[{\bf d}_{0}=-{\bf g}({\bf x}_{0})]$ ;

(2) for k = 0, 1, 2, ..., p − 1, $[{\bf d}_0]$

(a) $[\alpha _{k}=-{\bf d}_{k}^{T}{\bf g}({\bf x}_{k})/{\bf d}_{k}^{T}{\bi H}{\bf d}_{k}]$ ;
(b) $[{\bf x}_{k+1}={\bf x}_{k}+\alpha _{k}{\bf d}_{k}]$ ;
(c) $[\gamma _{k}={\bf g}({\bf x}_{k+1})^{T}{\bf g}({\bf x}_{k+1})/{\bf g}({\bf x}_{k})^{T}{\bf g}({\bf x}_{k})]$ ;
(d) $[{\bf d}_{k+1}=-{\bf g}({\bf x}_{k})+\gamma _{k}{\bf d}_{k}]$ .

This algorithm finds the exact solution for the quadratic function in not more than p steps.

This algorithm cannot be used directly for the nonlinear case because it requires H to compute $[\alpha _k]$ , and the goal is to solve the problem without computing the Hessian. To accomplish this, the exact computation of α is replaced by an actual line search, and the termination after at most p steps is replaced by a convergence test. Thus, we obtain, for a given starting value $[{\bf x}_0]$ and a general, nonquadratic function S:

(1) define $[{\bf d}_{0}=-{\bf g}({\bf x}_{0})]$ ;
(2) set k = 0;

(3) do until convergence

(a) $[{\bf x}_{k+1}={\bf x}_{k}+\alpha _{k}{\bf d}_{k}]$ , where $[\alpha ]$ is chosen by a line search;
(b) $[\gamma _{k}={\bf g}({\bf x}_{k+1})^{T}{\bf g}({\bf x}_{k+1})/{\bf g}({\bf x}_{k})^{T}{\bf g}({\bf x}_{k})]$ ;
(c) $[{\bf d}_{k+1}=-{\bf g}({\bf x}_{k+1})+\gamma _{k}{\bf d}_{k}]$ ;
(d) .

Note that, as promised, H is not needed. In practice, it has been observed that the line search need not be exact, but that periodic restarts in the steepest-descent direction are often helpful. This procedure often requires more iterations and function evaluations than methods that store approximate Hessians, but the cost per iteration is small. Thus, it is often the overall least-expensive method for large problems.

For the least-squares problem, recall that we are finding the minimum of $[S({\bf x})=(1/2)[{\bf y}^{\prime }-{\bi Z}{\bf x}]^T[{\bf y}^{\prime }-{\bi Z}{\bf x}], \eqno (8.1.5.3)]$ for which $[{\bf g}({\bf x})={\bi Z}^T({\bi Z}{\bf x}-{\bf y}^{\prime }). \eqno (8.1.5.4)]$ By using these definitions in the conjugate-gradient algorithm, it is possible to formulate a specific algorithm for linear least squares that requires only the calculation of Z times a vector and Z^T times a vector, and never requires the calculation or factorization of Z^TZ.

In practice, such an algorithm will, due to roundoff error, sometimes require more than p iterations to reach a solution. A detailed examination of the performance of the procedure shows, however, that fewer than p iterations will be required if the eigenvalues of Z^TZ are bunched, that is, if there are sets of multiple eigenvalues. Specifically, if the eigenvalues are bunched into k distinct sets, then the conjugate-gradient method will converge in k iterations. Thus, significant improvements can be made if the problem can be transformed to one with bunched eigenvalues. Such a transformation leads to the so-called preconditioned conjugate-gradient method. In order to analyse the situation, let C be a p × p matrix that transforms the variables, such that $[{\bf x}^{\prime }={\bi C}{\bf x}. \eqno (8.1.5.5)]$ Then, $[{\bf y}^{\prime }-{\bi Z}{\bf x}={\bf y}^{\prime }-{\bi ZC}^{-1}{\bf x}^{\prime}. \eqno (8.1.5.6)]$ Therefore, C should be such that the system Cx = x′ is easy to solve, and $[({\bi ZC}^{-1})^T{\bi ZC}^{-1}]$ has bunched eigenvalues. The ideal choice would be C = R, where R is the upper triangular factor of the QR decomposition, since $[{\bi ZR}^{-1}={\bi Q}_{{\bi Z}}]$ . $[{\bi Q}_{{\bi Z}}^T{\bi Q}_{{\bi Z}}={\bi I}]$ has all of its eigenvalues equal to one, and, since R is triangular, the system is easy to solve. If R were known, however, the problem would already be exactly solved, so this is not a useful alternative. Unfortunately, no universal best choice seems to exist, but one approach is to choose a sparse approximation to R by ignoring rows that cause too much fill in or by making R a diagonal matrix whose elements are the Euclidean norms of the columns of Z. Bear in mind that, in the nonlinear case, an expensive computation to choose C in the first iteration may work very well in subsequent iterations with no further expense. One should be aware of the trade off between the extra work per iteration of the preconditioned-conjugate gradient method versus the reduction in the number of iterations. This is especially important in nonlinear problems.

The solution of large, least-squares problems is currently an active area of research, and we have certainly not given an exhaustive list of methods in this chapter. The choice of method or approach for any particular problem is dependent on many conditions. Some of these are:

(1) The size of the problem. Clearly, as computer memories continue to grow, the boundary between small and large problems also grows. Nevertheless, even if a problem can fit into memory, its sparsity structure may be exploited in order to obtain a more efficient algorithm.
(2) The number of times the problem (or similar ones) will be solved. If it is a one-shot problem (a rare occurrence), then one is usually most strongly influenced by easy-to-use, existing software. Exceptions, of course, exist where even a single solution of the problem requires extreme care.
(3) The expense of evaluating the function. With a complicated, nonlinear function like the structure-factor formula, the computational effort to determine the values of the function and its derivatives usually greatly exceeds that required to solve the linearized problem. Therefore, a full Gauss–Newton, trust-region, or quasi-Newton method may be warranted.
(4) Other structure in the problem. Rarely does a problem have a random sparsity pattern. Non-zero values usually occur in blocks or in some regular pattern for which special decomposition methods can be devised.
(5) The machine on which the problem is to be solved. We have said nothing about the existing vector and parallel processors. Suffice it to say that the most efficient procedure for a serial machine may not be the right algorithm for one of these novel machines. Appropriate numerical methods for such architectures are also being actively investigated.

References

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 686-687