Quasi-Newton, or secant, methods

Prince, E.; Boggs, P. T.

doi:10.1107/97809553602060000609

International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 683-684

Section 8.1.4.3. Quasi-Newton, or secant, methods

E. Prince^a and P. T. Boggs^b

^a NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and ^bScientific Computing Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

8.1.4.3. Quasi-Newton, or secant, methods

| top | pdf |

The second class of methods for solving the general, nonlinear least-squares problem is based on the unconstrained minimization of the function S(x), as defined in (8.1.4.1). Newton's method for solving this problem is derived by constructing a quadratic approximation to S at the current trial point, $[{\bf x}_c]$ , giving $[S_c({\bf d})=S({\bf x}_c)+{\bf g}({\bf x}_c)^T{\bf d}+(1/2){\bf d}^T{\bi H}({\bf x}_c){\bf d}, \eqno (8.1.4.6)]$ from which the Newton step is obtained by minimizing [S_c] with respect to d. Here, $[{\bf g}]$ represents the gradient of S and $[{\bi H}]$ is the Hessian matrix, the p × p symmetric matrix of second partial derivatives, $[H_{jk}=\partial ^2S/\partial x_j\partial x_k]$ . Thus, d is calculated by solving the linear system $[{\bi H}({\bf x}_c){\bf d}=-{\bf g}({\bf x}_c). \eqno (8.1.4.7)]$ [Note: In the literature on optimization, the notation $[\nabla ^2S({\bf x})]$ is often used to denote the Hessian matrix of the function S(x). This should not be confused with the Laplacian operator.] While Newton's method is locally quadratically convergent, it suffers from well known drawbacks. First, it is not globally convergent without employing some form of line search or trust region to safeguard it. Second, it requires the computation of the Hessian of S in each iteration.

The Hessian, however, has the form $[{\bi H}({\bf x}_c)={\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)+{\bi B}({\bf x}_c), \eqno (8.1.4.8)]$ where B(x) is given by $[B({\bf x}_c)_{jk}=\textstyle\sum\limits_{i=1}^nw_i[y_i-M_i({\bf x}_c)]\partial M_i^2({\bf x}_c)/\partial x_j\partial x_k. \eqno (8.1.4.9)]$ The first term of the Hessian, which is dependent only on the Jacobian, is readily available, but B, even if it is available analytically, is, typically, expensive to compute. Furthermore, in some situations, such as in the vicinity of a more symmetric model, H(x_c) may not be positive definite.

In order to overcome these difficulties, various methods have been proposed to approximate H without calculating B. Because these methods are based on approximations to Newton's method, they are often referred to as quasi-Newton methods. Two popular versions use approximations to B that are known as the Davidon–Fletcher–Powell (DFP) update and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, with BFGS being apparently superior in practice. The basic idea of both procedures is to use values of the gradient to update an approximation to the Hessian in such a way as to preserve as much previous information as possible, while incorporating the new information obtained in the most recent step. From (8.1.4.6), the gradient of S_c(d) is $[{\bf g}({\bf x}_c+{\bf d})={\bf g}({\bf x}_c)+{\bi H}({\bf x}_c){\bf d}. \eqno (8.1.4.10)]$ If the true function is not quadratic, or if H is only approximately known, the gradient at the point $[{\bf x}_c+{\bf d}]$ , where d is the solution of (8.1.4.7), will not vanish. We make use of the value of $[{\bf g}({\bf x}_c+{\bf d})]$ to compute a correction to H to give a Hessian that would have predicted what was actually found. That is, find an update matrix, B′, such that $[\lbrack {\bi H}({\bf x}_c)+{\bi B}^{\prime }({\bf x}_c)]{\bf d}={\bf g}({\bf x}_c+{\bf d})-{\bf g}({\bf x}_c). \eqno (8.1.4.11)]$ This is known as the secant, or quasi-Newton, relation. An algorithm based on the BFGS update goes as follows: Given $[{\bf x}_c]$ and $[{\bi H}_c]$ (usually a scaled, identity matrix is used as the initial approximation, in which case the first step is in the steepest descent direction),

(1) solve H_cd = −g(x_c) for the direction d;
(2) compute α such that S(x_c+ αd) satisfies condition (8.1.4.3);
(3) set x_c = x_c + αd;
(4) update H_c by H_c + g(x_c)g(x_c)^T/d^Tg(x_c) + qq^T/αq^Td (see Nash & Sofer, 1995), where q = g(x_c + αd) − g(x_c);
(5) if not converged, go to (1), else stop.

If, in step (2), the additional condition is applied that d^Tg(x_c+ αd) = 0, it becomes an exact line search. This, however, is an unnecessarily severe condition, because it can be shown that, if d^T[g(x_c + αd) − g(x_c)] $[\gt 0]$ , the approximation to the Hessian remains positive definite. The two terms in this expression are values of dS/dα, and the condition states that S does not decrease more rapidly as α increases, which will always be true for some range of values of α $[\gt \, 0]$ . This algorithm is globally convergent; it will find a point at which the gradient vanishes, although that point may be a false minimum. It should be noted, however, that the procedure does not produce a final Hessian matrix whose inverse necessarily has any resemblance to a variance–covariance matrix. To get that it is necessary to compute $[{\bi J}(\widehat {{\bf x}})]$ and from it compute $[({\bi J}^T{\bi WJ})^{-1}]$ .

If a scaled, identity matrix is used as the initial approximation to H, no use is made of the fact that, at least in the vicinity of the minimum, a major part of the Hessian matrix is given by $[{\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)]$ . Thus, if there can be reasonable confidence in the general features of the model, it is useful to use $[{\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)]$ as the initial approximation to H, in which case the first step is in the Gauss–Newton direction. The line-search provision, however, guarantees convergence, and, when n and p are both large, as in many crystallographic applications of least squares, a quasi-Newton update gives an adequate approximation to the new Hessian with a great deal less computation.

Another procedure that makes use of the fact that the linear approximation gives a major part of the Hessian constructs an approximation of the form $[{\bi H}({\bf x})={\bi J}({\bf x})^T{\bi WJ}({\bf x})+{\bi B}({\bf x}), \eqno (8.1.4.12)]$ where B is generated by quasi-Newton methods. The quasi-Newton condition that must be satisfied in this case is $[\lbrack {\bi J}({\bf x}_{+})^T{\bi WJ}({\bf x}_{+})+{\bi B}({\bf x}_{+})]{\bf d}_c={\bf q}_c, \eqno (8.1.4.13)]$ where $[{\bf d}_c={\bf x}_{+}-{\bf x}_c]$ , and $[{\bf q}_c={\bi J}({\bf x}_{+})^T{\bi W}[{\bf y}-{\bi M}({\bf x}_{+})]-{\bi J}({\bf x}_c)^T{\bi W}[{\bf y}-{\bi M}({\bf x}_c)]. ]$ A formula based on the BFGS update (Nash & Sofer, 1995) is $[{\bi B}({\bf x}_{+})={\bi B}({\bf x}_c)-{\bi G}_c{\bf d}_c{\bf d}_c^T{\bi G}_c/{\bf d}_c^T{\bi G}{\bf d}_c+{\bf q}_c{\bf q}_c^T/{\bf q}_c^T{\bf d}_c, \eqno (8.1.4.14)]$ where $[{\bi G}_{c} = {\bi J}({\bf x}_{+})^T{\bi WJ}({\bf x}_{+})+{\bi B}({\bf x}_c)]$ . Since $[{\bi J}({\bf x}_{+})]$ and $[{\bi M}({\bf x}_{+})]$ must be computed anyway, this technique maximizes the use of information with little additional computing. The resulting approximation to the Hessian, however, may not be positive definite, and precautions must be taken to ensure convergence. In the vicinity of a correct solution, where the residuals are small, the addition of B is not likely to help much, and it can be dropped. Far from the solution, however, it can be very helpful. An implementation of this procedure has been described by Bunch, Gay & Welsch (1993); it appears to be at least as efficient as the trust region, Levenberg–Marquardt procedure, and is probably better when residuals are large.

In actual practice, it is not the Hessian matrix itself that is updated, but rather its Cholesky factor (Prince, 1994). This requires approximately the same number of operations and allows the solution of the linear system for computing d in a time that increases in proportion to [p^2] . This strategy also allows the use of the approximate Hessian in convergence checks with no significant computational overhead and no extra storage, as would be required for storing both the Hessian and its inverse.

References

Bunch, D. S., Gay, D. M. & Welsch, R. E. (1993). Algorithm 717: subroutines for maximum likelihood and quasi-likelihood estimation of parameters in nonlinear regression models. ACM Trans. Math. Softw. 19, 109–130. Google Scholar

Nash, S. & Sofer, A. (1995). Linear and nonlinear programming. New York: McGraw-Hill.Google Scholar

Prince, E. (1994). Mathematical techniques in crystallography and materials science, 2nd ed. Berlin: Springer. Google Scholar

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 683-684