Methods for nonlinear least squares

Prince, E.; Boggs, P. T.

doi:10.1107/97809553602060000609

International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 682-685

Section 8.1.4. Methods for nonlinear least squares

E. Prince^a and P. T. Boggs^b

^a NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and ^bScientific Computing Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

8.1.4. Methods for nonlinear least squares

| top | pdf |

Recall (equation 8.1.2.1) that the general, nonlinear problem can be stated in the form: find the minimum of $[S({\bf x})=\sum _{i=1}^n w_i [y_i-M_i({\bf x})]^2, \eqno (8.1.4.1)]$ where x is a vector of p parameters, and M(x) represents a set of model functions that predict the values of observations, y. In this section, we discuss two useful ways of solving this problem and consider the relative merits of each. The first is based on iteratively linearizing the functions $[M_i({\bf x})]$ and approximating (8.1.4.1) by one of the form in (8.1.2.5). The second uses an unconstrained minimization algorithm, based on Newton's method, to minimize S(x).

8.1.4.1. The Gauss–Newton algorithm

| top | pdf |

Let $[{\bf x}_c]$ be the current approximation to $[\widehat {{\bf x}}]$ , the solution to (8.1.4.1). We construct a linear approximation to $[M_i({\bf x})]$ in the vicinity of $[{\bf x}={\bf x}_c]$ by expanding it in a Taylor series through the linear terms, obtaining $[M_i({\bf x})=M_i({\bf x}_c)+\textstyle\sum\limits_{j=1}^pJ_{ij}({\bf x}-{\bf x}_c)_j, \eqno (8.1.4.2)]$ where $[{\bi J}]$ is the Jacobian matrix, defined by $[J_{ij}=\partial M_i({\bf x})/\partial x_j]$ . A straightforward procedure, known as the Gauss–Newton algorithm, may be formally stated as follows:

(1) compute d as the solution of the linear system $[{\bi J}^{T}{\bi WJ}{\bf d}={\bi J}^{T}{\bi W}[{\bf y}-{\bi M}({\bf x}_{c})];]$
(2) set x_c = x_c+ d;
(3) if not converged (see Subsection 8.1.4.4), go to (1), else stop.

The convergence rate of the Gauss–Newton algorithm depends on the size of the residual, that is, on $[S(\widehat {{\bf x}})]$ . If $[S(\widehat {{\bf x}})=0]$ , then the convergence rate is quadratic; if it is small, then the rate is linear; but if $[S(\widehat {{\bf x}})]$ is large, then the procedure is not locally convergent at all. Fortunately, this procedure can be altered so that it is always locally convergent, and even globally convergent, that is, convergent to a relative minimum from any starting point. There are two possibilities. First, the procedure can be modified to include a line search. This is accomplished by computing the direction d as above and then choosing α such that S(x_c + αd) is sufficiently smaller than $[S({\bf x}_c)]$ . In order to guarantee convergence, one uses the test $[S({\bf x}_c+\alpha {\bf d}) \, \lt \, S({\bf x}_c)-\alpha \gamma {\bf d}^T{\bi J}({\bf x}_c)^T[{\bf y}-{\bi M}({\bf x}_c)], \eqno (8.1.4.3)]$ where, as actually implemented in modern codes, γ has values of the order of 10⁻⁴. [In theory, a slightly stronger condition is necessary, but this is usually avoided by other means. See Dennis & Schnabel (1983) for details.] While this improves the situation dramatically, it still suffers from the deficiency that it is very slow on some problems, and it is undefined if the rank of the Jacobian, $[{\bi J}(\widehat {{\bf x}})]$ , is less than p. J(x) usually has rank p near the solution, but it may be rank deficient far away. Also, it may be `close' to rank deficient, and give numerical difficulties (see Subsection 8.1.3.3).

8.1.4.2. Trust-region methods – the Levenberg–Marquardt algorithm

| top | pdf |

These remaining problems can be addressed by the utilization of a trust-region modification to the basic Gauss–Newton algorithm. For this procedure, we compute the step, d, as the solution to the linear least-squares problem subject to the constraint that $[\left \| {\bf d}\right \| \leq \tau _c]$ , where $[\tau _c]$ is the trust-region radius. Here, the linearized model is modified by admitting that the linearization is only valid within a limited region around the current approximation. It is clear that, if the trust region is sufficiently large, this constrained step will in fact be unconstrained, and the step will be the same as the Gauss–Newton step. If the constraint is active, however, the step has the form $[\lbrack {\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)+\mu {\bi I}]{\bf d}(\mu)={\bi J}({\bf x}_c){\bi W}[{\bf y}-{\bi M}({\bf x}_c)], \eqno (8.1.4.4)]$ where μ is the Lagrange multiplier corresponding to the constraint (see Section 8.3.1 ), that is, μ is the value such that $[\| {\bf d}(\mu)\| =\tau _c]$ . Formula (8.1.4.4) is known as the Levenberg–Marquardt equation. It can be seen from this formula that the step direction is intermediate between the Gauss–Newton direction and the direction of steepest descent, for which reason it is frequently known as ``Marquardt's compromise'' (Draper & Smith, 1981).

Efficient numerical calculation of d for a given value of μ is accomplished by noting that (8.1.4.4) is equivalent to the linear least-squares problem, find the minimum of $[S=\left \| \left ({{\bi J}( {\bf x}_c) \atop \sqrt {\mu }{\bi I}}\right) {\bf d}-\left ( {[{\bf y}-{\bi M}\left ({\bf x}_c\right)] \atop {\bi O}}\right) \right \| ^2. \eqno (8.1.4.5)]$ This problem can be solved by saving the QR factorization for μ = 0 and updating it for various values of μ greater than 0. The actual computation of μ is accomplished by a modified Newton method applied to the constraint equation (see Dennis & Schnabel, 1983, for details). Having calculated the constrained value of d, we set $[{\bf x}_{+}={\bf x}_c+{\bf d}]$ . The algorithm is completed by specifying a procedure for updating the trust-region parameter, τ_c, after each step. This is done by comparing the actual value of $[S({\bf x}_{+})]$ with the predicted value based on the linearization. If there is good agreement between these values, τ is increased. If there is not good agreement, τ is left unchanged, and if $[S({\bf x}_{+}) \, \gt \, S({\bf x}_c)]$ , the step is rejected, and τ is reduced. Global convergence can be shown under reasonable conditions, and very good computational performance has been observed in practice.

8.1.4.3. Quasi-Newton, or secant, methods

| top | pdf |

The second class of methods for solving the general, nonlinear least-squares problem is based on the unconstrained minimization of the function S(x), as defined in (8.1.4.1). Newton's method for solving this problem is derived by constructing a quadratic approximation to S at the current trial point, $[{\bf x}_c]$ , giving $[S_c({\bf d})=S({\bf x}_c)+{\bf g}({\bf x}_c)^T{\bf d}+(1/2){\bf d}^T{\bi H}({\bf x}_c){\bf d}, \eqno (8.1.4.6)]$ from which the Newton step is obtained by minimizing [S_c] with respect to d. Here, $[{\bf g}]$ represents the gradient of S and $[{\bi H}]$ is the Hessian matrix, the p × p symmetric matrix of second partial derivatives, $[H_{jk}=\partial ^2S/\partial x_j\partial x_k]$ . Thus, d is calculated by solving the linear system $[{\bi H}({\bf x}_c){\bf d}=-{\bf g}({\bf x}_c). \eqno (8.1.4.7)]$ [Note: In the literature on optimization, the notation $[\nabla ^2S({\bf x})]$ is often used to denote the Hessian matrix of the function S(x). This should not be confused with the Laplacian operator.] While Newton's method is locally quadratically convergent, it suffers from well known drawbacks. First, it is not globally convergent without employing some form of line search or trust region to safeguard it. Second, it requires the computation of the Hessian of S in each iteration.

The Hessian, however, has the form $[{\bi H}({\bf x}_c)={\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)+{\bi B}({\bf x}_c), \eqno (8.1.4.8)]$ where B(x) is given by $[B({\bf x}_c)_{jk}=\textstyle\sum\limits_{i=1}^nw_i[y_i-M_i({\bf x}_c)]\partial M_i^2({\bf x}_c)/\partial x_j\partial x_k. \eqno (8.1.4.9)]$ The first term of the Hessian, which is dependent only on the Jacobian, is readily available, but B, even if it is available analytically, is, typically, expensive to compute. Furthermore, in some situations, such as in the vicinity of a more symmetric model, H(x_c) may not be positive definite.

In order to overcome these difficulties, various methods have been proposed to approximate H without calculating B. Because these methods are based on approximations to Newton's method, they are often referred to as quasi-Newton methods. Two popular versions use approximations to B that are known as the Davidon–Fletcher–Powell (DFP) update and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, with BFGS being apparently superior in practice. The basic idea of both procedures is to use values of the gradient to update an approximation to the Hessian in such a way as to preserve as much previous information as possible, while incorporating the new information obtained in the most recent step. From (8.1.4.6), the gradient of S_c(d) is $[{\bf g}({\bf x}_c+{\bf d})={\bf g}({\bf x}_c)+{\bi H}({\bf x}_c){\bf d}. \eqno (8.1.4.10)]$ If the true function is not quadratic, or if H is only approximately known, the gradient at the point $[{\bf x}_c+{\bf d}]$ , where d is the solution of (8.1.4.7), will not vanish. We make use of the value of $[{\bf g}({\bf x}_c+{\bf d})]$ to compute a correction to H to give a Hessian that would have predicted what was actually found. That is, find an update matrix, B′, such that $[\lbrack {\bi H}({\bf x}_c)+{\bi B}^{\prime }({\bf x}_c)]{\bf d}={\bf g}({\bf x}_c+{\bf d})-{\bf g}({\bf x}_c). \eqno (8.1.4.11)]$ This is known as the secant, or quasi-Newton, relation. An algorithm based on the BFGS update goes as follows: Given $[{\bf x}_c]$ and $[{\bi H}_c]$ (usually a scaled, identity matrix is used as the initial approximation, in which case the first step is in the steepest descent direction),

(1) solve H_cd = −g(x_c) for the direction d;
(2) compute α such that S(x_c+ αd) satisfies condition (8.1.4.3);
(3) set x_c = x_c + αd;
(4) update H_c by H_c + g(x_c)g(x_c)^T/d^Tg(x_c) + qq^T/αq^Td (see Nash & Sofer, 1995), where q = g(x_c + αd) − g(x_c);
(5) if not converged, go to (1), else stop.

If, in step (2), the additional condition is applied that d^Tg(x_c+ αd) = 0, it becomes an exact line search. This, however, is an unnecessarily severe condition, because it can be shown that, if d^T[g(x_c + αd) − g(x_c)] $[\gt 0]$ , the approximation to the Hessian remains positive definite. The two terms in this expression are values of dS/dα, and the condition states that S does not decrease more rapidly as α increases, which will always be true for some range of values of α $[\gt \, 0]$ . This algorithm is globally convergent; it will find a point at which the gradient vanishes, although that point may be a false minimum. It should be noted, however, that the procedure does not produce a final Hessian matrix whose inverse necessarily has any resemblance to a variance–covariance matrix. To get that it is necessary to compute $[{\bi J}(\widehat {{\bf x}})]$ and from it compute $[({\bi J}^T{\bi WJ})^{-1}]$ .

If a scaled, identity matrix is used as the initial approximation to H, no use is made of the fact that, at least in the vicinity of the minimum, a major part of the Hessian matrix is given by $[{\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)]$ . Thus, if there can be reasonable confidence in the general features of the model, it is useful to use $[{\bi J}({\bf x}_c)^T{\bi WJ}({\bf x}_c)]$ as the initial approximation to H, in which case the first step is in the Gauss–Newton direction. The line-search provision, however, guarantees convergence, and, when n and p are both large, as in many crystallographic applications of least squares, a quasi-Newton update gives an adequate approximation to the new Hessian with a great deal less computation.

Another procedure that makes use of the fact that the linear approximation gives a major part of the Hessian constructs an approximation of the form $[{\bi H}({\bf x})={\bi J}({\bf x})^T{\bi WJ}({\bf x})+{\bi B}({\bf x}), \eqno (8.1.4.12)]$ where B is generated by quasi-Newton methods. The quasi-Newton condition that must be satisfied in this case is $[\lbrack {\bi J}({\bf x}_{+})^T{\bi WJ}({\bf x}_{+})+{\bi B}({\bf x}_{+})]{\bf d}_c={\bf q}_c, \eqno (8.1.4.13)]$ where $[{\bf d}_c={\bf x}_{+}-{\bf x}_c]$ , and $[{\bf q}_c={\bi J}({\bf x}_{+})^T{\bi W}[{\bf y}-{\bi M}({\bf x}_{+})]-{\bi J}({\bf x}_c)^T{\bi W}[{\bf y}-{\bi M}({\bf x}_c)]. ]$ A formula based on the BFGS update (Nash & Sofer, 1995) is $[{\bi B}({\bf x}_{+})={\bi B}({\bf x}_c)-{\bi G}_c{\bf d}_c{\bf d}_c^T{\bi G}_c/{\bf d}_c^T{\bi G}{\bf d}_c+{\bf q}_c{\bf q}_c^T/{\bf q}_c^T{\bf d}_c, \eqno (8.1.4.14)]$ where $[{\bi G}_{c} = {\bi J}({\bf x}_{+})^T{\bi WJ}({\bf x}_{+})+{\bi B}({\bf x}_c)]$ . Since $[{\bi J}({\bf x}_{+})]$ and $[{\bi M}({\bf x}_{+})]$ must be computed anyway, this technique maximizes the use of information with little additional computing. The resulting approximation to the Hessian, however, may not be positive definite, and precautions must be taken to ensure convergence. In the vicinity of a correct solution, where the residuals are small, the addition of B is not likely to help much, and it can be dropped. Far from the solution, however, it can be very helpful. An implementation of this procedure has been described by Bunch, Gay & Welsch (1993); it appears to be at least as efficient as the trust region, Levenberg–Marquardt procedure, and is probably better when residuals are large.

In actual practice, it is not the Hessian matrix itself that is updated, but rather its Cholesky factor (Prince, 1994). This requires approximately the same number of operations and allows the solution of the linear system for computing d in a time that increases in proportion to [p^2] . This strategy also allows the use of the approximate Hessian in convergence checks with no significant computational overhead and no extra storage, as would be required for storing both the Hessian and its inverse.

8.1.4.4. Stopping rules

| top | pdf |

An iterative algorithm must contain some criterion for termination of the iteration process. This is a delicate part of all nonlinear optimization codes, and depends strongly on the scaling of the parameters. Although exceptions exist to almost all reasonable scaling rules, a basic principle is that a unit change in any variable should have approximately the same effect on the sum of squares. Thus, as discussed in Subsection 8.1.3.3 (equation 8.1.3.9), the ideal unit for each parameter is the standard uncertainty of its estimate, which can usually be adequately approximated by the reciprocal of the column norm of J. In modern codes, the user has the option of supplying a diagonal scaling matrix whose elements are the reciprocals of some estimate of a typical `significant' shift in the corresponding parameter.

In principle, the following conditions should hold when the convergence of a well scaled, least-squares procedure has reached its useful limit:

(1) $[S({\bf x}_{c})=S(\widehat {{\bf x}})]$ ;
(2) $[{\bi J}({\bf x}_{c})^{T}{\bi W}[{\bf y}-{\bi M}({\bf x}_{c})]]$ = O;
(3) $[{\bf x}_{c}=\widehat {{\bf x}}]$ .

The actual stopping rules must be chosen relative to the algorithm used (other conditions also exist) and the particular application. It is clear, however, that, since $[\widehat {{\bf x}}]$ is not known, the last two conditions cannot be used as stated. Also, these tests are dependent on the scaling of the problem, and the variables are not related to the sizes of the quantities involved. We present tests, therefore, that are relative error tests that take into account the scaling of the variables.

The general, relative error test is stated as follows. Two scalar quantities, a and b, are said to satisfy a relative error test with respect to a tolerance T if $[{|a-b| \over |a|} \leq T. \eqno (8.1.4.15)]$ Roughly speaking, if T is of the form $[10^{-q}]$ then a and b agree to q digits. Obviously, there is a problem with this test if [a=0] and there will be numerical difficulties if a is close to 0. Thus, in practice, (8.1.4.15) is replaced by $[ |a-b| \leq (|a|+1)T, \eqno (8.1.4.16)]$ which reduces to an absolute error test as $[a \rightarrow 0]$ . A careful examination may be required to set this tolerance correctly, but, typically, if one of the fast, stable algorithms is used, only a few more iterations are necessary to get six or eight digits if one or two are already known. Note also that the actual value depends on $[\varepsilon ]$ , the relative machine precision. It is fruitless to seek more digits of accuracy than are expressed in the machine representation.

A test based on condition (1) is often implemented by using the linear approximation to M or the quadratic approximation to S. Thus, using the quadratic approximation to S, we can compute the predicted reduction by $[\Delta_{{\rm pred}}=S({\bf x}_c)-{\bf d}^T{\bi J}^T{\bi W}[{\bf y}-{\bi M}({\bf x}_c)]. \eqno (8.1.4.17)]$ Similarly, the actual reduction is $[\Delta_{{\rm act}}=S({\bf x}_c)-S({\bf x}_{+}). \eqno (8.1.4.18)]$ The test then becomes $[\Delta_{{\rm pred}}\leq [1+S({\bf x}_c)]T]$ , $[\Delta_{{\rm act}}\leq]$ $[[1+S({\bf x}_c)]T]$ , and $[\Delta_{{\rm act}}\leq 2\Delta_{{\rm pred}}]$ . That is, we want both the predicted and actual reductions to be small and the actual reduction to agree reasonably well with the predicted reduction. A typical value for T should be 10⁻⁴, although the value again depends on $[\varepsilon]$ , and the user is cautioned not to make this tolerance too loose.

For a test on condition (2), we compute the cosine of the angle between the vector of residuals and the linear subspace spanned by the columns of J, $[\cos \zeta =\{{\bf d}^T{\bi J}^T{\bi W}[{\bf y}-{\bi M}({\bf x}_{+})]\}/[({\bf d}^T{\bi J}^T{\bi WJ}{\bf d})S({\bf x}_{+})]^{1/2}. \eqno (8.1.4.19)]$ The test is $[\cos \zeta \leq T]$ , where, again, T should be 10⁻⁴ or smaller.

Test 3 above is usually only present to prevent the process from continuing when almost nothing is happening. Clearly, we do not know $[\widehat {{\bf x}}]$ , thus the test is typically that corresponding elements of $[{\bf x}_{+}]$ and $[{\bf x}_c]$ satisfy (8.1.4.16), where T is chosen to be 10^−q. A recommended value of q is half the number of digits carried in the computation, e.g. q = 8 for standard 64-bit (double-precision or 16 digit) calculations. Sometimes, the relative error test is of the form $[|({\bf x}_{+})_j-({\bf x}_c)_j|/\sigma _j \leq T]$ , where σ_j is the standard uncertainty computed from the inverse Hessian in the last iteration. Although this test has some statistical validity, it is quite expensive and usually not worth the work involved to compute.

8.1.4.5. Recommendations

| top | pdf |

One situation in which the Gauss–Newton algorithm behaves particularly poorly is in the vicinity of a saddle point in parameter space, where the true Hessian matrix is not positive definite. This occurs in structure refinement where a symmetric model is refined to convergence and then is replaced by a less symmetric model. The hypersurface of S will have negative curvature in a finite sized region of the parameter space for the less-symmetric model, and it is essential to use a safeguarded algorithm, one that incorporates a line search or a trust region, in order to get out of that region.

On the basis of this discussion, we can draw the following conclusions:

(1) In cases where the fit is poor, owing to an incomplete model or in the initial stages of refinement, methods based on the quadratic approximation to S (quasi-Newton methods) often perform better. This is particularly important when the model is close to a more symmetric configuration. These methods are more expensive per iteration and generally require more storage, but their greater stability in such problems usually justifies the cost.
(2) With small residual problems, where the model is complete and close to the solution, a safeguarded Gauss–Newton method is preferred. The trust-region implementation (Levenberg–Marquardt algorithm) has been very successful in practice.
(3) The best advice is to pick a good implementation of either method and stay with it.

References

Bunch, D. S., Gay, D. M. & Welsch, R. E. (1993). Algorithm 717: subroutines for maximum likelihood and quasi-likelihood estimation of parameters in nonlinear regression models. ACM Trans. Math. Softw. 19, 109–130. Google Scholar

Dennis, J. E. & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice Hall.Google Scholar

Draper, N. & Smith, H. (1981). Applied regression analysis. New York: John Wiley.Google Scholar

Nash, S. & Sofer, A. (1995). Linear and nonlinear programming. New York: McGraw-Hill.Google Scholar

Prince, E. (1994). Mathematical techniques in crystallography and materials science, 2nd ed. Berlin: Springer. Google Scholar

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 682-685