International
Tables for Crystallography Volume C Mathematical, physical and chemical tables Edited by E. Prince © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 678-688
https://doi.org/10.1107/97809553602060000609 Chapter 8.1. Least squaresChapter 8.1 summarizes the mathematics of both the statistical and numerical analytical aspects of data fitting by linear and nonlinear least squares. It includes discussions of various algorithms for the solution of the least squares problem and, for the nonlinear case, of the stability and convergence properties of different procedures. Finally, it includes information on the availability of existing high-quality computer software. Keywords: conditioning; conjugate-gradient methods; Gauss–Newton algorithm; large-scale problems; least-squares calculations; Levenberg–Marquardt algorithm; linear algebra; non-linear least squares; normal equations; numerical methods; quasi-Newton methods; secant methods; sparse matrices; statistics; stopping rules; trust-region methods. |
The process of arriving at a model for a crystal structure may usefully be considered to consist of two distinct stages. The first, which may be called determination, involves the use of chemical and physical intuition, direct methods, Fourier and Patterson methods, and other techniques to arrive at an approximate model for the structure that incorporates unit-cell dimensions, space group, chemical composition, and information with respect to the immediate environment of each atom. The second stage, which we shall call refinement, involves finding the values of adjustable parameters in the model that give the best fit between the predicted diffraction intensities and those observed in an experiment, in order to extract precise information about interatomic distances and bond angles, thermal motion, site occupancies, electron distribution, and so forth. Although there are several different criteria for the best fit to data, such as maximum likelihood and maximum entropy, one of the most commonly used is the method of least squares. This chapter discusses both numerical and statistical aspects of refinement by the method of least squares. Because both aspects make extensive use of linear algebra, we begin with a summary of definitions and fundamental operations in linear algebra (Stewart, 1973; Prince, 1994) and of basic definitions and concepts in mathematical statistics (Draper & Smith, 1981; Box, Hunter & Hunter, 1978). We then discuss the principles of linear and nonlinear least squares and conclude with an extensive discussion of numerical methods used in practical implementation of the technique.
A matrix is an ordered, rectangular array of numbers, real or complex. Matrices will be denoted by upper-case, bold italic letters, A. Their individual elements will be denoted by upper-case, italic letters with subscripts. Aij denotes the element in the ith row and the jth column of A. A matrix with only one row is a row vector; a matrix with only one column is a column vector. Vectors will be denoted by lower-case, bold roman letters, and their elements will be denoted by lower-case, italic letters with single subscripts. Scalar constants will usually be denoted by lower-case, Greek letters.
A matrix with the same number of rows as columns is square. If Aij = 0 for all , A is upper triangular. If Aij = 0 for all , A is lower triangular. If Aij = 0 for all , A is diagonal. If Aij = 0 for all i and j, A is null. A matrix, B, such that Bij = Aji for all i and j is the transpose of A, and is denoted by AT. Matrices with the same dimensions may be added and subtracted: (A+ B)ij = Aij + Bij. A matrix may be multiplied by a scalar: (αA)ij = αAij. Multiplication of matrices is defined by (AB)ij = , where m is the number of columns of A and the number of rows of B (which must be equal). Addition and multiplication of matrices obey the associative law: (A + B) + C = A + (B + C); (AB)C = A(BC). Multiplication of matrices obeys the distributive law: A(B + C) = AB + AC. Addition of matrices obeys the commutative law: A + B = B + A, but multiplication, except in certain (important) special cases, does not: AB BA. The transpose of a product is the product of the transposes of the factors in reverse order: (AB)T= BTAT.
The trace of a square matrix is the sum of its diagonal elements. The determinant of an square matrix, A, denoted by |A|, is the sum of terms, each of which is a product of the diagonal elements of a matrix derived from A by permuting columns or rows (see Stewart, 1973). The rank of a matrix (not necessarily square) is the dimension of the largest square submatrix that can be formed from it, by selecting rows and columns, whose determinant is not equal to zero. A matrix has full column rank if its rank is equal to its number of columns. A square matrix whose diagonal elements are equal to one and whose off-diagonal elements are equal to zero is an identity matrix, denoted by I. If , A is nonsingular, and there exists a matrix A−1, the inverse of A, such that AA−1 = A−1A = I. If |A| = 0, A is singular, and has no inverse. The adjoint, or conjugate transpose, of A is a matrix, , such that = , where the asterisk indicates complex conjugate. If , A is unitary. If the elements of a unitary matrix are real, it is orthogonal. From this definition, if A is orthogonal, it follows that for all j, and if . By analogy, two column vectors, x and y, are said to be orthogonal if xTy = 0.
For any square matrix, A, there exists a set of vectors, , such that , where is a scalar. The values are the eigenvalues of A, and the vectors are the corresponding eigenvectors. If , A is Hermitian, and, if the elements are real, A = AT, so that A is symmetric. It can be shown (see, for example, Stewart, 1973) that, if A is Hermitian, all eigenvalues are real, and there exists a unitary matrix, T, such that is diagonal, with the elements of D equal to the eigenvalues of A, and the columns of T are the eigenvectors. An n × n symmetric matrix therefore has n mutually orthogonal eigenvectors. If the product xTAx is greater than (or equal to) zero for any non-null vector, x, A is positive (semi)definite. Because x may be, in particular, an eigenvector, all eigenvalues of a positive (semi)definite matrix are greater than (or equal to) zero. Any matrix of the form BTB is positive semi definite, and, if B has full column rank, A = BTB is positive definite. If A is positive definite, there exists an upper triangular matrix, R, or, equivalently, a lower triangular matrix, L, with positive diagonal elements, such that RTR = LLT= A. R, or L, is called the Cholesky factor of A. The magnitude, length or Euclidean norm of a vector, x, denoted by , is defined by . The induced matrix norm of a matrix, B, denoted , is defined as the maximum value of for . Because xTBTBx will have its maximum value for a fixed value of xTx when x is parallel to the eigenvector that corresponds to the largest eigenvalue of BTB, this definition implies that ||B|| is equal to the square root of the largest eigenvalue of BTB. The condition number of B is the square root of the ratio of the largest and smallest eigenvalues of BTB. (Other definitions of norms exist, with corresponding definitions of condition number. We shall not be concerned with any of these.)
We shall make extensive use of the so-called QR decomposition, which is defined as follows: For any real matrix, Z, there exists an n × n orthogonal matrix, Q, such that where R is a p × p upper triangular matrix, and O denotes an (n − p) × p null matrix. Thus, we have which is known as the QR decomposition of Z. If we partition Q as , where QZ has dimensions n × p, and has dimensions n × (n − p), (8.1.1.2) becomes which is known as the QR factorization. We shall make use of the following facts. First, R is nonsingular if and only if the columns of Z are linearly independent; second, the columns of QZ form an orthonormal basis for the range space of Z, that is, they span the same space as Z; and, third, the columns of form an orthonormal basis for the null space of ZT, that is, ZT = O.
There are two common procedures for computing the QR factorization. The first makes use of Householder transformations, which are defined by where xTx = 1. H is symmetric, and H2 = I, so that H is orthogonal. In three dimensions, H corresponds to a reflection in a mirror plane perpendicular to x, because of which Stewart (1973) has suggested the alternative term elementary reflector. A vector v is transformed by Hv into the vector , where e represents a vector with e1 = 1, and ei = 0 for , if The factorization procedure for an n × p matrix, A (Stewart, 1973; Anderson et al., 1992), takes as v in the first step the first column of A, and forms A1 = H1A, which has zeros in all elements of the first column below the diagonal. In the second step, v has a zero as the first element and is filled out by those elements of the second column of A1 on or below the diagonal. A2 = H2A1 then has zeros in all elements below the diagonal in the first two columns. This process is repeated (p − 2) more times, after which Q = Hp H2H1, and R = Ap is upper triangular.
QR factorization by Householder transformations requires for efficiency that the entire n × p matrix be stored in memory, and requires of order np2 operations. A procedure that requires storage of only the upper triangle makes use of Givens rotations, which are 2 × 2 matrices of the form Multiplication of a matrix, B, by G will put a zero in the element if . The factorization of A involves reading, or computing, the rows of A one at a time. In the first step, matrix B1 consists of the first row of R and the current row of A, from which the first element is eliminated. In the second step, B21 is the second row of R and the (p − 1) non-zero elements of the second row of the transformed B1. After the first p rows have been treated, each additional row of A requires 2p(p + 1) multiplications to fill it with zeros. However, because the operation is easily vectorized, the time required may be a small proportion of the total computing time on a vector oriented computer.
A probability density function, which will be abbreviated p.d.f., is a function, Φ(x), such that the probability of finding the random variable x in the interval is given by A p.d.f. has the properties and A cumulative distribution function, which will be abbreviated c.d.f., is defined by The properties of Φ(x) imply that , and Φ(x) = dΨ(x)/dx. The expected value of a function, f(x), of random variable x is defined by If f(x) = xn, is the nth moment of Φ(x). The first moment, often denoted by μ, is the mean of Φ(x). The second moment about the mean, , usually denoted by σ2, is the variance of . The positive square root of the variance is the standard deviation.
For a vector, x, of random variables, , the joint probability density function, or joint p.d.f., is a function, ΦJ(x), such that The marginal p.d.f. of an element (or a subset of elements), , is a function, , such that This is a p.d.f. for alone, irrespective of the values that may be found for any other element of x. For two random variables, x and y (either or both of which may be vectors), the conditional p.d.f. of x given y = y0 is defined by where is a renormalizing factor. This is a p.d.f. for x when it is known that y = y0. If for all y, or, equivalently, if , the random variables x and y are said to be statistically independent.
Moments may be defined for multivariate p.d.f.s in a manner analogous to the one-dimensional case. The mean is a vector defined by where the volume of integration is the entire domain of x. The variance–covariance matrix is defined by The diagonal elements of V are the variances of the marginal p.d.f.s of the elements of x, that is, . It can be shown that, if and are statistically independent, when . If two vectors of random variables, x and y, are related by a linear transformation, x = By, the means of their joint p.d.f.s are related by μx = Bμy, and their variance–covariance matrices are related by Vx = BVyBT.
The method of least squares may be formulated as follows: Given a set of n observations, , that are measurements of quantities that can be described by differentiable model functions, , where x is a vector of parameters, , find the values of the parameters for which the sum is minimum. Here, represents a weight assigned to the ith observation. The values of the parameters that give the minimum value of S are called estimates of the parameters, and a function of the data that locates the minimum is an estimator. A necessary condition for S to be a minimum is for the gradient to vanish, which gives a set of simultaneous equations, the normal equations, of the form The model functions, , are, in general, nonlinear, and there are no direct ways to solve these systems of equations. Iterative methods for solving them are discussed in Section 8.1.4. Much of the analysis of results, however, is based on the assumption that linear approximations to the model functions are good approximations in the vicinity of the minimum, and we shall therefore begin with a discussion of linear least squares.
To express linear relationships, it is convenient to use matrix notation. Let M(x) and y be column vectors whose ith elements are Mi(x) and . Similarly, let b be a vector and A be a matrix such that a linear approximation to the ith model function can be written Equations (8.1.2.3) can be written, in matrix form, and, for this linear model, (8.1.2.1) becomes where W is a diagonal matrix whose diagonal elements are . In this notation, the normal equations (8.1.2.2) can be written and their solution is If for all i, and A has full column rank, then ATWA will be positive definite, and S will have a unique minimum at . The matrix is a p × n matrix that relates the n-dimensional observation space to the p-dimensional parameter space and is known as the least-squares estimator; because each element of is a linear function of the observations, it is a linear estimator. [Note that, in actual practice, the matrix H is not actually evaluated, except, possibly, in very small problems. Rather, the linear system ATWAx = ATW(y − b) is solved using the methods of Section 8.1.3.]
The least-squares estimator has some special properties in statistical analysis. Suppose that the elements of y are experimental observations drawn at random from populations whose means are given by the model, M(x), for some unknown x, which we wish to estimate. This may be written The expected value of the least-squares estimate is
If the expected value of an estimate is equal to the variable to be estimated, the estimator is said to be unbiased. Equation (8.1.2.9) shows that the least-squares estimator is an unbiased estimator for x, independent of W, provided only that y is an unbiased estimate of M(x), the matrix ATWA is nonsingular, and the elements of W are constants independent of y and M(x). Let and be the variance–covariance matrices for the joint p.d.f.s of the elements of x and y, respectively. Then, . Let G be the matrix , so that is the particular least-squares estimate for which . Then, . If is positive definite, its lower triangular Cholesky factor, L, exists, so that . [If V is diagonal, L is also diagonal, with ] It is readily verified that the matrix product , but the diagonal elements of this product are the sums of squares of the elements of rows of (H − G)L, and are therefore greater than or equal to zero. Therefore, the diagonal elements of Vx, which are the variances of the marginal p.d.f.s of the elements of , are minimum when .
Thus, the least-squares estimator is unbiased for any positive-definite weight matrix, W, but the variances of the elements of the vector of estimated parameters are minimized if . [Note also that if, and only if, .] For this reason, the least-squares estimator with weights proportional to the reciprocals of the variances of the observations is referred to as the best linear unbiased estimator for the parameters of a model describing those observations. (These specific results are included in a more general result known as the Gauss–Markov theorem.)
The analysis up to this point has assumed that the model is linear, that is that the expected values of the observations can be expressed by , where A is some matrix. In crystallography, of course, the model is highly nonlinear, and this assumption is not valid. The principles of linear least squares can be extended to nonlinear model functions by first finding, by numerical methods, a point in parameter space, , at which the gradient vanishes and then expanding the model functions about that point in Taylor's series, retaining only the linear terms. Equation (8.1.2.4) then becomes where evaluated at x = x0. Because we have already found the least-squares solution, the estimate reduces to . It is important, however, not to confuse , which is a convenient origin, with , which is a random variable describable by a joint p.d.f. with mean and a variance–covariance matrix , reducing to when .
This variance–covariance matrix is the one appropriate to the linear approximation given in (8.1.2.10), and it is valid (and the estimate is unbiased) only to the extent that the approximation is a good one. A useful criterion for an adequate approximation (Fedorov, 1972) is, for each j and k, where σi is the estimated standard deviation or standard uncertainty (Schwarzenbach, Abrahams, Flack, Prince & Wilson, 1995) of . This criterion states that the curvature of S(y, x) in a region whose size is of order σ in observation space is small; it ensures that the effect of second-derivative terms in the normal-equations matrix on the eigenvalues and eigenvectors of the matrix is negligible. [For a further discussion and some numerical tests of alternatives, see Donaldson & Schnabel (1986).]
The process of refinement can be viewed as the construction of a conditional p.d.f. of a set of model parameters, x, given a set of observations, y. An important expression for this p.d.f. is derived from two equivalent expressions for the joint p.d.f. of x and y: Provided , the conditional p.d.f. we seek can be written Here, the factor is the factor that is required to normalize the p.d.f. is the conditional probability of observing a set of values of y as a function of x. When the observations have already been made, however, this can also be considered a density function for x that measures the likelihood that those particular values of y would have been observed for various values of x. It is therefore frequently written , and (8.1.2.14) becomes where is the normalizing constant. , the marginal p.d.f. of x in the absence of any additional information, incorporates all previously available information concerning x, and is known as the prior p.d.f., or, frequently, simply as the prior of x. Similarly, ΦC(x|y) is the posterior p.d.f., or the posterior, of x. The relation in (8.1.2.14) and (8.1.2.15) was first stated in the eighteenth century by Thomas Bayes, and it is therefore known as Bayes's theorem (Box & Tiao, 1973). Although its validity has never been in serious question, its application has divided statisticians into two vehemently disputing camps, one of which, the frequentists, considers that Bayesian methods give nonobjective results, while the other, the Bayesians, considers that only by careful construction of a `noninformative' prior can true objectivity be achieved (Berger & Wolpert, 1984).
Diffraction data, in general, contain no phase information, so the likelihood function for the structure factor, F, given a value of observed intensity, will have a value significantly different from zero in an annular region of the complex plane with a mean radius equal to |F|. Because this is insufficient information with which to determine a crystal structure, a prior p.d.f. is constructed in one (or some combination) of two ways. Either the prior knowledge that electron density is non negative is used to construct a joint p.d.f. of amplitudes and phases, given amplitudes for all reflections and phases for a few of them (direct methods), or chemical knowledge and intuition are used to construct a trial structure from which structure factors can be calculated, and the phase of Fcalc is assigned to Fobs. Both of these procedures can be considered to be applications of Bayes's theorem. In fact, Fcalc for a refined structure can be considered a Bayesian estimate of F.
In this section, we consider in detail numerical methods for solving linear least-squares problems, that is, the situation where (8.1.2.4) and (8.1.2.5) apply exactly.
The linear least-squares problem can be viewed geometrically as the problem of finding the point in a p-dimensional subspace, defined as the set of points that can be reached by a linear combination of the columns of A, closest to a given point, y, in an n-dimensional observation space. Since this is equivalent to finding the orthogonal projection of point y into that subspace, it is not surprising that an orthogonal decomposition of A helps to solve the problem. For convenience in this discussion, let us remove the weight matrix from the problem by defining the standardized design matrix by where U is the upper triangular Cholesky factor of W.
Consider the least-squares problem with the QR factorization of Z, as given in Subsection 8.1.1.1. For y′ = U(y − b), (8.1.2.5) becomes which reduces to The second term in (8.1.3.3) is independent of x, and is therefore the sum of squared residuals. The first term vanishes if which, because R is upper triangular, is easily solved for x. The QR decomposition of Z therefore leads naturally to the following algorithm for solving the linear least-squares problem:
|
Let us now consider the relationship of the QR procedure for solving the linear least-squares problem to the classical method based on the normal equations. The normal equations can be derived by differentiating (8.1.3.2) and equating the result to a null vector. This yields The algorithm is therefore to compute the cross-product matrix, , and the right-hand side, d = ZTy′, and to solve the resulting system of equations, Bx = d. This is usually accomplished by computing the Cholesky decomposition of B, that is , where C is upper triangular, and then solving the two triangular systems and . Because , equation (8.1.3.5) becomes or It is clear that R is the Cholesky factor of , although it is formed in a different way. This procedure requires of order operations to form the product and operations for the Cholesky decomposition. In some situations, the extra time to compute the QR factorization is justified because of greater stability, as will be discussed below. Most other quantities of statistical interest can be computed directly from the QR factorization.
The condition number of Z, which is defined (Subsection 8.1.1.1) as the square root of the ratio of the largest to the smallest eigenvalue of , is an indicator of the effect a small change in an element of Z will have on the elements of and of . A large value of the condition number means that small errors in computing an element of Z, owing possibly to truncation or roundoff in the computer, can introduce large errors into the elements of the inverse matrix. Also, when the condition number is large, the standard uncertainties of some estimated parameters will be large. A large condition number, as defined in this way, can result from either scaling or correlation or some combination of these. To illustrate this, consider the matrices and where represents machine precision, which can be defined as the smallest number in machine representation that, when added to 1, gives a result different from 1. By the conventional definition, both of these matrices have a condition number for Z of . Because numbers of order can be perfectly well represented, however, the first one can be inverted without loss of precision, whereas an inverse for the second would be totally meaningless. It is good practice, therefore, to factor the design matrix, Z, into the form where S is a p × p diagonal matrix whose elements define some kind of `natural' unit appropriate to the parameter represented in each column of Z. The ideal natural unit would be the standard uncertainty of that parameter, but this is not available until after the calculation has been completed. If correlation is not too severe, suitable values for the elements of S, of the same order of magnitude as those derived from the standard uncertainty, are the column Euclidean norms, that is where denotes the jth column of Z. This scaling causes all diagonal elements of ZTZ to be equal to one, and errors in the elements of Z will have roughly equal effects.
Ill conditioning that results from correlation, as in the second example above, is more difficult to deal with. It is an indication that some linear combination of parameters, some eigenvector of the normal equations matrix, is poorly determined by the available data. Use of the QR factorization of Z to compute the Cholesky factor of ZTZ may be advantageous, in spite of the additional computation time, because better numerical stability is obtained in marginal situations. As a practical matter, however, it is important to recognize that an ill conditioned matrix is a symptom of a flaw in the model or in the experimental design (or both). Use can be made of the fact that, although determining the entire set of eigenvalues and eigenvectors of a large matrix is computationally an inherently difficult problem, a relatively simple algorithm, known as a condition estimator (Anderson et al., 1992), can produce a good approximation to the eigenvector that corresponds to the smallest eigenvalue of a nearly singular matrix. This information can be used in either or both of two ways. First, without any fundamental modification to the model or the experiment, a simple, linear transformation of the parameters so that the problem eigenvector is one of the independent parameters, followed by rescaling, can resolve the numerical difficulties in computing the estimates. A common example is the situation where a phase transition results in the doubling of a unit cell, with pairs of atoms almost but not quite related by a lattice translation. A transformation that makes the estimated parameters the sums and differences of corresponding parameters in related pairs of atoms can make a dramatic improvement in the condition number. Alternatively, the problem eigenvector can be set to some value determined from theory or from some other experiment (see Section 8.3.1 ), or additional data can be collected that are selected to make that combination of parameters determinate.
Recall (equation 8.1.2.1) that the general, nonlinear problem can be stated in the form: find the minimum of where x is a vector of p parameters, and M(x) represents a set of model functions that predict the values of observations, y. In this section, we discuss two useful ways of solving this problem and consider the relative merits of each. The first is based on iteratively linearizing the functions and approximating (8.1.4.1) by one of the form in (8.1.2.5). The second uses an unconstrained minimization algorithm, based on Newton's method, to minimize S(x).
Let be the current approximation to , the solution to (8.1.4.1). We construct a linear approximation to in the vicinity of by expanding it in a Taylor series through the linear terms, obtaining where is the Jacobian matrix, defined by . A straightforward procedure, known as the Gauss–Newton algorithm, may be formally stated as follows:
The convergence rate of the Gauss–Newton algorithm depends on the size of the residual, that is, on . If , then the convergence rate is quadratic; if it is small, then the rate is linear; but if is large, then the procedure is not locally convergent at all. Fortunately, this procedure can be altered so that it is always locally convergent, and even globally convergent, that is, convergent to a relative minimum from any starting point. There are two possibilities. First, the procedure can be modified to include a line search. This is accomplished by computing the direction d as above and then choosing α such that S(xc + αd) is sufficiently smaller than . In order to guarantee convergence, one uses the test where, as actually implemented in modern codes, γ has values of the order of 10−4. [In theory, a slightly stronger condition is necessary, but this is usually avoided by other means. See Dennis & Schnabel (1983) for details.] While this improves the situation dramatically, it still suffers from the deficiency that it is very slow on some problems, and it is undefined if the rank of the Jacobian, , is less than p. J(x) usually has rank p near the solution, but it may be rank deficient far away. Also, it may be `close' to rank deficient, and give numerical difficulties (see Subsection 8.1.3.3).
These remaining problems can be addressed by the utilization of a trust-region modification to the basic Gauss–Newton algorithm. For this procedure, we compute the step, d, as the solution to the linear least-squares problem subject to the constraint that , where is the trust-region radius. Here, the linearized model is modified by admitting that the linearization is only valid within a limited region around the current approximation. It is clear that, if the trust region is sufficiently large, this constrained step will in fact be unconstrained, and the step will be the same as the Gauss–Newton step. If the constraint is active, however, the step has the form where μ is the Lagrange multiplier corresponding to the constraint (see Section 8.3.1 ), that is, μ is the value such that . Formula (8.1.4.4) is known as the Levenberg–Marquardt equation. It can be seen from this formula that the step direction is intermediate between the Gauss–Newton direction and the direction of steepest descent, for which reason it is frequently known as ``Marquardt's compromise'' (Draper & Smith, 1981).
Efficient numerical calculation of d for a given value of μ is accomplished by noting that (8.1.4.4) is equivalent to the linear least-squares problem, find the minimum of This problem can be solved by saving the QR factorization for μ = 0 and updating it for various values of μ greater than 0. The actual computation of μ is accomplished by a modified Newton method applied to the constraint equation (see Dennis & Schnabel, 1983, for details). Having calculated the constrained value of d, we set . The algorithm is completed by specifying a procedure for updating the trust-region parameter, τc, after each step. This is done by comparing the actual value of with the predicted value based on the linearization. If there is good agreement between these values, τ is increased. If there is not good agreement, τ is left unchanged, and if , the step is rejected, and τ is reduced. Global convergence can be shown under reasonable conditions, and very good computational performance has been observed in practice.
The second class of methods for solving the general, nonlinear least-squares problem is based on the unconstrained minimization of the function S(x), as defined in (8.1.4.1). Newton's method for solving this problem is derived by constructing a quadratic approximation to S at the current trial point, , giving from which the Newton step is obtained by minimizing with respect to d. Here, represents the gradient of S and is the Hessian matrix, the p × p symmetric matrix of second partial derivatives, . Thus, d is calculated by solving the linear system [Note: In the literature on optimization, the notation is often used to denote the Hessian matrix of the function S(x). This should not be confused with the Laplacian operator.] While Newton's method is locally quadratically convergent, it suffers from well known drawbacks. First, it is not globally convergent without employing some form of line search or trust region to safeguard it. Second, it requires the computation of the Hessian of S in each iteration.
The Hessian, however, has the form where B(x) is given by The first term of the Hessian, which is dependent only on the Jacobian, is readily available, but B, even if it is available analytically, is, typically, expensive to compute. Furthermore, in some situations, such as in the vicinity of a more symmetric model, H(xc) may not be positive definite.
In order to overcome these difficulties, various methods have been proposed to approximate H without calculating B. Because these methods are based on approximations to Newton's method, they are often referred to as quasi-Newton methods. Two popular versions use approximations to B that are known as the Davidon–Fletcher–Powell (DFP) update and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, with BFGS being apparently superior in practice. The basic idea of both procedures is to use values of the gradient to update an approximation to the Hessian in such a way as to preserve as much previous information as possible, while incorporating the new information obtained in the most recent step. From (8.1.4.6), the gradient of Sc(d) is If the true function is not quadratic, or if H is only approximately known, the gradient at the point , where d is the solution of (8.1.4.7), will not vanish. We make use of the value of to compute a correction to H to give a Hessian that would have predicted what was actually found. That is, find an update matrix, B′, such that This is known as the secant, or quasi-Newton, relation. An algorithm based on the BFGS update goes as follows: Given and (usually a scaled, identity matrix is used as the initial approximation, in which case the first step is in the steepest descent direction),
If, in step (2), the additional condition is applied that dTg(xc+ αd) = 0, it becomes an exact line search. This, however, is an unnecessarily severe condition, because it can be shown that, if dT[g(xc + αd) − g(xc)] , the approximation to the Hessian remains positive definite. The two terms in this expression are values of dS/dα, and the condition states that S does not decrease more rapidly as α increases, which will always be true for some range of values of α . This algorithm is globally convergent; it will find a point at which the gradient vanishes, although that point may be a false minimum. It should be noted, however, that the procedure does not produce a final Hessian matrix whose inverse necessarily has any resemblance to a variance–covariance matrix. To get that it is necessary to compute and from it compute .
If a scaled, identity matrix is used as the initial approximation to H, no use is made of the fact that, at least in the vicinity of the minimum, a major part of the Hessian matrix is given by . Thus, if there can be reasonable confidence in the general features of the model, it is useful to use as the initial approximation to H, in which case the first step is in the Gauss–Newton direction. The line-search provision, however, guarantees convergence, and, when n and p are both large, as in many crystallographic applications of least squares, a quasi-Newton update gives an adequate approximation to the new Hessian with a great deal less computation.
Another procedure that makes use of the fact that the linear approximation gives a major part of the Hessian constructs an approximation of the form where B is generated by quasi-Newton methods. The quasi-Newton condition that must be satisfied in this case is where , and A formula based on the BFGS update (Nash & Sofer, 1995) is where . Since and must be computed anyway, this technique maximizes the use of information with little additional computing. The resulting approximation to the Hessian, however, may not be positive definite, and precautions must be taken to ensure convergence. In the vicinity of a correct solution, where the residuals are small, the addition of B is not likely to help much, and it can be dropped. Far from the solution, however, it can be very helpful. An implementation of this procedure has been described by Bunch, Gay & Welsch (1993); it appears to be at least as efficient as the trust region, Levenberg–Marquardt procedure, and is probably better when residuals are large.
In actual practice, it is not the Hessian matrix itself that is updated, but rather its Cholesky factor (Prince, 1994). This requires approximately the same number of operations and allows the solution of the linear system for computing d in a time that increases in proportion to . This strategy also allows the use of the approximate Hessian in convergence checks with no significant computational overhead and no extra storage, as would be required for storing both the Hessian and its inverse.
An iterative algorithm must contain some criterion for termination of the iteration process. This is a delicate part of all nonlinear optimization codes, and depends strongly on the scaling of the parameters. Although exceptions exist to almost all reasonable scaling rules, a basic principle is that a unit change in any variable should have approximately the same effect on the sum of squares. Thus, as discussed in Subsection 8.1.3.3 (equation 8.1.3.9), the ideal unit for each parameter is the standard uncertainty of its estimate, which can usually be adequately approximated by the reciprocal of the column norm of J. In modern codes, the user has the option of supplying a diagonal scaling matrix whose elements are the reciprocals of some estimate of a typical `significant' shift in the corresponding parameter.
In principle, the following conditions should hold when the convergence of a well scaled, least-squares procedure has reached its useful limit:
|
The actual stopping rules must be chosen relative to the algorithm used (other conditions also exist) and the particular application. It is clear, however, that, since is not known, the last two conditions cannot be used as stated. Also, these tests are dependent on the scaling of the problem, and the variables are not related to the sizes of the quantities involved. We present tests, therefore, that are relative error tests that take into account the scaling of the variables.
The general, relative error test is stated as follows. Two scalar quantities, a and b, are said to satisfy a relative error test with respect to a tolerance T if Roughly speaking, if T is of the form then a and b agree to q digits. Obviously, there is a problem with this test if and there will be numerical difficulties if a is close to 0. Thus, in practice, (8.1.4.15) is replaced by which reduces to an absolute error test as . A careful examination may be required to set this tolerance correctly, but, typically, if one of the fast, stable algorithms is used, only a few more iterations are necessary to get six or eight digits if one or two are already known. Note also that the actual value depends on , the relative machine precision. It is fruitless to seek more digits of accuracy than are expressed in the machine representation.
A test based on condition (1) is often implemented by using the linear approximation to M or the quadratic approximation to S. Thus, using the quadratic approximation to S, we can compute the predicted reduction by Similarly, the actual reduction is The test then becomes , , and . That is, we want both the predicted and actual reductions to be small and the actual reduction to agree reasonably well with the predicted reduction. A typical value for T should be 10−4, although the value again depends on , and the user is cautioned not to make this tolerance too loose.
For a test on condition (2), we compute the cosine of the angle between the vector of residuals and the linear subspace spanned by the columns of J, The test is , where, again, T should be 10−4 or smaller.
Test 3 above is usually only present to prevent the process from continuing when almost nothing is happening. Clearly, we do not know , thus the test is typically that corresponding elements of and satisfy (8.1.4.16), where T is chosen to be 10−q. A recommended value of q is half the number of digits carried in the computation, e.g. q = 8 for standard 64-bit (double-precision or 16 digit) calculations. Sometimes, the relative error test is of the form , where σj is the standard uncertainty computed from the inverse Hessian in the last iteration. Although this test has some statistical validity, it is quite expensive and usually not worth the work involved to compute.
One situation in which the Gauss–Newton algorithm behaves particularly poorly is in the vicinity of a saddle point in parameter space, where the true Hessian matrix is not positive definite. This occurs in structure refinement where a symmetric model is refined to convergence and then is replaced by a less symmetric model. The hypersurface of S will have negative curvature in a finite sized region of the parameter space for the less-symmetric model, and it is essential to use a safeguarded algorithm, one that incorporates a line search or a trust region, in order to get out of that region.
On the basis of this discussion, we can draw the following conclusions:
|
Because the least-squares problems arising in crystallography are often very large, the methods we have discussed above are not always the most efficient. Some large problems have special structure that can be exploited to produce quite efficient algorithms. A particular special structure is sparsity, that is, the problems have Jacobian matrices that have a large fraction of their entries zero. Of course, not all large problems are sparse, so we shall also discuss approaches applicable to more general problems.
We shall first discuss large, sparse, linear least-squares problems (Heath, 1984), since these techniques form the basis for nonlinear extensions. As we proceed, we shall indicate how the procedures should be modified in order to handle nonlinear problems. Recall that the problem is to find the minimum of the quadratic form [y − Ax]TW[y − Ax], where y is a vector of observations, Ax represents a vector of the values of a set of linear model functions that predict the values of y, and W is a positive-definite weight matrix. Again, for convenience, we make the transformation y′ = Uy, where U is the upper triangular Cholesky factor of W, and Z = UA, so that the quadratic form becomes (y′ − Zx)T(y′ − Zx), and the minimum is the solution of the system of normal equations, ZTZx = ZTy′. Even if Z is sparse, it is easy to see that H = ZTZ need not be sparse, because if even one row of Z has all of its elements nonzero, all elements of H will be nonzero. Therefore, the direct use of the normal equations may preclude the efficient exploitation of sparsity. But suppose H is sparse. The next step in solving the normal equations is to compute the Cholesky decomposition of H, and it may turn out that the Cholesky factor is not sparse. For example, if H has the form where x represents a nonzero element, then the Cholesky factor, R, will not be sparse, but if then R has the form
These examples show that, although the sparsity of R is independent of the row ordering of Z, the column order can have a profound effect. Procedures exist that analyse Z and select a permutation of the columns that reduces the `fill' in R. An algorithm for using the normal equations is then as follows:
|
This algorithm is fast, and it will produce acceptable accuracy if the condition number of Z is not too large. If extension to the nonlinear case is considered, it should be kept in mind that the first two steps need only be done once, since the sparsity pattern of the Jacobian does not, in general, change from iteration to iteration.
The QR decomposition of matrices that may be kept in memory is most often performed by the use of Householder transformations (see Subsection 8.1.1.1). For sparse matrices, or for matrices that are too large to be held in memory, this technique has several drawbacks. First, the method works by inserting zeros in the columns of Z, working from left to right, but at each step it tends to fill in the columns to the right of the one currently being worked on, so that columns that are initially sparse cease to be so. Second, each Householder transformation needs to be applied to all of the remaining columns, so that the entire matrix must be retained in memory to make efficient use of this procedure.
The alternative procedure for obtaining the QR decomposition by the use of Givens rotations overcomes these problems if the entire upper triangular matrix, R, can be stored in memory. Since this only requires about locations, it is usually possible. Also, it may happen that R has a sparse representation, so that even fewer locations will be needed. The algorithm based on Givens rotations is as follows:
In order to specify how to use this algorithm to solve the linear least-squares problem, we must also state how to account for Q. We could accumulate Q or save enough information to generate it later, but this usually requires excessive storage. The better alternatives are either to apply the steps of Q to y′ as we proceed or to simply discard the information and solve . It should be noted that the order of rows can make a significant difference. Suppose The work to complete the QR decomposition is of order p2 operations, because each element below the main diagonal can be eliminated by one Givens rotation with no fill, whereas for each Givens rotation fills an entire row, and the QR decomposition requires of order np2operations.
A numerical procedure that is applicable to large-scale problems that may not be sparse is called the conjugate-gradient method. Conjugate-gradient methods were originally designed to solve the quadratic minimization problem, find the minimum of where H is a symmetric, positive-definite matrix. The gradient of S is and its Hessian matrix is H. Given an initial estimate, , the conjugate-gradient algorithm is
|
This algorithm finds the exact solution for the quadratic function in not more than p steps.
This algorithm cannot be used directly for the nonlinear case because it requires H to compute , and the goal is to solve the problem without computing the Hessian. To accomplish this, the exact computation of α is replaced by an actual line search, and the termination after at most p steps is replaced by a convergence test. Thus, we obtain, for a given starting value and a general, nonquadratic function S:
|
Note that, as promised, H is not needed. In practice, it has been observed that the line search need not be exact, but that periodic restarts in the steepest-descent direction are often helpful. This procedure often requires more iterations and function evaluations than methods that store approximate Hessians, but the cost per iteration is small. Thus, it is often the overall least-expensive method for large problems.
For the least-squares problem, recall that we are finding the minimum of for which By using these definitions in the conjugate-gradient algorithm, it is possible to formulate a specific algorithm for linear least squares that requires only the calculation of Z times a vector and ZT times a vector, and never requires the calculation or factorization of ZTZ.
In practice, such an algorithm will, due to roundoff error, sometimes require more than p iterations to reach a solution. A detailed examination of the performance of the procedure shows, however, that fewer than p iterations will be required if the eigenvalues of ZTZ are bunched, that is, if there are sets of multiple eigenvalues. Specifically, if the eigenvalues are bunched into k distinct sets, then the conjugate-gradient method will converge in k iterations. Thus, significant improvements can be made if the problem can be transformed to one with bunched eigenvalues. Such a transformation leads to the so-called preconditioned conjugate-gradient method. In order to analyse the situation, let C be a p × p matrix that transforms the variables, such that Then, Therefore, C should be such that the system Cx = x′ is easy to solve, and has bunched eigenvalues. The ideal choice would be C = R, where R is the upper triangular factor of the QR decomposition, since . has all of its eigenvalues equal to one, and, since R is triangular, the system is easy to solve. If R were known, however, the problem would already be exactly solved, so this is not a useful alternative. Unfortunately, no universal best choice seems to exist, but one approach is to choose a sparse approximation to R by ignoring rows that cause too much fill in or by making R a diagonal matrix whose elements are the Euclidean norms of the columns of Z. Bear in mind that, in the nonlinear case, an expensive computation to choose C in the first iteration may work very well in subsequent iterations with no further expense. One should be aware of the trade off between the extra work per iteration of the preconditioned-conjugate gradient method versus the reduction in the number of iterations. This is especially important in nonlinear problems.
The solution of large, least-squares problems is currently an active area of research, and we have certainly not given an exhaustive list of methods in this chapter. The choice of method or approach for any particular problem is dependent on many conditions. Some of these are:
|
It is often useful to consider the data for a least-squares problem to be in the form , where the are considered to be the independent variables and the the dependent variables. The implicit assumption in ordinary least squares is that the independent variables are known exactly. It sometimes occurs, however, that these independent variables also have errors associated with them that are significant with respect to the errors in the observations . In such cases, referred to as `errors in variables' or `measurement error models', the ordinary least-squares methodology is not appropriate and its use may give misleading results (see Fuller, 1987).
Let us define to be the model functions that predict the . Observe that ordinary least squares minimizes the sum of the squares of the vertical distances from the observed points to the curve . If has an error , and these errors are normally distributed, then the maximum-likelihood estimate of the parameters is found by minimizing the sum of the squares of the weighted orthogonal distances from the point to the curve . More precisely, the optimization problem to be solved is given by where and are appropriately chosen weights. Problem (8.1.6.1) is called the orthogonal distance regression (ODR) problem. Problem (8.1.6.1) can be solved as a least-squares problem in the combined variables x, δ by the methods given above. This, however, is quite inefficient, since such a procedure would not exploit the special structure of the ODR problem. Few algorithms that exploit this structure exist; one has been given by Boggs, Byrd & Schnabel (1987), and the software, called ODRPACK, is by Boggs, Byrd, Donaldson & Schnabel (1989). The algorithm is based on the trust-region (Levenberg–Marquardt) method described above, but it exploits the special structure of (8.1.6.1) so that the cost of each iteration is no more expensive than the cost of a similar iteration of the corresponding ordinary least-squares problems. For a discussion of some of the statistical properties of the resulting estimates, including a procedure for the computation of the variance–covariance matrix, see Boggs & Rogers (1990).
Giving even general recommendations on software is a difficult task for several reasons. Clearly, the selection of methods discussed in earlier sections contains implicitly some recommendations for approaches. Among the reasons for avoiding specifics are the following:
|
The general recommendation is to analyse carefully the needs and resources in terms of these considerations, and to seek expert assistance whenever possible. As much as possible, avoid the temptation to write your own codes. Despite the fact that the quality of existing software is far from uniformly high, the benefits of utilizing high-quality software generally far outweigh the costs of finding, obtaining, and installing it.
Sources of information on software have improved significantly in the past several years. Nevertheless, the task of identifying software in terms of problems that can be solved; organizing, maintaining and updating such a list; and informing the user community still remains formidable.
A current, problem-oriented system that includes both a problem classification scheme and a network tool for obtaining documentation and source code (for software in the public domain) is the Guide to Available Mathematical Software (GAMS). This system is maintained by the National Institute of Standards and Technology (NIST) and is continually being updated as new material is received. It gives references to software in several software repositories; the URL is http://math.nist.gov/gams .
References
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S. & Sorenson, D. (1992). LAPACK user's guide, 2nd ed. Philadelphia: SIAM Publications.Google ScholarBerger, J. O. & Wolpert, R. L. (1984). The likelihood principle. Hayward, CA: Institute of Mathematical Statistics. Google Scholar
Boggs, P. T., Byrd, R. H., Donaldson, J. R. & Schnabel, R. B. (1989). ODRPACK – software for weighted orthogonal distance regression. ACM Trans. Math. Softw. 15, 348–364. Google Scholar
Boggs, P. T., Byrd, R. H. & Schnabel, R. B. (1987). A stable and efficient algorithm for nonlinear orthogonal distance regression. SIAM J. Sci. Stat. Comput. 8, 1052–1078. Google Scholar
Boggs, P. T. & Rogers, J. E. (1990). Orthogonal distance regression. Contemporary mathematics: statistical analysis of measurement error models and applications. Providence, RI: AMS. Google Scholar
Box, G. E. P., Hunter, W. G. & Hunter, J. S. (1978). Statistics for experimenters: an introduction to design, data analysis and model building. New York: John Wiley. Google Scholar
Box, G. E. P. & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley. Google Scholar
Bunch, D. S., Gay, D. M. & Welsch, R. E. (1993). Algorithm 717: subroutines for maximum likelihood and quasi-likelihood estimation of parameters in nonlinear regression models. ACM Trans. Math. Softw. 19, 109–130. Google Scholar
Dennis, J. E. & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
Donaldson, J. R. & Schnabel, R. B. (1986). Computational experience with confidence regions and confidence intervals for nonlinear least squares. Computer science and statistics. Proceedings of the Seventeenth Symposium on the Interface, edited by D. M. Allen, pp. 83–91. New York: North-Holland.Google Scholar
Draper, N. & Smith, H. (1981). Applied regression analysis. New York: John Wiley.Google Scholar
Fedorov, V. V. (1972). Theory of optimal experiments, translated by W. J. Studden & E. M. Klimko. New York: Academic Press. Google Scholar
Fuller, W. A. (1987). Measurement error models. New York: John Wiley & Sons.Google Scholar
Heath, M. T. (1984). Numerical methods for large, sparse, linear least squares problems. SIAM J. Sci. Stat. Comput. 5, 497–513. Google Scholar
Nash, S. & Sofer, A. (1995). Linear and nonlinear programming. New York: McGraw-Hill.Google Scholar
Prince, E. (1994). Mathematical techniques in crystallography and materials science, 2nd ed. Berlin: Springer. Google Scholar
Schwarzenbach, D., Abrahams, S. C., Flack, H. D., Prince, E. & Wilson, A. J. C. (1995). Statistical descriptors in crystallography. II. Report of a Working Group on Expression of Uncertainty in Measurement. Acta Cryst. A51, 565–569. Google Scholar
Stewart, G. W. (1973). Introduction to matrix computations. New York: Academic Press.Google Scholar