Implementation of linear least squares

Prince, E.; Boggs, P. T.

doi:10.1107/97809553602060000609

International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 681-682

Section 8.1.3. Implementation of linear least squares

E. Prince^a and P. T. Boggs^b

^a NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and ^bScientific Computing Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

8.1.3. Implementation of linear least squares

| top | pdf |

In this section, we consider in detail numerical methods for solving linear least-squares problems, that is, the situation where (8.1.2.4) and (8.1.2.5) apply exactly.

8.1.3.1. Use of the QR factorization

| top | pdf |

The linear least-squares problem can be viewed geometrically as the problem of finding the point in a p-dimensional subspace, defined as the set of points that can be reached by a linear combination of the columns of A, closest to a given point, y, in an n-dimensional observation space. Since this is equivalent to finding the orthogonal projection of point y into that subspace, it is not surprising that an orthogonal decomposition of A helps to solve the problem. For convenience in this discussion, let us remove the weight matrix from the problem by defining the standardized design matrix by $[{\bi Z}={\bi UA}, \eqno (8.1.3.1)]$ where U is the upper triangular Cholesky factor of W.

Consider the least-squares problem with the QR factorization of Z, as given in Subsection 8.1.1.1. For y′ = U(y − b), (8.1.2.5) becomes $[\eqalignno{S &=({\bf y}^{\prime }-{\bi Z}{\bf x})^T({\bf y}^{\prime }-{\bi Z}{\bf x}) \cr &=[{\bi Q}^T({\bf y}^{\prime }-{\bi Z}{\bf x})]^T[{\bi Q}^T({\bf y}^{\prime }-{\bi Z}{\bf x})], & (8.1.3.2)}]$ which reduces to $[S=({\bi Q}_{{\bi Z}}^T{\bf y}^{\prime }-{\bi R}{\bf x})^T({\bi Q}_{{\bi Z}}^T{\bf y}^{\prime }-{\bi R}{\bf x})+{\bf y}^{\prime T}{\bi Q}_{\bot }{\bi Q}_{\bot }^T{\bf y}^{\prime }. \eqno (8.1.3.3)]$ The second term in (8.1.3.3) is independent of x, and is therefore the sum of squared residuals. The first term vanishes if $[{\bi R}{\bf x}={\bi Q}_{{\bi Z}}^T{\bf y}^{\prime }, \eqno (8.1.3.4)]$ which, because R is upper triangular, is easily solved for x. The QR decomposition of Z therefore leads naturally to the following algorithm for solving the linear least-squares problem:

(1) compute the QR factorization of Z;
(2) compute $[{\bi Q}_{{\bi Z}}^{T}{\bf y}^{\prime };]$
(3) solve Rx = $[{\bi Q}_{{\bi Z}}^{T}{\bf y}^{\prime }]$ for x.
(4) compute the residual sum of squares by $[{\bf y}^{\prime T}{\bf y}^{\prime }-{\bf y}^{\prime }{\bi Q}_{{\bi Z}}{\bi Q}_{{\bi Z}}^{T}{\bf y}^{\prime }]$ .
(5) compute the variance–covariance matrix from $[{\bi V}_{\bf x} ={\bi R}^{-1}({\bi R}^{-1})^{T}]$ .

8.1.3.2. The normal equations

| top | pdf |

Let us now consider the relationship of the QR procedure for solving the linear least-squares problem to the classical method based on the normal equations. The normal equations can be derived by differentiating (8.1.3.2) and equating the result to a null vector. This yields $[{\bi Z}^{T}{\bi Z}{\bf x}={\bi Z}^{T}{\bf y}^{\prime }. \eqno (8.1.3.5)]$ The algorithm is therefore to compute the cross-product matrix, $[{\bi B}={\bi Z}^{T}{\bi Z}]$ , and the right-hand side, d = Z^Ty′, and to solve the resulting system of equations, Bx = d. This is usually accomplished by computing the Cholesky decomposition of B, that is $[{\bi B}={\bi C}^{T}{\bi C}]$ , where C is upper triangular, and then solving the two triangular systems $[{\bi C}^{T}{\bf v}={\bf d}]$ and $[{\bi C}{\bf x}={\bf v}]$ . Because $[{\bi Z}={\bi Q}_{{\bi Z}}{\bi R}]$ , equation (8.1.3.5) becomes $[{\bi R}^{T}{\bi Q}_{{\bi Z}}^{T}{\bi Q}_{{\bi Z}}{\bi R}{\bf x}={\bi R}^{T}{\bi Q}_{{\bi Z}}^{T}{\bf y}^{\prime }, \eqno (8.1.3.6)]$ or $[{\bi R}^{T}{\bi R}{\bf x}={\bi R}^{T}{\bi Q}_{{\bi Z}}^{T}{\bf y}^{\prime }. \eqno (8.1.3.7)]$ It is clear that R is the Cholesky factor of $[{\bi Z}^{T}{\bi Z}]$ , although it is formed in a different way. This procedure requires of order $[(np^{2})/2]$ operations to form the product $[{\bi Z}^{T}{\bi Z}]$ and $[p^{3}/3]$ operations for the Cholesky decomposition. In some situations, the extra time to compute the QR factorization is justified because of greater stability, as will be discussed below. Most other quantities of statistical interest can be computed directly from the QR factorization.

8.1.3.3. Conditioning

| top | pdf |

The condition number of Z, which is defined (Subsection 8.1.1.1) as the square root of the ratio of the largest to the smallest eigenvalue of $[{\bi Z}^T{\bi Z}]$ , is an indicator of the effect a small change in an element of Z will have on the elements of $[({\bi Z}^T{\bi Z})^{-1}]$ and of $[\widehat {{\bf x}}]$ . A large value of the condition number means that small errors in computing an element of Z, owing possibly to truncation or roundoff in the computer, can introduce large errors into the elements of the inverse matrix. Also, when the condition number is large, the standard uncertainties of some estimated parameters will be large. A large condition number, as defined in this way, can result from either scaling or correlation or some combination of these. To illustrate this, consider the matrices $[{\bi Z}^T{\bi Z}=\left (\matrix{ 2+\varepsilon &0 \cr 0 &\varepsilon }\right) ]$ and $[{\bi Z}^T{\bi Z}=\left (\matrix{ 1 &1-\varepsilon \cr 1-\varepsilon &1}\right), ]$ where $[\varepsilon]$ represents machine precision, which can be defined as the smallest number in machine representation that, when added to 1, gives a result different from 1. By the conventional definition, both of these matrices have a condition number for Z of $[[(2+\varepsilon)/\varepsilon] ^{1/2}]$ . Because numbers of order $[\varepsilon]$ can be perfectly well represented, however, the first one can be inverted without loss of precision, whereas an inverse for the second would be totally meaningless. It is good practice, therefore, to factor the design matrix, Z, into the form $[{\bi Z}={\bi TS}, \eqno (8.1.3.8)]$ where S is a p × p diagonal matrix whose elements define some kind of `natural' unit appropriate to the parameter represented in each column of Z. The ideal natural unit would be the standard uncertainty of that parameter, but this is not available until after the calculation has been completed. If correlation is not too severe, suitable values for the elements of S, of the same order of magnitude as those derived from the standard uncertainty, are the column Euclidean norms, that is $[S=\left \| {\bf z}_j\right \| =({\bf z}_j^T{\bf z}_j)^{1/2}, \eqno (8.1.3.9)]$ where $[{\bf z}_j]$ denotes the jth column of Z. This scaling causes all diagonal elements of Z^TZ to be equal to one, and errors in the elements of Z will have roughly equal effects.

Ill conditioning that results from correlation, as in the second example above, is more difficult to deal with. It is an indication that some linear combination of parameters, some eigenvector of the normal equations matrix, is poorly determined by the available data. Use of the QR factorization of Z to compute the Cholesky factor of Z^TZ may be advantageous, in spite of the additional computation time, because better numerical stability is obtained in marginal situations. As a practical matter, however, it is important to recognize that an ill conditioned matrix is a symptom of a flaw in the model or in the experimental design (or both). Use can be made of the fact that, although determining the entire set of eigenvalues and eigenvectors of a large matrix is computationally an inherently difficult problem, a relatively simple algorithm, known as a condition estimator (Anderson et al., 1992), can produce a good approximation to the eigenvector that corresponds to the smallest eigenvalue of a nearly singular matrix. This information can be used in either or both of two ways. First, without any fundamental modification to the model or the experiment, a simple, linear transformation of the parameters so that the problem eigenvector is one of the independent parameters, followed by rescaling, can resolve the numerical difficulties in computing the estimates. A common example is the situation where a phase transition results in the doubling of a unit cell, with pairs of atoms almost but not quite related by a lattice translation. A transformation that makes the estimated parameters the sums and differences of corresponding parameters in related pairs of atoms can make a dramatic improvement in the condition number. Alternatively, the problem eigenvector can be set to some value determined from theory or from some other experiment (see Section 8.3.1 ), or additional data can be collected that are selected to make that combination of parameters determinate.

References

Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S. & Sorenson, D. (1992). LAPACK user's guide, 2nd ed. Philadelphia: SIAM Publications.Google Scholar

International Tables for Crystallography (2006). Vol. C. ch. 8.1, pp. 681-682