International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

International Tables for Crystallography (2006). Vol. C. ch. 8.4, pp. 702-706
https://doi.org/10.1107/97809553602060000612

Chapter 8.4. Statistical significance tests

E. Princea and C. H. Spiegelmanb

a NIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and bDepartment of Statistics, Texas A&M University, College Station, TX 77843, USA

Chapter 8.4 introduces the χ2 distribution and shows how it can be used to assess whether a model produced by a least-squares fit is consistent with the data. The F distribution, the distribution of the ratio of two independent random variables that both have χ2 distributions, is derived. The F distribution and two others derived from it, Student's t distribution and Hamilton's R-factor ratio distribution, can be used to decide whether one model is significantly better than another. The projection matrix is defined and the concept of leverage is introduced. The projection matrix can be used to determine which data points have the most influence in determining particular refined parameters.

Keywords: chi-squared distributions; influential data points; statistical significance tests; statistical validity.

In Chapter 8.1[link] , we discussed the method of least squares and procedures for estimating the values of the adjustable parameters of a model that predicts the mean of a population from which experimental observations are drawn at random. Any model, however, will have some set of parameter values that gives the best least-squares fit. We must now address the question of whether that best fit is adequate, that is, whether it is plausible, given the precision of the data, to accept the hypothesis that the model really is a correct representation of the phenomena that have been measured in the collection of the data. In this chapter, we discuss the probability density function for the sum of squared residuals if the individual residuals are drawn from a normal distribution, the χ2 distribution, and the conditions under which this p.d.f. may be assumed to approximate a practical case. Next, we discuss the F distribution, which is the distribution of the ratio of two independent, random variables, each of which has a χ2 distribution, and its use in comparing the fits of constrained and unconstrained versions of a model. We also discuss a test that is useful for a more general comparison of models. Finally, we discuss the variation among data points of their effectiveness in improving the precision of parameter estimates and the application of this analysis to the optimum design of experiments.

8.4.1. The χ2 distribution

| top | pdf |

We have seen [equation (8.1.2.1[link] )] that the least-squares estimate is derived by finding the minimum value of a sum of terms of the form [ R_i=w_i[y_i-M_i({\bf x})]^2, \eqno (8.4.1.1)]and, further, that the precision of the estimate is optimized if the weight, [w_i], is the reciprocal of the variance of the population from which the observation is drawn, [w_i=1/\sigma _i^2]. Using this relation, (8.4.1.1)[link] can be written [ R_i=\left\{\left[y_i-M_i({\bf x})\right]/\sigma _i\right\}^2. \eqno (8.4.1.2)]Each term is the square of a difference between observed and calculated values, expressed as a fraction of the standard uncertainty of the observed value. But, by definition, [ \sigma _i^2=\left \langle [y_i-M_i({\bf x})]^2\right \rangle, \eqno (8.4.1.3)]where x has its unknown `correct' value, so that 〈R〉 = 1, and the expected value of the sum of n such terms is n. It can be shown (Draper & Smith, 1981[link]) that each parameter estimated reduces this expected sum by one, so that, for p estimated parameters, [\left \langle S\right \rangle =\left \langle \textstyle\sum\limits _{i=1}^n\left \{ \left [y_i-M_i\left (\widehat {{\bf x}}\right) \right]/\sigma _i\right \} ^2\right \rangle =n-p, \eqno (8.4.1.4)]where [\widehat {{\bf x}}] is the least-squares estimate. The standard uncertainty of an observation of unit weight, also referred to as the goodness-of-fit parameter, is defined by [ G=\left [{S \over n-p}\right] ^{1/2}=\left [\displaystyle {\left \{ \sum _{i=1}^nw_i\left [y_i-M_i(\widehat {{\bf x}})\right] ^2\right \} \over n-p}\right] ^{1/2}. \eqno (8.4.1.5)]From (8.4.1.4)[link], it follows that 〈G〉 = 1 for a correct model with weights assigned in accordance with (8.4.1.2)[link].

A value of G that is close to one, if the weights have been assigned by [w_i=1/\sigma _i^2], is an indicator that the model is consistent with the data. It should be noted that it is not necessarily an indicator that the model is `correct', because it does not rule out the existence of an alternative model that fits the data as well or better. An assessment of the adequacy of the fit of a given model depends, however, on what is meant by `close to one', which depends in turn on the spread of a probability density function for G. We saw in Chapter 8.1[link] that least squares with this weighting scheme would give the best, linear, unbiased estimate of the model parameters, with no restrictions on the p.d.f.s of the populations from which the observations are drawn except for the implicit assumption that the variances of these p.d.f.s are finite. To construct a p.d.f. for G, however, it is necessary to make an assumption about the shapes of the p.d.f.s for the observations. The usual assumption is that these p.d.f.s can be described by the normal p.d.f., [ \Phi _N(x,\mu, \sigma)={1 \over \sqrt {2\pi }\sigma }\exp \left [{\left (x-\mu \right) ^2 \over 2\sigma ^2}\right] . \eqno (8.4.1.6)]The justification for this assumption comes from the central-limit theorem, which states that, under rather broad conditions, the p.d.f. of the arithmetic mean of n observations drawn from a population with mean μ and variance σ2 tends, for large n, to a normal distribution with mean μ and variance [\sigma ^2/n]. [For a discussion of the central limit theorem, see Cramér (1951[link]).]

If we make the assumption of a normal distribution of errors and make the substitution z = (x − μ)/σ, (8.4.1.6)[link] becomes [ \Phi _N(z,0,1)= {1\over\sqrt {2\pi }}\exp \left (- {{z^2}\over 2}\right). \eqno (8.4.1.7)]The probability that [z^2] will be less than χ2 is equal to the probability that z will lie in the interval [-\chi \leq z\leq \chi ], or [\Psi (\chi ^2)=\textstyle\int\limits_0^{\chi ^2}\Phi (z^2){\,{\rm d}}z^2=\textstyle\int\limits _{-\chi }^{+\chi }\Phi (z){\,{\rm d}}z. \eqno (8.4.1.8)]Letting [t=z^2/2] and substituting in (8.4.1.7)[link], this becomes [ \Psi (\chi ^2) = { 1 \over \sqrt {\pi }} \int\limits_0^{\chi ^2/2}t^{-1/2} \exp (-t){\,{\rm d}}t. \eqno (8.4.1.9)][\Phi (\chi ^2)={\,{\rm d}}\Psi (\chi ^2)/{{\rm d}}\chi ^2], so that [ \matrix{ \Phi \left (\chi ^2\right) = \left (2\pi \chi ^2\right) ^{-1/2}\exp \left (-\chi ^2/2\right), \qquad &\chi ^2 \gt 0, \cr \Phi \left (\chi ^2\right) = 0, \hfill &\chi ^2\leq 0.} \eqno (8.4.1.10)]The joint p.d.f. of the squares of two random variables, [z_1] and [z_2 ], drawn independently from the same population with a normal p.d.f. is [ \Phi _J\left (z_1^2,z_2^2\right) = {1\over 2\pi z_1z_2}\exp \left [-{z_1^2+z_2^2 \over 2}\right] , \eqno (8.4.1.11)]and the p.d.f. of the sum, [s^2], of these two terms is the integral over the joint p.d.f. of all pairs of [z_1^2] and [z_2^2] that add up to [s^2]. [ \Phi (s^2)={1\over 2\pi }\exp \left (- {s^2\over 2}\right) \left [z_1^2(s^2-z_1^2)\right] {\,{\rm d}}z_1^2. \eqno (8.4.1.12)]This integral can be evaluated by use of the gamma and beta functions. The gamma function is defined for positive real x by [ \Gamma (x)=\textstyle\int\limits_0^\infty t^{x-1}\exp (-t){\,{\rm d}}t. \eqno (8.4.1.13)]Although this function is continuous for all [x\gt 0], its value is of interest in the context of this analysis only for x equal to positive, integral multiples of 1/2. It can be shown that Γ(1/2) = [\sqrt {\pi }], Γ(1) = 1, and Γ(x + 1) = xΓ(x). It follows that, for a positive integer, n, Γ(n) = (n −1)!, and that Γ(3/2) = [\sqrt {\pi }/2], Γ(5/2) = [3\sqrt {\pi }/4], etc. The beta function is defined by [ B(x,y)=\textstyle\int\limits_0^1t^{x-1}(1-t)^{y-1}{\,{\rm d}}t. \eqno (8.4.1.14)]It can be shown (Prince, 1994[link]) that [B(x,y)=\Gamma (x)\Gamma (y)/\Gamma (x+y) ]. Making the substitution [t=z_1^2/s^2], (8.4.1.12)[link] becomes [\eqalignno{ \Phi (s^2) &= {1\over2\pi }\exp \left (- {s^2\over 2}\right) \int _0^1\; \left [t(1-t)\right] ^{-1/2}{\,{\rm d}}t \cr &= {1\over 2\pi }\exp \left (- {s^2 \over 2}\right) B(1/2,1/2) \cr &= {\textstyle {1\over2}}\exp \left (- {s^2 \over 2}\right), \qquad s^2\geq 0.\qquad & (8.4.1.15)}]By a similar procedure, it can be shown that, if χ2 is the sum of ν terms, [z_1^2], [z_2^2], [\ldots ], [z_v^2], where all are drawn independently from a population with the p.d.f. given in (8.4.1.10)[link], χ2 has the p.d.f. [\matrix{ \Phi \left (\chi ^2,\nu \right) = \displaystyle {\left (\chi ^2\right) ^{\nu /2-1} \over 2^{\nu /2}\Gamma (\nu /2) }\exp \left (- {\chi ^2 \over 2}\right), \qquad &\chi ^{\dot {2}}\gt 0, \cr \Phi \left (\chi ^2,\nu \right) = 0, \hfill &\chi ^2\leq 0.  & (8.4.1.16)}]The parameter ν is known as the number of degrees of freedom, but this use of that term must not be confused with the conventional use in physics and chemistry. The p.d.f. in (8.4.1.16)[link] is the chi-squared distribution with ν degrees of freedom. Table 8.4.1.1[link] gives the values of χ2/ν for which the cumulative distribution function (c.d.f.) Ψ(χ2, ν) has various values for various choices of ν. This table is provided to enable verification of computer codes that may be used to generate more extensive tables. It was generated using a program included in the statistical library DATAPAC (Filliben, unpublished). Fortran code for this program appears in Prince (1994[link]).

Table 8.4.1.1| top | pdf |
Values of χ2/ν for which the c.d.f. ψ(χ2, ν) has the values given in the column headings, for various values of ν

ν0.5 0.9 0.95 0.99 0.995
10.4549 2.7055 3.8415 6.6349 7.8795
20.6931 2.3026 2.9957 4.6052 5.2983
30.7887 2.0838 2.6049 3.7816 4.2794
40.8392 1.9449 2.3719 3.3192 3.7151
60.8914 1.7741 2.0986 2.8020 3.0913
80.9180 1.6702 1.9384 2.5113 2.7444
100.9342 1.5987 1.8307 2.3209 2.5188
150.9559 1.4871 1.6664 2.0385 2.1868
200.9669 1.4206 1.5705 1.8783 1.9999
250.9735 1.3753 1.5061 1.7726 1.8771
300.9779 1.3419 1.4591 1.6964 1.7891
400.9834 1.2951 1.3940 1.5923 1.6692
500.9867 1.2633 1.3501 1.5231 1.5898
600.9889 1.2400 1.3180 1.4730 1.5325
800.9917 1.2072 1.2735 1.4041 1.4540
1000.9933 1.1850 1.2434 1.3581 1.4017
1200.9945 1.1686 1.2214 1.3246 1.3638
140 0.9952 1.1559 1.2044 1.2989 1.3346
1600.9958 1.1457 1.1907 1.2783 1.3114
2000.9967 1.1301 1.1700 1.2472 1.2763

The quantity (np)G is the sum of n terms that have mean value (np)/n. Because the process of determining the least-squares fit establishes p relations among them, however, only (np) of the terms are independent. The number of degrees of freedom is therefore ν = (np), and, if the model is correct, and the terms have been properly weighted, χ2 = (np)G2 has the chi-squared distribution with (np) degrees of freedom. In crystallography, the number of degrees of freedom tends to be large, and the p.d.f. for G correspondingly sharp, so that even rather small deviations from G2 = 1 should cause one or both of the hypotheses of a correct model and appropriate weights to be rejected. It is common practice to assume that the model is correct, and that the weights have correct relative values, that is that they have been assigned by [w_i=k/\sigma _i^2], where k is a number different from, usually greater than, one. G is then taken to be an estimate of k, and all elements of (ATWA)−1 (Section 8.1.2[link] ) are multiplied by G2 to get an estimated variance–covariance matrix. The range of validity of this procedure is limited at best. It is discussed further in Chapter 8.5[link] .

8.4.2. The F distribution

| top | pdf |

Consider an unconstrained model with p parameters and a constrained one with q parameters, where [q\lt p]. We wish to decide whether the constrained model represents an adequate fit to the data, or if the additional parameters in the unconstrained model provide, in some important sense, a better fit to the data. Provided the (pq) additional columns of the design matrix, A, are linearly independent of the previous q columns, the sum of squared residuals must be reduced by some finite amount by adjusting the additional parameters, but we must decide whether this improved fit would have occurred purely by chance, or whether it represents additional information.

Let [s_c^2] and [s_u^2] be the weighted sums of squared residuals for the constrained and unconstrained models, respectively. If the constrained and unconstrained models are equally good representations of the data, and the weights have been assigned by [w_i=1/\sigma _i^2], the expected values of the sums of squares are [\langle s_c^2\rangle = (n-q)] and [\langle s_u^2\rangle =(n-p)], and, further, they should be distributed as χ2 with (nq) and (np) degrees of freedom, respectively. Also, [\langle s_c^2-s_u^2\rangle = (p-q)], and [(s_c^2-s_u^2)] is distributed as χ2 with (pq) degrees of freedom. [s_c^2] and [s_u^2] are not independent, but [(s_c^2-s_u^2)] is the squared magnitude of a vector in a (pq)-dimensional subspace that is orthogonal to the (np)-dimensional space of [s_u^2]. Therefore, [s_u^2] and [(s_c^2-s_u^2)] are independent, random variables, each with a χ2 distribution. Let [\chi _1^2=(s_c^2-s_u^2)], [\chi _2^2=s_u^2], ν1 = pq, and ν2 = np. The ratio F = [(\chi _1^2/\nu _1)/(\chi _2^2/\nu _2)] should have a value close to one, even if the weights have relative rather than absolute values, but we need a measure of how far away from one this ratio can be before we must reject the hypothesis that the two models are equally good representations of the data. The conditional p.d.f. for F, given a value of [\chi _2^2], is [ \Phi _C\left (F|\chi _2^2\right) = {\left [\left (\nu _1/\nu _2\right) \chi _2^2\right] ^{\nu _1/2}F^{\nu _1/2-1} \over 2^{\nu _1/2}\Gamma (\nu _1/2) }\exp \left [-(\nu _1/\nu _2) \chi _2^2F/2\right] , \eqno (8.4.2.1)]and the marginal p.d.f. for [\chi _2^2] is [ \Phi _M\left (\chi _2^2\right) = {\left (\chi _2^2\right) ^{\nu _2/2-1} \over 2^{\nu _2/2}\Gamma(\nu _2/2) }\exp \left (-\chi _2^2/2\right). \eqno (8.4.2.2)]The marginal p.d.f. for F is obtained by integration of the joint p.d.f., [ \Phi (F)=\textstyle\int\limits_0^\infty \Phi _C\left (F|\chi _2^2\right) \!\Phi _M\!\left(\chi _2^2\right) {\,{\rm d}}\chi _2^2, \eqno (8.4.2.3)]yielding the result [ \Phi (F,\nu _1,\nu _2)= {\left (\nu _1/\nu _2\right) F^{\nu _1/2-1} \over B\left (\nu _1/2,\nu _2/2\right) \left [1+\left (\nu _1/\nu _2\right) F\right] ^{\left (\nu _1+\nu _2\right) /2}}. \eqno (8.4.2.4)]This p.d.f. is known as the F distribution with [\nu _1] and [\nu _2] degrees of freedom. Table 8.4.2.1[link] gives the values of F for which the c.d.f. [\Psi (F,\nu _1,\nu _2)] is equal to 0.95 for various choices of ν1 and ν2. Fortran code for the program from which the table was generated appears in Prince (1994[link]).

Table 8.4.2.1| top | pdf |
Values of the F ratio for which the c.d.f. ψ(F, ν1, ν2) has the value 0.95, for various choices of ν1 and ν2

[{\nu _2}][\nu _1]
1 2 4 8 15
104.9646 4.1028 3.4781 3.0717 2.8450
204.3512 3.4928 2.8661 2.4471 2.2033
304.1709 3.3158 2.6896 2.2662 2.0148
404.0847 3.2317 2.6060 2.1802 1.9245
504.0343 3.1826 2.5572 2.1299 1.8714
604.0012 3.1504 2.5252 2.0970 1.8364
803.9604 3.1108 2.4859 2.0564 1.7932
1003.9361 3.0873 2.4626 2.0323 1.7675
1203.9201 3.0718 2.4472 2.0164 1.7505
1503.9042 3.0564 2.4320 2.0006 1.7335
2003.8884 3.0411 2.4168 1.9849 1.7167
3003.8726 3.0259 2.4017 1.9693 1.6998
4003.8648 3.0183 2.3943 1.9616 1.6914
6003.8570 3.0107 2.3868 1.9538 1.6831
10003.8508 3.0047 2.3808 1.9477 1.6764

The cumulative distribution function [\Psi (F,\nu _1,\nu _2)] gives the probability that the F ratio will be less than some value by chance if the models are equally consistent with the data. It is therefore a necessary, but not sufficient, condition for concluding that the unconstrained model gives a significantly better fit to the data that [\Psi (F,\nu _1,\nu _2)] be greater than 1 − α, where α is the desired level of significance. For example, if [\Psi (F,\nu _1,\nu _2)] = 0.95, the probability is only 0.05 that a value of F this large or greater would have been observed if the two models were equally good representations of the data.

Hamilton (1964[link]) observed that the F ratio could be expressed in terms of the crystallographic weighted R index, which is defined, for refinement on |F| (and similarly for refinement on |F|2), by[R_w=\left [\mathop {\textstyle \sum }w_i(|{\rm F}_o|_i-|{\rm F}_c|_i)^2\big/\mathop {\textstyle \sum }w_i|{\rm F}_o|_i^2\right] ^{1/2}. \eqno (8.4.2.5)]

Denoting by [R_c] and [R_u] the weighted R indices for the constrained and unconstrained models, respectively, [ F=(\nu _2/\nu _1)[(R_c/R_u)^2 - 1], \eqno (8.4.2.6)]and a c.d.f. for [R_c/R_u] can be readily derived from this relation. A significance test based on [R_c/R_u] is known as Hamilton's R-ratio test; it is entirely equivalent to a test on the F ratio.

8.4.3. Comparison of different models

| top | pdf |

Tests based on F or the R ratio have several limitations. One important one is that they are applicable only when the parameters of one model form a subset of the parameters of the other. Also, the F test makes no distinction between improvement in fit as a result of small improvements throughout the entire data set and a large improvement in a small number of critically sensitive data points. A test that can be used for comparing arbitrary pairs of models, and that focuses attention on those data points that are most sensitive to differences in the models, was introduced by Williams & Kloot (1953[link]; also Himmelblau, 1970[link]; Prince, 1982[link]).

Consider a set of observations, [y_{0i}], and two models that predict values for these observations, [y_{1i}] and [y_{2i}], respectively. We determine the slope of the regression line [z=\lambda x], where [z_i=\left [y_{0i}-(1/2)\left (y_{1i}+y_{2i}\right) \right] /\sigma _i], and [x_i=\left (y_{1i}-y_{2i}\right) /\sigma _i]. Suppose model 1 is a perfect fit to the data, which have been measured with great precision, so that [y_{0i}=y_{1i}] for all i. Under these conditions, λ = +1/2. Similarly, if model 2 is a perfect fit, λ = −1/2. Real experimental data, of course, are subject to random error, and |λ| in general would be expected to be less than 1/2. A least-squares estimate of λ is [ \widehat {\lambda }= { \sum\limits _{i=1}^nz_i\,x_i\; \over \sum\limits_{i=1}^nx_i^2}, \eqno (8.4.3.1)]and it has an estimated variance [ \widehat {\sigma }\,_\lambda ^2= { \sum\limits_{i=1}^n\, z_i^2-\widehat {\lambda }\,^2\sum\limits _{i=1}^n\,x _i^2 \over \left (n-1\right) \sum\limits_{i=1}^n\, x_i^2}. \eqno (8.4.3.2)]The hypothesis that the two models give equally good fits to the data can be tested by considering [\widehat {\lambda }] to be an unconstrained, one-parameter fit that is to be compared with a constrained, zero-parameter fit for which λ = 0. A p.d.f. for making this comparison can be derived from an F distribution with ν1 = 1 and ν2 = ν = (n − 1). [ \Phi (F,1,\nu)= {\Gamma [(\nu +1) /2] \over\sqrt {\pi vF}\Gamma (\nu /2) (1+F/\nu ) ^{(\nu +1) /2}}. \eqno (8.4.3.3)]If we let [|t|=\sqrt {F}], and use [ \textstyle\int\limits_0^{F_0}\Phi (F,1,\nu){\,{\rm d}}F=\textstyle\int\limits_{-t_0}^{+t_0}\Phi (t,\nu){\,{\rm d}}t, \eqno (8.4.3.4)]we can derive a p.d.f. for t, which is [ \Phi (t,\nu)= {\Gamma [ (\nu +1) /2] \over \sqrt {\pi \nu }\Gamma (\nu /2) [1+t^2/\nu] ^{ (\nu +1) /2}}. \eqno (8.4.3.5)]This p.d.f. is known as Student's t distribution with ν degrees of freedom. Setting [t=\widehat {\lambda }/\widehat {\sigma }_\lambda ], the c.d.f. Ψ(t, ν) can be used to test the alternative hypotheses λ = 0 and λ = ±1/2. Table 8.4.3.1[link] gives the values of t for which the c.d.f. Ψ(t, ν) has various values for various values of ν. Fortran code for the program from which this table was generated appears in Prince (1994[link]).

Table 8.4.3.1| top | pdf |
Values of t for which the c.d.f. Ψ(t, ν) has the values given in the column headings, for various values of ν

ν0.750.900.950.990.995
11.00003.07776.313831.820663.6570
20.81651.88562.92006.96469.9249
30.76491.63772.35344.54075.8409
40.74071.53322.13193.74694.6041
60.71761.43981.94323.14273.7074
80.70641.39681.85962.89653.3554
100.69981.37221.81252.76383.1693
120.69551.35621.78232.68103.0546
140.69241.34501.76132.62452.9769
160.69011.33681.74592.58352.9208
200.68701.32531.72472.52802.8453
250.68441.31641.70812.48512.7874
300.68281.31041.69732.45732.7500
350.68161.30621.68962.43772.7238
400.68071.30311.68392.42332.7045
500.67941.29871.67592.40332.6778
600.67861.29581.67072.39012.6603
800.67761.29221.66412.37392.6387
1000.67701.29011.66022.36422.6259
1200.67651.28861.65772.35782.6174

Again, it must be understood that the results of these statistical comparisons do not imply that either model is a correct one. A statistical indication of a good fit says only that, given the model, the experimenter should not be surprised at having observed the data values that were observed. It says nothing about whether the model is plausible in terms of compatibility with the laws of physics and chemistry. Nor does it rule out the existence of other models that describe the data as well as or better than any of the models tested.

8.4.4. Influence of individual data points

| top | pdf |

When the method of least squares, or any variant of it, is used to refine a crystal structure, it is implicitly assumed that a model with adjustable parameters makes an unbiased prediction of the experimental observations for some (a priori unknown) set of values of those parameters. The existence of any reflection whose observed intensity is inconsistent with this assumption, that is that it differs from the predicted value by an amount that cannot be reconciled with the precision of the measurement, must cause the model to be rejected, or at least modified. In making precise estimates of the values of the unknown parameters, however, different reflections do not all carry the same amount of information (Shoemaker, 1968[link]; Prince & Nicholson, 1985[link]). For an obvious example, consider a space-group systematic absence. Except for possible effects of multiple diffraction or twinning, any observed intensity at a position corresponding to a systematic absence is proof that the screw axis or glide plane is not present. If no intensity is observed for any such reflection, however, any parameter values that conform to the space group are equally acceptable. It is to be expected, on the other hand, that some intensities will be extremely sensitive to small changes in some parameter, and that careful measurement of those intensities will lead to correspondingly precise estimates of the parameter values. For the purpose of precise structure refinement, it is useful to be able to identify the influential reflections.

Consider a vector of observations, y, and a model M(x). The elements of y define an n-dimension space, and the model values, Mi(x), define a p-dimensional subspace within it. The least-squares solution [equation (8.1.2.7[link] )], [ \widehat {{\bf x}}=({\bi A}^T{\bi WA})^{-1}{\bi A}^T{\bi W}({\bf y}-{\bf y}_0), \eqno (8.4.4.1)]is such that [\widehat {{\bf y}}={\bi M}(\widehat {{\bf x}})] is the closest point to y that corresponds to some possible value of x. In (8.4.4.1)[link], W = V−1 is the inverse of the variance–covariance matrix for the joint p.d.f. of the elements of y, and [{\bf y}_0={\bi M}({\bf x}_0)] is a point in the p-dimensional subspace close enough to [{\bi M}(\widehat {{\bf x}})] so that the linear approximation [ {\bi M}({\bf x})={\bf y}_0+{\bi A}({\bf x}-{\bf x}_0) \eqno (8.4.4.2)][where [A_{ij}=\partial M_i({\bf x})/\partial x_j]] is a good one. Let R be the Cholesky factor of W, so that [{\bi W}={\bi R}{^T}{\bi R}], and let Z = RA, [{\bf y}^{\prime }={\bf y}-{\bf y}_0], and [\widehat {{\bf y}}^{\prime }=\widehat {{\bf y}}-{\bf y}_0]. The least-squares estimate may then be written [ \widehat {{\bf x}}={\bf x}_0+({\bi Z}^T{\bi Z})^{-1}{\bi Z}^T{\bf y}^{\prime } \eqno (8.4.4.3)]and [ \widehat {{\bf y}}^{\prime }={\bi Z}(\widehat {{\bf x}}-{\bf x}_0)={\bi Z}({\bi Z}^T{\bi Z})^{-1}{\bi Z}^T{\bf y}^{\prime }. \eqno (8.4.4.4)]Thus, the matrix P = Z(ZTZ)ZT, the projection matrix, is a linear relation between the observed data values and the corresponding calculated values. (Because [\widehat {{\bf y}}\,^{\prime }={\bi P}{\bf y}'], the matrix P is frequently referred to in the statistical literature as the hat matrix.) P2 = Z(ZTZ)− 1ZTZ(ZTZ)−1ZT = Z(ZTZ)−1ZT = P, so that P is idempotent. P is an n × n positive semidefinite matrix with rank p, and its eigenvalues are either 1 (p times) or 0 (np times). Its diagonal elements lie in the range [0\leq P_{ii}\leq 1], and the trace of P is p, so that the average value of [P_{ii}] is p/n. Furthermore, [ P_{ii}=\textstyle\sum\limits_{j=1}^nP_{ij}^2. \eqno (8.4.4.5)]A diagonal element of P is a measure of the influence that an observation has on its own calculated value. If [P_{ii}] is close to one, the model is forced to fit the ith data point, which puts a constraint on the value of the corresponding function of the parameters. A very small value of [P_{ii}], because of (8.4.4.5)[link], implies that all elements of the row must be small, and that observation has little influence on its own or any other calculated value. Because it is a measure of influence on the fit, [P_{ii}] is sometimes referred to as the leverage of the ith observation. Note that, because [({\bi Z}^T{\bi Z})^{-1}={\bi V}_{{\bf x}}], the variance–covariance matrix for the elements of [\widehat {{\bf x}}], [{\bi P}] is the variance–covariance matrix for [\widehat {{\bf y}} ], whose elements are functions of the elements of [\widehat {{\bf x}}]. A large value of [P_{ii}] means that [y_i] is poorly defined by the elements of [\widehat {{\bf x}}], which implies in turn that some elements of [\widehat {{\bf x}}] must be precisely defined by a precise measurement of [y_i^{\prime }].

It is apparent that, in a real experiment, there will be appreciable variation among observations in their leverage. It can be shown (Fedorov, 1972[link]; Prince & Nicholson, 1985[link]) that the observations with the greatest leverage also have the largest effect on the volume of the p-dimensional confidence region for the parameter estimates. Because this volume is a rather gross measure, however, it is useful to have a measure of the influence of individual observations on individual parameters. Let [{\bi V}_n ] be the variance–covariance matrix for a refinement including n observations, and let z be a row vector whose elements are zj = [[\partial M({\bf x})/\partial x_j]/]σ for an additional observation. [{\bi V}_{n+1}], the variance–covariance matrix with the additional observation included, is, by definition, [ {\bi V}_{n+1}=({\bi Z}^T{\bi Z}+{\bf z}^T{\bf z})^{-1}, \eqno (8.4.4.6)]which, in the linear approximation, can be shown to be [ {\bi V}_{n+1}={\bi V}_n-{\bi V}_n{\bf z}^T{\bf z}{\bi V}_n/(1+{\bf z}{\bi V}_n{\bf z}^T). \eqno (8.4.4.7)]The diagonal elements of the rank one matrix D = VnzTzVn/(1 + zVnzT) are therefore the amounts that the variances of the estimates of individual parameters will be reduced by inclusion of the additional observation.

This result depends on the elements of Z and z not changing significantly in the (presumably small) shift from [\widehat {{\bf x}}_n] to [\widehat {{\bf x}}_{n+1}]. That this condition is satisfied may be verified by the following procedure. Find an approximation to [\widehat {{\bf x}}_{n+1} ] by a line search along the line [{\bf x}=\widehat {{\bf x}}_n+\alpha {\bi V}_{n+1}{\bf z}^Ty_{n+1}^{\prime }], and then evaluate B, a quasi-Newton update such as the BFGS update (Subsection 8.1.4.3[link] ) at that point. If α = 1, and the gradient of the sum of squares vanishes, then the linear approximation is exact, and B is null. If [ \left | B_{ij}\right | \ll \left [\left ({\bi Z}^T{\bi Z}+{\bf z}^T{\bf z}\right) _{ii}\left ({\bi Z}^T{\bi Z}+{\bf z}^T{\bf z}\right) _{jj}\right] ^{1/2} \eqno (8.4.4.8)]for all i and j, then (8.4.4.7)[link] can be expected to be an excellent approximation for a nonlinear model.

References

First citation Cramér, H. (1951). Mathematical methods of statistics. Princeton, NJ: Princeton University Press.Google Scholar
First citation Draper, N. & Smith, H. (1981). Applied regression analysis. New York: John Wiley.Google Scholar
First citation Fedorov, V. V. (1972). Theory of optimal experiments, translated by W. J. Studden & E. M. Klimko. New York: Academic Press.Google Scholar
First citation Hamilton, W. C. (1964). Statistics in physical science: estimation, hypothesis testing and least squares. New York: Ronald Press.Google Scholar
First citation Himmelblau, D. M. (1970). Process analysis by statistical methods. New York: John Wiley.Google Scholar
First citation Prince, E. (1982). Comparison of the fits of two models to the same data set. Acta Cryst. B38, 1099–1100.Google Scholar
First citation Prince, E. (1994). Mathematical techniques in crystallography and materials science, 2nd ed. Berlin/Heidelberg/New York/London/Paris/Tokyo/Hong Kong/Barcelona/Budapest: Springer-Verlag.Google Scholar
First citation Prince, E. & Nicholson, W. L. (1985). Influence of individual reflections on the precision of parameter estimates in least squares refinement. Structure and statistics in crystallography, edited by A. J. C. Wilson, pp. 183–195. Guilderland, NY: Adenine Press.Google Scholar
First citation Shoemaker, D. P. (1968). Optimization of counting time in computer controlled X-ray and neutron single-crystal diffractometry. Acta Cryst. A24, 136–142.Google Scholar
First citation Williams, E. J. & Kloot, N. H. (1953). Interpolation in a series of correlated observations. Aust. J. Appl. Sci. 4, 1–17.Google Scholar








































to end of page
to top of page