International
Tables for Crystallography Volume C Mathematical, physical and chemical tables Edited by E. Prince © International Union of Crystallography 2006 
International Tables for Crystallography (2006). Vol. C. ch. 8.2, pp. 689692
https://doi.org/10.1107/97809553602060000610 Chapter 8.2. Other refinement methodsLeast squares is a powerful data fitting method when the distribution of statistical fluctuation in the data is approximately normal, or Gaussian, but it can perform poorly if the distribution function has longer tails than a Gaussian distribution. Chapter 8.2 discusses several procedures that work better than least squares if the normality condition is not satisfied. Maximum likelihood methods, which are identical to least squares for a normal distribution, can be designed to be optimum for any distribution. Other methods are robust, because they work well over a broad range of distributions, and resistant, because they are insensitive to the presence in the data of points that disagree with the model. Maximum entropy methods are particularly useful when there are insufficient data. Keywords: entropy maximization; maximumlikelihood methods; robust/resistant methods. 
Chapter 8.1 discusses structure refinement by the method of least squares, which has a long history of successful use in data fitting and statistical analysis of results. It is an excellent technique to use in a wide range of practical problems, it is easy to implement, and it usually gives results that are straightforward and unambiguous. If a set of observations, , is an unbiased estimate of the values of model functions, , a properly weighted leastsquares estimate is the best, linear, unbiased estimate of the parameters, x, provided the variances of the p.d.f.s of the populations from which the observations are drawn are finite. This assumes, however, that the model is correct and complete, an assumption whose validity may not necessarily be easily justified. Furthermore, least squares tends to perform poorly when the distribution of errors in the observations has longer tails than a normal, or Gaussian, distribution. For these reasons, a number of other procedures have been developed that attempt to retain the strengths of least squares but are less sensitive to departures from the ideal conditions that have been implicitly assumed. In this chapter, we discuss several of these methods. Two of them, maximumlikelihood methods and robust/resistant methods, are closely related to least squares. A third one uses a function that is mathematically related to the entropy function of thermodynamics and statistical mechanics, and is therefore referred to as the maximumentropy method. For a discussion of the particular application of least squares to structure refinement with powder data that has become known as the Rietveld method (Rietveld, l969), see Chapter 8.6 .
In Chapter 8.1 , structure refinement is presented as finding the answer to the question, `given a set of observations drawn randomly from populations whose means are given by a model, M(x), for some set of unknown parameters, x, how can we best determine the means, variances and covariances of a joint probability density function that describes the probabilities that the true values of the elements of x lie in certain ranges?'. For a broad class of density functions for the observations, the linear estimate that is unbiased and has minimum variances for all parameters is given by the properly weighted method of least squares. The problem can also be stated in the slightly different manner, `given a model and a set of observations, what is the likelihood of observing those particular values, and for what values of the parameters of the model is that likelihood a maximum?'. This set of parameters is the maximumlikelihood estimate.
Suppose the ith observation is drawn from a population whose p.d.f. is , where , x is the set of `true' values of the parameters, and is a measure of scale appropriate to that observation. If the observations are independent, their joint p.d.f. is the product of the individual, marginal p.d.f.s: The function can also be viewed as a conditional p.d.f. for given , or, equivalently, as a likelihood function for x given an observed value of , in which case it is written . Because a value actually observed logically must have a finite, positive likelihood, the density function in (8.2.1.1) and its logarithm will be maximum for the same values of x: In the particular case where the error distribution is normal, and , the standard uncertainty of the ith observation, is known, then and the logarithm of the likelihood function is maximum when is minimum, and the maximumlikelihood estimate and the leastsquares estimate are identical.
For an error distribution that is not normal, the maximumlikelihood estimate will be different from the leastsquares estimate, but it will, in general, involve finding a set of parameters for which a sum of terms like those in (8.2.1.2) is a maximum (or the sum of the negatives of such terms is a minimum). It can thus be expressed in the general form: find the minimum of the sum where ρ is defined by ρ(x) = −ln [Φ(x)], and Φ(x) is the p.d.f. of the error distribution appropriate to the observations. If , the method is least squares. If the error distribution is the Cauchy distribution, , , which increases much more slowly than x^{2} as x increases, causing large deviations to have much less influence than they do in least squares.
Although there is no need for ρ(x) to be a symmetric function of x (the error distribution can be skewed), it may be assumed to have a minimum at x = 0, so that dρ(x)/dx = 0. A series expansion about the origin therefore begins with the quadratic term, and This procedure is thus equivalent to a variant of least squares in which the weights are functions of the deviation.
Properly weighted least squares gives the best linear estimate for a very broad range of distributions of random errors in the data and the maximumlikelihood estimate if that error distribution is normal or Gaussian. But the best linear estimator may nevertheless not be a very good one, and the error distribution may not be well known. It is therefore important to address the question of how good an estimation procedure may be when the conditions for which it is designed may not be satisfied. Refinement procedures may be classified according to the extent that they possess two properties known as robustness and resistance. A procedure is said to be robust if it works well for a broad range of error distributions and resistant if its results are not strongly affected by fluctuations in any small subset of the data. Because least squares is a linear estimator, the influence of any single data point on the parameter estimates increases without limit as the difference between the observation and the model increases. It therefore works poorly if the actual error distribution contains large deviations with a frequency that substantially exceeds that expected from a normal distribution. Further, it has the undesirable property that it will make the fit of a few wildly discrepant data points better by making the fit of many points a little worse. Least squares is therefore neither robust nor resistant.
Tukey (1974) has listed a number of properties a procedure should have in order to be robust and resistant. Because least squares works well when the error distribution is normal, the procedure should behave like least squares for small deviations whose distribution is similar to the normal distribution. It should deemphasize large differences between the model and the data, and it should connect these extremes smoothly. A procedure suggested by Tukey was applied to crystal structure refinement by Nicholson, Prince, Buchanan & Tucker (1982). It corresponds to a fitting function ρ(Δ) [equation (8.2.1.5)] of the form where , and s is a resistant measure of scale.
In order to see what is meant by a resistant measure, consider a large sample of observations, , with a normal distribution. The sample mean, is an unbiased estimate of the population mean. Contamination of the sample by a small number of observations containing large, systematic errors, however, would have a large effect on the estimate. The median value of is also an unbiased estimate of the population mean, but it is virtually unaffected by a few contaminating points. Similarly, the sample variance, is an unbiased estimate of the population variance, but, again, it is strongly affected by a few discrepant points, whereas , where is the interquartile range, the difference between the first and third quartile observations, is an estimate of the population variance that is almost unaffected by a small number of discrepant points. The median and the interquartile range are thus resistant quantities that can be used to estimate the mean and variance of a population distribution when the sample may contain points that do not belong to the population. A value of the scale parameter, , for use in evaluating the quantities in (8.2.2.1), that has proved to be useful is , where represents the median value of , the median absolute deviation, or MAD.
Implementation of a procedure based on the function given in (8.2.2.1) involves modification of the weights used in each cycle by Because of this weight modification, the procedure is sometimes referred to as `iteratively reweighted least squares'. It should be recognized, however, that the function that is minimized is more complex than a sum of squares. In a strict application of the Gauss–Newton algorithm (see Section 8.1.3 ) to the minimization of this function, each term in the summations to form the normalequations matrix contains a factor , where ω(Δ) = d^{2}ρ/dΔ^{2} = 1 − 6Δ^{2} + 5Δ^{4}. This factor actually gives some data points a negative effective `weight', because the sum is actually reduced by making the fit worse. The inverse of this normalequations matrix is not an estimate of the variance–covariance matrix; for that the unmodified weights, equal to , must be used, but, because more discrepant points have been deliberately down weighted relative to the ideal weights, the variances are, in general, underestimated. A recommended procedure (Huber, 1973; Nicholson et al., 1982) is to calculate the normalequations matrix using the unmodified weights, invert that matrix, and premultiply by an estimate of the variance of the residuals (Section 8.4.1 ) using modified weights and (n − p) degrees of freedom. Huber showed that this estimate is biased low, and suggests multiplication by a number, , greater than one, and given by where is the mean value, and is the variance of over the entire data set. The conditions under which this expression is derived are not well satisfied in the crystallographic problem, but, if n/p is large and is not too much less than one, the value of c will be close to . plays the role of a `variance efficiency factor'. That is, the variances are approximately those that would be achieved with a leastsquares fit to a data set with normally distributed errors that contained data points.
Robust/resistant methods have been discussed in detail by Huber (1981), Belsley, Kuh & Welsch (1980), and Hoaglin, Mosteller & Tukey (1983). An analysis by Wilson (1976) shows that a fitting procedure gives unbiased estimates if where and are the observed and calculated values of , respectively. Least squares is the case where all terms on both sides of the equation are equal to zero; the weights are fixed. In maximumlikelihood estimation or robust/resistant estimation, the effective weights are functions of the deviation, causing possible introduction of bias. Equation (8.2.2.6), however, suggests that the estimates will still be unbiased if the sums on both sides are zero, which will be the case if the error distribution and the weight modification function are both symmetric about Δ = 0.
Note that the fact that two different weighting schemes applied to the same data lead to different values for the estimate does not necessarily imply that either value is biased. As long as the observations represent unbiased estimates of the values of the model functions, any weighting scheme gives unbiased estimates of the model parameters, although some weighting schemes will cause those estimates to be more precise than others will. Bias can be introduced if a procedure systematically causes fluctuations in one direction to be weighted more heavily than fluctuations in the other. For example, in the Rietveld method (Chapter 8.6 ), the observations are counts of quanta, which are subject to fluctuation according to the Poisson distribution, where the probability of observing k counts per unit time is given by The mean and the variance of this p.d.f. are both equal to λ, so that the ideal weighting should have . However, λ_{i} is not known a priori, and must be estimated. The usual procedure is to take as an estimate of λ_{i}, but this is an unbiased estimate only asymptotically for large k (Box & Tiao, 1973), and, furthermore, causes observations that have negative, random errors to be weighted more heavily than observations that have positive ones. This correlation can be removed by using, after a preliminary cycle of refinement, as an estimate of . This might seem to have the effect of making the weights dependent on the calculated values, so that the righthand side of (8.2.2.6) is no longer zero, but this applies only if the weights are changed during the refinement. There is thus no conflict with the result in (equation 8.1.2.9 ). In practice, in any case, many other sources of uncertainty are much more important than any possible bias that could be introduced by this effect.
Entropy maximization, like least squares, is of interest primarily as a framework within which to find or adjust parameters of a model. Rationalization of the name `entropy maximization' by analogy to thermodynamics is controversial, but there is formal proof (Shore & Johnson, 1980, Johnson & Shore, 1983) supporting entropy maximization as the unique method of inference that satisfies basic consistency requirements (Livesey & Skilling, 1985). The proof consists of discovering the consequences of four consistency axioms, which may be stated informally as follows:

The term `entropy' is used in this chapter as a name only, the name for variation functions that include the form , where may represent probability or, more generally, a positive proportion. Any positive measure, either observed or derived, of the relative apportionment of a characteristic quantity among observations can serve as the proportion.
The method of entropy maximization may be formulated as follows: given a set of n observations, , that are measurements of quantities that can be described by model functions, , where x is a vector of parameters, find the prior, positive proportions, , and the values of the parameters for which the positive proportions make the sum where and , a maximum. S is called the Shannon–Jaynes entropy. For some applications (Collins, 1982), it is desirable to include in the variation function additional terms or restraints that give S the form where the λs are undetermined multipliers, but we shall discuss here only applications where λ_{i} = 0 for all i, and an unrestrained entropy is maximized. A necessary condition for S to be a maximum is for the gradient to vanish. Using and straightforward algebraic manipulation gives equations of the form It should be noted that, although the entropy function should, in principle, have a unique stationary point corresponding to the global maximum, there are occasional circumstances, particularly with restrained problems where the undetermined multipliers are not all zero, where it may be necessary to verify that a stationary solution actually maximizes entropy.
For an example of the application of the maximumentropy method, consider (Collins, 1984) a collection of diffraction intensities in which various subsets have been measured under different conditions, such as on different films or with different crystals. All systematic corrections have been made, but it is necessary to put the different subsets onto a common scale. Assume that every subset has measurements in common with some other subset, and that no collection of subsets is isolated from the others. Let the measurement of intensity in subset i be , and let the scale factor that puts intensity on the scale of subset i be . Equation (8.2.3.1) becomes where the term is zero if does not appear in subset i. Because and are parameters of the model, equations (8.2.3.5) become and These simplify to and where Equations (8.2.3.8) may be solved iteratively, starting with the approximations and Q = 0.
The standard uncertainties of scale factors and intensities are not used in the solution of equations (8.2.3.8), and must be computed separately. They may be estimated on a fractional basis from the variances of estimated population means for a scale factor and for an intensity, respectively. The maximumentropy scale factors and scaled intensities are relative, and either set may be multiplied by an arbitrary, positive constant without affecting the solution.
For another example, consider the maximumentropy fit of a linear function to a set of independently distributed variables. Let represent an observation drawn from a population with mean and finite variance ; we wish to find the maximumentropy estimate of and . Assume that the mismatch between the observation and the model is normally distributed, so that its probability density is the positive proportion where . The prior proportion is given by Letting , equations (8.2.3.5) become and which simplifies to where may be interpreted as a weight and is given by . Equations (8.2.3.12) may be solved iteratively, starting with the approximations that the sums over j on the righthand side are zero and for all i, that is, using the solutions to the corresponding, unweighted leastsquares problem. Resetting after each iteration by only half the indicated amount defeats a tendency towards oscillation. Approximate standard uncertainties for the parameters, and , may be computed by conventional means after setting to zero the sums over j on the righthand side of equations (8.2.3.12). (See, however, a discussion of computing variance–covariance matrices in Section 8.1.2 .) Note that is small for both small and large values of . Thus, in contrast to the robust/resistant methods (Section 8.2.2), which deemphasize only the large differences, this method downweights both the small and the large differences and adjusts the parameters on the basis of the moderatesize mismatches between model and data. The procedure used in this twodimensional, linear model can be extended to linear models, and linear approximations to nonlinear models, in any number of dimensions using methods discussed in Chapter 8.1 .
The maximumentropy method has been described (Jaynes, 1979) as being `maximally noncommittal with respect to all other matters; it is as uniform (by the criterion of the Shannon information measure) as it can be without violating the given constraint[s]'. Least squares, because it gives minimum variance estimates of the parameters of a model, and therefore of all functions of the model including the predicted values of any additional data points, might be similarly described as `maximally committal' with regard to the collection of more data. Least squares and maximum entropy can therefore be viewed as the extremes of a range of methods, classified according to the degree of a priori confidence in the correctness of the model, with the robust/resistant methods lying somewhere in between (although generally closer to least squares). Maximumentropy methods can be used when it is desirable to avoid prejudice in favour of a model because of doubt as to the model's correctness.
References
Belsley, D. A., Kuh, E. & Welsch, R. E. (1980). Regression diagnostics. New York: John Wiley.Google ScholarBox, G. E. P. & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: AddisonWesley.Google Scholar
Collins, D. M. (1982). Electron density images from imperfect data by iterative entropy maximization. Nature (London), 298, 49–51.Google Scholar
Collins, D. M. (1984). Scaling by entropy maximization. Acta Cryst. A40, 705–708.Google Scholar
Hoaglin, D. C., Mosteller, M. & Tukey, J. W. (1983). Understanding robust and exploratory data analysis. New York: John Wiley.Google Scholar
Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Stat. 1, 799–821.Google Scholar
Huber, P. J. (1981). Robust statistics. New York: John Wiley.Google Scholar
Jaynes, E. T. (1979). Where do we stand on maximum entropy? The maximum entropy formalism, edited by R. D. Liven & M. Tribus, pp. 44–49. Cambridge, MA: Massachusetts Institute of Technology.Google Scholar
Johnson, R. W. & Shore, J. E. (1983). Comments on and correction to 'Axiomatic derivation of the principle of maximum entropy and the principle of minimum crossentropy'. IEEE Trans. Inf. Theory, IT29, 942–943.Google Scholar
Livesey, A. K. & Skilling, J. (1985). Maximum entropy theory. Acta Cryst. A41, 113–122.Google Scholar
Nicholson, W. L., Prince, E., Buchanan, J. & Tucker, P. (1982). A robust/resistant technique for crystal structure refinement. Crystallographic statistics: progress and problems, edited by S. Rameseshan, M. F. Richardson & A. J. C. Wilson, pp. 220–263. Bangalore: Indian Academy of Sciences.Google Scholar
Rietveld, H. M. (1969). A profile refinement method for nuclear and magnetic structures. J. Appl. Cryst. 2, 65–71.Google Scholar
Shore, J. E. & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum crossentropy. IEEE Trans. Inf. Theory, IT26, 26–37.Google Scholar
Tukey, J. W. (1974). Introduction to today's data analysis. Critical evaluation of chemical and physical structural information, edited by D. R. Lide & M. A. Paul, pp. 3–14. Washington: National Academy of Sciences.Google Scholar
Wilson, A. J. C. (1976). Statistical bias in leastsquares refinement. Acta Cryst. A32, 994–996.Google Scholar