Model phases: probabilities, bias and maps

Read, R. J.

doi:10.1107/97809553602060000688

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 15.2, pp. 325-331 | 1 | 2 |
https://doi.org/10.1107/97809553602060000688

Chapter 15.2. Model phases: probabilities, bias and maps

R. J. Read^a ^*

^a Department of Haematology, University of Cambridge, Wellcome Trust Centre for Molecular Mechanisms in Disease, CIMR, Wellcome Trust/MRC Building, Hills Road, Cambridge CB2 2XY, England
Correspondence e-mail: rjr27@cam.ac.uk

The optimal use of model phase information requires an estimate of its reliability, specifically the probability that various values of the phase angle are true. This chapter covers the importance of phase in model bias; structure-factor probability relationships; figure-of-merit weighting for model phases; map coefficients to reduce model bias; difference-map coefficients; refinement bias; and maximium-likelihood structure refinement.

Keywords: Parseval's theorem; Sim distribution; Wilson distribution; coordinate errors; figure-of-merit weighting for model phases; maximum likelihood; model bias; model phases; phase combination; refinement; structure-factor probability distributions.

15.2.1. Introduction

| top | pdf |

The intensities of X-ray diffraction spots measured from a crystal give us only the amplitudes of the diffracted waves. To reconstruct a map of the electron density in the crystal, the unmeasured phase information is also required. In fact, the phases are much more important to the appearance of the map than the measured amplitudes. When phases are supplied by an atomic model, therefore, some degree of model bias is inevitable.

The optimal use of model phase information requires an estimate of its reliability, specifically the probability that various values of the phase angle are true. Such a probability distribution can be derived, starting first with the relationship between the structure factor (amplitude and phase) of the model and that of the true crystal structure. The phase probability distribution can then be obtained from this and used, for instance, to provide a figure-of-merit weighting that minimizes the r.m.s. error from the true electron density.

Even with figure-of-merit weighting, model-phased electron density is biased towards the model. The systematic bias component of model-phased map coefficients can be predicted, allowing the derivation of map coefficients that give electron-density maps with reduced model bias. With the help of a few simple assumptions, a correction for bias can also be made when different sources of phase information are combined.

Finally, the refinement of a model against the observed amplitudes allows a certain amount of overfitting of the data, which leads to an extra `refinement bias'. Fortunately, the use of appropriate refinement strategies, including maximum-likelihood targets, can reduce the severity of this problem.

15.2.2. Model bias : importance of phase

| top | pdf |

Dramatic illustrations of the importance of the phase have been published. For instance, Ramachandran & Srinivasan (1961) calculated an electron-density map using phases from one structure and amplitudes from another. In this map there are peaks at the positions of the atoms in the structure that contributed the phase information, but not in the structure that contributed the amplitudes. Similar calculations with two-dimensional Fourier transforms of photographs (Oppenheim & Lim, 1981 ; Read, 1997 ) show that the phases of one completely overwhelm the amplitudes of the other.

These examples, though dramatic, are not completely representative of the normal situation, where the structure contributing the phases is partially or even nearly correct. Nonetheless, model phases always contribute bias, so that the resulting map tends to bear too close a resemblance to the model.

15.2.2.1. Parseval's theorem

| top | pdf |

The importance of the phase can be understood most easily in terms of Parseval's theorem, a result that is important to the understanding of many aspects of the Fourier transform and its use in crystallography. Parseval's theorem states that the mean-square value of the variable on one side of a Fourier transform is proportional to the mean-square value of the variable on the other side. Since the Fourier transform is additive, Parseval's theorem also applies to sums or differences.

If $[\rho_{1}]$ and $[\rho_{2}]$ are, for instance, the true electron density and the electron density of the model, respectively, Parseval's theorem tells us that the r.m.s. error in the electron density is proportional to the r.m.s. error in the structure factor. (The structure-factor error is a vector error in the complex plane.) $[\eqalignno{\langle \rho^{2}\rangle &= (1/V^{2}) \textstyle\sum\limits_{{\rm all} \ {\bf h}}\displaystyle |{\bf F(h)}|^{2}, &\cr \Big\langle (\rho_{1} - \rho_{2})^{2}\Big\rangle &= (1/V^{2}) \textstyle\sum\limits_{{\rm all} \ {\bf h}}\displaystyle |{\bf F}_{1} {\bf (h)} - {\bf F}_{2} {\bf (h)}|^{2}. &\cr}]$

This understanding of error in electron-density maps explains why the phase is much more important than the amplitude in determining the appearance of an electron-density map. As illustrated in Fig. 15.2.2.1, a random choice of phase (from a uniform distribution of all possible phases) will generally give a larger error in the complex plane than a random choice of amplitude [from a Wilson (1949) distribution of amplitudes].

Figure 15.2.2.1| top | pdf |

Schematic illustration of the relative errors introduced by a random choice of phase or a random choice of amplitude. The example has been constructed to represent the r.m.s. errors introduced by randomization (computed by averages over the Wilson distribution). Phase randomization will introduce r.m.s. errors of $[(2)^{1/2}]$ ( $[\simeq 1.41]$ ) times the r.m.s. structure-factor amplitude $[|{\bf F}|]$ . By comparison, map coefficients weighted by figures of merit of zero would have r.m.s. errors equal to the r.m.s. $[|{\bf F}|]$ , so a featureless map would be more accurate than a random-phase map. Amplitude randomization will introduce r.m.s. errors of $[\left[(4 - \pi)/2\right]^{1/2}]$ ( $[\simeq 0.66]$ ) times the r.m.s. $[|{\bf F}|]$ , so a map computed with random amplitudes will be closer to the true map than a featureless map.

15.2.3. Structure-factor probability relationships

| top | pdf |

To use model phase information optimally, the probability distribution for the true phase (or, equivalently, the distribution of the error in the model phase) needs to be known. Such a distribution can be derived by first working out the probability distribution for the true structure factor (or the distribution of the vector difference between the model and true structure factors). Then the phase probability distribution is obtained by fixing the known value of the structure-factor amplitude and renormalizing.

A number of related structure-factor distributions have been derived, differing in the amount of information available about the structure and in the assumed form of errors in the model. These range from the Wilson distribution, which applies when none of the atomic positions is known, to a distribution that applies when there are a variety of sources of error in an atomic model.

15.2.3.1. Wilson and Sim structure-factor distributions in P1

| top | pdf |

For the Wilson distribution (Wilson, 1949 ), it is assumed that the atoms in a crystal structure in space group P1 are scattered randomly and independently through the unit cell. In fact, it is sufficient to make the much less restrictive assumption that the atoms are placed randomly with respect to the Bragg planes defined by the Miller indices. The assumption of independence is somewhat more problematic, since there are restrictions on the distances between atoms, large volumes of protein crystals are occupied by disordered solvent and many protein crystals display noncrystallographic symmetry; as discussed elsewhere (Vellieux & Read, 1997 ), the resulting relationships among structure factors are exploited implicitly in averaging and solvent-flattening procedures. The higher-order relationships among structure factors are used explicitly in direct methods for solving small-molecule structures and are being developed for use in protein structures (Bricogne, 1993 ). For the purposes of simpler relationships between the calculated and true structure factors for a single hkl, however, the lack of complete independence does not seem to create serious problems.

When atoms are placed randomly relative to the Bragg planes, the contribution of each atom to the structure factor will have a phase varying randomly from 0 to 2π. The overall structure factor can then be considered to be the result of a random walk in the complex plane, which can be treated as an application of the central limit theorem. The structure factor is the sum of the independent atomic scattering contributions, each of which has a probability distribution defined as a circle in the complex plane centred on the origin, with a radius of $[f_{j}]$ . The centroid of this atomic distribution is at the origin, and the variance for each of the real and imaginary parts is $[{1 \over 2} f_{j}^{2}]$ . The probability distribution of the structure factor that is the sum of these contributions is a two-dimensional Gaussian, the product of the one-dimensional Gaussians for the real and imaginary parts. Because the variances are equal in the real and imaginary directions, it can be simplified, as shown below, and expressed in terms of a single distribution parameter, $[\Sigma_{N}]$ . $[\eqalign{{\bf F} &= \textstyle\sum\limits_{j = 1}^{N}\displaystyle f_{j} \exp (2 \pi i {\bf h} \cdot {\bf x}_{j}) = A + iB\hbox{;}\quad \langle A\rangle = \langle B\rangle = 0\hbox{;}\hfill \cr \sigma^{2} (A) &= \sigma^{2} (B) = \textstyle{1 \over 2}\displaystyle \textstyle\sum\limits_{j = 1}^{N}\displaystyle f_{j}^{2} = {\textstyle{1 \over 2}} \Sigma_{N}, \hbox{so} \hfill\cr p(A) &= [1/(\pi \Sigma_{N})^{1/2}] \exp \left(-A^{2}/\Sigma_{N}\right),\hfill\cr p(B) &= [1/(\pi \Sigma_{N})^{1/2}] \exp \left(-B^{2}/\Sigma_{N}\right), \hfill\cr p({\bf F}) &= p(A, B) = (1/\pi \Sigma_{N}) \exp \left(-|{\bf F}|^{2}/\Sigma_{N}\right). \hfill\cr}]$

The Sim distribution (Sim, 1959), which is relevant when the positions of some of the atoms are known, has a very similar basis, except that the structure factor is now considered to arise from a random walk starting from the position of the structure factor corresponding to the known part, $[{\bf F}_{P}]$ . Atoms with known positions do not contribute to the variance, while each of the atoms with unknown positions (the `Q' atoms) contributes $[{1 \over 2} f_{j}^{2}]$ to each of the real and imaginary parts, as in the Wilson distribution. The distribution parameter in this case is referred to as $[\Sigma_{Q}]$ . The Sim distribution is a conditional probability distribution, depending on the value of $[{\bf F}_{P}]$ , $[p({\bf F}\hbox{;}\ {\bf F}_{P}) = (1/\pi \Sigma_{Q}) \exp \left(-|{\bf F} - {\bf F}_{P}|^{2}/\Sigma_{Q}\right).]$

The Wilson (1949) and Woolfson (1956) distributions for space group $[P\bar{1}]$ are obtained similarly, except that the random walks are along a line and the resulting Gaussian distributions are one-dimensional. (The Woolfson distribution is the centric equivalent of the Sim distribution.) For more complicated space groups, it is reasonable to assume that acentric reflections follow the P1 distribution and that centric reflections follow the $[P\bar{1}]$ distribution. However, for any zone of the reciprocal lattice in which symmetry-related atoms are constrained to scatter in phase, the variances must be multiplied by the expected intensity factor, ɛ, for the zone, because the symmetry-related contributions are no longer independent.

15.2.3.2. Probability distributions for variable coordinate errors

| top | pdf |

In the Sim distribution, an atom is considered to be either exactly known or completely unknown in its position. These are extreme cases, since there will normally be varying degrees of uncertainty in the positions of various atoms in a model. The treatment can be generalized by allowing a probability distribution of coordinate errors for each atom. In this case, the centroid for the individual atomic contribution to the structure factor will no longer be obtained by multiplying by either zero or one. Averaged over the circle corresponding to possible phase errors, the centroid will generally be reduced in magnitude, as illustrated in Fig. 15.2.3.1. In fact, averaging to obtain the centroid is equivalent to weighting the atomic scattering contribution by the Fourier transform of the coordinate-error probability distribution, $[d_{j}]$ . By the convolution theorem, this in turn is equivalent to convoluting the atomic density with the coordinate-error distribution. Intuitively, the atom is smeared over all of its possible positions. The weighting factor, $[d_{j}]$ , is thus analogous to the thermal-motion term in the structure-factor expression.

Figure 15.2.3.1| top | pdf |

Centroid of the structure-factor contribution from a single atom. The probability of a phase for the contribution is indicated by the thickness of the line.

The variances for the individual atomic contributions will differ in magnitude, but if there are a sufficient number of independent sources of error, we can invoke the central limit theorem again and assume that the probability distribution for the structure factor will be a Gaussian centred on $[\textstyle\sum d_{j}\; f_{j} \exp \left(2 \pi i {\bf h} \cdot {\bf x}_{j}\right)]$ . If the coordinate-error distribution is Gaussian, and if each atom in the model is subject to the same errors, the resulting structure-factor probability distribution is the Luzzati (1952) distribution . In this special case, $[d_{j} = D]$ for all atoms, where D is the Fourier transform of a Gaussian and behaves like the application of an overall B factor.

15.2.3.3. General treatment of the structure-factor distribution

| top | pdf |

The Wilson, Sim, Luzzati and variable-error distributions have very similar forms, because they are all Gaussians arising from the application of the central limit theorem. The central limit theorem is valid under many circumstances; even when there are errors in position, scattering factor and B factor, as well as missing atoms, a similar distribution still applies. As long as these sources of error are independent, the true structure factor will have a Gaussian distribution centred on $[D{\bf F}_{C}]$ (Fig. 15.2.3.2), where D now includes effects of all sources of error, as well as compensating for errors in the overall scale and B factor (Read, 1990 ). $[p({\bf F}\hbox {;}\ {\bf F}_{C}) = (1/\pi \varepsilon \sigma_{\Delta}^{2}) \exp \left(-|{\bf F} - D{\bf F}_{C}|^{2}/\varepsilon \sigma_{\Delta}^{2}\right)]$ in the acentric case, where $[\sigma_{\Delta}^{2} = \Sigma_{N} - D^{2}\Sigma_{P}]$ , ɛ is the expected intensity factor and $[\Sigma_{P}]$ is the Wilson distribution parameter for the model.

Figure 15.2.3.2| top | pdf |

Schematic illustration of the general structure-factor distribution, relevant in the case of any set of independent random errors in the atomic model.

For centric reflections, the scattering differences are distributed along a line, so the probability distribution is a one-dimensional Gaussian. $[p({\bf F}\hbox{;}\ {\bf F}_{C}) = [1/(2 \pi \varepsilon \sigma_{\Delta}^{2})^{1/2} ]\exp \left(-|{\bf F} - D{\bf F}_{C}|^{2}/2 \varepsilon \sigma_{\Delta}^{2}\right).]$

15.2.3.4. Estimating $[\sigma_{A}]$

| top | pdf |

Srinivasan (1966) showed that the Sim and Luzzati distributions could be combined into a single distribution that had a particularly elegant form when expressed in terms of normalized structure factors, or E values. This functional form still applies to the general distribution that reflects a variety of sources of error; the only difference is the interpretation placed on the parameters (Read, 1990). If F and $[{\bf F}_{C}]$ are replaced by the corresponding E values, a parameter $[\sigma_{A}]$ plays the role of D, and $[\sigma_{\Delta}^{2}]$ reduces to ( $[1 - \sigma_{A}^{2}]$ ). [The parameter $[\sigma_{A}]$ is equivalent to D after correction for model completeness; $[\sigma_{A} = D(\Sigma_{P}/\Sigma_{N})^{1/2}.]$ ] When the structure factors are normalized, overall scale and B-factor effects are also eliminated. The parameter $[\sigma_{A}]$ that characterizes this probability distribution varies as a function of resolution. It must be deduced from the amplitudes $[|{\bf F}_{O}|]$ and $[|{\bf F}_{C}|]$ , since the phase (thus the phase difference) is unknown.

A general approach to estimating parameters for probability distributions is to maximize a likelihood function . The likelihood function is the overall joint probability of making the entire set of observations, which is a function of the desired parameters. The parameters that maximize the probability of making the set of observations are the most consistent with the data. The idea of using maximum likelihood to estimate model phase errors was introduced by Lunin & Urzhumtsev (1984), who gave a treatment that was valid for space group P1. In a more general treatment that applies to higher-symmetry space groups, allowance is made for the statistical effects of crystal symmetry (centric zones and differing expected intensity factors) (Read, 1986 ).

The $[\sigma_{A}]$ values are estimated by maximizing the joint probability of making the set of observations of $[|{\bf F}_{O}|]$ . If the structure factors are all assumed to be independent, the joint probability distribution is the product of all the individual distributions. The assumption of independence is not completely justified in theory, but the results are fairly accurate in practice. $[L = \textstyle\prod\limits_{\bf h}p(|{\bf F}_{O}|\hbox{;} \ |{\bf F}_{C}|).]$ The required probability distribution, $[p(|{\bf F}_{O}|\hbox{;} \ |{\bf F}_{C}|)]$ , is derived from $[p({\bf F}\hbox {;}\ {\bf F}_{C})]$ by integrating over all possible phase differences and neglecting the errors in $[|{\bf F}_{O}|]$ as a measure of $[|{\bf F}|]$ . The form of this distribution, which is given in other publications (Read, 1986 , 1990 ), differs for centric and acentric reflections. (It is important to note that although the distributions for structure factors are Gaussian, the distributions for amplitudes obtained by integrating out the phase are not.) It is more convenient to deal with a sum than a product, so the log likelihood function is maximized instead. In the program SIGMAA, reciprocal space is divided into spherical shells, and a value of the parameter $[\sigma_{A}]$ is refined for each resolution shell. Details of the algorithm are given elsewhere (Read, 1986).

The resolution shells must be thick enough to contain several hundred to a thousand reflections each, in order to provide $[\sigma_{A}]$ estimates with a sufficiently small statistical error. A larger number of shells (fewer reflections per shell) can be used for refined structures, since estimates of $[\sigma_{A}]$ become more precise as the true value approaches 1. If there are sufficient reflections per shell, the estimates will vary smoothly with resolution. As discussed below, the smooth variation with resolution can also be exploited through a restraint that allows $[\sigma_{A}]$ values to be estimated from fewer reflections.

15.2.4. Figure-of-merit weighting for model phases

| top | pdf |

Blow & Crick (1959) and Sim (1959) showed that the electron-density map with the least r.m.s. error is calculated from centroid structure factors. This conclusion follows from Parseval's theorem, because the centroid structure factor (its probability-weighted average value or expected value) minimizes the r.m.s. error of the structure factor. Since the structure-factor distribution $[p({\bf F}\hbox{;}\ {\bf F}_{C})]$ is symmetrical about $[{\bf F}_{C}]$ , the expected value of F will have the same phase as $[{\bf F}_{C}]$ , but the averaging around the phase circle will reduce its magnitude if there is any uncertainty in the phase value (Fig. 15.2.4.1). We treat the reduction in magnitude by applying a weighting factor called the figure of merit, m, which is equivalent to the expected value of the cosine of the phase error.

Figure 15.2.4.1| top | pdf |

Figure-of-merit weighted model-phased structure factor, obtained as the probability-weighted average over all possible phases.

15.2.5. Map coefficients to reduce model bias

| top | pdf |

15.2.5.1. Model bias in figure-of-merit weighted maps

| top | pdf |

A figure-of-merit weighted map, calculated with coefficients $[m|{\bf F}_{O}|\exp(i\alpha_{C})]$ , has the least r.m.s. error from the true map. According to the normal statistical (minimum variance) criteria, then, it is the best map. However, such a map will suffer from model bias; if its purpose is to allow the detection and repair of errors in the model, this is a serious qualitative defect. Fortunately, it is possible to predict the systematic errors leading to model bias and to make some correction for them.

Main (1979) dealt with this problem in the case of a perfect partial structure. Since the relationships among structure factors are the same in the general case of a partial structure with various errors, once $[D{\bf F}_{C}]$ is substituted for $[{\bf F}_{C}]$ , all that is required to apply Main's results more generally is a change of variables (Read, 1986 , 1990 ).

In Main's approach, the cosine law is used to introduce the cosine of the phase error, which is converted into a figure of merit by taking expected values. Some manipulations allow us to solve for the figure-of-merit weighted map coefficient, which is approximated as a linear combination of the true structure factor and the model structure factor (Main, 1979 ; Read, 1986 ). Finally, we can solve for an approximation to the true structure factor, giving map coefficients from which the systematic model bias component has been removed. $[\eqalignno{&m|{\bf F}_{O}|\exp(i\alpha_{C}) = F/2 + D{\bf F}_{C}/2 + \hbox{ noise terms},\cr &F \simeq (2m|{\bf F}_{O}| - D|{\bf F}_{C}|)\exp(i\alpha_{C}).\cr}]$

A similar analysis for centric structure factors shows that there is no systematic model bias in figure-of-merit weighted map coefficients, so no bias correction is needed in the centric case.

15.2.5.2. Model bias in combined phase maps

| top | pdf |

When model phase information is combined with, for instance, multiple isomorphous replacement (MIR) phase information, there will still be model bias in the acentric map coefficients, to the extent that the model influences the final phases. However, it is inappropriate to continue using the same map coefficients to reduce model bias, because some phases could be determined almost completely by the MIR phase information. It makes much more sense to have map coefficients that reduce to the coefficients appropriate for either model or MIR phases, in extreme cases where there is only one source of phase information, and that vary smoothly between those extremes.

Map coefficients that satisfy these criteria (even if they are not rigorously derived) are implemented in the program SIGMAA. The resulting maps are reasonably successful in reducing model bias. Two assumptions are made: (1) the model bias component in the figure-of-merit weighted map coefficient, $[m_{\rm com}|{\bf F}_{O}|\exp(i\alpha_{\rm com})]$ , is proportional to the influence that the model phase has had on the combined phase; and (2) the relative influence of a source of phase information can be measured by the information content, H (Guiasu, 1977), of the phase probability distribution. The first assumption corresponds to the idea that the figure-of-merit weighted map coefficient is a linear combination of the MIR and model phase cases. $[\!\matrix{\hbox{MIR:} \hfill& m_{\rm MIR} | {\bf F}_{O} | \exp (i\alpha_{\rm MIR}) \hfill& \simeq {\bf F} \hfill\cr \hbox{Model:} \hfill& m_{C} | {\bf F}_{O} | \exp (i\alpha_{C}) \hfill& \simeq {\bf F}/2 + D{\bf F}_{C}/2 \hfill\cr \hbox{Combined:} \hfill& m_{\rm com} | {\bf F}_{O} | \exp (i\alpha_{\rm com}) \hfill& \simeq [1 - (w/2)] {\bf F} + (w/2) D{\bf F}_{C}, \hfill\cr}]$ where $[w = H_{C} / (H_{C} + H_{\rm MIR})]$ and $[H = \int\limits_{0}^{2\pi} p(\alpha) \ln {p(\alpha) \over p_{0} (\alpha)} \kern2pt\hbox{d} \alpha\hbox{;} \quad\ p_{0} (\alpha) = {1 \over 2\pi}.]$

Solving for an approximation to the true F gives the following expression, which can be seen to reduce appropriately when w is 0 (no model influence) or 1 (no MIR influence): $[{\bf F} \simeq {2m|{\bf F}_{O}| \exp(i\alpha_{\rm com}) - wD{\bf F}_{C} \over 2 - w}.]$

15.2.6. Estimation of overall coordinate error

| top | pdf |

In principle, since the distribution of observed and calculated amplitudes is determined largely by the coordinate errors of the model, one can determine whether a particular coordinate-error distribution is consistent with the amplitudes. Unfortunately, it turns out that the coordinate errors cannot be deduced unambiguously, because many distributions of coordinate errors are consistent with a particular distribution of amplitudes (Read, 1990).

If the simplifying assumption is made that all the atoms are subject to a single error distribution, then the parameter D (and thus the related parameter $[\sigma_{A}]$ ) varies with resolution as the Fourier transform of the error distribution, as discussed above. Two related methods to estimate overall coordinate error are based on the even more specific assumption that the coordinate-error distribution is Gaussian: the Luzzati plot (Luzzati, 1952) and the $[\sigma_{A}]$ plot (Read, 1986). Unfortunately, the central assumption is not justified; atoms that scatter more strongly (heavier atoms or atoms with lower B factors) tend to have smaller coordinate errors than weakly scattering atoms. The proportion of the structure factor contributed by well ordered atoms increases at high resolution, so that the structure factors agree better at high resolution than if there were a single error distribution.

It is often stated, optimistically, that the Luzzati plot provides an upper bound to the coordinate error, because the observation errors in $[|{\bf F}_{O}|]$ have been ignored. This is misleading, because there are other effects that cause the Luzzati and $[\sigma_{A}]$ plots to give underestimates (Read, 1990). Chief among these are the correlation of errors and scattering power and the overfitting of the amplitudes in structure refinement (discussed below). These estimates of overall coordinate error should not be interpreted too literally; at best, they provide a comparative measure.

15.2.7. Difference-map coefficients

| top | pdf |

The computer program SIGMAA (Read, 1986) has been developed to implement the results described here. Apart from the two types of map coefficient discussed above, two types of difference-map coefficient can also be produced:

(1) Model-phased difference map: $[(m|{\bf F}_{O}| - D|{\bf F}_{C}|) \exp(i\alpha_{C})]$ ;
(2) General difference map: $[m_{\rm com} |{\bf F}_{O}| \exp(i\alpha_{\rm com}) - D{\bf F}_{C}]$ .

The general difference map, it should be noted, uses a vector difference between the figure-of-merit weighted combined phase coefficient (the `best' estimate of the true structure factor) and the calculated structure factor. When additional phase information is available, it should provide a clearer picture of the errors in the model.

15.2.8. Refinement bias

| top | pdf |

The structure-factor probabilities discussed above depend on the atoms having independent errors (or at least a sufficient number of groups of atoms having independent errors). Unfortunately, this assumption breaks down when a structure is refined against the observed diffraction data. Few protein crystals diffract to sufficiently high resolution to provide a large number of observations for every refinable parameter. The refinement problem is, therefore, not sufficiently overdetermined, so it is possible to overfit the data. If there is an error in the model that is outside the range of convergence of the refinement method, it is possible to introduce compensating errors in the rest of the structure to give a better, and misleading, agreement in the amplitudes. As a result, the phase accuracy (hence the weighting factors m and D) is overestimated, and model bias is poorly removed. Because simulated annealing is a more effective minimizer than gradient methods (Brünger et al., 1987), it is also more effective at locating local minima, so structures refined by simulated annealing probably tend to suffer more severely from refinement bias.

There is another interpretation to the problem of refinement bias. As Silva & Rossmann (1985) point out, minimizing the r.m.s. difference between the amplitudes $[|{\bf F}_{O}|]$ and $[|{\bf F}_{C}|]$ is equivalent (by Parseval's theorem) to minimizing the difference between the model electron density and the density corresponding to the map coefficients $[|{\bf F}_{O}|\exp(i\alpha_{C})]$ ; a lower residual is obtained either by making the model look more like the true structure, or by making the model-phased map look more like the model through the introduction of systematic phase errors.

A number of strategies are available to reduce the degree or impact of refinement bias. The overestimation of phase accuracy has been overcome in a new version of SIGMAA that is under development (Read, unpublished). Cross-validation data, which are normally used to compute $[R_{\rm free}]$ as an unbiased indicator of refinement progress (Brünger, 1992), are used to obtain unbiased $[\sigma_{A}]$ estimates. Because of the high statistical error of $[\sigma_{A}]$ estimates computed from small numbers of reflections, reliable values can only be obtained by exploiting the smoothness of the $[\sigma_{A}]$ curve as a function of resolution. This can be achieved either by fitting a functional form or by adding a penalty to points that deviate from the line connecting their neighbours. Lunin & Skovoroda (1995) have independently proposed the use of cross-validation data for this purpose, but as their algorithm is equivalent to the conventional SIGMAA algorithm, it will suffer severely from statistical error.

The degree of refinement bias can be reduced by placing less weight on the agreement of structure-factor amplitudes. Anecdotal evidence suggests that the problem is less serious, in structures refined using X-PLOR (Brünger et al., 1987), when the Engh & Huber (1991) parameter set is used for the energy terms. In this new parameter set, the deviations from standard geometry are much more strictly restrained, so in effect the pressure on the agreement of structure-factor amplitudes is reduced. The use of maximum-likelihood targets for refinement (discussed below) also helps to reduce overfitting.

If errors are suspected in certain parts of the structure, `omit refinement' (in which the questionable parts are omitted from the model) can be a very effective way to eliminate refinement bias in those regions (James et al., 1980 ; Hodel et al., 1992 ).

If MIR or MAD (multiwavelength anomalous dispersion) phases are available, combined phase maps tend to suffer less from refinement bias, depending on the extent to which the experimental phases influence the combined phases. Finally, it is always a good idea to refer occasionally to the original MIR or MAD map, which cannot suffer at all from model bias or refinement bias.

15.2.9. Maximum-likelihood structure refinement

| top | pdf |

In the past, conventional structure refinement was based on a least-squares target, which would be justified if the observed and calculated structure-factor amplitudes were related by a Gaussian probability distribution. Unfortunately, the relationship between $[|{\bf F}_{O}|]$ and $[|{\bf F}_{C}|]$ is not Gaussian, and the distribution for $[|{\bf F}_{O}|]$ is not even centred on $[|{\bf F}_{C}|]$ . Because of this, it was suggested (Read, 1990 ; Bricogne, 1991 ) that a maximum-likelihood target should be used instead, and that it should be based on probability distributions such as those described above.

Three implementations of maximum-likelihood structure refinement have now been reported (Pannu & Read, 1996 ; Murshudov et al., 1997 ; Bricogne & Irwin, 1996 ). As expected, there is a decrease in refinement bias, as the calculated structure-factor amplitudes will not be forced to be equal to the observed amplitudes. Maximum-likelihood targets have been shown to work much better than least-squares targets, particularly when the starting models are poor.

Prior phase information can also be incorporated into a maximum-likelihood target (Pannu et al., 1998). Tests show that even weak phase information can have a dramatic effect on the success of refinement, and that the amount of overfitting is even further reduced (Pannu et al., 1998).

Acknowledgements

This chapter is a revised version of a contribution to Methods in Enzymology (Read, 1997).

References

Blow, D. M. & Crick, F. H. C. (1959). The treatment of errors in the isomorphous replacement method. Acta Cryst. 12, 794–802.Google Scholar

Bricogne, G. (1991). A multisolution method of phase determination by combined maximization of entropy and likelihood. III. Extension to powder diffraction data. Acta Cryst. A47, 803–829.Google Scholar

Bricogne, G. (1993). Direct phase determination by entropy maximization and likelihood ranking: status report and perspectives. Acta Cryst. D49, 37–60.Google Scholar

Bricogne, G. & Irwin, J. (1996). In Proceedings of the CCP4 study weekend. Macromolecular refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory.Google Scholar

Brünger, A. T. (1992). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–474.Google Scholar

Brünger, A. T., Kuriyan, J. & Karplus, M. (1987). Crystallographic R factor refinement by molecular dynamics. Science, 235, 458–460.Google Scholar

Engh, R. A. & Huber, R. (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Cryst. A47, 392–400.Google Scholar

Guiasu, S. (1977). Information theory with applications. London: McGraw-Hill.Google Scholar

Hodel, A., Kim, S.-H. & Brünger, A. T. (1992). Model bias in macromolecular crystal structures. Acta Cryst. A48, 851–858.Google Scholar

James, M. N. G., Sielecki, A. R., Brayer, G. D., Delbaere, L. T. J. & Bauer, C.-A. (1980). Structures of product and inhibitor complexes of Streptomyces griseus protease A at 1.8 Å resolution – a model for serine protease catalysis. J. Mol. Biol. 144, 43–88.Google Scholar

Lunin, V. Yu. & Skovoroda, T. P. (1995). R-free likelihood-based estimates of errors for phases calculated from atomic models. Acta Cryst. A51, 880–887.Google Scholar

Lunin, V. Yu. & Urzhumtsev, A. G. (1984). Improvement of protein phases by coarse model modification. Acta Cryst. A40, 269–277.Google Scholar

Luzzati, V. (1952). Traitement statistique des erreurs dans la determination des structures cristallines. Acta Cryst. 5, 802–810.Google Scholar

Main, P. (1979). A theoretical comparison of the β, γ′ and 2F_o − F_c syntheses. Acta Cryst. A35, 779–785.Google Scholar

Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Refinement of macromolecular structures by the maximum-likelihood method. Acta Cryst. D53, 240–255.Google Scholar

Oppenheim, A. V. & Lim, J. S. (1981). The importance of phase in signals. Proc. IEEE, 69, 529–541.Google Scholar

Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Incorporation of prior phase information strengthens maximum-likelihood structure refinement. Acta Cryst. D54, 1285–1294.Google Scholar

Pannu, N. S. & Read, R. J. (1996). Improved structure refinement through maximum likelihood. Acta Cryst. A52, 659–668.Google Scholar

Ramachandran, G. N. & Srinivasan, R. (1961). An apparent paradox in crystal structure analysis. Nature (London), 190, 159–161.Google Scholar

Read, R. J. (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149.Google Scholar

Read, R. J. (1990). Structure-factor probabilities for related structures. Acta Cryst. A46, 900–912.Google Scholar

Read, R. J. (1997). Model phases: probabilities and bias. Methods Enzymol. 277, 110–128.Google Scholar

Silva, A. M. & Rossmann, M. G. (1985). The refinement of southern bean mosaic virus in reciprocal space. Acta Cryst. B41, 147–157.Google Scholar

Sim, G. A. (1959). The distribution of phase angles for structures containing heavy atoms. II. A modification of the normal heavy-atom method for non-centrosymmetrical structures. Acta Cryst. 12, 813–815.Google Scholar

Srinivasan, R. (1966). Weighting functions for use in the early stages of structure analysis when a part of the structure is known. Acta Cryst. 20, 143–144.Google Scholar

Vellieux, F. M. D. & Read, R. J. (1997). Non-crystallographic symmetry averaging in phase refinement and extension. Methods Enzymol. 277, 18–53.Google Scholar

Wilson, A. J. C. (1949). The probability distribution of X-ray intensities. Acta Cryst. 2, 318–321.Google Scholar

Woolfson, M. M. (1956). An improvement of the `heavy-atom' method of solving crystal structures. Acta Cryst. 9, 804–810.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 15.2, pp. 325-331
https://doi.org/10.1107/97809553602060000688

Chapter 15.2. Model phases: probabilities, bias and maps

15.2.1. Introduction

15.2.2. Model bias: importance of phase

15.2.2.1. Parseval's theorem

15.2.3. Structure-factor probability relationships

15.2.3.1. Wilson and Sim structure-factor distributions in P1

15.2.3.2. Probability distributions for variable coordinate errors

15.2.3.3. General treatment of the structure-factor distribution

15.2.3.4. Estimating

15.2.4. Figure-of-merit weighting for model phases

15.2.5. Map coefficients to reduce model bias

15.2.5.1. Model bias in figure-of-merit weighted maps

15.2.5.2. Model bias in combined phase maps

15.2.6. Estimation of overall coordinate error

15.2.7. Difference-map coefficients

15.2.8. Refinement bias

15.2.9. Maximum-likelihood structure refinement

Acknowledgements

References

15.2.2. Model bias : importance of phase

15.2.3.4. Estimating $[\sigma_{A}]$