Structure-factor probability relationships

Read, R. J.

doi:10.1107/97809553602060000688

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 15.2, pp. 325-327 | 1 | 2 |

Section 15.2.3. Structure-factor probability relationships

R. J. Read^a ^*

^a Department of Haematology, University of Cambridge, Wellcome Trust Centre for Molecular Mechanisms in Disease, CIMR, Wellcome Trust/MRC Building, Hills Road, Cambridge CB2 2XY, England
Correspondence e-mail: rjr27@cam.ac.uk

15.2.3. Structure-factor probability relationships

| top | pdf |

To use model phase information optimally, the probability distribution for the true phase (or, equivalently, the distribution of the error in the model phase) needs to be known. Such a distribution can be derived by first working out the probability distribution for the true structure factor (or the distribution of the vector difference between the model and true structure factors). Then the phase probability distribution is obtained by fixing the known value of the structure-factor amplitude and renormalizing.

A number of related structure-factor distributions have been derived, differing in the amount of information available about the structure and in the assumed form of errors in the model. These range from the Wilson distribution, which applies when none of the atomic positions is known, to a distribution that applies when there are a variety of sources of error in an atomic model.

15.2.3.1. Wilson and Sim structure-factor distributions in P1

| top | pdf |

For the Wilson distribution (Wilson, 1949), it is assumed that the atoms in a crystal structure in space group P1 are scattered randomly and independently through the unit cell. In fact, it is sufficient to make the much less restrictive assumption that the atoms are placed randomly with respect to the Bragg planes defined by the Miller indices. The assumption of independence is somewhat more problematic, since there are restrictions on the distances between atoms, large volumes of protein crystals are occupied by disordered solvent and many protein crystals display noncrystallographic symmetry; as discussed elsewhere (Vellieux & Read, 1997), the resulting relationships among structure factors are exploited implicitly in averaging and solvent-flattening procedures. The higher-order relationships among structure factors are used explicitly in direct methods for solving small-molecule structures and are being developed for use in protein structures (Bricogne, 1993). For the purposes of simpler relationships between the calculated and true structure factors for a single hkl, however, the lack of complete independence does not seem to create serious problems.

When atoms are placed randomly relative to the Bragg planes, the contribution of each atom to the structure factor will have a phase varying randomly from 0 to 2π. The overall structure factor can then be considered to be the result of a random walk in the complex plane, which can be treated as an application of the central limit theorem. The structure factor is the sum of the independent atomic scattering contributions, each of which has a probability distribution defined as a circle in the complex plane centred on the origin, with a radius of $[f_{j}]$ . The centroid of this atomic distribution is at the origin, and the variance for each of the real and imaginary parts is $[{1 \over 2} f_{j}^{2}]$ . The probability distribution of the structure factor that is the sum of these contributions is a two-dimensional Gaussian, the product of the one-dimensional Gaussians for the real and imaginary parts. Because the variances are equal in the real and imaginary directions, it can be simplified, as shown below, and expressed in terms of a single distribution parameter, $[\Sigma_{N}]$ . $[\eqalign{{\bf F} &= \textstyle\sum\limits_{j = 1}^{N}\displaystyle f_{j} \exp (2 \pi i {\bf h} \cdot {\bf x}_{j}) = A + iB\hbox{;}\quad \langle A\rangle = \langle B\rangle = 0\hbox{;}\hfill \cr \sigma^{2} (A) &= \sigma^{2} (B) = \textstyle{1 \over 2}\displaystyle \textstyle\sum\limits_{j = 1}^{N}\displaystyle f_{j}^{2} = {\textstyle{1 \over 2}} \Sigma_{N}, \hbox{so} \hfill\cr p(A) &= [1/(\pi \Sigma_{N})^{1/2}] \exp \left(-A^{2}/\Sigma_{N}\right),\hfill\cr p(B) &= [1/(\pi \Sigma_{N})^{1/2}] \exp \left(-B^{2}/\Sigma_{N}\right), \hfill\cr p({\bf F}) &= p(A, B) = (1/\pi \Sigma_{N}) \exp \left(-|{\bf F}|^{2}/\Sigma_{N}\right). \hfill\cr}]$

The Sim distribution (Sim, 1959), which is relevant when the positions of some of the atoms are known, has a very similar basis, except that the structure factor is now considered to arise from a random walk starting from the position of the structure factor corresponding to the known part, $[{\bf F}_{P}]$ . Atoms with known positions do not contribute to the variance, while each of the atoms with unknown positions (the `Q' atoms) contributes $[{1 \over 2} f_{j}^{2}]$ to each of the real and imaginary parts, as in the Wilson distribution. The distribution parameter in this case is referred to as $[\Sigma_{Q}]$ . The Sim distribution is a conditional probability distribution, depending on the value of $[{\bf F}_{P}]$ , $[p({\bf F}\hbox{;}\ {\bf F}_{P}) = (1/\pi \Sigma_{Q}) \exp \left(-|{\bf F} - {\bf F}_{P}|^{2}/\Sigma_{Q}\right).]$

The Wilson (1949) and Woolfson (1956) distributions for space group $[P\bar{1}]$ are obtained similarly, except that the random walks are along a line and the resulting Gaussian distributions are one-dimensional. (The Woolfson distribution is the centric equivalent of the Sim distribution.) For more complicated space groups, it is reasonable to assume that acentric reflections follow the P1 distribution and that centric reflections follow the $[P\bar{1}]$ distribution. However, for any zone of the reciprocal lattice in which symmetry-related atoms are constrained to scatter in phase, the variances must be multiplied by the expected intensity factor, ɛ, for the zone, because the symmetry-related contributions are no longer independent.

15.2.3.2. Probability distributions for variable coordinate errors

| top | pdf |

In the Sim distribution, an atom is considered to be either exactly known or completely unknown in its position. These are extreme cases, since there will normally be varying degrees of uncertainty in the positions of various atoms in a model. The treatment can be generalized by allowing a probability distribution of coordinate errors for each atom. In this case, the centroid for the individual atomic contribution to the structure factor will no longer be obtained by multiplying by either zero or one. Averaged over the circle corresponding to possible phase errors, the centroid will generally be reduced in magnitude, as illustrated in Fig. 15.2.3.1. In fact, averaging to obtain the centroid is equivalent to weighting the atomic scattering contribution by the Fourier transform of the coordinate-error probability distribution, $[d_{j}]$ . By the convolution theorem, this in turn is equivalent to convoluting the atomic density with the coordinate-error distribution. Intuitively, the atom is smeared over all of its possible positions. The weighting factor, $[d_{j}]$ , is thus analogous to the thermal-motion term in the structure-factor expression.

Figure 15.2.3.1| top | pdf |

Centroid of the structure-factor contribution from a single atom. The probability of a phase for the contribution is indicated by the thickness of the line.

The variances for the individual atomic contributions will differ in magnitude, but if there are a sufficient number of independent sources of error, we can invoke the central limit theorem again and assume that the probability distribution for the structure factor will be a Gaussian centred on $[\textstyle\sum d_{j}\; f_{j} \exp \left(2 \pi i {\bf h} \cdot {\bf x}_{j}\right)]$ . If the coordinate-error distribution is Gaussian, and if each atom in the model is subject to the same errors, the resulting structure-factor probability distribution is the Luzzati (1952) distribution . In this special case, $[d_{j} = D]$ for all atoms, where D is the Fourier transform of a Gaussian and behaves like the application of an overall B factor.

15.2.3.3. General treatment of the structure-factor distribution

| top | pdf |

The Wilson, Sim, Luzzati and variable-error distributions have very similar forms, because they are all Gaussians arising from the application of the central limit theorem. The central limit theorem is valid under many circumstances; even when there are errors in position, scattering factor and B factor, as well as missing atoms, a similar distribution still applies. As long as these sources of error are independent, the true structure factor will have a Gaussian distribution centred on $[D{\bf F}_{C}]$ (Fig. 15.2.3.2), where D now includes effects of all sources of error, as well as compensating for errors in the overall scale and B factor (Read, 1990). $[p({\bf F}\hbox {;}\ {\bf F}_{C}) = (1/\pi \varepsilon \sigma_{\Delta}^{2}) \exp \left(-|{\bf F} - D{\bf F}_{C}|^{2}/\varepsilon \sigma_{\Delta}^{2}\right)]$ in the acentric case, where $[\sigma_{\Delta}^{2} = \Sigma_{N} - D^{2}\Sigma_{P}]$ , ɛ is the expected intensity factor and $[\Sigma_{P}]$ is the Wilson distribution parameter for the model.

Figure 15.2.3.2| top | pdf |

Schematic illustration of the general structure-factor distribution, relevant in the case of any set of independent random errors in the atomic model.

For centric reflections, the scattering differences are distributed along a line, so the probability distribution is a one-dimensional Gaussian. $[p({\bf F}\hbox{;}\ {\bf F}_{C}) = [1/(2 \pi \varepsilon \sigma_{\Delta}^{2})^{1/2} ]\exp \left(-|{\bf F} - D{\bf F}_{C}|^{2}/2 \varepsilon \sigma_{\Delta}^{2}\right).]$

15.2.3.4. Estimating $[\sigma_{A}]$

| top | pdf |

Srinivasan (1966) showed that the Sim and Luzzati distributions could be combined into a single distribution that had a particularly elegant form when expressed in terms of normalized structure factors, or E values. This functional form still applies to the general distribution that reflects a variety of sources of error; the only difference is the interpretation placed on the parameters (Read, 1990). If F and $[{\bf F}_{C}]$ are replaced by the corresponding E values, a parameter $[\sigma_{A}]$ plays the role of D, and $[\sigma_{\Delta}^{2}]$ reduces to ( $[1 - \sigma_{A}^{2}]$ ). [The parameter $[\sigma_{A}]$ is equivalent to D after correction for model completeness; $[\sigma_{A} = D(\Sigma_{P}/\Sigma_{N})^{1/2}.]$ ] When the structure factors are normalized, overall scale and B-factor effects are also eliminated. The parameter $[\sigma_{A}]$ that characterizes this probability distribution varies as a function of resolution. It must be deduced from the amplitudes $[|{\bf F}_{O}|]$ and $[|{\bf F}_{C}|]$ , since the phase (thus the phase difference) is unknown.

A general approach to estimating parameters for probability distributions is to maximize a likelihood function . The likelihood function is the overall joint probability of making the entire set of observations, which is a function of the desired parameters. The parameters that maximize the probability of making the set of observations are the most consistent with the data. The idea of using maximum likelihood to estimate model phase errors was introduced by Lunin & Urzhumtsev (1984), who gave a treatment that was valid for space group P1. In a more general treatment that applies to higher-symmetry space groups, allowance is made for the statistical effects of crystal symmetry (centric zones and differing expected intensity factors) (Read, 1986).

The $[\sigma_{A}]$ values are estimated by maximizing the joint probability of making the set of observations of $[|{\bf F}_{O}|]$ . If the structure factors are all assumed to be independent, the joint probability distribution is the product of all the individual distributions. The assumption of independence is not completely justified in theory, but the results are fairly accurate in practice. $[L = \textstyle\prod\limits_{\bf h}p(|{\bf F}_{O}|\hbox{;} \ |{\bf F}_{C}|).]$ The required probability distribution, $[p(|{\bf F}_{O}|\hbox{;} \ |{\bf F}_{C}|)]$ , is derived from $[p({\bf F}\hbox {;}\ {\bf F}_{C})]$ by integrating over all possible phase differences and neglecting the errors in $[|{\bf F}_{O}|]$ as a measure of $[|{\bf F}|]$ . The form of this distribution, which is given in other publications (Read, 1986, 1990), differs for centric and acentric reflections. (It is important to note that although the distributions for structure factors are Gaussian, the distributions for amplitudes obtained by integrating out the phase are not.) It is more convenient to deal with a sum than a product, so the log likelihood function is maximized instead. In the program SIGMAA, reciprocal space is divided into spherical shells, and a value of the parameter $[\sigma_{A}]$ is refined for each resolution shell. Details of the algorithm are given elsewhere (Read, 1986).

The resolution shells must be thick enough to contain several hundred to a thousand reflections each, in order to provide $[\sigma_{A}]$ estimates with a sufficiently small statistical error. A larger number of shells (fewer reflections per shell) can be used for refined structures, since estimates of $[\sigma_{A}]$ become more precise as the true value approaches 1. If there are sufficient reflections per shell, the estimates will vary smoothly with resolution. As discussed below, the smooth variation with resolution can also be exploited through a restraint that allows $[\sigma_{A}]$ values to be estimated from fewer reflections.

References

Bricogne, G. (1993). Direct phase determination by entropy maximization and likelihood ranking: status report and perspectives. Acta Cryst. D49, 37–60.Google Scholar

Lunin, V. Yu. & Urzhumtsev, A. G. (1984). Improvement of protein phases by coarse model modification. Acta Cryst. A40, 269–277.Google Scholar

Luzzati, V. (1952). Traitement statistique des erreurs dans la determination des structures cristallines. Acta Cryst. 5, 802–810.Google Scholar

Read, R. J. (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149.Google Scholar

Read, R. J. (1990). Structure-factor probabilities for related structures. Acta Cryst. A46, 900–912.Google Scholar

Sim, G. A. (1959). The distribution of phase angles for structures containing heavy atoms. II. A modification of the normal heavy-atom method for non-centrosymmetrical structures. Acta Cryst. 12, 813–815.Google Scholar

Srinivasan, R. (1966). Weighting functions for use in the early stages of structure analysis when a part of the structure is known. Acta Cryst. 20, 143–144.Google Scholar

Vellieux, F. M. D. & Read, R. J. (1997). Non-crystallographic symmetry averaging in phase refinement and extension. Methods Enzymol. 277, 18–53.Google Scholar

Wilson, A. J. C. (1949). The probability distribution of X-ray intensities. Acta Cryst. 2, 318–321.Google Scholar

Woolfson, M. M. (1956). An improvement of the `heavy-atom' method of solving crystal structures. Acta Cryst. 9, 804–810.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 15.2, pp. 325-327