Computational options and tactics

Dauter, Z.; Murshudov, G. N.; Wilson, K. S.

doi:10.1107/97809553602060000696

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 18.4, pp. 396-398 | 1 | 2 |

Section 18.4.4. Computational options and tactics

Z. Dauter,^a ^* G. N. Murshudov^b and K. S. Wilson^c

^a National Cancer Institute, Brookhaven National Laboratory, Building 725A-X9, Upton, NY 11973, USA,^bStructural Biology Laboratory, Department of Chemistry, University of York, York YO10 5DD, England, and CLRC, Daresbury Laboratory, Daresbury, Warrington, WA4 4AD, England, and ^cStructural Biology Laboratory, Department of Chemistry, University of York, York YO10 5DD, England
Correspondence e-mail: dauter@bnl.gov

18.4.4. Computational options and tactics

| top | pdf |

18.4.4.1. Use of F or F²

| top | pdf |

The X-ray experiment provides two-dimensional diffraction images. These are transformed to integrated but unscaled data, which are transformed to Bragg reflection intensities that are subsequently transformed to structure-factor amplitudes. At each transformation some assumptions are used, and the results will depend on their validity. Invalid assumptions will introduce bias toward these assumptions into the resulting data. Ideally, refinement (or estimation of parameters) should be against data that are as close as possible to the experimental observations, eliminating at least some of the invalid assumptions. Extrapolating this to the extreme, refinement should use the images as observable data, but this poses several severe problems, depending on data quantity and the lack of an appropriate statistical model.

Alternatively, the transformation of data can be improved by revising the assumptions. The intensities are closer to the real experiment than are the structure-factor amplitudes, and use of intensities would reduce the bias. However, there are some difficulties in the implementation of intensity-based likelihood refinement (Pannu & Read, 1996).

Gaussian approximation to intensity-based likelihood (Murshudov et al., 1997) would avoid these difficulties, since a Gaussian distribution of error can be assumed in the intensities but not the amplitudes. However, errors in intensities may not only be the result of counting statistics, but may have additional contributions from factors such as crystal disorder and motion of the molecules in the lattice during data collection.

Nevertheless, the problem of how to treat weak reflections remains. Some of the measured intensities will be negative, as a result of statistical errors of observation, and the proportion of such measurements will be relatively large for weakly diffracting macromolecular structures, especially at atomic resolution. For intensity-based likelihood, this is less important than for the amplitude-based approach. French & Wilson (1978) have given a Bayesian approach for the derivation of structure-factor amplitudes from intensities using Wilson's distribution (Wilson, 1942) as a prior, but there is room for improvement in this approach. Firstly, the assumed Wilson distribution could be upgraded using the scaling techniques suggested by Cowtan & Main (1998) and Blessing (1997), and secondly, information about effects such as pseudosymmetry could be exploited.

Another argument for the use of intensities rather than amplitudes is relevant to least squares where the derivative for amplitude-based refinement with respect to $[F_{\rm calc}]$ when $[F_{\rm calc}]$ is equal to zero is singular (Schwarzenbach et al., 1995). This is not the case for intensity-based least squares. In applying maximum likelihood, this problem does not arise (Pannu & Read, 1996; Murshudov et al., 1997).

Finally, while there may be some advantages in refining against F², Fourier syntheses always require structure-factor amplitudes.

18.4.4.2. Restraints and/or constraints on coordinates and ADPs

| top | pdf |

Even for small-molecule structures, disordered regions of the unit cell require the imposition of stereochemical restraints or constraints if the chemical integrity is to be preserved and the ADPs are to be realistic. The restraints are comparable to those used for proteins at lower resolution and this makes sense, since the poorly ordered regions with high ADPs in effect do not contribute to the high-angle diffraction terms, and as a result their parameters are only defined by the lower-angle amplitudes.

Thus, even for a macromolecule for which the crystals diffract to atomic resolution, there will be regions possessing substantial thermal or static disorder, and restraints on the positional parameters and ADPs are essential for these parts. Their effect on the ordered regions will be minimal, as the X-ray terms will dominate the refinement, provided the relative weighting of X-ray and geometric contributions is appropriate.

Another justification for use of restraints is that refinement can be considered a Bayesian estimation. From this point of view, all available and usable prior knowledge should be exploited, as it should not harm the parameter estimation during refinement. Bayesian estimation shows asymptotic behaviour (Box & Tiao, 1973), i.e., when the number of observations becomes large, the experimental data override the prior knowledge. In this sense, the purpose of the experiment is to enhance our knowledge about the molecule, and the procedure should be cumulative, i.e., the result of the old experiment should serve as prior knowledge for the design and treatment of new experiments (Box & Tiao, 1973; Stuart et al., 1999; O'Hagan, 1994). However, there are problems in using restraints. For example, the probability distribution reflecting the degree of belief in the restraints is not good enough. Use of a Gaussian approximation to distributions of distances, angles and other geometric properties has not been justified. Firstly, the distribution of geometric parameters depends strongly on ADPs, and secondly, different geometric parameters are correlated. This problem should be the subject of further investigation.

18.4.4.3. Partial occupancy

| top | pdf |

It may be necessary to refine one additional parameter, the occupancy factor of an atomic site, for structures possessing regions that are spatially or temporally disordered, with some atoms lying in more than one discrete site. The sum of the occupancies for alternative individual sites of a protein atom must be 1.0.

For macromolecules, the occupancy factor is important in several situations, including the following:

(1) when a protein or ligand atom is present in all molecules in the lattice, but can lie in more than one position due to alternative conformations;
(2) for the solvent region, where there may be overlapping and mutually exclusive solvent networks;
(3) when ligand-binding sites are only partially occupied due to weak binding constants, and the structures represent a mixture of native enzyme with associated solvent and the complex structure;
(4) when there is a mixture of protein residues in the crystal, due to inhomogeneity of the sample arising from polymorphism, a mixture of mutant and wild-type protein or other causes.

Unfortunately, the occupancy parameter is highly correlated with the ADP, and it is difficult to model these two parameters at resolutions less than atomic. Even at atomic resolution, it can prove difficult to refine the occupancy satisfactorily with statistical certainty.

18.4.4.4. Validation of extra parameters during the refinement process

| top | pdf |

The introduction of additional parameters into the model always results in a reduction in the least-squares or maximum-likelihood residual – in crystallographic terms, the R factor. However, the statistical significance of this reduction is not always clear, since this simultaneously reduces the observation-to-parameter ratio. It is therefore important to validate the significance of the introduction of further parameters into the model on a statistical basis. Early attempts to derive such an objective tool were made by Hamilton (1965). Unfortunately, they proved to be cumbersome in practice for large structures and did not provide the required objectivity.

Direct application of the Hamilton test is especially problematical for macromolecules because of the use of restraints. Attempts have been made to overcome these problems, using a direct extension of the Hamilton test itself (Bacchi et al., 1996) or with a combination of self and cross validation (Tickle et al., 1998).

Brünger (1992a) introduced the concept of statistical cross validation to evaluate the significance of introducing extra features into the atomic model. For this, a small and randomly distributed subset of the experimental observations is excluded from the refinement procedure, and the residual against this subset of reflections is termed $[R_{\rm free}]$ . It is generally sufficient to include about 1000 reflections in the $[R_{\rm free}]$ subset; further increase in this number provides little, if any, statistical advantage but diminishes the power of the minimization procedure. For atomic resolution structures, cross validation is important in establishing whether the introduction of an additional type of feature to the model (with its associated increase in parameters) is justified. There are two limitations to this. Firstly, if $[R_{\rm free}]$ shows zero or minimal decrease compared to that in the R factor, the significance remains unclear. Secondly, the introduction of individual features, for example the partial occupancy of five water molecules, can provide only a very small change in $[R_{\rm free}]$ , which will be impossible to substantiate. To recapitulate, at atomic resolution the prime use of cross validation is in establishing protocols with regard to extended sets of parameter types. The sets thus defined will depend on the quality of the data.

In the final analysis, validation of individual features depends on the electron density, and Fourier maps must be judiciously inspected. Nevertheless, this remains a somewhat subjective approach and is in practice intractable for extensive sets of parameters, such as the occupancies and ADPs of all solvent sites. For the latter, automated procedures, which are presently being developed, are an absolute necessity, but they may not be optimal in the final stages of structure analysis, and visual inspection of the model and density is often needed.

The problems of limited data and reparameterization of the model remain. At high resolution, reparameterization means having the same number of atoms, but changing the number of parameters to increase their statistical significance, for example switching from an anisotropic to an isotropic atomic model or vice versa. In contrast, when reparameterization is applied at low resolution, this usually involves reduction in the number of atoms, but this is not an ideal procedure, as real chemical entities of the model are sacrificed. Reducing the number of atoms will inevitably result in disagreement between the experiment and model, which in turn will affect the precision of other parameters. It would be more appropriate to reduce the number of parameters without sacrificing the number of atoms, for example by describing the model in torsion-angle space. Water poses a particular problem, as at low as well as at high resolution not all water molecules cannot be described as discrete atoms. Algorithms are needed to describe them as a continuous model with only a few parameters. In the simplest model, the solvent can be described as a constant electron density.

18.4.4.5. Practical strategies

| top | pdf |

It is not reasonable to give absolute rules for refinement of atomic resolution structures at this time, as the field is rather new and is developing rapidly. Pioneering work has been carried out by Teeter et al. (1993) on crambin , based on data recorded on this small and highly stable protein using a conventional diffractometer. Studies on perhaps more representative proteins are those on ribonuclease Sa at 1.1 Å (Sevcik et al., 1996) and triclinic lysozyme at 0.9 Å resolution (Walsh et al., 1998). These studies used data from a synchrotron source with an imaging-plate detector at room temperature for the ribonuclease and at 100 K for the lysozyme. The strategy involved the application of conventional restrained least squares or maximum-likelihood techniques in the early stages of refinement, followed by a switch over to SHELXL to introduce a full anisotropic model. A series of other papers have appeared in the literature following similar protocols, reflecting the fact that, until recently, only SHELXL was generally available for refining macromolecular structures with anisotropic models and appropriate stereochemical restraints. Programs such as REFMAC have now been extended to allow anisotropic models. As they use fast Fourier transforms for the structure-factor calculations, the speed advantage will mean that REFMAC or comparable programs are likely to be used extensively in this area in the future, even if SHELXL is used in the final step to extract error estimates.

References

Bacchi, A., Lamzin, V. S. & Wilson, K. S. (1996). A self-validation technique for protein structure refinement: the extended Hamilton test. Acta Cryst. D52, 641–646.Google Scholar

Blessing, R. H. (1997). LOCSCL: a program to statistically optimize local scaling of single-isomorphous-replacement and single-wavelength-anomalous-scattering data. J. Appl. Cryst. 30, 176–177.Google Scholar

Box, G. E. P. & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, Massachusetts/California/London: Addison-Wesley.Google Scholar

Brünger, A. T. (1992a). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–475.Google Scholar

Cowtan, K. D. & Main, P. (1998). Miscellaneous algorithms for density modification. Acta Cryst. D53, 487–493.Google Scholar

French, S. & Wilson, K. S. (1978). On the treatment of negative intensity observations. Acta Cryst. A34, 517–525.Google Scholar

Hamilton, W. C. (1965). Significance tests on the crystallographic R factor. Acta Cryst. 18, 502–510.Google Scholar

Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Refinement of macromolecular structures by the maximum-likelihood method. Acta Cryst. D53, 240–255.Google Scholar

O'Hagan, A. (1994). Kendal's advanced theory of statistics; Bayesian inference, Vol. 2B. Cambridge: Arnold, Hodder Headline and Cambridge University Press.Google Scholar

Pannu, N. S. & Read, R. J. (1996). Improved structure refinement through maximum likelihood. Acta Cryst. A52, 659–668.Google Scholar

Schwarzenbach, D., Abrahams, S. C., Flack, H. D., Prince, E. & Wilson, A. J. C. (1995). Statistical descriptors in crystallography. II. Report of a working group on expression of uncertainty in measurement. Acta Cryst. A51, 565–569.Google Scholar

Sevcik, J., Dauter, Z., Lamzin, V. S. & Wilson, K. S. (1996). Ribonuclease from Streptomyces aureofaciens at atomic resolution. Acta Cryst. D52, 327–344.Google Scholar

Stuart, A., Ord, K. J. & Arnold, S. (1999). Kendall's advanced theory of statistics; classical inference and linear model, Vol. 2A. London/Sydney/Auckland: Arnold, Hodder Headline.Google Scholar

Teeter, M. M., Roe, S. M. & Heo, N. H. (1993). Atomic resolution (0.83 Å) crystal structure of the hydrophobic protein crambin at 130 K. J. Mol. Biol. 230, 292–311.Google Scholar

Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998). R_free and the R_free ratio. Part I: Derivation of expected values of cross-validation residuals used in macromolecular least-squares refinement. Acta Cryst. D54, 547–557.Google Scholar

Walsh, M. A., Schneider, T. R., Sieker, L. C., Dauter, Z., Lamzin, V. S. & Wilson, K. S. (1998). Refinement of triclinic hen egg-white lysozyme at atomic resolution. Acta Cryst. D54, 522–546.Google Scholar

Wilson, A. J. C. (1942). Determination of absolute from relative X-ray data intensities. Nature (London), 150, 151–152.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 18.4, pp. 396-398

Section 18.4.4. Computational options and tactics

18.4.4. Computational options and tactics

18.4.4.1. Use of F or F2

18.4.4.2. Restraints and/or constraints on coordinates and ADPs

18.4.4.3. Partial occupancy

18.4.4.4. Validation of extra parameters during the refinement process

18.4.4.5. Practical strategies

References

18.4.4.1. Use of F or F²