International
Tables for Crystallography Volume F Crystallography of biological macromolecules Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F. ch. 18.2, p. 375
Section 18.2.2. Cross validation
a
The Howard Hughes Medical Institute, and Departments of Molecular and Cellular Physiology, Neurology and Neurological Sciences, and Stanford Synchrotron Radiation Laboratory, Stanford Universty, 1201 Welch Road, MSLS P210, Stanford, CA 94305-5489, USA,bThe Howard Hughes Medical Institute and Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06511, USA, and cDepartment of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06511, USA |
Cross validation (Brünger, 1992) plays a fundamental role in the maximum-likelihood target functions described below. A few remarks about this method are therefore warranted (for reviews see Kleywegt & Brünger, 1996; Brünger, 1997). For cross validation, the diffraction data are divided into two sets: a large working set (usually comprising 90% of the data) and a complementary test set (comprising the remaining 10%). The diffraction data in the working set are used in the normal crystallographic refinement process, whereas the test data are not. The cross-validated (or `free') R value computed with the test-set data is a more faithful indicator of model quality. It provides a more objective guide during the model building and refinement process than the conventional R value. It also ensures that introduction of additional parameters (e.g. water molecules, relaxation of noncrystallographic symmetry restraints, or multi-conformer models) improves the quality of the model, rather than increasing overfitting.
Since the conventional R value shows little correlation with the accuracy of a model, coordinate-error estimates derived from the Luzzati (1952) or (Read, 1986) methods are unrealistically low. Kleywegt & Brünger (1996) showed that more reliable coordinate errors can be obtained by cross validation of the Luzzati or coordinate-error estimates. An example is shown in Fig. 18.2.2.1 using the crystal structure and diffraction data of penicillopepsin (Hsu et al., 1977). At 1.8 Å resolution, the model has an estimated coordinate error of ~0.2 Å as assessed by multiple independent refinements. As the resolution of the diffraction data is artificially truncated and the model re-refined, the coordinate error (assessed by the atomic root-mean-square difference to the refined model at 1.8 Å resolution) increases monotonically. The conventional R value improves as the resolution decreases and the quality of the model worsens. Consequently, coordinate-error estimates do not display the correct behaviour either: the error estimates are approximately constant, regardless of the resolution and actual coordinate error of the models. However, when cross validation is used (i.e., the test reflections are used to compute the estimated coordinate errors), the results are much better: the cross-validated errors are close to the actual coordinate error, and they show the correct trend as a function of resolution (Fig. 18.2.2.1).
References
Brunger, A. T. (1992). The free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–474.Google ScholarBrunger, A. T. (1997). Free R value: cross-validation in crystallography. Methods Enzymol. 277, 366–396.Google Scholar
Hsu, I. N., Delbaere, L. T. J., James, M. N. G. & Hoffman, T. (1977). Penicillopepsin from Penicillium janthinellum crystal structure at 2.8 Å and sequence homology with porcine pepsin. Nature (London), 266, 140–145.Google Scholar
Kleywegt, G. J. & Brunger, A. T. (1996). Cross-validation in crystallography: practice and applications. Structure, 4, 897–904.Google Scholar
Luzzati, V. (1952). Traitement statistique des erreurs dans la determination des structures cristallines. Acta Cryst. 5, 802–810.Google Scholar
Read, R. J. (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149.Google Scholar