Detecting outliers

Kleywegt, G. J.

doi:10.1107/97809553602060000707

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 21.1, pp. 498-499 | 1 | 2 |

Section 21.1.3. Detecting outliers

G. J. Kleywegt^a ^*

^aDepartment of Cell and Molecular Biology, Uppsala University, Biomedical Centre, Box 596, SE-751 24 Uppsala, Sweden
Correspondence e-mail: gerard@xray.bmc.uu.se

21.1.3. Detecting outliers

| top | pdf |

21.1.3.1. Classes of quality indicators

| top | pdf |

Many statistics, methods and programs were developed in the 1990s to help identify errors in protein models. These methods generally fall into two classes: one in which only coordinates and B factors are considered (such methods often entail comparison of a model to information derived from structural databases) and another in which both the model and the crystallographic data are taken into account. Alternatively, one can distinguish between methods that essentially measure how well the refinement program has succeeded in imposing restraints (e.g. deviations from ideal geometry, conventional R value) and those that assess aspects of the model that are `orthogonal' to the information used in refinement (e.g. free R value, patterns of non-bonded interactions, conformational torsion-angle distributions). An additional distinction can be made between methods that provide overall (global) statistics for a model (such methods are suitable for monitoring the progress of the refinement and rebuilding process) and those that provide information at the level of residues or atoms (such methods are more useful for detecting local problems in a model). It is important to realise that almost all coordinate-based validation methods detect outliers (i.e. atoms or residues with unusual properties): to assess whether an outlier arises from an error in the model or whether it is a genuine, but unusual, feature of the structure, one must inspect the (preferably unbiased) electron-density maps (Jones et al., 1996)!

In this section, some quality indicators will be discussed that have been found to be particularly useful in daily protein crystallographic practice for the purpose of detecting problems in intermediate models. Section 21.1.7 provides a more extensive discussion of many of the quality criteria that are or have been used by macromolecular crystallographers.

21.1.3.2. Local statistics

| top | pdf |

From a practical point of view, these are the most useful for the crystallographer who is about to rebuild a model. Examples of useful quality indicators are:

(1) The real-space fit (Jones et al., 1991; Chapman, 1995; Jones & Kjeldgaard, 1997; Vaguine et al., 1999), expressed as an R value or as a correlation coefficient between `observed' and calculated density. This property can be calculated for any subset of atoms, e.g. for an entire residue, for main-chain atoms or for side-chain atoms. It is best to use a map that is biased by the model as little as possible [e.g., a σ_A-weighted map (Read, 1986), an NCS-averaged map (Kleywegt & Read, 1997) or an omit map (Bhat & Cohen, 1984; Hodel et al., 1992)]. In practice, the real-space fit is strongly correlated with the atomic temperature factors, even though these are not used in the calculations.
(2) The Ramachandran plot (Ramakrishnan & Ramachandran, 1965; Kleywegt & Jones, 1996b). Residues with unusual main-chain φ, ψ torsion-angle combinations that do not have unequivocally clear electron density are almost always in error. However, one should keep in mind that the error may have its origin in (one of) the neighbouring residues. For instance, if the peptide O atom of a residue is pointing in the wrong direction, the φ value for the next residue may be off by 150–180° (Kleywegt, 1996; Kleywegt & Jones, 1998).
(3) The pep-flip value (Jones et al., 1991; Kleywegt & Jones, 1998). This statistic measures the r.m.s. distance between the peptide O atom of a residue and its counterparts found in a database of well refined high-resolution structures that occur in parts of those structures with a similar local C^α backbone conformation. If the pep-flip value is large (e.g. >2.5 Å), the residue is termed an outlier, but whether it is an error can only be determined by inspecting the local density.
(4) The rotamer side-chain fit value (Jones et al., 1991; Kleywegt & Jones, 1998). This statistic measures the r.m.s. distance between the side-chain atoms of a residue and those in the most similar rotamer conformation for that residue type. A value greater than ∼1.0–1.5 Å signals an outlier. In many cases (particularly, but not exclusively, at low resolution), a non-rotamer side chain can easily be replaced by a rotamer conformation, perhaps in conjunction with a slight rigid-body movement of the entire residue or with some adjustment of the side-chain torsion angles (Zou & Mowbray, 1994; Kleywegt & Jones, 1997).
(5) Hydrogen-bonding analysis . The correct orientation of histidine, asparagine and glutamine side chains cannot usually be inferred from electron density alone. Inexperienced crystallographers can benefit from suggestions based on the analysis of hydrogen-bonding networks (Hooft et al., 1996b), although every case should be examined critically (e.g. the program does not know about solvent molecules that have not yet been added to the model or that cannot be placed because of the limitations of the data; in addition, sometimes an amino group may be interacting with an aromatic side chain).

In addition to these criteria, residues with other unusual features should be examined in the electron-density maps for the crystallographer to be able to decide whether they are in error. Such features may pertain to unusual temperature factors, unusual occupancies, unusual bond lengths or angles, unusual torsion angles or deviations from planarity (e.g. for the peptide plane), unusual chirality (e.g. for the C^α atom of every residue type except glycine), unusual differences in the temperature factors of chemically bonded atoms, unusual packing environments (Vriend & Sander, 1993), very short distances between non-bonded atoms (including symmetry mates), large positional shifts during refinement, unusual deviations from noncrystallographic symmetry (Kleywegt & Jones, 1995b; Kleywegt, 1996) etc.

21.1.3.3. Global statistics

| top | pdf |

The crystallographic R value used to be the major global quality indicator until it was realised that it can easily be fooled, especially at low resolution (Brändén & Jones, 1990; Jones et al., 1991; Brünger, 1992a; Kleywegt & Jones, 1995b). The free R value , introduced by Brünger (1992a, 1993), has been shown to be much more reliable and harder to manipulate (Kleywegt & Brünger, 1996; Brünger, 1997). It is excellently suited for monitoring the progress of refinement, for detecting major problems with model or data and for helping reduce over-fitting of the data (which occurs if many more parameters are refined in a model than is warranted by the information content of the crystallographic data). Moreover, the free R value can be used to estimate the coordinate error of the final model (Kleywegt et al., 1994; Kleywegt & Brünger, 1996; Brünger, 1997; Cruickshank, 1999).

In addition, the average or r.m.s. values for many of the local statistics, their minimum or maximum values or the percentage of outliers can be quoted and used to obtain an impression of the overall quality of the model and the overall fit of the model to the data.

References

Bhat, T. N. & Cohen, G. H. (1984). OMITMAP: an electron density map suitable for the examination of errors in a macromolecular model. J. Appl. Cryst. 17, 244–248.Google Scholar

Brändén, C.-I. & Jones, T. A. (1990). Between objectivity and subjectivity. Nature (London), 343, 687–689.Google Scholar

Brünger, A. T. (1992a). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–475.Google Scholar

Brünger, A. T. (1993). Assessment of phase accuracy by cross validation: the free R value. Methods and applications. Acta Cryst. D49, 24–36.Google Scholar

Brünger, A. T. (1997). The free R value: a more objective statistic for crystallography. Methods Enzymol. 277, 366–396.Google Scholar

Chapman, M. S. (1995). Restrained real-space macromolecular atomic refinement using a new resolution-dependent electron-density function. Acta Cryst. A51, 69–80.Google Scholar

Cruickshank, D. W. J. (1999). Remarks about protein structure precision. Acta Cryst. D55, 583–601.Google Scholar

Hodel, A., Kim, S.-H. & Brünger, A. T. (1992). Model bias in macromolecular crystal structures. Acta Cryst. A48, 851–858.Google Scholar

Hooft, R. W. W., Sander, C. & Vriend, G. (1996b). Positioning hydrogen atoms by optimizing hydrogen-bond networks in protein structures. Proteins Struct. Funct. Genet. 26, 363–376.Google Scholar

Jones, T. A. & Kjeldgaard, M. (1997). Electron density map interpretation. Methods Enzymol. 277, 173–208.Google Scholar

Jones, T. A., Kleywegt, G. J. & Brünger, A. T. (1996). Storing diffraction data. Nature (London), 381, 18–19.Google Scholar

Jones, T. A., Zou, J.-Y., Cowan, S. W. & Kjeldgaard, M. (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst. A47, 110–119.Google Scholar

Kleywegt, G. J. (1996). Use of non-crystallographic symmetry in protein structure refinement. Acta Cryst. D52, 842–857.Google Scholar

Kleywegt, G. J., Bergfors, T., Senn, H., Le Motte, P., Gsell, B., Shudo, K. & Jones, T. A. (1994). Crystal structures of cellular retinoic acid binding proteins I and II in complex with all-trans-retinoic acid and a synthetic retinoid. Structure, 2, 1241–1258.Google Scholar

Kleywegt, G. J. & Brünger, A. T. (1996). Checking your imagination: applications of the free R value. Structure, 4, 897–904.Google Scholar

Kleywegt, G. J. & Jones, T. A. (1995b). Where freedom is given, liberties are taken. Structure, 3, 535–540.Google Scholar

Kleywegt, G. J. & Jones, T. A. (1996b). Phi/Psi-chology: Ramachandran revisited. Structure, 4, 1395–1400.Google Scholar

Kleywegt, G. J. & Jones, T. A. (1997). Model-building and refinement practice. Methods Enzymol. 277, 208–230.Google Scholar

Kleywegt, G. J. & Jones, T. A. (1998). Databases in protein crystallography. Acta Cryst. D54, 1119–1131.Google Scholar

Kleywegt, G. J. & Read, R. J. (1997). Not your average density. Structure, 5, 1557–1569.Google Scholar

Ramakrishnan, C. & Ramachandran, G. N. (1965). Stereochemical criteria for polypeptide and protein chain conformations. II. Allowed conformations for a pair of peptide units. Biophys. J. 5, 909–933.Google Scholar

Read, R. J. (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149.Google Scholar

Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). SFCHECK: a unified set of procedures for evaluating the quality of macromolecular structure-factor data and their agreement with the atomic model. Acta Cryst. D55, 191–205.Google Scholar

Vriend, G. & Sander, C. (1993). Quality control of protein models: directional atomic contact analysis. J. Appl. Cryst. 26, 47–60.Google Scholar

Zou, J. Y. & Mowbray, S. L. (1994). An evaluation of the use of databases in protein structure refinement. Acta Cryst. D50, 237–249.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 21.1, pp. 498-499