International
Tables for Crystallography Volume F Crystallography of biological macromolecules Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F. ch. 21.3, pp. 520-530
https://doi.org/10.1107/97809553602060000709 Chapter 21.3. Detection of errors in protein models
aUCLA–DOE Laboratory of Structural Biology and Molecular Medicine, UCLA, Box 951570, Los Angeles, CA 90095-1570, USA, bUCLA–DOE Laboratory of Structural Biology and Molecular Medicine, Department of Chemistry & Biochemistry, Molecular Biology Institute and Department of Biological Chemistry, UCLA, Los Angeles, CA 90095-1570, USA, and cUCLA–DOE Laboratory of Structural Biology and Molecular Medicine, Department of Chemistry & Biochemistry and Molecular Biology Institute, UCLA, Los Angeles, CA 90095-1569, USA The detection of errors in protein models is discussed. Programs used in this field, including PROCHECK, WHAT IF, VERIFY3D and ERRAT, are described and the type of errors that they detect are outlined. Examples of the detection of errors in structures are provided with a focus on the programs VERIFY3D and ERRAT. Keywords: ERRAT ; PROCHECK; VERIFY3D; WHAT IF; errors; structure validation. |
The discovery of major errors in several protein structural models determined by X-ray crystallography has focused attention on methods of detecting and minimizing such errors. There are several sources of error in the determination of a protein structure. Errors enter not only in the collection of the experimental data, but especially in their interpretation. Limited diffraction resolution and poor phases frequently lead to electron-density maps that are difficult to interpret. As a result, preliminary protein models built into ambiguous maps often contain errors of various types. The different types of errors can be arranged in decreasing order of severity, as follows: mistracing of the protein chain due to uncertainty in backbone connectivity, misalignment or misregistration of residues, and misplacement of side-chain and backbone atoms. It is critical to be able to identify these problematic regions of a model so they can be given special attention during the iterative process of model building and atomic refinement.
During atomic refinement, the atomic coordinates of the macromolecule are adjusted to minimize an error function of two terms. The first term contains the discrepancies between the observed diffraction data and structure factors calculated from the model. The second term describes the deviations from ideal geometry, such as deviations in bond lengths, bond angles, planarity and other specific features. When refinement is complete, the residual errors in the separate terms are reported, with the discrepancies in the diffraction data embodied in the R value. These error values are usually taken as the first indicators of structure quality.
Beyond criteria that are explicitly minimized during refinement, other structural properties may be devised and evaluated. Some properties that have been investigated include the distribution of non-polar and polar residues both on the surface and in the interior of the protein, and preferred environments for different atom types and residues. These measures use the empirical knowledge gathered in the Protein Data Bank (PDB) to assess how `normal' or `abnormal' a given model is. The measures are also useful in cases in which the experimental diffraction data are not available (e.g. when assessing structures already in the data bank). Several programs that validate protein structural models on the basis of various structural properties are available. Among them are PROCHECK (Laskowski et al., 1993), WHAT IF (Vriend, 1990; Vriend & Sander, 1993), ERRAT (Colovos & Yeates, 1993), and VERIFY3D (Lüthy et al., 1992; Bowie et al., 1991). The various programs have the same objectives, but differ in many important respects. The approaches differ with regard to the scale of the analysis (e.g. atom-based versus amino-acid based), the level of detail in the program output, and the degree to which the evaluated properties are independent of the refinement function.
Any property that has been constrained or heavily restrained during refinement of the atomic model, and any property that has been closely monitored during rebuilding, cannot be used as the sole criterion to assess or `prove' the quality of the model. The reason is that if the atomic model is adjusted to optimize a particular property, that property no longer gives an unbiased measure of model accuracy. For example, most refinement programs operate by adjusting atomic positions to minimize the difference between observed and calculated structure-factor amplitudes, known as the R factor or R value. Since the R value is the target of the optimization procedure, it does not provide an independent measure of quality. As a result, the ordinary R value can be misleading. A much more reliable measure is the free R value (Brünger, 1992), which is calculated from a randomly selected subset of the diffraction data that are excluded from the atomic refinement calculations. The importance of using the free R value to monitor refinement is now widely accepted.
Likewise, independent criteria must be employed to judge protein models themselves, aside from the diffraction data. Typical atomic refinement protocols tightly restrain the obvious stereochemical terms, such as bond lengths, angles and planarity. Therefore, low deviation from ideal geometry cannot be presented as proof of the quality of the structure. Independent criteria must be based on higher-level geometric considerations. Several programs that include such evaluations are described here.
Criteria that are useful for assessing the validity of protein models are those that are not directly restrained during the process of refinement. The following three properties of protein models are of this type: (1) the main-chain dihedral angles ; (2) the non-bonded interactions of protein atoms with other protein atoms and with the solvent; and (3) the packing of atoms within the structure. Each of these properties of a proposed model can be compared for consistency with the same property observed in a database of trustworthy structures. To the extent that the property deviates from the values observed for the proteins of the database, the proposed model is suspect. Some of these properties can be computed for each segment of a protein or for local regions in three-dimensional (3D) space. In this way, inaccurate regions within a proposed model can be identified.
21.3.3. Algorithms for the detection of errors in protein models and the types of errors they detect
The PROCHECK (Laskowski et al., 1993) suite of programs compares the stereochemistry of a proposed protein model to stereochemical features of known structures. The program provides an assessment of the overall quality of the model by comparing the model with well refined structures of the same resolution, and also highlights regions that may need further adjustment. The output of PROCHECK comprises a number of plots, together with detailed residue-by-residue listings of secondary-structure assignment, non-bonded interactions between different pairs of residues, main-chain bond lengths and bond angles, and peptide-bond planarity.
The program also displays main-chain dihedral angles (φ and ψ) as a two-dimensional Ramachandran (Ramachandran & Sasisekharan, 1968) plot. The Ramachandran plot classifies each residue in one of three categories: `allowed' conformations; `partially allowed' conformations, which give rise to modestly unfavourable repulsion between non-bonded atoms, and which might be overcome by attractive effects such as hydrogen bonds; and `disallowed' interactions which give highly unfavourable non-bonded interatomic distances. The Ramachandran plot can identify unacceptable clusters of – angles, revealing possible errors made during model building and refinement. As opposed to covalent bond angles and bond lengths, the main-chain dihedral angles are not usually restrained during X-ray refinement and therefore can be used to validate the structural model independently. In practice, the Ramachandran plot is one of the simplest, most sensitive tools for assessing the quality of a protein model.
The PROCHECK suite is generally useful for assessing the quality of protein structures in various stages of completion. The Ramachandran analysis is especially informative. However, it is possible, at least in principle, to devise an incorrect model with fully acceptable main-chain and side-chain stereochemistry, so other methods must also be used to assess protein models.
The molecular modelling and drug design program WHAT IF (Vriend, 1990) performs a large number of geometrical checks, comparing a proposed protein model to a set of canonical distances and angles. These parameters include bond lengths and bond angles, side-chain planarity, torsion angles, interatomic distances, unusual backbone conformations and the Ramachandran plot. New additions (Vriend & Sander, 1993) include a `quality factor', and a number of checks for clashes between symmetry-related molecules. Starting from the hypothesis that atom–atom interactions are the primary determinant of protein folding, the program tests a protein model for proper packing by calculating a contact quality index. Each contact is characterized by its fragment type (80 types from the 20 residues), the atom type and the three-dimensional location of the atom relative to the local frame of the fragment. Sets of database-derived distributions are compared with the actual distribution in the protein model being tested. A good agreement with the database distribution produces a high contact quality index. A low packing score can indicate any of: poor packing, misthreading of the sequence, bad crystal contacts, bad contacts with a co-factor, or proximity to a vacant active site. The contact analysis available in WHAT IF can be used as an independent quality indicator during crystallographic refinement, or during the process of protein modelling and design.
The program VERIFY3D (Lüthy et al., 1992; Bowie et al., 1991) measures the compatibility of a protein model with its own amino-acid sequence. Each residue position in the 3D model is characterized by its environment and is represented by a row of 20 numbers in a `3D profile'. These numbers are the statistical preferences, called 3D–1D scores, of each of the 20 amino acids for this environment. Environments of residues are defined by three parameters: the area of the residue buried in the protein and inaccessible to solvent, the fraction of the side-chain area that is covered by polar atoms (O and N), and the local secondary structure. The 3D profile score, S, for the compatibility of the sequence with the model is the sum, over all residue positions, of the 3D–1D scores for the amino-acid sequence of the protein. The compatibility of segments of the sequence with their 3D structures can be assessed by plotting, against sequence number, the average 3D–1D score in a window of 21 residues. The 3D profile method rests on the observation that soluble proteins bury many hydrophobic side chains and not many polar residues.
Three applications for 3D profiles exist. The first is to assess the validity of protein models (Lüthy et al., 1992). For 3D protein models known to be correct, the 3D profile score, S, for the compatibility of the amino-acid sequence with the environments formed by the model is high. In contrast, S for the compatibility with its sequence of a totally or partially wrong 3D protein model is generally low. Therefore, models that are largely incorrect or models that contain a small number of improperly built segments can be detected by low-scoring regions in the 3D profile. However, not all faulty regions are always evident directly from the profile, particularly if the misbuilt regions are at the termini, where they are obscured by the windowing procedure. The second application is to assess which is the stable oligomeric state of the folded protein, by comparing the accessibility (buried or exposed) of amino-acid side chains in the monomeric and oligomeric state (Eisenberg et al., 1992). The third application is to identify other protein sequences which are folded in the same general pattern as the structure from which the profile was prepared (Bowie et al., 1991). Predicting a protein structure from sequence requires a link between 3D structure and 1D sequence. The program VERIFY3D provides this link by reducing a 3D structure to 1D string of environmental classes. Therefore the method can be used to evaluate any protein model or to measure the compatibility of any protein structure with its amino-acid sequence.
The program ERRAT (Colovos & Yeates, 1993) analyses the relative frequencies of noncovalent interactions between atoms of various types. It can be viewed as an extension of the earlier 3D profile approach from the residue level to the atom level. Three types of atoms are considered (C, N and O), and consequently six types of interactions are possible (CC, CN, CO, NN, NO and OO).
ERRAT operates under the hypothesis that different atom types will be distributed non-randomly with respect to each other in proteins due to complex geometric and energetic considerations, and that structural errors will lead to detectable anomalies in the pattern of interactions. Assessment of the non-bonded interactions is subject to the following restrictions: the distance between the two atoms in space is less than some preset limit, typically 3.5 Å, and the atoms within the same residue or those that are covalently bonded to each other are not considered. For each nine-residue segment of sequence, the non-bonded contacts to other atoms in the protein are tallied by atomic interaction type and the result is divided by the total number of interactions. This gives a list, or six-dimensional vector, of fractional interaction frequencies that add up to unity. In this way, each nine-residue fragment generates one point in a five-dimensional space; only five of the six fractional values are independent. A large number of observations were extracted from reliable high-resolution structures and used to establish a multivariate five-dimensional normal distribution for accurate protein structures. This distribution is used to evaluate the probability that a given set of interactions from a protein model in question is correct. Since the ERRAT evaluation is based on a normal distribution calibrated on a reliable database, it is straightforward to estimate the likelihood that each region of a candidate protein model is incorrect. This method provides an unbiased and statistically sound tool for identifying incorrectly built regions in protein models.
Regardless of the specific approach or the specific criteria for validating structural models, a reliable reference database has to be chosen by careful selection of known structures. Suitable criteria to consider when selecting a database are: protein structures determined to resolutions of 2.5 Å or better, R factors less than 25%, and good geometry, particularly of the dihedral angles of the protein backbone. In addition, the database should include examples from many diverse classes of structures and at the same time avoid multiple identical structures.
Several examples are presented of errors in structural models determined by X-ray crystallography that can be detected using validation methods. One is that of the small subunit of ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO), which was traced essentially backwards from a poor electron-density map (Chapman et al., 1988). The program ERRAT finds that approximately 40% of the residues in this mistraced model are outside the 95% confidence limit (Fig. 21.3.5.1a). This limit is the error value above which a given region can be judged to be erroneous with 95% certainty, so a reliable model should exceed this value over less than 5% of its length. The final model of RuBisCO (Curmi et al., 1992) shows only 2% of the residues outside ERRAT's 95% confidence limit. Similarly, the 3D profile calculated from VERIFY3D for the erroneous model (Fig. 21.3.5.1b) gives a total score of 15 when matched to the sequence of the small subunit of RuBisCO. This score is well below the expected value of 58 for the correct structure of this length. Indeed, the 3D profile of the correct model (Curmi et al., 1992) (Fig. 21.3.5.1b) of RuBisCO has a score of 55. PROCHECK and WHAT IF also identify stereochemistry problems in the original model, including deviant bond angles and bond lengths, many residues in the disallowed Ramachandran regions (Fig. 21.3.5.1c), bad peptide-bond planarity, and bad non-bonded interactions. In contrast, most amino-acid residues of the correct RuBisCO model are in the allowed regions of the Ramachandran plot (Fig. 21.3.5.1d) with good overall geometry.
The archive of obsolete PDB entries maintained by the San Diego Supercomputer group (http://pdbobs.sdsc.edu ) includes old versions of protein structures that have been withdrawn and/or replaced by the depositor with a newer version. One example is that of a protein (3xia.coor) originally solved to 3 Å in the wrong space group and later to 1.8 Å in the correct space group (1xya.coor). The ERRAT program reveals problems in the original model, with 45% of the residues outside the 95% confidence limit (Fig. 21.3.5.2a). The more recent model has only 1.5% of the residues outside the 95% confidence limit. The problem in the original model is also illustrated by the VERIFY3D plot (Fig. 21.3.5.2b) for which the average score is often below the value of 0.1 and dips below zero at four points. In contrast, the VERIFY3D plot of the revised model shows no dips below zero. Poor stereochemistry is also apparent in the Ramachandran plot of the original model (Fig. 21.3.5.2c). Only 38% of the backbone dihedral angles lie in the most favoured regions, compared to 93.8% in the revised model (Fig. 21.3.5.2d).
The potential usefulness of error-detecting programs during model building is suggested by stages in the crystal structure determination of triacylglycerol lipase from Pseudomonas cepacia (Kim et al., 1997), which was solved by MIR. The authors kindly provided us with ten different models (assigned as stage number 1–10) along the course of model building and refinement. Regions where Cα positions shifted between initial and final models correlated with regions where the error functions improved. For example, the program ERRAT points at specific regions (e.g. 18–35 and 135–165) originally assigned as polyalanine. When at the next stage of refinement these were changed to the actual amino-acid sequence, these regions behaved normally (Fig. 21.3.5.3a). This illustrates that ERRAT is able to illuminate problem areas in a structure.
VERIFY3D is sensitive to unusual environments in proteins. An illustration is offered by the structures of lipases, with and without their inhibitors. There are two general conformations known as `closed' and `open'. In the so-called `closed' structure, the catalytic triad is buried underneath a helical segment, called a `lid' (Brzozowski et al., 1991), so that hydrophobic residues tend to be buried as observed in a `normal' 3D profile. In the `open' conformation, the lipid binding site becomes accessible to the solvent, and hydrophobic surfaces (residues 140–150 and 230–250) are exposed by the movement of the `lid'. These hydrophobic exposed regions are strikingly shown in the 3D profile of the `open' structure (Fig. 21.3.5.3b), which clearly reveals the two problematic regions (140–150 and 230–250) with profile scores below zero. The exposed hydrophobic residues 140–150 from one symmetry model make van der Waals interactions with hydrophobic residues 230–250 from a symmetry-related molecule (Kim et al., 1997). These interactions are revealed as higher scores in those regions when inspecting the 3D profiles of the two symmetry-related molecules.
Another example of unusual environment is that of diphtheria toxin (DT), which exists as a monomer as well as a dimer. Monomeric DT is a Y-shaped molecule with three domains known as catalytic (C), transmembrane (T) and receptor binding domain (R). Crystal structures have been determined for both the `closed' monomeric form and for a domain-swapped dimeric form (Bennett et al., 1994). Upon dimerization, a massive conformational rearrangement occurs and the entire R domain from each monomer of the dimer is interchanged with the other monomer. This involves breaking the noncovalent interactions between the R domain and the C and T domains and rotating the R domain by 180° with atomic movements up to 65 Å to produce the `open' conformation. After rearrangement, each R domain reforms the same noncovalent interactions as it had in the monomer, but with the C and T domains of the other monomer. The existence of both open and closed forms of DT requires that large conformational changes occur in residues 379–387 (the hinge loop). The 3D profile of the `open' form (Fig. 21.3.5.4a) shows low scores for these residues compared to the closed monomer or dimer (Fig. 21.3.5.4b). The higher scores of the open monomer are consistent with the greater stability of the monomer in the closed rather than the open conformation.
The past two decades have seen a surge of development in the experimental techniques of crystal structure determination. As a consequence, many structures originally solved at low resolution were later determined at higher resolution, often starting with improved phases. The archive of obsolete PDB entries maintained by the San Diego Supercomputer group (http://pdbobs.sdsc.edu ) served as a benchmark for evaluating the ERRAT program. For testing, 17 pairs of protein models were selected. Each pair comprised an obsolete entry and the revised model that replaced it. Using ERRAT, the overall quality of each model was expressed as a single number according to the fraction of the structure falling below the 95% confidence limit for rejection. The overall scores are significantly better for the revised structures, most of which were analysed at improved resolution (Fig. 21.3.5.5a). This result further demonstrates the utility of ERRAT for monitoring the model-building process. Furthermore, a strong correlation is found between the percentage of residues within the 95% confidence limit given by ERRAT and the percentage of residues in the most favoured regions of the Ramachandran plot of PROCHECK (Fig. 21.3.5.5b). In general, the problematic regions detected by the two programs agree with each other.
In order to ensure the quality of the growing protein structure databases, models must be evaluated carefully during and after the structure determination process. Model evaluation can incorporate two types of measures: agreement between the model and the experimental diffraction data, and agreement between the model and the database of known structures. The latter types use the atomic coordinates of the final model, but do not rely on the diffraction data. In recent years, powerful methods of this type have been developed.
The most informative and reliable model-evaluation criteria are those that measure properties not optimized as part of the automatic refinement procedure. The free R value has become important for monitoring the progress of atomic refinement for the same reason: it is based on reflections not included in refinement. We have focused here on two programs, VERIFY3D and ERRAT, which both evaluate high-level geometric properties not optimized during atomic refinement. Each offers the convenience of a single score over a sliding window along the protein sequence. Because VERIFY3D operates on the level of amino-acid residues, it is sensitive to errors on that scale, particularly those that affect the distribution of polar and nonpolar residues. ERRAT operates on the atomic level and has proven to be particularly useful for pinpointing local regions of protein models that require further adjustments. When used in combination, these methods and others can help crystallographers produce more accurate structural models of proteins.
The programs ERRAT and VERIFY3D are available on the World Wide Web for non-commercial applications. The URL for VERIFY3D is http://nihserver.mbi.ucla.edu/Verify_3D/ and the URL for ERRAT is http://nihserver.mbi.ucla.edu/ERRATv2/ . VERIFY3D and ERRAT expect a coordinate file in PDB format. The programs return plots of the type shown in this chapter.
Acknowledgements
We thank Dr Kyeong Kyu Kim and Dr Se Won Suh of the Department of Chemistry, Seoul National University, Korea, for the models of triacylglycerol lipase. This work was supported by grants NIH GM 31299, DOE DE-FG03-87ER60615 and NSF MCB 9420769.
References
Bennett, M. J., Choe, S. & Eisenberg, D. (1994). Domain swapping: entangling alliances between proteins. Proc. Natl Acad. Sci. USA, 91, 3127–3131.Google ScholarBowie, J. U., Lüthy, R. & Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164–170.Google Scholar
Brünger, A. T. (1992). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–475.Google Scholar
Brzozowski, A. M., Derewenda, U., Derewenda, Z. S., Dodson, G. G., Lawson, D. M., Turkenburg, J. P., Bjorkling, F., Huge-Jensen, B., Patkar, S. A. & Thim, L. (1991). A model for interfacial activation in lipases from the structure of fungal lipase-inhibitor complex. Nature (London), 351, 491–494.Google Scholar
Chapman, M. S., Suh, S. W., Curmi, P. M., Cascio, D., Smith, W. W. & Eisenberg, D. (1988). Tertiary structure of plant RuBisCO: domains and their contacts. Science, 241, 71–74.Google Scholar
Colovos, C. & Yeates, T. O. (1993). Verification of protein structures: patterns of nonbonded atomic interactions. Protein Sci. 2, 1511–1519.Google Scholar
Curmi, P. M. G., Cascio, D., Sweet, R. M., Eisenberg, D. & Schreuder, H. (1992). Crystal structure of the unactivated form of ribulose-1,5 bisphosphate carboxylase/oxygenase from tobacco refined at 2.0 Å resolution. J. Biol. Chem. 267, 16980–16989.Google Scholar
Eisenberg, D., Bowie, J. U., Lüthy, R. & Choe, S. (1992). Three-dimensional profiles for analyzing protein sequence-structure relationships. Faraday Discuss. Chem. Soc. 93, 25–34.Google Scholar
Kim, K. K., Song, H. K., Shin, D. H., Hwang, K. Y. & Suh, S. W. (1997). The crystal structure of a triacylglycerol lipase from Pseudomonas cepacia reveals a highly open conformation in the absence of a bound inhibitor. Structure, 5, 173–185.Google Scholar
Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.Google Scholar
Lüthy, R., Bowie, J. U. & Eisenberg, D. (1992). Assessment of protein models with three-dimensional profiles. Nature (London), 356, 83–85.Google Scholar
Ramachandran, G. N. & Sasisekharan, V. (1968). Conformation of polypeptides and proteins. Adv. Protein Chem. 23, 283–438.Google Scholar
Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program. J. Mol. Graphics, 8, 52–56.Google Scholar
Vriend, G. & Sander, C. (1993). Quality control of protein models: directional atomic contact analysis. J. Appl. Cryst. 26, 47–60.Google Scholar