InternationalCrystallography of biological macromoleculesTables for Crystallography Volume F Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F. ch. 21.1, pp. 500-501
## Section 21.1.7.1. Data quality |

Although many quality and validation criteria have been developed for assessing coordinate sets of protein models, comparatively few criteria are available for assessing the quality of the crystallographic data.

Possibly the most common mistake in papers describing protein crystal structures is an incorrectly quoted formula for the merging *R* value (calculated during data reduction), where the outer sum (*h*) is over the unique reflections (in most implementations, only those reflections that have been measured more than once are included in the summations) and the inner sum (*i*) is over the set of independent observations of each unique reflection (Drenth, 1994). This statistic is supposed to reflect the spread of multiple observations of the intensity of the unique reflections (where the multiple observations may derive from symmetry-related reflections, different images or different crystals). Unfortunately, *R*_{merge} is a very poor statistic, since its value increases with increasing redundancy (Weiss & Hilgenfeld, 1997; Diederichs & Karplus, 1997), even though the signal-to-noise ratio of the average intensities will be higher as more observations are included (in theory, an *N*-fold increase of the number of independent observations should improve the signal-to-noise ratio by a factor of *N* ^{1/2}). At high redundancy, the value of *R*_{merge} is directly related to the average signal-to-noise ratio (Weiss & Hilgenfeld, 1997): *R*_{merge} ≃ 0.8/<*I*/σ(*I*)>.

Diederichs & Karplus (1997) have suggested a number of alternative measures that lack most of the drawbacks of *R*_{merge}. Their statistic *R*_{meas} is similar to *R*_{merge}, but includes a correction for redundancy (*m*), Another statistic, the pooled coefficient of variation (PCV), is defined as Since PCV = 1/<*I*/σ(*I*)>, this quantity also provides an indication as to whether the standard deviations σ(*I*) have been estimated appropriately. Finally, the statistic *R*_{mrgd-F}, used for assessing the quality of the reduced data, enables a direct comparison of this merging *R* value with the refinement residuals *R* and *R*_{free}.

Ideally, merging statistics should be quoted for all resolution shells (which should not be too broad), as well as for the entire data set. However, as a minimum, the values for the two extreme (low- and high-resolution) shells and for the entire data set should be reported.

Data completeness can be assessed by calculating what fraction of the unique reflections within a range of Bragg spacings that could in theory be observed has actually been measured. Ideally, completeness should be quoted for all resolution shells (which should not be too broad), as well as for the entire data set. However, as a minimum, the values for the two extreme (low- and high-resolution) shells and for the entire data set should be reported.

Redundancy is defined as the number of independent observations (after merging of partial reflections) per unique reflection in the final merged and symmetry-reduced data set. Ideally, average redundancy should be quoted for all resolution shells (which should not be too broad), as well as for the entire data set. However, as a minimum, the values for the two extreme (low- and high-resolution) shells and for the entire data set should be reported.

The average strength or significance of the observed intensities can be expressed in different ways. Values that are often quoted include the percentage of reflections for which *I*/σ(*I*) exceeds a certain value (usually 3.0) and the average value of *I*/σ(*I*). Ideally, these numbers should be quoted for all resolution shells (which should not be too broad), as well as for the entire data set. However, as a minimum, the values for the two extreme (low- and high-resolution) shells and for the entire data set should be reported.

The nominal resolution limits of a data set are chosen by the crystallographer, usually at the data-processing stage, and ought to reflect the range of Bragg spacings for which useful intensity data have been collected. Unfortunately, owing to the subjective nature of this process, resolution limits cannot be compared meaningfully between data sets processed by different crystallographers. Careful crystallographers will take factors such as shell completeness, redundancy and <*I*/σ(*I*)> into account, whereas others may simply look up the minimum and maximum Bragg spacing of all observed reflections. Bart Hazes (personal communication) has suggested defining the effective resolution of a data set as that resolution at which the number of observed reflections would constitute a 100% complete data set. Alternatively, Vaguine *et al.* (1999) define the effective (or optical) resolution as the expected minimum distance between two resolved peaks in the electron-density map and calculate this quantity as 2Δ_{ P}/2^{1/2}, where Δ_{ P} is the width of the origin Patterson peak. One day, hopefully, the term `resolution' will be replaced by an estimate of the information content of data sets. Randy Read (personal communication) has carried out preliminary work along these lines.

The accuracy of unit-cell parameters has been shown to be grossly overestimated for small-molecule crystal structures (Taylor & Kennard, 1986). Not intimidated by this observation, some macromolecular crystallographers routinely quote unit-cell axes of 100–200 Å with a precision of 0.01 Å. An analysis of several high-resolution protein crystal structures has revealed that surprisingly large errors in the unit-cell parameters appear to be quite common (at least if synchrotron sources are used for data collection; EU 3-D Validation Network, 1998). Such errors can be detected *a posteriori* by checking if the bond lengths in a model show any systematic, perhaps direction-dependent, variations from their target values.

From the symmetry of the diffraction pattern, the point-group symmetry of the crystal lattice can usually be derived. It is important to merge the data in the point group with the highest possible symmetry (usually assessed using merging statistics) in order to minimize the chance of making an incorrect space-group assignment (Marsh, 1995, 1997; Kleywegt *et al.*, 1996). Once the first data set has been processed, it is always useful to compute a self-rotation function. A non-origin peak of comparable strength to the origin peak will indicate that the true space group has higher symmetry. [Similarly, a self-Patterson function can be calculated at this stage to detect any purely translational NCS (Kleywegt & Read, 1997).] Once the final model is available, a search for possibly missed higher symmetry can be carried out, *e.g.* using the method developed by Hooft *et al.* (1994).

Sometimes crystallographic symmetry breaks down (pseudo-symmetry): an apparent higher symmetry at low resolution does not hold at higher resolution. In some cases, this is a consequence of the chemistry of the system studied (*e.g.* an asymmetric ligand bound by a symmetric protein dimer). In other cases, it may go undetected and complicate space-group determination and solution and refinement of the structure.

When it comes to space-group determination, many of the lessons learned by small-molecule crystallographers also apply to macromolecular crystallography (Marsh, 1995; Watkin, 1996).

### References

EU 3-D Validation Network (1998).*Who checks the checkers? Four validation tools applied to eight atomic resolution structures.*

*J. Mol. Biol.*

**276**, 417–436.Google Scholar

Diederichs, K. & Karplus, P. A. (1997).

*Improved R-factors for diffraction data analysis in macromolecular crystallography.*

*Nature Struct. Biol.*

**4**, 269–275.Google Scholar

Drenth, J. (1994).

*Principles of protein X-ray crystallography.*New York: Springer–Verlag.Google Scholar

Hooft, R. W. W., Sander, C. & Vriend, G. (1994).

*Reconstruction of symmetry-related molecules from Protein Data Bank (PDB) files.*

*J. Appl. Cryst.*

**27**, 1006–1009.Google Scholar

Kleywegt, G. J., Hoier, H. & Jones, T. A. (1996).

*A re-evaluation of the crystal structure of chloromuconate cycloisomerase.*

*Acta Cryst.*D

**52**, 858–863.Google Scholar

Kleywegt, G. J. & Read, R. J. (1997).

*Not your average density.*

*Structure*,

**5**, 1557–1569.Google Scholar

Marsh, R. E. (1995).

*Some thoughts on choosing the correct space group.*

*Acta Cryst.*B

**51**, 897–907.Google Scholar

Marsh, R. E. (1997).

*The perils of*

*Cc*revisited.*Acta Cryst.*B

**53**, 317–322.Google Scholar

Taylor, R. & Kennard, O. (1986).

*Accuracy of crystal structure error estimates.*

*Acta Cryst.*B

**42**, 112–120.Google Scholar

Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999).

*SFCHECK: a unified set of procedures for evaluating the quality of macromolecular structure-factor data and their agreement with the atomic model.*

*Acta Cryst.*D

**55**, 191–205.Google Scholar

Watkin, D. (1996).

*Pseudo symmetry.*In

*Proceedings of the CCP4 study weekend. Macromolecular refinement*, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 171–184. Warrington: Daresbury Laboratory.Google Scholar

Weiss, M. S. & Hilgenfeld, R. (1997).

*On the use of the merging R factor as a quality indicator for X-ray data.*

*J. Appl. Cryst.*

**30**, 203–205.Google Scholar