International
Tables for Crystallography Volume F Crystallography of biological macromolecules Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F. ch. 18.5, pp. 403-418
https://doi.org/10.1107/97809553602060000697 Chapter 18.5. Coordinate uncertainty
a
Chemistry Department, UMIST, Manchester M60 1QD, England Full-matrix least-squares is taken as the basis for an examination of protein-structure precision. A two-atom model is used to compare the precisions of unrestrained and restrained refinements. In this model, restrained refinement determines a bond length which is the weighted mean of the unrestrained diffraction-only length and the geometric-dictionary length. As a protein example, data with 0.94 Å resolution for concanavalin A are used in unrestrained and restrained full-matrix inversions to provide e.s.d.'s σ(r) for positions and σ(l) for bond lengths. σ(r) is as small as 0.01 Å for atoms with low Debye B values but increases strongly with B. The results emphasize the distinction between unrestrained and restrained refinements and also between σ(r) and σ(l). An unrestrained full-matrix inversion for an immunoglobulin with 1.7 Å data is also discussed. Several approximate methods are examined critically. These include Luzzati plots and the diffraction-component precision index (DPI). The DPI estimate of σ(r, Bavg) is given by a simple formula, which uses R or Rfree and is based on a very rough approximation to the least-squares method. Examples show its usefulness as a precision comparator for high- and low-resolution structures. Keywords: R factors; Rfree; accuracy; atomic displacement parameters; block-matrix approximation; concanavalin A; coordinate uncertainty; DPI; diffraction-component precision index; errors; free R factor; full-matrix inversion; goodness of fit; least-squares methods; low-resolution structures; Luzzati plot; modified Fourier method for estimating coordinate uncertainty; normal equations; position error; precision; refinement; residual function; restrained full-matrix inversion for concanavalin A; restrained refinement; restraints; temperature factors; unrestrained full-matrix inversion; weighting. |
Even in 1967 when the first few protein structures had been solved, it would have been hard to imagine a time when the best protein structures would be determined with a precision approaching that of small molecules. That time was reached during the 1990s. Consequently, the methods for the assessment of the precision of small molecules can be extended to good-quality protein structures.
The key idea is simply stated. At the conclusion and full convergence of a least-squares or equivalent refinement, the estimated variances and covariances of the parameters may be obtained through the inversion of the least-squares full matrix.
The inversion of the full matrix for a large protein is a gigantic computational task, but it is being accomplished in a rising number of cases. Alternatively, approximations may be sought. Often these can be no more than rough order-of-magnitude estimates. Some of these approximations are considered below.
Caveat.
Quite apart from their large numbers of atoms, protein structures show features differing from those of well ordered small-molecule structures. Protein crystals contain large amounts of solvent, much of it not well ordered. Parts of the protein chain may be floppy or disordered. All natural protein crystals are noncentrosymmetric, hence the simplifications of error assessment for centrosymmetric structures are inapplicable. The effects of incomplete modelling of disorder on phase angles, and thus on parameter errors, are not addressed explicitly in the following analysis. Nor does this analysis address the quite different problem of possible gross errors or misplacements in a structure, other than by their indication through high B values or high coordinate standard uncertainties. These various difficulties are, of course, reflected in the values of used in the precision estimates.
On the problems of structure validation see Part 21
of this volume and Dodson (1998)
.
Some structure determinations do make a first-order correction for the effects of disordered solvent on phase angles by application of Babinet's principle of complementarity (Langridge et al., 1960; Moews & Kretsinger, 1975
; Tronrud, 1997
). Babinet's principle follows from the fact that if
is constant throughout the cell, then
, except for F(0). Consequently, if the cell is divided into two regions C and D,
. Thus if D is a region of disordered solvent,
can be estimated from
. A first approximation to a disordered model may be obtained by placing negative point-atoms with very high Debye B values at all the ordered sites in region C. This procedure provides some correction for very low resolution planes. Alternatively, corrections are sometimes made by a mask bulk solvent model (Jiang & Brünger, 1994
).
The application of restraints in protein refinement does not affect the key idea about the method of error estimation. A simple model for restrained refinement is analysed in Section 18.5.3, and the effect of restraints is discussed in Section 18.5.4
and later.
Much of the material in this chapter is drawn from a Topical Review published in Acta Crystallographica, Section D (Cruickshank, 1999).
Protein structures exhibiting noncrystallographic symmetry are not considered in this chapter.
A distinction should be made between the terms accuracy and precision. A single measurement of the magnitude of a quantity differs by error from its unknown true value λ. In statistical theory (Cruickshank, 1959), the fundamental supposition made about errors is that, for a given experimental procedure, the possible results of an experiment define the probability density function f(x) of a random variable. Both the true value λ and the probability density f(x) are unknown. The problem of assessing the accuracy of a measurement is thus the double problem of estimating f(x) and of assuming a relation between f(x) and λ.
Precision relates to the function f(x) and its spread.
The problem of what relationship to assume between f(x) and the true value λ is more subtle, involving particularly the question of systematic errors. The usual procedure, after correcting for known systematic errors, is to suppose that some typical property of f(x), often the mean, is the value of λ. No repetition of the same experiment will ever reveal the systematic errors, so statistical estimates of precision take into account only random errors. Empirically, systematic errors can be detected only by remeasuring the quantity with a different technique.
Care is needed in reading older papers. The word accuracy was sometimes intended to cover both random and systematic errors, or it may cover only random errors in the above sense of precision (known systematic errors having been corrected).
In recent years, the well established term estimated standard deviation (e.s.d.) has been replaced by the term standard uncertainty (s.u.). (See Section 18.5.2.3 on statistical descriptors.)
It is useful to begin with a reminder that the Debye , where u is the atomic displacement parameter. If B = 80 Å2, the r.m.s. amplitude is 1.01 Å. The centroid of an atom with such a B is unlikely to be precisely determined. For B = 40 Å2, the 0.71 Å r.m.s. amplitude of an atom is approximately half a C—N bond length. For B = 20 Å2, the amplitude is 0.50 Å. Even for B = 5 Å2, the amplitude is 0.25 Å. The size of the atomic displacement amplitudes should always be borne in mind when considering the precision of the position of the centroid of an atom.
Scattering power depends on . For B = 20 Å2 and d = 4, 2 or 1 Å, this factor is 0.54, 0.08 or 0.0001. For d = 2 Å and B = 5, 20 or 80 Å2, the factor is again 0.54, 0.08 or 0.0001. The scattering power of an atom thus depends very strongly on B and on the resolution
. Scattering at high resolution (low d) is dominated by atoms with low B.
An immediate consequence of the strong dependence of scattering power on B is that the standard uncertainties of atomic coordinates also depend very strongly on B, especially between atoms of different B within the same structure.
[An IUCr Subcommittee on Atomic Displacement Parameter Nomenclature (Trueblood et al., 1996) has recommended that the phrase `temperature factor', though widely used in the past, should be avoided on account of several ambiguities in its meaning and usage. The Subcommittee also discourages the use of B and the anisotropic tensor B in favour of
and U, on the grounds that the latter have a more direct physical significance. The present author concurs (Cruickshank, 1956
, 1965
). However, as the use of B or
is currently so widespread in biomolecular crystallography, this chapter has been written in terms of B.]
In the unrestrained least-squares method, the residual is minimized, where Δ is either
for
or
for
, and w(hkl) is chosen appropriately. The summation is over crystallographically independent planes.
When R is a minimum with respect to the parameter ,
, i.e.,
For
,
; for
,
. The n parameters have to be varied until the n conditions (18.5.2.2)
are satisfied. For a trial set of the
close to the correct values, we may expand Δ as a function of the parameters by a Taylor series to the first order. Thus for
,
where
is a small change in the parameter
, and u and e represent the whole sets of parameters and changes. The minus sign occurs before the summation, since
, and the changes in
are being considered.
Substituting (18.5.2.3) in (18.5.2.2)
, we get the normal equations for
,
There are n of these equations for
to determine the n unknown
.
For the normal equations are
Both forms of the normal equations can be abbreviated to
For the values of for common parameters see, e.g., Cruickshank (1970)
.
Some important points in the derivation of the standard uncertainties of the refined parameters can be most easily understood if we suppose that the matrix can be approximated by its diagonal elements. Each parameter is then determined by a single equation of the form
where
or
. Hence
At the conclusion of the refinement, when R is a minimum, the variance (square of the s.u.) of the parameter
due to uncertainties in the Δ's is
If the weights have been chosen as
or
, this simplifies to
which is appropriate for absolute weights. Equation (18.5.2.10)
provides an s.u. for a parameter relative to the s.u.'s
or
of the observations.
In general, with the full matrix in the normal equations,
where
is an element of the matrix inverse to
. The covariance of the parameters
and
is
In the early stages of refinement, artificial weights may be chosen to accelerate refinement. In the final stages, the weights must be related to the precision of the structure factors if parameter variances are being sought. There are two distinct ways, covering two ranges of error, in which this may be done.
In recent years, there have been developments and changes in statistical nomenclature and usage. Many aspects are summarised in the reports of the IUCr Subcommittee on Statistical Descriptors in Crystallography (Schwarzenbach et al., 1989, 1995
). In the second report, inter alia, the Subcommittee emphasizes the terms uncertainty and standard uncertainty (s.u.). The latter is a replacement for the older term estimated standard deviation (e.s.d.). The Subcommittee classify uncertainty components in two categories, based on their method of evaluation: type A, estimated by the statistical analysis of a series of observations, and type B, estimated otherwise. As an example of the latter, a type B component could allow for doubts concerning the estimated shape and dimensions of the diffracting crystal and the subsequent corrections made for absorption.
The square root S of the expression S2, (18.5.2.12) above, is called the goodness of fit when the weights are the reciprocals of the absolute variances of the observations.
One recommendation in the second report does call for comment here. While agreeing that formulae like (18.5.2.13) lead to conservative estimates of parameter variances, the report suggests that this practice is based on the questionable assumption that the variances of the observations by which the weights are assigned are relatively correct but uniformly underestimated. When the goodness of fit
, then either the weights or the model or both are suspect.
Comment is needed. The account in Section 18.5.2.2 describes two distinct ways of estimating parameter variances, covering two ranges of error. The kind of weights envisaged in the reports (based on variances of type A and/or of type B) are of a class described for method (1)
. They are not the weights to be used in method (2)
(though they may be a component in such weights). Method (2)
implicitly assumes from the outset that there are experimental errors, some covered and others not covered by method (1)
, and that there are imperfections in the calculated model (as is obviously true for proteins). Method (2)
avoids exploring the relative proportions and details of these error sources and aims to provide a realistic estimate of parameter uncertainties which can be used in external comparisons. It can be formally objected that method (2)
does not conform to the criteria of random-variable theory, since clearly the Δ's are partially correlated through the remaining model errors and some systematic experimental errors. But it is a useful procedure. Method (1)
on its own would present an optimistic view of the reliability of the overall investigation, the degree of optimism being indicated by the inverse of the goodness of fit (18.5.2.12)
. In method (2)
, if the weights are on an arbitrary scale, then
can have an arbitrary value.
For an advanced-level treatment of many aspects of the refinement of structural parameters, see Part 8
of International Tables for Crystallography, Volume C (2004)
. The detection and treatment of systematic error are discussed in Chapter 8.5
therein.
Protein structures are often refined by a restrained refinement program such as PROLSQ (Hendrickson & Konnert, 1980). Here, a function of the type
is minimized, where Q denotes a geometrical restraint such as a bond length. Formally, all one is doing is extending the list of observations. One is adding to the protein diffraction data geometrical data from a stereochemical dictionary such as that of Engh & Huber (1991)
. A chain C—N bond length may be known from the dictionary with much greater precision
, say 0.02 Å, than from an unrestrained diffraction-data-only protein refinement.
In a high-resolution unrestrained refinement of a small molecule, the standard uncertainty (s.u.) of a bond length A—B is often well approximated by However, in a protein determination
is often much smaller than either
or
because of the excellent information from the stereochemical dictionary, which correlates the positions of A and B.
Laying aside computational size and complexity, the protein precision problem is straightforward in principle. When a restrained refinement has converged to an acceptable structure and the shifts in successive rounds have become negligible, invert the full matrix. The inverse matrix immediately yields estimates of the variances and covariances of all parameters.
The dimensions of the matrix are the same whether or not the refinement is restrained. The full matrix will be rather sparse, but not nearly as sparse as in a small-molecule refinement. For the purposes of Section 18.5.3, it is irrelevant whether the residual for the diffraction data is based on
or
. On the relative weighting of the diffraction and restraint terms, see Section 18.5.3.3
.
Some aspects of restrained refinement are easily understood by considering a one-dimensional protein consisting of two like atoms in the asymmetric unit, with coordinates and
relative to a fixed origin and bond length
. In the refinement, the normal equations are of the type
. For two non-overlapping like atoms, the diffraction data will yield a normal matrix
with inverse
where
A geometric restraint on the length will yield a normal matrix with no inverse, since its determinant is zero, where
Note
, so that
where
is the variance assigned to the length in the stereochemical dictionary.
Combining
the diffraction data and the restraint, the normal matrix becomes with inverse
For the diffraction data alone, the variance of
is
For the diffraction data plus restraint, the variance of
is
Note that though the restraint says nothing about the position of
, the variance of
has been reduced because of the coupling to the position of the other atom. In the limit when
,
is only half
.
The general formula for the variance of the length is
For the diffraction data alone, this gives
as expected. For the diffraction data plus restraint,
For small a,
, as expected. The variance of the restrained length, (18.5.3.15)
, can be re-expressed as
For the two-atom protein, it can be proved directly, as one would expect from (18.5.3.16), that restrained refinement determines a length which is the weighted mean of the diffraction-only length and the geometric dictionary length.
The centroid has coordinate . It is easily found that
. Thus, as expected, the restraint says nothing about the position of the molecule in the cell.
For numerical illustrations of the s.u.'s in restrained refinement, suppose the stereochemical length restraint has Å. Equation (18.5.3.16)
gives the length s.u.
in restrained refinement. If the diffraction-only
Å, the restrained
is 0.012 Å. If
Å,
is 0.019 Å. However large
,
never exceeds 0.02 Å.
Equation (18.5.3.12) gives the position s.u.
in restrained refinement. If the diffraction-only
Å, the restrained
is 0.009 Å. If
Å,
Å. For large
,
tends to
as the strong restraint couples the two atoms together. For very small
, the relatively weak restraint has no effect.
When only relative diffraction weights are known, as in equation (18.5.2.13), it has been common (Rollett, 1970)
to scale the geometric restraint terms against the diffraction terms by replacing the restraint weights
by
, where
. However, this scheme cannot be used for low-resolution structures if
.
The treatment by Tickle et al. (1998a) shows that the reduction
in the number of degrees of freedom has to be distributed among all the data, both diffraction observations and restraints. Since the geometric restraint weights are on an absolute scale (Å−2), they propose that the (absolute) scale of the diffraction weights should be determined by adjustment until the restrained residual R′ (18.5.3.1)
is equal to its expected value
.
For a method of determining the scale of the diffraction weights based on , see Brünger (1993)
.
The geometric restraint weights were classified by the IUCr Subcommittee (Schwarzenbach et al., 1995) as derived from observations supplementary to the diffraction data, with uncertainties of type B (Section 18.5.2.3)
.
G. M. Sheldrick extended his SHELXL96 program (Sheldrick & Schneider, 1997) to provide extra information about protein precision through the inversion of least-squares full matrices. His programs have been used by Deacon et al. (1997)
for the high-resolution refinement of native concanavalin A with 237 residues, using data at 110 K to 0.94 Å refined anisotropically. After the convergence and completion of full-matrix restrained refinement for the structure, the unrestrained full matrix (coordinates only) was computed and then inverted in a massive calculation. This led to s.u's
,
,
and
for all atoms, and to
and
for all bond lengths and angles.
is defined as
. For concanavalin A the restrained full matrix was also inverted, thus allowing the comparison of restrained and unrestrained s.u.'s.
The results for concanavalin A from the inversion of the coordinate matrices of order 6402 (= 2134 × 3) are plotted in Figs. 18.5.4.1 and 18.5.4.2
. Fig. 18.5.4.1
shows
versus
for the fully occupied atoms of the protein (a few atoms with B > 60 Å2 are off-scale). The points are colour-coded black for carbon, blue for nitrogen and red for oxygen. Fig. 18.5.4.1(a)
shows the restrained results, and Fig. 18.5.4.1(b)
shows the unrestrained diffraction-data-only results. Superposed on both sets of data points are least-squares quadratic fits determined with weights
. At high B, the unrestrained
can be at least double the restrained
, e.g., for carbon at B = 50 Å2, the unrestrained
is about 0.25 Å, whereas the restrained
is about 0.11 Å. For B < 10 Å2, both
's fall below 0.02 Å and are around 0.01 Å at B = 6 Å2.
![]() | Plots of |
![]() | Plots of |
For B < 10 Å2, the better precision of oxygen as compared with nitrogen, and of nitrogen as compared with carbon, can be clearly seen. At the lowest B, the unrestrained in Fig. 18.5.4.1(b)
are almost as small as the restrained
in Fig. 18.5.4.1(a)
. [The quadratic fits of the restrained results in Fig. 18.5.4.1(a)
are evidently slightly imperfect in making
tend almost to 0 as B tends to 0.]
Fig. 18.5.4.2 shows
versus
for the bond lengths in the protein. The points are colour-coded black for C—C, blue for C—N and red for C—O. The restrained and unrestrained distributions are very different for high B. The restrained distribution in Fig. 18.5.4.2(a)
tends to about 0.02 Å, which is the standard uncertainty of the applied restraint for 1–2 bond lengths, whereas the unrestrained distribution in Fig. 18.5.4.2(b)
goes off the scale of the diagram. But for B < 10 Å2, both distributions fall to around 0.01 Å.
The differences between the restrained and unrestrained and
can be understood through the two-atom model for restrained refinement described in Section 18.5.3
. For that model, the equation
relates the bond-length s.u. in the restrained refinement,
, to the
of the unrestrained refinement and the s.u.
assigned to the length in the stereochemical dictionary. In the refinements,
was 0.02 Å for all bond lengths. When this is combined in (18.5.3.16)
with the unrestrained
of any bond, the predicted restrained
is close to that found in the restrained full matrix.
It can be seen from Fig. 18.5.4.2(b) that many bond lengths with average B < 10 Å2 have
Å. For these bonds the diffraction data have greater weight than the stereochemical dictionary. Some bonds have
as low as 0.0080 Å, with
around 0.0074 Å. This situation is one consequence of the availability of diffraction data to the high resolution of 0.94 Å. For large
(i.e., high B), equation (18.5.3.16)
predicts that
Å, as is found in Fig. 18.5.4.2(a)
.
In an isotropic approximation, . Equation (18.5.3.12)
of the two-atom model can be recast to give
For low B, say
in concanavalin, (18.5.4.1)
gives quite good predictions of
from
. For instance, for a carbon atom with B = 15 Å2, the quadratic curve for carbon in Fig. 18.5.4.1(b)
shows
Å, and Fig. 18.5.4.1(a)
shows
Å. While if
Å is used with (18.5.4.1)
, the resulting prediction for
is 0.028 Å.
However, for high B, say B = 50 Å2, the quadratic curve for carbon in Fig. 18.5.4.1(b) shows
Å, and Fig. 18.5.4.1(a)
shows
Å, whereas (18.5.4.1)
leads to the poor estimate
Å.
Thus at high B, equation (18.5.4.1) from the two-atom model does not give a good description of the relationship between the restrained and unrestrained
. The reason is obvious. Most atoms are linked by 1–2 bond restraints to two or three other atoms. Even a carbonyl oxygen atom linked to its carbon atom by a 0.02 Å restraint is also subject to 0.04 Å 1–3 restraints to chain
and N atoms. Consequently, for a high-B atom, when the restraints are applied it is coupled to several other atoms in a group, and its
is lower, compared with the diffraction-data-only
, by a greater amount than would be expected from the two-atom model.
Sheldrick has provided the results of the unrestrained lower-resolution refinement of a single-chain immunoglobulin mutant (T39K) with 218 amino-acid residues, with data to 1.70 Å refined isotropically (Usón et al., 1999). Fig. 18.5.4.3
shows
versus
for the fully occupied protein atoms. Superposed on the data points are least-squares quadratic fits. In a first very rough approximation for
suggested later by equation (18.5.6.3)
, the dependence on atom type is controlled by
, the reciprocal of the atomic number. Sheldrick found that a
dependence produced too little difference between C, N and O. The proportionalities between the quadratics for
in Figs. 18.5.4.1
and 18.5.4.3
are based on the reciprocals of the scattering factors at
, symbolized by
. For C, N and O, these are 2.494, 3.219 and 4.089, respectively. For potential use in later work, the least-squares fits to the
in Å are recorded here as
for the immunoglobulin (unrestrained), concanavalin A (unrestrained) and concanavalin A (restrained), respectively.
![]() | Plot of |
As might be expected from the lower resolution, the lowest 's in the immunoglobulin are about six times the lowest
's in concanavalin. But at B = 50 Å2, the immunoglobulin curve for carbon gives
Å, which is only 50% larger than the concanavalin value of 0.25 Å.
Fig. 18.5.4.4 shows
versus
for the immunoglobulin. Note that the lowest immunoglobulin unrestrained
is about 0.06 Å, which is three times the 0.02 Å
bond restraint.
Geometric restraint dictionaries typically use bond-length weights based on of around 0.02 or 0.03 Å. Tables 18.5.7.1
–18.5.7.3
show that even 1.5 Å studies have diffraction-only errors
of 0.08 Å and upwards. Only for resolutions of 1.0 Å or so are the diffraction-only errors comparable with the dictionary weights. Of course, the dictionary offers no values for many of the configurational parameters of the protein structure, including the centroid and molecular orientation.
The opening contention of this chapter in Section 18.5.1.1 is that the variances and covariances of the structural parameters of proteins can be found from the inverse of the least-squares normal matrix. But there is a caveat, chiefly that explicit account would not be taken of disorder of the solvent or of parts of the protein. Corrections by Babinet's principle of complementarity or by mask bulk solvent models are only first-order approximations. The consequences of such disorder problems, which make the variation of calculated structure factors nonlinear over the range of interest, may in future be better handled by maximum-likelihood methods (e.g. Read, 1990
; Bricogne, 1993
; Bricogne & Irwin, 1996
; Murshudov et al., 1997
). Pannu & Read (1996)
have shown how the maximum-likelihood method can be cast computationally into a form akin to least-squares calculations. Full-matrix precision estimates along the lines of the present chapter are probably somewhat low.
It should also be noted that full-matrix estimates of coordinate precision are most reliably derived from matrices involving both coordinates and atomic displacement parameters. This is particularly important for lower-resolution analyses, in which atomic images overlap. The work on the high-resolution analysis of concanavalin A described in Section 18.5.4.1 was based on the very large coordinate matrix, of order 6402. The omission, because of computer limitations, of the anisotropic displacement parameters from the full matrix will have caused the coordinate s.u.'s of atoms with high
to be underestimated.
Much information about the quality of a molecular model can be obtained from the eigenvalues and eigenvectors of the normal matrix (Cowtan & Ten Eyck, 2000).
The full-matrix inversions described in the previous section require massive calculations. The length of the calculations is more a matter of the order of the matrix, i.e., the number of parameters, than of the number of observations. When restraints are applied, it is the diffraction-cum-restraints full matrix which should be inverted.
With the increasing power of computers and more efficient algorithms (e.g. Tronrud, 1999; Murshudov et al., 1999
), a final full matrix should be computed and inverted much more regularly – and not just for high-resolution analyses. Low-resolution analyses have a need, beyond the indications given by B values, to identify through
estimates their regions of tolerable and less tolerable precision.
If full-matrix calculations are impractical, partial schemes can be suggested. As far back as 1973, Watenpaugh et al. (1973), in a study of rubredoxin at 1.5 Å resolution, effectively inverted the diffraction full matrix in 200 parameter blocks to obtain individual s.u.'s. A similar scheme for restrained refinements could also use overlapping large blocks. A minimal block scheme in refinements of any resolution is to calculate blocks for each residue and for the block interactions between successive residues. The inversion process could then use the matrices in running groups of three successive residues, taking only the inverted elements for the central residue as the estimates of its variances and covariances.
For low-resolution analyses with very large numbers of atoms, it might be sufficient to gain a general idea of the behaviour of as a function of B by computing a limited number of blocks for representative or critical groups of residues. The parameters used in the blocks should include the B's, since atomic images overlap at low resolution, thus correlating the position of one atom with the displacement parameters of its neighbours.
In the simplest form of the Fourier-map approach to centrosymmetric high-resolution structures, atomic positions are given by the maxima of the observed electron density. The uncertainty of such a position may be estimated as the uncertainty in the slope function (first derivative) divided by the curvature (second derivative) at the peak (Cruickshank, 1949a), i.e.,
However, atomic positions are affected by finite-series and peak-overlapping effects.
Hence, more generally, atomic positions may be determined by the requirement that the slope of the difference map at the position of atom r should be zero, or equivalently that the slopes at atom r of the observed and calculated electron densities should be equal. As a criterion this becomes the basis of the modified Fourier method (Cruickshank, 1952, 1959
, 1999
; Bricogne, 2001
, Section 1.3.4.4.7.5
), which, like the least-squares method, is applicable whether or not the atomic peaks are resolved and is applicable to noncentrosymmetric structures. For refinement, a set of n simultaneous linear equations are involved, analogous to the normal equations of least squares. Their right-hand sides are the slopes of the difference map at the trial atomic positions.
The diagonal elements of the matrix, for coordinate of an atom with Debye B value
, are approximately equal to
where
or 2 for acentric or centric reflections. The summation is over all independent planes and their symmetry equivalents. Strictly speaking, (18.5.5.2)
is a curvature only for centrosymmetric structures.
In the modified Fourier method, This is simply an estimate of the r.m.s. uncertainty at a general position (Cruickshank & Rollett, 1953
) in the slope of the difference map, i.e., the r.m.s. uncertainty on the right-hand side of the modified Fourier method.
is then given by (18.5.5.1)
, using (18.5.5.3)
and (18.5.5.2).
An extreme example of an apparently successful gross approximation to protein precision is represented by Daopin et al.'s (1994) treatment of two independent determinations (at 1.8 and 1.95 Å) of the structure of TGF-β2. They reported that the modified Fourier-map formulae given in Section 18.5.5.2
yielded a quite good description of the B dependence of the positional differences between the two independent determinations. However, there is a formal difficulty about this application. Equation (18.5.5.1)
derives from a diffraction-data-only approach, whereas the two structures were determined from restrained refinements. Even though the TNT restraint parameters and weights may have been the same in both refinements, it is slightly surprising that (18.5.5.1)
should have worked well.
Equation (18.5.2.1) requires the summation of various series over all (hkl) observations; such calculations are not customarily provided in protein programs. However, due to the fundamental similarities between Fourier and least-squares methods demonstrated by Cochran (1948)
, Cruickshank (1949b
, 1952
, 1959
), and Cruickshank & Robertson (1953)
, closely similar estimates of the precision of individual atoms can be obtained from the reciprocal of the diagonal elements of the diffraction-data-only least-squares matrix. These elements will often have been calculated already within the protein refinement programs, but possibly never output. Such estimates could be routinely available.
Between approximations using largish blocks and those using only the reciprocals of diagonal terms, a whole variety of intermediate approximations involving some off-diagonal terms could be envisaged.
Whatever method is used to estimate uncertainties, it is essential to distinguish between coordinate uncertainty, e.g., , and position uncertainty
.
The remainder of this chapter discusses two rough-and-ready indicators of structure precision: the diffraction-component precision index (DPI) and Luzzati plots.
From general statistical theory, one would expect the s.u. of an atomic coordinate determined from the diffraction data alone to show dependence on four factors: Here,
is some measure of the precision of the data;
is the recognition that the information content of the data has to be shared out;
is the number of independent data, but to achieve the correct number of degrees of freedom this must be reduced by
, the number of parameters determined; and
is a more specialized factor arising from the sensitivity
of the data to the parameter x. Here
is the r.m.s. reciprocal radius of the data. Any statistical error estimate must show some correspondence to these four factors.
Cruickshank (1960) offered a simple order-of-magnitude formula for
in small molecules. It was intended for use in experimental design: how many data of what precision are needed to achieve a given precision in the results? The formula, derived from a very rough estimate of a least-squares diagonal element in non-centrosymmetric space groups, was
Here p =
, R is the usual residual
and
is the number of atoms of type i needed to give scattering power at
equal to that of the asymmetric unit of the structure, i.e.,
. [The formula has also proved very useful in a systematic study of coordinate precision in the many thousands of small-molecule structure analyses recorded in the Cambridge Structural Database (Allen et al., 1995a
,b
).]
For small molecules, the above definition of allowed the treatment of different types of atom with not-too-different B's. However, it is not suitable for individual atoms in proteins where there is a very large range of B values and some atoms have B's so large as to possess negligible scattering power at
.
Often, as in isotropic refinement, , where
is the total number of atoms in the asymmetric unit. For fully anisotropic refinement,
.
A first very rough extension of (18.5.6.2) for application in proteins to an atom with
is
where k is about 1.0,
is the average B for fully occupied sites and C is the fractional completeness of the data to
. In deriving (18.5.6.3)
from (18.5.6.2)
,
has been replaced by
, and the factor
has been increased to 1.0 as a measure of caution in the replacement of a full matrix by a diagonal approximation.
is an empirical function to allow for the dependence of
on B. However, the results in Section 18.5.4.2
showed that the parameters
and
depend on the structure.
As also mentioned in Section 18.5.4.2, Sheldrick has found that the
in
is better replaced by
, the scattering factor at
. Hence,
may be taken as
A useful comparison of the relative precision of different structures may be obtained by comparing atoms with the respective in the different structures. (18.5.6.3)
then reduces to
The smaller the
and the R, the better the precision of the structure. If the difference between oxygen, nitrogen and carbon atoms is ignored,
may be taken simply as the number of fully occupied sites. For heavy atoms, (18.5.6.4)
must be used for
.
Equation (18.5.6.5) is not to be regarded as having absolute validity. It is a quick and rough guide for the diffraction-data-only error component for an atom with Debye B equal to the
for the structure. It is named the diffraction-component precision index, or DPI. It contains none of the restraint data.
For low-resolution structures, the number of parameters may exceed the number of diffraction data. In (18.5.6.3) and (18.5.6.5)
,
is then negative, so that
is imaginary. This difficulty can be circumvented empirically by replacing p with
and R with
(Brünger, 1992
). The counterpart of the DPI (18.5.6.5)
is then
Here
is the number of reflections included in the refinement, not the number in the
set.
It may be asked: how can there be any estimate for the precision of a coordinate from the diffraction data only when there are insufficient diffraction data to determine the structure? By following the line of argument of Cruickshank's (1960) analysis, (18.5.6.6)
is a rough estimate of the square root of the reciprocal of one diagonal element of the diffraction-only least-squares matrix. All the other parameters can be regarded as having been determined from a diffraction-plus-restraints matrix.
Clearly, (18.5.6.6) can also be used as a general alternative to (18.5.6.5)
as a DPI, irrespective of whether the number of degrees of freedom
is positive or negative.
Comment
. When p is positive, (18.5.6.6) would be exactly equivalent to (18.5.6.5)
only if
. Tickle et al. (1998b)
have shown that the expected relationship in a restrained refinement is actually
where
, the latter term, as in (18.5.3.1)
, being the weighted sum of the squares of the restraint residuals.
The DPI (18.5.6.9) with R was offered as a quick and rough guide for the diffraction-data-only error for an atom with
. The necessary data for the comparison with the two unrestrained full-matrix inversions of Section 18.5.5
are given in Table 18.5.7.1
. For concanavalin A with
, the full-matrix quadratic (18.5.4.2b)
gives 0.033 Å for a carbon atom and the DPI gives 0.034 Å for an unspecified atom. For the immunoglobulin with
, the full-matrix quadratic (18.5.4.2a)
gives
for a carbon atom, while the DPI gives 0.22 Å.
|
For these two structures, the simple DPI formula compares surprisingly well with the unrestrained full-matrix calculations at .
For the restrained full-matrix calculations on concanavalin A, the quadratic (18.5.4.2c) with
gives
for a carbon atom, which is only 15% smaller than the unrestrained 0.033 Å. This small decrease matches the discussion of
and
in Section 18.5.4.1
following equation (18.5.4.1)
. But that discussion also indicates that for the immunoglobulin, the restrained
, which was not computed, will be proportionaly much lower than the unrestrained value of
, since the restraints are relatively more important in the immunoglobulin.
Table 18.5.7.2 shows a range of examples of the application of the DPI (18.5.6.9)
using R to proteins of differing precision, starting with the smallest
. In all the examples,
has been set equal to
, the total number of atoms. The ninth and tenth columns show
values derived from Luzzati (1952)
and Read (1986)
plots described later in Section 18.5.8
.
|
The first entry is for crambin at 0.83 Å resolution and 130 K (Stec et al., 1995). Their results were obtained from an unrestrained full-matrix anisotropic refinement. Inversion of the full matrix gave s.u.'s
for backbone atoms, 0.0168 Å for side-chain atoms and 0.0409 Å for solvent atoms, with an average for all atoms of 0.022 Å. The DPI
corresponds to
, which is satisfactorily intermediate between the full-matrix values for the backbone and side-chain atoms.
Sevcik et al. (1996) carried out restrained anisotropic full-matrix refinements on data from two slightly different crystals of ribonuclease Sa, with
of 1.15 and 1.20 Å. They inverted full-matrix blocks containing parameters of 20 residues to estimate coordinate errors. The overall r.m.s. coordinate error for protein atoms is given as 0.03 Å, and for all atoms (including waters and ligands) as 0.07 Å for MGMP and 0.05 Å for MSA. The DPI gives
for both structures.
The next entries concern the two lower-resolution (1.8 and 1.95 Å) studies of TGF- (Daopin et al., 1994
). The DPI gives
for 1TGI and 0.24 Å for 1TGF. This indicates an r.m.s. position difference between the structures for atoms with
of
. Daopin et al. reported the differences between the two determinations, omitting poor parts, as
(main chain) and 0.29 Å (all atoms).
Human diferric lactoferrin (Haridas et al., 1995) is an example of a large protein at the lower resolution of 2.2 Å, with a high value of
, leading to
.
Three crystal forms of thaumatin were studied by Ko et al. (1994). The orthorhombic and tetragonal forms diffracted to 1.75 Å, but the monoclinic C2 form diffracted only to 2.6 Å. The structures with 1552 protein atoms were successfully refined with restraints by XPLOR and TNT. For the monoclinic form, the number of parameters exceeds the number of diffraction observations, so
is negative and no estimate by (18.5.6.9)
of the diffraction-data-only error is possible. The DPI (18.5.6.9)
gives 0.17 and 0.16 Å for the orthorhombic and tetragonal forms, respectively.
As in the case of monoclinic thaumatin, for low-resolution structures the number of parameters may exceed the number of diffraction data. To circumvent this difficulty, it was proposed in Section 18.5.6.3 to replace
by
and R by
in a revised formula (18.5.6.10)
for the DPI. Table 18.5.7.3
shows examples for some structures for which both R and
were available. The second row for each protein shows the alternative values for
,
and the DPI
from (18.5.6.10)
.
|
For the structures with , the DPI is much the same whether it is based on R or
.
Tickle et al. (1998a) have made full-matrix error estimates for isotropic restrained refinements of γB-crystallin with
and of βB2-crystallin with
. The DPI
calculated for the two structures is 0.14 and 0.25 Å with R in (18.5.6.9)
, and 0.14 and 0.22 Å with
in (18.5.6.10)
. The full-matrix weighted averages of
for all protein atoms were 0.10 and 0.15 Å, for only main-chain atoms 0.05 and 0.08 Å, for side-chain atoms 0.14 and 0.20 Å, and for water oxygens 0.27 and 0.35 Å. Again, the DPI gives reasonable overall indices for the quality of the structures.
For the complex of bovine ribonuclease A and porcine ribonuclease inhibitor (Kobe & Deisenhofer, 1995) with
, the number of reflections is only just larger than the number of parameters, so that
is very large, and the DPI with R gives an unrealistic 1.85 Å. With
,
.
The HyHEL-5–lysozyme complex (Cohen et al., 1996) had
. Here the number of reflections is less than the number of parameters, but the
formula gives
.
The DPI (18.5.6.9) or (18.5.6.10)
provides a very simple formula for
. It is based on a very rough approximation to a diagonal element of the diffraction-data-only matrix. Using a diagonal element is a reasonable approximation for atomic resolution structures, but for low-resolution structures there will be significant off-diagonal terms between overlapping atoms. The effect can be simulated in the two-atom protein model of Section 18.5.3.2
by introducing positive off-diagonal elements into the diffraction-data matrix (18.5.3.3)
. As expected,
is increased. So the DPI will be an underestimate of the diffraction component in low-resolution structures.
However, the true restrained variance in the new counterpart of (18.5.3.12)
remains less than the diagonal diffraction result (18.5.3.11)
. Thus for low-resolution structures, the DPI should be an overestimate of the true precision given by a restrained full-matrix calculation (where the restraints act to hold the overlapping atoms apart). This is confirmed by the results for the 2.1 Å study of βB2-crystallin (Tickle et al., 1998a
) discussed in Section 18.5.7.3
and Table 18.5.7.3
. The restrained full-matrix average for all protein atoms was
Å, compared with the DPI 0.25 Å (on R) or 0.22 Å (on
). The ratio between the unrestrained DPI and the restrained full-matrix average is consistent with a view of a low-resolution protein as a chain of effectively rigid peptide groups. The ratio no doubt gets much worse for resolutions of 3 Å and above.
The DPI estimate of is given by a formula of `back-of-an-envelope' simplicity.
is taken to be the average B for fully occupied sites, but the weights implicit in the averaging are not well defined in the derivation of the DPI. Thus the DPI should perhaps be regarded as simply offering an estimate of a typical
for a carbon or nitrogen atom with a mid-range B. From the evidence of the tables in this section, except at low resolution, it seems to give a useful overall indication of protein precision, even in restrained refinements.
The DPI evidently provides a method for the comparative ranking of different structure determinations
. In this regard it is a complement to the general use of as a quick indicator of possible structural quality.
Note that (18.5.6.3) and (18.5.6.4)
offer scope for making individual error estimates for atoms of different B and Z.
Luzzati (1952) provided a theory for estimating, at any stage of a refinement, the average positional shifts which would be needed in an idealized refinement to reach
. He did not provide a theory for estimating positional errors at the end of a normal refinement.
Luzzati gave families of curves for R versus for varying average positional errors
for both centrosymmetric and noncentrosymmetric structures. The curves do not depend on the number N of atoms in the cell. They all rise from
at
to the Wilson (1950)
values 0.828 and 0.586 for random structures at high
. Table 18.5.8.1
gives
as a function of
for three-dimensional noncentrosymmetric structures.
|
In a footnote (p. 807), Luzzati suggested that at the end of a normal refinement (with R nonzero due to experimental and model errors, etc.), the curves would indicate an upper limit for . He noted that typical small-molecule
's of 0.01–0.02 Å, if used as
in the plots, would give much smaller R's than are found at the end of a refinement.
As examples, the Luzzati plots for the two structures of TGF-β2 are shown in Fig. 18.5.8.1. Daopin et al. (1994)
inferred average
's around 0.21 Å for 1TGI and 0.23 Å for 1TGF.
![]() | Luzzati plots showing the refined R factor as a function of resolution for 1TGI (solid squares) and 1TGF (open squares) (Daopin et al., 1994 |
Of the three Luzzati assumptions summarized above, the most attractive is the third, which does not require the atoms to be identical nor the position errors to be small. For proteins, there are very obvious difficulties with assumption (2). Errors do depend very strongly on Z and B. In the high-angle data shells, atoms with large B's contribute neither to nor to
, and so have no effect on R in these shells. In their important paper on protein accuracy, Chambers & Stroud (1979)
said `the [Luzzati] estimate derived from reflections in this range applies mainly to [the] best determined atoms.'
Thus a Luzzati plot seems to allow a cautious upper-limit statement about the precision of the best parts of a structure, but it gives little indication for the poor parts.
One reason for the past popularity of Luzzati plots has been that the R values for the middle and outer shells of a structure often roughly follow a Luzzati curve. Evidently, the effective average for the structure must be decreasing as
increases, since atoms of high B are ceasing to contribute, whereas the proportionate experimental errors must be increasing. This also suggests that the upper limit for
for the low-B atoms could be estimated from the lowest Luzzati theoretical curve touched by the experimental R plot. Thus in Fig. 18.5.8.1
the upper limits for the low-B atoms could be taken as 0.18 and 0.21 Å, rather than the 0.21 and 0.23 Å chosen by Daopin et al.
From the introduction of by Brünger (1992)
and the discussion of
by Tickle et al. (1998b)
, it can be seen that Luzzati plots should be based on a residual more akin to
than R in order to avoid bias from the fitting of data.
The mean positional error of atoms can also be estimated from the
plots of Read (1986
, 1990
). This method arose from Read's analysis of improved Fourier coefficients for maps using phases from partial structures with errors. It is preferable in several respects to the Luzzati method, but like the Luzzati method it assumes that the coordinate distribution is the same for all atoms. Luzzati and/or Read estimates of
are available for some of the structures in Tables 18.5.7.2
and 18.5.7.3
. Often, the two estimates are not greatly different.
Luzzati plots are fundamentally different from other statistical estimates of error. The Luzzati theory applies to an idealized incomplete refinement and estimates the average shifts needed to reach . In the least-squares method, the equations for shifts are quite different from the equations for estimating variances in a converged refinement. However, Luzzati-style plots of R versus
can be reinterpreted to give statistically based estimates of
.
During Cruickshank's (1960) derivation of the approximate equation (18.5.6.2)
for
in diagonal least squares, he reached an intermediate equation
He then assumed R to be independent of
and took R outside the summation to reach (18.5.6.2)
above.
Luzzati (1952) calculated the acentric residual R as a function of
, the average radial error of the atomic positions. His analysis shows that R is a linear function of s and
for a substantial range of
, with
The theoretical Luzzati plots of R are nearly linear for small-to-medium
(see Fig. 18.5.8.1)
. If we substitute this R in the least-squares estimate (18.5.8.1)
and use the three-dimensional-Gaussian relation
, some manipulation (Cruickshank, 1999
) along the lines of Section 18.5.6
eventually yields a statistically based formula,
where
is the value of R at some value of
on the selected Luzzati curve. Equation (18.5.8.3)
provides a means of making a very rough statistical estimate of error for an atom with
(the average B for fully occupied sites) from a plot of R versus
.
Protein structures always show a great range of B values. The Luzzati theory effectively assumes that all atoms have the same B. Nonetheless, the Luzzati method applied to high-angle data shells does provide an upper limit for for the atoms with low B. It is an upper limit since experimental errors and model imperfections are not allowed for in the theory.
Low-resolution structures can be determined validly by using restraints, even though the number of diffraction observations is less than the number of atomic coordinates. The Luzzati method, based preferably on , can be applied to the atoms of low B in such structures. As the number of observations increases, and the resolution improves, the Luzzati
increasingly overestimates the true
of the low-B atoms.
In the use of Luzzati plots, the method of refinement, and its degree of convergence, is irrelevant. A Luzzati plot is a statement for the low-B atoms about the maximum errors associated with a given structure, whether converged or not.
References





















































