International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G. ch. 3.6, pp. 158-162

Section 3.6.6.2. Refinement

P. M. D. Fitzgerald,a* J. D. Westbrook,b P. E. Bourne,c B. McMahon,d K. D. Watenpaughe and H. M. Bermanf

a Merck Research Laboratories, Rahway, New Jersey, USA,bProtein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, New Jersey, USA,cResearch Collaboratory for Structural Bioinformatics, San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA,dInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England,eretired; formerly Structural, Analytical and Medicinal Chemistry, Pharmacia Corporation, Kalamazoo, Michigan, USA, and fProtein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, New Jersey, USA
Correspondence e-mail:  paula_fitzgerald@merck.com

3.6.6.2. Refinement

| top | pdf |

The categories describing refinement are as follows:

REFINE group
Overall description of the refinement (§3.6.6.2.1[link])
REFINE
REFINE_FUNCT_MINIMIZED
Analysis of the refined structure (§3.6.6.2.2[link])
REFINE_ANALYZE
Restraints and refinement by shells of resolution (§3.6.6.2.3[link])
REFINE_LS_RESTR
REFINE_LS_RESTR_NCS
REFINE_LS_RESTR_TYPE
REFINE_LS_SHELL
REFINE_LS_CLASS
Equivalent atoms in the refinement (§3.6.6.2.4[link])
REFINE_B_ISO
REFINE_OCCUPANCY
History of the refinement (§3.6.6.2.5[link])
REFINE_HIST

The macromolecular CIF dictionary contains many more data items for describing the refinement process than the core CIF dictionary does. In addition to new items in the REFINE category itself, additional categories have been introduced to describe in great detail the function minimized and the restraints applied, and the history of the refinement process, which often has many cycles. The REFINE_ANALYZE category can be used to give details of many of the quantities that may be used to assess the quality of the refinement. The REFINE_LS_SHELL category allows results to be reported by shells of resolution, and in effect replaces the more general core CIF category REFINE_LS_CLASS.

3.6.6.2.1. Overall description of the refinement

| top | pdf |

The data items in these categories are as follows:

(a) REFINE [Scheme scheme60]

(b) REFINE_FUNCT_MINIMIZED [Scheme scheme61]

The bullet ([\bullet]) indicates a category key. The arrow ([\rightarrow]) is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_) except where indicated by the [\sim] symbol. Data items marked with a plus (+) have companion data names for the standard uncertainty in the reported value, formed by appending the string _esd to the data name listed.

There is already an extensive set of data names in the REFINE category of the core dictionary, and Section 3.2.3.1[link] should be read with the present section. The only data items discussed in this section are entries in the mmCIF dictionary that do not have a counterpart in the core CIF dictionary. Analogues of a number of R factors in the core CIF dictionary have been added to the mmCIF dictionary to express these same R factors independently for the free and working sets of reflections. The remaining new data items have more specialized roles, which are discussed below.

The data item _refine.entry_id has been added to the REFINE category to provide the formal category key required by the DDL2 data model.

Many macromolecular structure refinements now use the statistical cross-validation technique of monitoring a `free' R factor (Brünger, 1997[link]). Rfree is calculated the same way as the conventional least-squares R factor, but using a small subset of reflections that are not used in the refinement of the structural model. Thus Rfree tests how well the model predicts experimental observations that are not themselves used to fit the model.

The mmCIF dictionary provides data names for Rfree and for the complementary Rwork values for the `working' set of reflections, which are the reflections that are used in the refinement. Separate data items are provided for unweighted and weighted versions of each R factor. A fixed percentage of the total number of reflections is usually assigned to the free group, and this percentage can be specified. Further details about the method used for selecting the free reflections can be given using _reflns.R_free_details. The estimated error in the Rfree value may also be given, along with the method used for determining its value.

The purposes of having a set of reflections that are not used in the refinement are to monitor the progress of the refinement and to ensure that the R factor is not being artificially reduced by the introduction of too many parameters. However, as the refinement converges, the working and free R factors both approach stable values. It is common practice, particularly in structures at high resolution, to stop monitoring Rfree at this point and to include all the reflections in the final rounds of refinement. It is thus worth noting a distinction between _refine.ls_R_factor_obs and _refine.ls_R_factor_R_work: _refine.ls_R_factor_obs relates to a refinement in which all reflections more intense than a specified threshold were used, while _refine.ls_R_factor_R_work relates to a refinement in which a subset of the observed reflections were excluded from the refinement and were used to calculate the free R factor. The dictionary allows the use of both values if a free R factor were calculated for most of the refinement, but all of the observed reflections were used in the final rounds of refinement; the protocol for this may be explained in _refine.details. When a full history of the refinement is provided using data items in the REFINE_HIST category, it is preferable to specify a change in protocol using data items in this category.

Other data items help to provide an assessment of the quality of the refinement. The scale-independent correlation coefficient between the observed and calculated structure factors may be recorded for the reflections included in the refinement using the data item _refine.correlation_coeff_Fo_to_Fc. There is a similar data item for the reflections that were not included in the refinement.

Overall standard uncertainties for positional and displacement parameters can be recorded according to a number of conventions. A maximum-likelihood residual for the positional parameters can be given using _refine.overall_SU_ML and the corresponding value for the displacement parameters can be given using _refine.overall_SU_B. Diffraction-component precision indexes for the displacement parameters based on the crystallographic R factor (the Cruickshank DPI; Cruickshank, 1999[link]) can be given using _refine.overall_SU_R_Cruickshank_DPI. The corresponding value for Rfree can be given using _refine.overall_SU_R_free.

The quality of a data set used for the refinement of a macromolecular structure is often given not only in terms of the scaling residuals, but also in terms of the data redundancy (the ratio of the number of reflections measured to the number of crystallographically unique reflections). Data items are provided to express the redundancy of all reflections, as well as those that have been marked as `observed' (i.e. exceeding the threshold for inclusion in the refinement). The percentage of the total number of reflections that are considered observed is another metric of the quality of the data set, and a data item is provided for this ( _refine.ls_percent_reflns_obs).

The limited resolution of many macromolecular data sets makes it inappropriate to refine anisotropic displacement factors for each atom. For these low- to medium-resolution studies, an overall anisotropic displacement model may be refined. The data items _refine.aniso_B* are provided for recording the unique elements of the matrix that describes the refined anisotropy.

The two-parameter method for modelling the contribution of the bulk solvent to the scattering proposed by Tronrud is used in several refinement programs. The data items _refine.solvent_model_* can be used to record the scale and displacement factors of this model, and any special aspects of its application to the refinement.

The average phasing figure of merit can be given for the working and free reflections. Unusually high or low values of displacement factors or occupancies can be a sign of problems with the refinement, so data items are provided to record the high, low and mean values of each. Further indicators of the quality of the refinement are found in the REFINE_ANALYZE category (Section 3.6.6.2.2[link]).

The data items in the REFINE_FUNCT_MINIMIZED category allow a brief description of the function minimized during refinement to be given (Example 3.6.6.7[link]). It is not possible to reconstruct the functioned minimized during the refinement by automatic parsing of the values of these data items, but the details given in them may still be helpful to someone reading the mmCIF.

Example 3.6.6.7. Results of the overall refinement of an HIV-1 protease structure (PDB 5HVP) described using data items in the REFINE and REFINE_FUNCT_MINIMIZED categories.

[Scheme scheme62]

3.6.6.2.2. Analysis of the refined structure

| top | pdf |

The data items in this category are as follows:

REFINE_ANALYZE [Scheme scheme63]

The bullet ([\bullet]) indicates a category key. The arrow ([\rightarrow]) is a reference to a parent data item.

In small-molecule crystallography, there is general agreement on the metrics that should be used to assess the quality of a structure determination, and data items in the REFINE category of the core CIF dictionary can be used to record them. For macromolecular structure determinations, no such agreement has been achieved yet and new metrics are frequently suggested as the field evolves. The REFINE_ANALYZE category can be used to record the metrics that were in common use at the time that the mmCIF dictionary was constructed; it is anticipated that new metrics will be added in future versions of the dictionary, and that some of the current metrics may fall into disuse.

Luzzati (1952[link]) devised a method for estimating the average positional shift that would be needed in an idealized refinement to reach an R factor of zero by using a plot of R factors against resolution. For some time, macromolecular crystallographers have used a modification of this approach to assess the average positional error. Recent practice has used Luzzati plots based on the free R values to yield a cross-validated error estimate. Data items are provided for recording these coordinate-error estimates and the range of resolution included in the plot (Example 3.6.6.8[link]). Related data names allow the specification of the value of [\sigma_a] used in constructing the Luzzati plot.

Example 3.6.6.8. Aspects of the refinement of an HIV-1 protease structure (PDB 5HVP) described with data items in the REFINE_ANALYZE category.

[Scheme scheme65]

A general feature of introducing more parameters in the model of the structure is a reduction in the R factor, but the statistical significance of this is often obscured by the simultaneous reduction in the ratio of observations to parameters. Attempts to extend Hamilton's (1965[link]) test to macromolecular structures are usually confounded by the use of restraints. Tickle et al. (1998[link]) proposed the use of a Hamilton generalized R factor analyzed separately for reflections in the working set (those used in the refinement) and for reflections in the free set (those set aside for cross validation), and these metrics are often reported in the literature. Data items are provided for recording the Hamilton generalized R factor for the working and free set of reflections, and for the ratio of the two.

Other indicators of a successful refinement involve the relative order of the model. Data items are provided for recording the sum of the occupancies of the hydrogen and non-hydrogen atoms in the model. The number of disordered residues may also be recorded.

3.6.6.2.3. Restraints and refinement by shells of resolution

| top | pdf |

The data items in these categories are as follows:

(a) REFINE_LS_RESTR [Scheme scheme64]

(b) REFINE_LS_RESTR_NCS [Scheme scheme66]

(c) REFINE_LS_RESTR_TYPE [Scheme scheme67]

(d) REFINE_LS_SHELL [Scheme scheme68]

(e) REFINE_LS_CLASS [Scheme scheme69]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item.

These categories were introduced in the mmCIF dictionary to allow a detailed description of several aspects of structure refinement to be given. Data items in the REFINE_LS_RESTR category allow geometric restraints to be specified and the deviations of restrained parameters from ideal values in the final model to be given. The type of the geometric restraints can be described in more detail using data items in the REFINE_LS_RESTR_TYPE category. Data items in the REFINE_LS_RESTR_NCS category can be used to give information about any restraints on noncrystallographic symmetry used in the refinement and the category REFINE_LS_SHELL contains data items that allow the results of refinement to be given by shells of resolution.

Data items in the REFINE_LS_RESTR category can be used to record details about the restraints applied to various classes of parameters during least-squares refinement (Example 3.6.6.9[link]). It is clearly useful to tabulate the various classes of restraint, their deviation from ideal target values and the criteria used to reject parameters that lie too far from a target, as these data are often published as part of a description of the refinement and are often deposited with the coordinates in an archive. However, the types of restraints applied depend strongly on the software package used, and as new refinement packages regularly become available, it was clearly not advisable to provide program-specific data items in the mmCIF dictionary. The approach taken in the mmCIF dictionary has been to allow the value of _refine_ls_restr.type to be a free-text field, so that arbitrary labels can be given to restraints that are particular to a software package, but to recommend the use of specific labels for restraints applied by particular programs. The dictionary provides examples for labels specific to the programs PROTIN/PROLSQ (Hendrickson & Konnert, 1979[link]) and RESTRAIN (Driessen et al., 1989[link]). These program-specific representations have particular prefixes; thus the value p_bond_d is a bond-distance restraint as applied by PROTIN/PROLSQ. Values for _refine_ls_restr.type appropriate for other refinement programs may be suggested in future versions of the mmCIF dictionary.

Example 3.6.6.9. Results of the refinement of an HIV-1 protease structure (PDB 5HVP) described with data items in the REFINE_LS_RESTR and REFINE_LS_SHELL categories.

[Scheme scheme70]

Data items in the REFINE_LS_RESTR_TYPE category can be used to specify the ranges within which quantities are allowed to vary for each type of restraint. The special value indicated by a full stop (.) represents a restraint unbounded on the high or low side.

Data items in the REFINE_LS_RESTR_NCS category can be used to record details about the restraints applied to atom positions in domains related by noncrystallographic symmetry during least-squares refinement, and also to record the deviation of the restrained atomic parameters at the end of the refinement. The domains related by noncrystallographic symmetry are defined in the STRUCT_NCS_DOM and related categories (see Section 3.6.7.5.5[link]). The quantities that can be recorded for each restrained domain are the root-mean-square deviations of the displacement and positional parameters, and the weighting coefficients used in the noncrystallographic restraint of each type of parameter. Any special aspects of the way the restraints were applied may be described using _refine_ls_restr_ncs.ncs_model_details.

Data items in the REFINE_LS_SHELL category are used to summarize details of the results of the least-squares refinement by shells of resolution (Example 3.6.6.9[link]). The resolution range, in ångströms, forms the category key; for each shell the quantities reported, such as the number of reflections above the threshold for counting as significantly intense, are all defined in the same way as the corresponding data items used to describe the results of the overall refinement in the REFINE category.

The core dictionary category REFINE_LS_CLASS was introduced after the release of the first version of the mmCIF dictionary. It provides a more general way of describing the treatment of particular subsets of the observations, but it is not expected to be used in macromolecular structural studies, where partition by shells of resolution is traditional.

3.6.6.2.4. Equivalent atoms in the refinement

| top | pdf |

The data items in these categories are as follows:

(a) REFINE_B_ISO [Scheme scheme71]

(b) REFINE_OCCUPANCY [Scheme scheme72]

The bullet ([\bullet]) indicates a category key.

In macromolecular structure refinement, displacement factors or occupancies are often treated as equivalent for groups of atoms. An example would be the case where most of the atoms in the structure are refined with isotropic displacement factors, but a bound metal atom is allowed to refine anisotropically. Another example would be where the occupancies for all of the atoms in the protein part of a macromolecular complex are fixed at 1.0, but the occupancies of atoms in a bound inhibitor are refined. The REFINE_B_ISO and REFINE_OCCUPANCY categories can be used to record this information (Example 3.6.6.10[link]).

Example 3.6.6.10. The handling of displacement factors and occupancies during the refinement of an HIV-1 protease structure (PDB 5HVP) described with data items in the REFINE_B_ISO and REFINE_OCCUPANCY categories.

[Scheme scheme73]

Data items in the REFINE_B_ISO category can be used to record details of the treatment of isotropic B (displacement) factors during refinement. There is no formal link between the classes identified by _refine_B_iso.class and individual atom sites, although relationships may be inferred if the class names are carefully chosen. The category allows the treatment of the atoms in each class (isotropic, anisotropic or fixed) and the value assigned for fixed isotropic B factors to be recorded. Any special details can be given in a free-text field.

Data items in the REFINE_OCCUPANCY category can be used to record details of the treatment of occupancies of groups of atom sites during refinement. As with the treatment of displacement factors in the REFINE_B_ISO category, the classes itemized by _refine_occupancy.class are not formally linked to the individual atom sites, but the relationships may be deduced if the class names are chosen carefully.

3.6.6.2.5. History of the refinement

| top | pdf |

The data items in this category are as follows:

REFINE_HIST [Scheme scheme75]

The bullet ([\bullet]) indicates a category key.

Data items in the REFINE_HIST category can be used to record details about the various steps in the refinement of the structure. They do not provide as thorough a description of the refinement as can be given in other categories for the final model, but instead allow a summary of the progress of the refinement to be given and supported by a small set of representative statistics.

The category is sufficiently compact that a large number of cycles could be summarized, but it is not expected that every cycle of refinement would be routinely reported. Example 3.6.6.11[link] shows an entry for a single cycle of refinement. It is likely that an author would present a representative sequence of entries in a looped list.

Example 3.6.6.11. An example of one cycle of refinement described with data items in the REFINE_HIST category.

[Scheme scheme74]

References

First citation Brünger, A. T. (1997). Free R value: cross-validation in crystallography. Methods Enzymol. 277, 366–396.Google Scholar
First citation Cruickshank, D. W. J. (1999). Remarks about protein structure precision. Acta Cryst. D55, 583–601.Google Scholar
First citation Driessen, H., Haneef, M. I. J., Harris, G. W., Howlin, B., Khan, G. & Moss, D. S. (1989). RESTRAIN: restrained structure-factor least-squares refinement program for macromolecular structures. J. Appl. Cryst. 22, 510–516.Google Scholar
First citation Hamilton, W. C. (1965). Significance tests on the crystallographic R factor. Acta Cryst. 18, 502–510.Google Scholar
First citation Hendrickson, W. A. & Konnert, J. H. (1979). Stereochemically restrained crystallographic least-squares refinement of macromolecule structures. In Biomolecular structure, conformation, function and evolution, edited by R. Srinavisan, Vol. I, pp. 43–57. New York: Pergamon Press.Google Scholar
First citation Luzzati, V. (1952). Traitement statistique des erreurs dans la determination des structures cristallines. Acta Cryst. 5, 802–810.Google Scholar
First citation Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998). Rfree and the Rfree ratio. I. Derivation of expected values of cross-validation residuals used in macromolecular least-squares refinement. Acta Cryst. D54, 547–557.Google Scholar








































to end of page
to top of page