International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

International Tables for Crystallography (2006). Vol. F. ch. 15.1, pp. 314-316   | 1 | 2 |

Section 15.1.2.2. Histogram matching

K. Y. J. Zhang,a K. D. Cowtanb* and P. Mainc

a Division of Basic Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N., Seattle, WA 90109, USA,bDepartment of Chemistry, University of York, York YO1 5DD, England, and cDepartment of Physics, University of York, York YO1 5DD, England
Correspondence e-mail:  cowtan+email@ysbl.york.ac.uk

15.1.2.2. Histogram matching

| top | pdf |

Histogram matching seeks to bring the distribution of electron-density values of a map to that of an ideal map. The density histogram of a map is the probability distribution of electron-density values. It provides a global description of the appearance of the map, and all spatial information is discarded. The comparison of the histogram for a given map with that expected for an ideal map can serve as a measure of quality. Furthermore, the initial map can be improved by adjusting density values in a systematic way to make its histogram match the ideal histogram.

15.1.2.2.1. Introduction

| top | pdf |

Histogram matching is a standard technique in image processing. It is aimed at bringing the density distribution of an image to an ideal distribution, thereby improving the image quality. The first attempt at modifying the electron-density distribution was that by Hoppe & Gassman (1968)[link], who proposed the `3–2' rule. The electron density was first normalized to a maximum of 1 and modified by imposing positivity. Subsequently, the electron density was modified by [\rho_{\rm mod} = 3\rho^{2} - 2\rho^{3}]. Podjarny & Yonath (1977)[link] used the skewness of the density histogram as a measure of quality of the modified map. Harrison (1988)[link] used a Gaussian function as the ideal histogram in his histogram-specification method for protein phase refinement and extension. The choice of the Gaussian function as the ideal electron-density distribution was based on theoretical arguments instead of experimental evaluation. The Gaussian function was also made independent of resolution. Lunin (1988)[link] used the electron-density distribution to retrieve the values of low-angle structure factors whose amplitudes had not been measured during an X-ray experiment. The electron-density distribution was thought to be structure specific and was derived from a homologous structure. Moreover, the histogram was derived from the entire unit cell, including both the protein and the solvent. Zhang & Main (1988)[link] systematically examined the electron-density histogram of several proteins and found that the ideal density histogram is dependent on resolution, the overall temperature factor and the phase error. It is, however, independent of structural conformation. The sensitivity to phase error suggests that the density histogram could be used for phase improvement. The structural conformation independence made it possible to predict the ideal histogram for unknown structures.

15.1.2.2.2. The prediction of the ideal histogram

| top | pdf |

Polypeptide structures in particular, and biological macromolecules in general, display a broadly similar atomic composition, and the way in which these atoms bond together is also conserved across a wide range of structures. These similarities between different protein structures can be used to predict the ideal histogram even when positional information for individual atoms is not available in a map. If the positional information is removed from an electron-density map, then what remains is an unlabelled list of density values. This list is the histogram of the electron-density distribution, which is independent of the relative disposition of these densities. The shape of the histogram is primarily based on the presence of atoms and their characteristic distances from each other. This is true for all polypeptide structures.

The frequency distribution, [P(\rho)], of electron-density values in a map can be constructed by sampling the map and counting the density values in different ranges. In practice, once the electron-density map has been sampled on a discrete grid, this frequency distribution becomes a histogram, but for convenience, it is treated here as a continuous distribution.

At resolutions of better than 6.0 Å and after exclusion of the solvent region, the frequency distribution of electron-density values for protein density over a wide range of proteins varies only with resolution and overall temperature factor to a good approximation. If the overall temperature factor is artificially adjusted, for example, by sharpening to [B_{\rm overall} = 0], then the frequency distributions may be treated as a function of resolution only. Therefore, once a good approximation to the molecular envelope is known, the frequency distribution of electron densities in the protein region as a function of resolution may be assumed to be known. Therefore, the ideal density histogram for an unknown map at a given resolution can be taken from any known structure at the same resolution (Zhang & Main, 1988[link], 1990a[link]).

The ideal electron-density histogram can also be predicted by an analytical formula (Lunin & Skovoroda, 1991[link]; Main, 1990a[link]). The method adopted by Main (1990a[link]) represents the density histogram by components that correspond to three types of electron density in the map. The first component is the region of overlapping densities, which can be represented by a randomly distributed background noise. The second component is the region of partially overlapping densities. The third component is the region of non-overlapping atomic peaks, which can be represented by a Gaussian.

The histogram for the overlapping part of the density can be represented by a Gaussian distribution, [P_{o} (\rho) = N\exp \left[- {{\left({\rho - \overline{\rho}}\right)^{2}/{2\sigma^{2}}}}\right], \eqno(15.1.2.8)] where [\overline{\rho}] is the mean density and σ is the standard deviation. The region of partially overlapping densities can be modelled by a cubic polynomial function, [P_{po} (\rho) = N\left({a\rho^{3} + b\rho^{2} + c\rho + d}\right). \eqno(15.1.2.9)] The histogram for the non-overlapping part of the density can be derived analytically from a Gaussian atom, [P_{no} (\rho) = N(A/\rho)[\ln (\rho_{0}/\rho)]^{1/2}, \eqno(15.1.2.10)] where [\rho_{0}] is the maximum density, N is a normalizing factor and A is the relative weight of the terms between equation (15.1.2.8)[link] and equation (15.1.2.10)[link].

If we use two threshold values, [\rho_{1}] and [\rho_{2}], to divide the three density regions, the complete formula can be expressed as [P(\rho) = \left\{\matrix{N \exp \left[- (\rho - \overline{\rho})^{2}/2\sigma^{2}\right]\hfill & \hbox{ for }\hfill& 2\rho \leq \rho_{2}\hfill \cr N (a\rho^{3} + b\rho^{2} + c\rho + d)\hfill& \hbox{ for }\hfill& 2\rho_{2} \lt \rho \leq \rho_{1} \hfill\cr N (A/\rho) [\ln (\rho_{0}/\rho)]^{1/2}\hfill & \hbox{ for }\hfill& 2\rho_{1} \lt \rho \leq \rho_{0}.\hfill \cr}\right. \eqno(15.1.2.11)]

The parameters a, b, c, d in the cubic polynomial are calculated by matching function values and gradients at [\rho_{1}] and [\rho_{2}]. The parameters in the histogram formula, [\overline{\rho}], σ, A, [\rho_{0}], [\rho_{1}], [\rho_{2}], can be obtained from histograms of known structures.

15.1.2.2.3. The process of histogram matching

| top | pdf |

Zhang & Main (1990a[link]) demonstrated that, at better than 4 Å resolution, the histogram for an MIR map is generally significantly different from the ideal distribution calculated from atomic coordinates. The obvious course is therefore to alter the map in such a way as to make its density histogram equal to the ideal distribution. Unfortunately, there are an infinite number of maps corresponding to any chosen density distribution, so we must choose a systematic method of altering the map.

The conventional method of performing such a modification is to retain the ordering of the density values in the map. The highest point in the original map will be the highest point in the modified map, the second highest points will correspond in the same way, and so on.

Mathematically, this transformation is represented as follows. Let [P(\rho)] be the current density histogram and [P'(\rho)] be the desired distribution, normalized such that their sums are equal to 1. The cumulative distribution functions, [N(\rho)] and [N'(\rho)], may then be calculated: [\eqalign{N(\rho) &= {\textstyle\int\limits_{\rho_{\min}}^{\rho}} P(\rho)\ \hbox{d} \rho,\cr N'(\rho') &= {\textstyle\int\limits_{\rho_{\min}}^{\rho'}} P'(\rho)\ \hbox{d} \rho.} \eqno(15.1.2.12)] The cumulative distribution function of a variable transforms a value chosen from the distribution into a number between 0 and 1, representing the position of that value in an ordered list of values chosen from the distribution.

The transformation may, therefore, be performed in two stages. A density value is taken from the initial distribution and the cumulative distribution function of the initial distribution is applied to obtain the position of that value in the distribution. The inverse of the cumulative distribution function for the desired distribution is applied to this value to obtain the density value for the corresponding point in the desired distribution. Thus, given a density value, ρ, from the initial distribution, the modified value, ρ′, is obtained by [\rho' = N'^{-1} \left[{N(\rho)}\right]. \eqno(15.1.2.13)] The distribution of ρ′ will then match the desired distribution after the above transformation. The transformation of an electron-density value by this method is illustrated in Fig. 15.1.2.3.[link] The transformation in equation (15.1.2.13)[link] can be achieved through a linear transform represented by [\rho'_{i} = a_{i} \rho_{i} + b_{i}, \eqno(15.1.2.14)] where [i = \left\{1, \ldots, n\right\}] and n is the number of density bins. The above linear transform is sufficient if the number of density bins is large enough. An n value of about 200 is usually quite satisfactory.

[Figure 15.1.2.3]

Figure 15.1.2.3| top | pdf |

Transformation of density ρ to [\rho'_{\rm mod}] by histogram matching.

Various properties of the electron density are specified in the density histogram, such as the minimum, maximum and mean density, the density variance, and the entropy of the map. The mean density of the ideal map can be obtained by [\overline{\rho} = {\textstyle\int\limits_{\rho_{\min}}^{\rho_{\max}}} {\rho P(\rho)\ \hbox{d}\rho}. \eqno(15.1.2.15)] The variance of the density in the ideal map can be obtained by [\sigma (\rho) = \left({\overline {\rho^{2}} - \overline{\rho}^{2}}\right)^{1/2}, \eqno(15.1.2.16)] where [\overline{\rho^{2}} = {\textstyle\int\limits_{\rho_{\min}}^{\rho_{\max}}} {\rho^{2} P(\rho)\ \hbox{d}\rho}. \eqno(15.1.2.17)] The entropy of the ideal map can be calculated by [S = - {\textstyle\int\limits_{\rho_{\min}}^{\rho_{\max}}} {P(\rho)} \rho \ln (\rho)\ \hbox{d}\rho. \eqno(15.1.2.18)]

Therefore, the process of histogram matching applies a minimum and a maximum value to the electron density, imposes the correct mean and variance, and defines the entropy of the new map. The order of electron-density values remains unchanged after histogram matching.

Histogram matching is complementary to solvent flattening since it is applied to the protein region of a map, whereas solvent flattening only operates on the solvent region of the map. The same envelope that was used for isolating the solvent region can be used to determine the protein region of the cell. An alternative approach is to define separate solvent and protein masks, with uncertain regions excluded from either mask and allowed to keep their unmodified values.

15.1.2.2.4. Scaling the observed structure-factor amplitudes according to the ideal density histogram

| top | pdf |

In the process of density modification, electron density or structure factors from different sources are compared and combined. It is, therefore, crucial to ensure that all the structure factors and maps are on the same scale. The observed structure factors can be put on the absolute scale by Wilson statistics (Wilson, 1949[link]) using a scale and an overall temperature factor. This is accurate when atomic or near atomic resolution data are available. The scale and overall temperature factor obtained from Wilson statistics are less accurate when only medium- to low-resolution data are available. A more robust method of scaling non-atomic resolution data is through the density histogram (Cowtan & Main, 1993[link]; Zhang, 1993[link]).

The ideal density histogram defines the mean and variance of an electron density, as shown in equations (15.1.2.15)[link] and (15.1.2.16)[link]. We can scale the observed structure-factor amplitudes to be consistent with the target histogram using the following formula, obtained from the structure-factor equation and Parseval's theorem. The mean density and the density variance of the observed map can be calculated as [\eqalignno{\overline{\rho}' &= (1/V)F(000), &(15.1.2.19)\cr \sigma '(\rho) &= (1/V) \left[{\textstyle\sum\limits_{\bf h}} | F({\bf h})|^{2}\right]^{1/2}. &(15.1.2.20)}%(15.1.2.20)]

The mean and variance of the electron-density map at the desired resolution are calculated using the target histogram, the mean value of the solvent density, [\overline{\rho}_{\rm solv}], and the solvent volume of the cell, [V_{\rm solv}]. The F(000) term can then be evaluated from equations (15.1.2.15)[link] and (15.1.2.19)[link]: [{F(000) = (V - V_{\rm solv})\overline{\rho} + V_{\rm solv} \overline{\rho}_{\rm solv}.} \eqno(15.1.2.21)] The scale of the observed amplitudes can be obtained from equations (15.1.2.16)[link] and (15.1.2.20)[link], [F'({\bf h}) = KF({\bf h}), \eqno(15.1.2.22)] where [K = \left[(\overline{\rho^{2}} - \overline{\rho}^{2})\right]^{1/2}\bigg/\bigg\{(1/V) \left[{\textstyle\sum\limits_{\bf h}} | F({\bf h})|^{2}\right]^{1/2}\bigg\}. \eqno(15.1.2.23)] This method is adequate for scaling observed structure factors at any resolution.

References

First citation Cowtan, K. D. & Main, P. (1993). Improvement of macromolecular electron-density maps by the simultaneous application of real and reciprocal space constraints. Acta Cryst. D49, 148–157.Google Scholar
First citation Harrison, R. W. (1988). Histogram specification as a method of density modification. J. Appl. Cryst. 21, 949–952.Google Scholar
First citation Hoppe, W. & Gassmann, J. (1968). Phase correction, a new method to solve partially known structures. Acta Cryst. B24, 97–107.Google Scholar
First citation Lunin, V. Yu. (1988). Use of the information on electron density distribution in macromolecules. Acta Cryst. A44, 144–150.Google Scholar
First citation Lunin, V. Yu. & Skovoroda, T. P. (1991). Frequency-restrained structure-factor refinement. I. Histogram simulation. Acta Cryst. A47, 45–52.Google Scholar
First citation Main, P. (1990a). A formula for electron density histograms for equal-atom structures. Acta Cryst. A46, 507–509.Google Scholar
First citation Podjarny, A. D. & Yonath, A. (1977). Use of matrix direct methods for low-resolution phase extension for tRNA. Acta Cryst. A33, 655–661.Google Scholar
First citation Wilson, A. J. C. (1949). The probability distribution of X-ray intensities. Acta Cryst. 2, 318–321.Google Scholar
First citation Zhang, K. Y. J. (1993). SQUASH – combining constraints for macromolecular phase refinement and extension. Acta Cryst. D49, 213–222.Google Scholar
First citation Zhang, K. Y. J. & Main, P. (1988). Histogram matching as a density modification technique for phase refinement and extension of protein molecules. In Improving protein phases, edited by S. Bailey, E. Dodson & S. Phillips. Report DL/SCI/R26, pp. 57–64. Warrington: Daresbury Laboratory.Google Scholar
First citation Zhang, K. Y. J. & Main, P. (1990a). Histogram matching as a new density modification technique for phase refinement and extension of protein molecules. Acta Cryst. A46, 41–46.Google Scholar








































to end of page
to top of page