International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

International Tables for Crystallography (2006). Vol. F. ch. 23.1, pp. 577-578   | 1 | 2 |

Section 23.1.2. Locating domains in 3D structures

L. Holmb* and C. Sanderc

23.1.2. Locating domains in 3D structures

| top | pdf |

23.1.2.1. Introduction

| top | pdf |

Modular design is beneficial in many areas of life, including computer programming, manufacturing, and even in protein folding.

Protein-structure analysis has long operated with the notion of domains, i.e., dividing large structures into quasi-independent substructures or modules (Wetlaufer, 1973[link]; Bork, 1992[link]). In various contexts, these substructures are thought to fold autonomously, to carry specific molecular functions such as binding or catalysis, to move relative to each other as semi-rigid bodies and to speed the evolution of new functions by recombination (Fig. 23.1.2.1)[link].

[Figure 23.1.2.1]

Figure 23.1.2.1| top | pdf |

The structure of diphtheria toxin (Bennett & Eisenberg, 1994[link]) beautifully illustrates domains as structural, functional and evolutionary units. Structurally, note the compact globular shape of each domain and the flexible linkers between them. Functionally, note how each domain carries out a different stage of infection by the bacterium: receptor binding, membrane penetration and ADP-ribosylation of the target protein. Evolutionarily, note the occurrence of domains homologous to the catalytic domain of diphtheria toxin in exo-, entero- and pertussis toxins, and in poly-ADP-ribose polymerase (Holm & Sander, 1999[link]). Arrows point to recurrent substructures in structural neighbours (Lionetti et al., 1991[link]; Li et al., 1996[link]; Tormo et al., 1996[link]) of each domain of diphtheria toxin. Drawn using MOLSCRIPT version 2 (Kraulis, 1991[link]).

The problem of subdividing protein molecules into structural and functional units has received the attention of numerous researchers over the last 25 years. Early algorithms focused on protein folding or unfolding pathways and aimed at identifying substructures that would be physically stable on their own. Nowadays, with bulging macromolecular databases, the focus has shifted to devise automatic methods for identifying domains that can form the basis for a consistent protein-structure classification (Murzin et al., 1995[link]; Orengo et al., 1997[link]; Holm & Sander, 1999[link]).

This review presents the concepts underlying computational methods for locating domains in 3D structures. Those interested in implementations are referred to the web services of the European Bioinformatics Institute1 and related sites.

23.1.2.2. Compactness

| top | pdf |

A variety of ingenious techniques have been invented for locating structural domains in 3D structures. These include inspection of distance maps, clustering, neighbourhood correlation, plane cutting, interface area minimization, specific volume minimization, searching for mechanical hinge points, maximization of compactness and maximization of buried surface area (Rossmann & Liljas, 1974[link]; Rashin, 1976[link]; Crippen, 1978[link]; Nemethy & Scheraga, 1979[link]; Rose, 1979[link]; Schulz & Schirmer, 1979[link]; Go, 1981[link]; Lesk & Rose, 1981[link]; Sander, 1981[link]; Wodak & Janin, 1981[link]; Zehfus & Rose, 1986[link]; Kikuchi et al., 1988[link]; Moult & Unger, 1991[link]; Holm & Sander, 1994b[link]; Zehfus, 1994[link]; Islam et al., 1995[link]; Siddiqui & Barton, 1995[link]; Swindells, 1995[link]; Holm & Sander, 1996[link]; Sowdhamini et al., 1996[link]; Zehfus, 1997[link]; Holm & Sander, 1998[link]; Jones et al., 1998[link]; Wernisch et al., 1999[link]).

Common to most approaches are the assumptions that folding units are compact and that the interactions between them are weak. These notions can be made quantitative, for example, by counting interatomic contacts and by locating domain borders by identifying groups of residues such that the number of contacts between groups is minimized. The hierarchic organization of putative folding units can be inferred starting from the complete structure and recursively cutting it (in silico) into smaller and smaller substructures. Alternatively, one may start from the residue or secondary-structure-element level and successively associate the most strongly interacting groups. The procedure involves two optimization problems.

The first optimization problem is algorithmic and concerns finding the optimal subdivisions. This problem is complicated by the possibility of the chain passing several times between domains (discontinuous domains). Without the constraint of sequential continuity, there is a combinatorial number of possibilities for dividing a set of residues into subsets (Zehfus, 1994[link]). This hurdle has been overcome by fast heuristics (Holm & Sander, 1994b[link]; Zehfus, 1997[link]; Wernisch et al., 1999[link]).

The second optimization problem concerns formulating physical criteria that distinguish between autonomous and nonautonomous folding units, i.e., defining termination criteria for recursive algorithms. Since compactness-related criteria do not have a clear bimodal distribution, domain-assignment algorithms (Holm & Sander, 1994b[link]; Islam et al., 1995[link]; Siddiqui & Barton, 1995[link]; Swindells, 1995[link]; Sowdhamini et al., 1996[link]; Wernisch et al., 1999[link]) use cutoff parameters that have been fine-tuned against an external reference set of domain definitions.

23.1.2.3. Recurrence

| top | pdf |

Most fold classifications use a hierarchical model where evolutionary families are a subcategory of fold type and it is natural to assume that domain boundaries should be conserved in evolution. Consistency concerns lead to a reformulation of the goals of the domain-assignment problem, away from (imprecise) physical models of stable folding units and towards recognizing such units phenomenologically in the database of known structures through recurrence. The concept of recurrence has long been the cornerstone of domain assignments by experts based on visual inspection (Richardson, 1981[link]). Recurrence means recognizing architectural units in one protein that have already been defined (named) in another.

The practical importance of domain identification is illustrated by the discoveries made by a systematic structure comparison of recurrent domains between histidine triad (HIT) proteins and galactose-6-phosphate uridylyltransferase [homodimer and internally duplicated common catalytic core, respectively (Holm & Sander, 1997[link])], and between beta-glucosyltransferase and glycogen phosphorylase [bare and heavily decorated common catalytic core, respectively (Holm & Sander, 1995[link]; Artymiuk et al., 1995[link])], even though the contours of the molecules look quite different.

Let us restate the goal of domain identification as an economic description of all known protein structures in terms of a small set of large substructures. This is an intuitive goal and conceptually related to the principle of minimal encoding in information theory. The key ingredients of the optimization problem are the gain associated with reusing a substructure and the cost associated with using many small substructures to describe a protein. An analogy in writing is that copying blocks of text is cheap, but for coherence some thought and effort is necessary for bridging the blocks.

With a suitably defined cost function, recurrence can be used to select an optimal set of substructures from the hierarchic folding or unfolding trees generated using compactness criteria. Thus, the unsatisfactorily solved problem of defining termination criteria for compactness algorithms can be turned into an optimization problem that does not rely on any external reference and leads to an internally consistent set of domain definitions.

The key difficulty is in quantifying the notion of economy so that it leads to a selection of substructures of `appropriate' size, i.e., globular folds and not, for example, supersecondary-structure motifs. One solution, which is physical nonsense but has the desired qualitative behaviour, is a heuristic objective function used in the DALI domain dictionary (Holm & Sander, 1998[link]). Recurrence is quantified in terms of the statistical significance of structural similarity for many pairs of substructures. The statistical significance is highest for structural similarities that involve large units and that completely cover a substructure unit. Exploiting these effects, a sum-of-pairs objective function is defined that favours recurrences of large substructures with distinct topological arrangements and packing of secondary-structure elements, and disfavours small substructures consisting of one or two secondary-structure elements despite their higher frequency of recurrence. Though other formulations of the optimization problem are possible, this empirically chosen objective function combined with a heuristic algorithm for optimization yields a useful set of substructures (domains).

23.1.2.4. Conclusion

| top | pdf |

While we do not foresee that automatically delineated domains will be accepted as the gold standard of the trade, modern methods, based on a combination of recurrence and compactness criteria, yield domain definitions that are consistent within protein families and often coincide with biologically functional units, recover the well known folding topologies with many members, produce clusters with good coverage of common secondary-structure elements, and provide a useful basis for large-scale structure analysis and classification.

References

First citation Artymiuk, P. J., Rice, D. W., Poirrette, A. R. & Willett, P. (1995). Beta-glucosyltransferase and phosphorylase reveal their common theme. Nature Struct. Biol. 2, 117–120.Google Scholar
First citation Bennett, M. J. & Eisenberg, D. (1994). Refined structure of monomeric diphtheria toxin at 2.3 Å resolution. Protein Sci. 3, 1464–1475.Google Scholar
First citation Bork, P. (1992). Mobile modules and motifs. Curr. Opin. Struct. Biol. 2, 413–421.Google Scholar
First citation Crippen, G. (1978). The tree structural organization of proteins. J. Mol. Biol. 126, 315–332.Google Scholar
First citation Go, M. (1981). Correlation of DNA exonic regions with protein structural units in hemoglobin. Nature (London), 291, 90–92. Google Scholar
First citation Holm, L. & Sander, C. (1994b). Parser for protein folding units. Proteins, 19, 256–268.Google Scholar
First citation Holm, L. & Sander, C. (1995). Evolutionary link between glycogen phosphorylase and a DNA modifying enzyme. EMBO J. 14, 1287–1293.Google Scholar
First citation Holm, L. & Sander, C. (1996). Mapping the protein universe. Science, 273, 595–602.Google Scholar
First citation Holm, L. & Sander, C. (1997). Enzyme HIT. Trends Biochem. Sci. 22, 116–117.Google Scholar
First citation Holm, L. & Sander, C. (1998). Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.Google Scholar
First citation Holm, L. & Sander, C. (1999). Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247.Google Scholar
First citation Islam, S. A., Luo, J. & Sternberg, M. J. (1995). Identification and analysis of domains in proteins. Protein Eng. 8, 513–525.Google Scholar
First citation Jones, S., Stewart, M., Michie, A. D., Swindells, M. B., Orengo, C. A. & Thornton, J. M. (1998). Domain assignment for protein structures using a consensus approach: characterisation and analysis. Protein Sci. 7, 233–242.Google Scholar
First citation Kikuchi, T., Nemethy, G. & Scheraga, H. A. (1988). Prediction of the location of structural domains in globular proteins. J. Protein Chem. 88, 427–471.Google Scholar
First citation Kraulis, P. J. (1991). MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950.Google Scholar
First citation Lesk, A. M. & Rose, G. D. (1981). Folding units in globular proteins. Proc. Natl Acad. Sci. USA, 78, 4304–4308.Google Scholar
First citation Li, M., Dyda, F., Benhar, I., Pastan, I. & Davies, D. R. (1996). Crystal structure of the catalytic domain of Pseudomonas exotoxin A complexed with a nicotinamide adenine dinucleotide analog: implications for the activation process and for ADP ribosylation. Proc. Natl Acad. Sci. USA, 93, 6902–6906.Google Scholar
First citation Lionetti, C., Guanziroli, M. G., Frigerio, F., Ascenzi, P. & Bolognesi, M. (1991). X-ray crystal structure of the ferric sperm whale myoglobin: imidazole complex at 2.0 Å resolution. J. Mol. Biol. 217, 409–412.Google Scholar
First citation Moult, J. & Unger, R. (1991). An analysis of protein folding pathways. Biochemistry, 30, 3816–3824.Google Scholar
First citation Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of the protein database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.Google Scholar
First citation Nemethy, G. & Scheraga, H. A. (1979). A possible folding pathway of bovine pancreatic Rnase. Proc. Natl Acad. Sci. USA, 76, 6050–6054.Google Scholar
First citation Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093–1108.Google Scholar
First citation Rashin, A. A. (1976). Location of domains in globular proteins. Nature (London), 291, 85–87.Google Scholar
First citation Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34, 167–339.Google Scholar
First citation Rose, G. D. (1979). Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134, 447–470.Google Scholar
First citation Rossmann, M. & Liljas, A. (1974). Recognition of structural domains in globular proteins. J. Mol. Biol. 85, 177–181.Google Scholar
First citation Sander, C. (1981). Physical criteria for folding units of globular proteins. In Structural aspects of recognition and assembly in biological macromolecules, Vol. I. Proteins and protein complexes, fibrous proteins, edited by M. Balaban, pp. 183–195. Jerusalem: Alpha Press.Google Scholar
First citation Schulz, G. E. & Schirmer, H. (1979). Principles of protein structure, ch. 5. New York: Springer Verlag.Google Scholar
First citation Siddiqui, A. S. & Barton, G. J. (1995). Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. Protein Sci. 4, 872–884.Google Scholar
First citation Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). A database of globular protein structural domains: clustering of representative family members into similar folds. Structure Fold. Des. 1, 209–220.Google Scholar
First citation Swindells, M. B. (1995). A procedure for detecting structural domains in proteins. Protein Sci. 4, 103–112.Google Scholar
First citation Tormo, J., Lamed, R., Chirino, A. J., Morag, E., Bayer, E. A., Shoham, Y. & Steitz, T. A. (1996). Crystal structure of a bacterial family-III cellulose-binding domain: a general mechanism for attachment to cellulose. EMBO J. 15, 5739–5751.Google Scholar
First citation Wernisch, L., Hunting, M. & Wodak, J. (1999). Identification of structural domains in proteins by a graph heuristic. Proteins, 35, 338–352.Google Scholar
First citation Wetlaufer, D. B. (1973). Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl Acad. Sci. USA, 70, 697–701.Google Scholar
First citation Wodak, J. & Janin, J. (1981). Location of structural domains in proteins. Biochemistry, 20, 6544–6552.Google Scholar
First citation Zehfus, M. H. (1994). Binary discontinuous compact protein domains. Protein Eng. 7, 335–340.Google Scholar
First citation Zehfus, M. H. (1997). Identification of compact, hydrophobically stabilized domains and modules containing multiple peptide chains. Protein Sci. 6, 1210–1219.Google Scholar
First citation Zehfus, M. H. & Rose, G. D. (1986). Compact units in proteins. Biochemistry, 25, 5759–5765.Google Scholar








































to end of page
to top of page