Locating domains in 3D structures

Holm, L.; Sander, C.

doi:10.1107/97809553602060000714

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 23.1, pp. 577-578 | 1 | 2 |

Section 23.1.2. Locating domains in 3D structures

L. Holm^b ^* and C. Sander^c

23.1.2. Locating domains in 3D structures

| top | pdf |

23.1.2.1. Introduction

| top | pdf |

Modular design is beneficial in many areas of life, including computer programming, manufacturing, and even in protein folding.

Protein-structure analysis has long operated with the notion of domains, i.e., dividing large structures into quasi-independent substructures or modules (Wetlaufer, 1973; Bork, 1992). In various contexts, these substructures are thought to fold autonomously, to carry specific molecular functions such as binding or catalysis, to move relative to each other as semi-rigid bodies and to speed the evolution of new functions by recombination (Fig. 23.1.2.1).

Figure 23.1.2.1| top | pdf |

The structure of diphtheria toxin (Bennett & Eisenberg, 1994) beautifully illustrates domains as structural, functional and evolutionary units. Structurally, note the compact globular shape of each domain and the flexible linkers between them. Functionally, note how each domain carries out a different stage of infection by the bacterium: receptor binding, membrane penetration and ADP-ribosylation of the target protein. Evolutionarily, note the occurrence of domains homologous to the catalytic domain of diphtheria toxin in exo-, entero- and pertussis toxins, and in poly-ADP-ribose polymerase (Holm & Sander, 1999). Arrows point to recurrent substructures in structural neighbours (Lionetti et al., 1991; Li et al., 1996; Tormo et al., 1996) of each domain of diphtheria toxin. Drawn using MOLSCRIPT version 2 (Kraulis, 1991).

The problem of subdividing protein molecules into structural and functional units has received the attention of numerous researchers over the last 25 years. Early algorithms focused on protein folding or unfolding pathways and aimed at identifying substructures that would be physically stable on their own. Nowadays, with bulging macromolecular databases, the focus has shifted to devise automatic methods for identifying domains that can form the basis for a consistent protein-structure classification (Murzin et al., 1995; Orengo et al., 1997; Holm & Sander, 1999).

This review presents the concepts underlying computational methods for locating domains in 3D structures. Those interested in implementations are referred to the web services of the European Bioinformatics Institute¹ and related sites.

23.1.2.2. Compactness

| top | pdf |

A variety of ingenious techniques have been invented for locating structural domains in 3D structures. These include inspection of distance maps, clustering, neighbourhood correlation, plane cutting, interface area minimization, specific volume minimization, searching for mechanical hinge points, maximization of compactness and maximization of buried surface area (Rossmann & Liljas, 1974; Rashin, 1976; Crippen, 1978; Nemethy & Scheraga, 1979; Rose, 1979; Schulz & Schirmer, 1979; Go, 1981; Lesk & Rose, 1981; Sander, 1981; Wodak & Janin, 1981; Zehfus & Rose, 1986; Kikuchi et al., 1988; Moult & Unger, 1991; Holm & Sander, 1994b; Zehfus, 1994; Islam et al., 1995; Siddiqui & Barton, 1995; Swindells, 1995; Holm & Sander, 1996; Sowdhamini et al., 1996; Zehfus, 1997; Holm & Sander, 1998; Jones et al., 1998; Wernisch et al., 1999).

Common to most approaches are the assumptions that folding units are compact and that the interactions between them are weak. These notions can be made quantitative, for example, by counting interatomic contacts and by locating domain borders by identifying groups of residues such that the number of contacts between groups is minimized. The hierarchic organization of putative folding units can be inferred starting from the complete structure and recursively cutting it (in silico) into smaller and smaller substructures. Alternatively, one may start from the residue or secondary-structure-element level and successively associate the most strongly interacting groups. The procedure involves two optimization problems.

The first optimization problem is algorithmic and concerns finding the optimal subdivisions. This problem is complicated by the possibility of the chain passing several times between domains (discontinuous domains ). Without the constraint of sequential continuity, there is a combinatorial number of possibilities for dividing a set of residues into subsets (Zehfus, 1994). This hurdle has been overcome by fast heuristics (Holm & Sander, 1994b; Zehfus, 1997; Wernisch et al., 1999).

The second optimization problem concerns formulating physical criteria that distinguish between autonomous and nonautonomous folding units, i.e., defining termination criteria for recursive algorithms. Since compactness-related criteria do not have a clear bimodal distribution, domain-assignment algorithms (Holm & Sander, 1994b; Islam et al., 1995; Siddiqui & Barton, 1995; Swindells, 1995; Sowdhamini et al., 1996; Wernisch et al., 1999) use cutoff parameters that have been fine-tuned against an external reference set of domain definitions.

23.1.2.3. Recurrence

| top | pdf |

Most fold classifications use a hierarchical model where evolutionary families are a subcategory of fold type and it is natural to assume that domain boundaries should be conserved in evolution. Consistency concerns lead to a reformulation of the goals of the domain-assignment problem, away from (imprecise) physical models of stable folding units and towards recognizing such units phenomenologically in the database of known structures through recurrence. The concept of recurrence has long been the cornerstone of domain assignments by experts based on visual inspection (Richardson, 1981). Recurrence means recognizing architectural units in one protein that have already been defined (named) in another.

The practical importance of domain identification is illustrated by the discoveries made by a systematic structure comparison of recurrent domains between histidine triad (HIT) proteins and galactose-6-phosphate uridylyltransferase [homodimer and internally duplicated common catalytic core, respectively (Holm & Sander, 1997)], and between beta-glucosyltransferase and glycogen phosphorylase [bare and heavily decorated common catalytic core, respectively (Holm & Sander, 1995; Artymiuk et al., 1995)], even though the contours of the molecules look quite different.

Let us restate the goal of domain identification as an economic description of all known protein structures in terms of a small set of large substructures. This is an intuitive goal and conceptually related to the principle of minimal encoding in information theory. The key ingredients of the optimization problem are the gain associated with reusing a substructure and the cost associated with using many small substructures to describe a protein. An analogy in writing is that copying blocks of text is cheap, but for coherence some thought and effort is necessary for bridging the blocks.

With a suitably defined cost function, recurrence can be used to select an optimal set of substructures from the hierarchic folding or unfolding trees generated using compactness criteria. Thus, the unsatisfactorily solved problem of defining termination criteria for compactness algorithms can be turned into an optimization problem that does not rely on any external reference and leads to an internally consistent set of domain definitions.

The key difficulty is in quantifying the notion of economy so that it leads to a selection of substructures of `appropriate' size, i.e., globular folds and not, for example, supersecondary-structure motifs. One solution, which is physical nonsense but has the desired qualitative behaviour, is a heuristic objective function used in the DALI domain dictionary (Holm & Sander, 1998). Recurrence is quantified in terms of the statistical significance of structural similarity for many pairs of substructures. The statistical significance is highest for structural similarities that involve large units and that completely cover a substructure unit. Exploiting these effects, a sum-of-pairs objective function is defined that favours recurrences of large substructures with distinct topological arrangements and packing of secondary-structure elements, and disfavours small substructures consisting of one or two secondary-structure elements despite their higher frequency of recurrence. Though other formulations of the optimization problem are possible, this empirically chosen objective function combined with a heuristic algorithm for optimization yields a useful set of substructures (domains).

23.1.2.4. Conclusion

| top | pdf |

While we do not foresee that automatically delineated domains will be accepted as the gold standard of the trade, modern methods, based on a combination of recurrence and compactness criteria, yield domain definitions that are consistent within protein families and often coincide with biologically functional units, recover the well known folding topologies with many members, produce clusters with good coverage of common secondary-structure elements, and provide a useful basis for large-scale structure analysis and classification.

References

Artymiuk, P. J., Rice, D. W., Poirrette, A. R. & Willett, P. (1995). Beta-glucosyltransferase and phosphorylase reveal their common theme. Nature Struct. Biol. 2, 117–120.Google Scholar

Bennett, M. J. & Eisenberg, D. (1994). Refined structure of monomeric diphtheria toxin at 2.3 Å resolution. Protein Sci. 3, 1464–1475.Google Scholar

Bork, P. (1992). Mobile modules and motifs. Curr. Opin. Struct. Biol. 2, 413–421.Google Scholar

Crippen, G. (1978). The tree structural organization of proteins. J. Mol. Biol. 126, 315–332.Google Scholar

Go, M. (1981). Correlation of DNA exonic regions with protein structural units in hemoglobin. Nature (London), 291, 90–92. Google Scholar

Holm, L. & Sander, C. (1994b). Parser for protein folding units. Proteins, 19, 256–268.Google Scholar

Holm, L. & Sander, C. (1995). Evolutionary link between glycogen phosphorylase and a DNA modifying enzyme. EMBO J. 14, 1287–1293.Google Scholar

Holm, L. & Sander, C. (1996). Mapping the protein universe. Science, 273, 595–602.Google Scholar

Holm, L. & Sander, C. (1997). Enzyme HIT. Trends Biochem. Sci. 22, 116–117.Google Scholar

Holm, L. & Sander, C. (1998). Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.Google Scholar

Holm, L. & Sander, C. (1999). Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247.Google Scholar

Islam, S. A., Luo, J. & Sternberg, M. J. (1995). Identification and analysis of domains in proteins. Protein Eng. 8, 513–525.Google Scholar

Jones, S., Stewart, M., Michie, A. D., Swindells, M. B., Orengo, C. A. & Thornton, J. M. (1998). Domain assignment for protein structures using a consensus approach: characterisation and analysis. Protein Sci. 7, 233–242.Google Scholar

Kikuchi, T., Nemethy, G. & Scheraga, H. A. (1988). Prediction of the location of structural domains in globular proteins. J. Protein Chem. 88, 427–471.Google Scholar

Kraulis, P. J. (1991). MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950.Google Scholar

Lesk, A. M. & Rose, G. D. (1981). Folding units in globular proteins. Proc. Natl Acad. Sci. USA, 78, 4304–4308.Google Scholar

Li, M., Dyda, F., Benhar, I., Pastan, I. & Davies, D. R. (1996). Crystal structure of the catalytic domain of Pseudomonas exotoxin A complexed with a nicotinamide adenine dinucleotide analog: implications for the activation process and for ADP ribosylation. Proc. Natl Acad. Sci. USA, 93, 6902–6906.Google Scholar

Lionetti, C., Guanziroli, M. G., Frigerio, F., Ascenzi, P. & Bolognesi, M. (1991). X-ray crystal structure of the ferric sperm whale myoglobin: imidazole complex at 2.0 Å resolution. J. Mol. Biol. 217, 409–412.Google Scholar

Moult, J. & Unger, R. (1991). An analysis of protein folding pathways. Biochemistry, 30, 3816–3824.Google Scholar

Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of the protein database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.Google Scholar

Nemethy, G. & Scheraga, H. A. (1979). A possible folding pathway of bovine pancreatic Rnase. Proc. Natl Acad. Sci. USA, 76, 6050–6054.Google Scholar

Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093–1108.Google Scholar

Rashin, A. A. (1976). Location of domains in globular proteins. Nature (London), 291, 85–87.Google Scholar

Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34, 167–339.Google Scholar

Rose, G. D. (1979). Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134, 447–470.Google Scholar

Rossmann, M. & Liljas, A. (1974). Recognition of structural domains in globular proteins. J. Mol. Biol. 85, 177–181.Google Scholar

Sander, C. (1981). Physical criteria for folding units of globular proteins. In Structural aspects of recognition and assembly in biological macromolecules, Vol. I. Proteins and protein complexes, fibrous proteins, edited by M. Balaban, pp. 183–195. Jerusalem: Alpha Press.Google Scholar

Schulz, G. E. & Schirmer, H. (1979). Principles of protein structure, ch. 5. New York: Springer Verlag.Google Scholar

Siddiqui, A. S. & Barton, G. J. (1995). Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. Protein Sci. 4, 872–884.Google Scholar

Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). A database of globular protein structural domains: clustering of representative family members into similar folds. Structure Fold. Des. 1, 209–220.Google Scholar

Swindells, M. B. (1995). A procedure for detecting structural domains in proteins. Protein Sci. 4, 103–112.Google Scholar

Tormo, J., Lamed, R., Chirino, A. J., Morag, E., Bayer, E. A., Shoham, Y. & Steitz, T. A. (1996). Crystal structure of a bacterial family-III cellulose-binding domain: a general mechanism for attachment to cellulose. EMBO J. 15, 5739–5751.Google Scholar

Wernisch, L., Hunting, M. & Wodak, J. (1999). Identification of structural domains in proteins by a graph heuristic. Proteins, 35, 338–352.Google Scholar

Wetlaufer, D. B. (1973). Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl Acad. Sci. USA, 70, 697–701.Google Scholar

Wodak, J. & Janin, J. (1981). Location of structural domains in proteins. Biochemistry, 20, 6544–6552.Google Scholar

Zehfus, M. H. (1994). Binary discontinuous compact protein domains. Protein Eng. 7, 335–340.Google Scholar

Zehfus, M. H. (1997). Identification of compact, hydrophobically stabilized domains and modules containing multiple peptide chains. Protein Sci. 6, 1210–1219.Google Scholar

Zehfus, M. H. & Rose, G. D. (1986). Compact units in proteins. Biochemistry, 25, 5759–5765.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 23.1, pp. 577-578