Tables for
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G. ch. 5.5, pp. 539-540

Section Ontology representation of macromolecular structure data

J. D. Westbrook,a* H. Yang,a Z. Fenga and H. M. Bermana

a Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
Correspondence e-mail: Ontology representation of macromolecular structure data

| top | pdf |

In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) assumed the management responsibilities for the PDB. One important outcome was the change in the underlying data representation used to process PDB data. The PDB now collects and processes data using a data representation based on a comprehensive ontology of macromolecular structure and experiment: the PDB exchange data dictionary. This representation is an extension of the mmCIF data dictionary, now the standard data representation for experimentally determined three-dimensional macromolecular structures. The dictionary and data files based on this data ontology (Westbrook & Bourne, 2000[link]) are expressed using Self-defining Text Archival and Retrieval (STAR) syntax (Chapter 2.1[link] ).

Although the mmCIF dictionary was developed within the crystallographic community, the metadata model employed by mmCIF is quite general and has been adopted by other application domains including NMR, molecular modelling and molecular recognition (dictionaries are available at ). Within the crystallographic community, metadata dictionaries have also been developed for other types of diffraction experiments, electron-microscopy data and for the general description of image data. The metadata concepts and tools that have been developed to support mmCIF are sufficiently general that they may be applied to the description of data in virtually any application.

The demands of structural genomics projects have driven the development of extensions to capture an increased level of experimental detail. These are available at . Extensions have also been introduced to describe NMR, cryo-electron microscopy and all aspects of protein production. The ability to rapidly add extensions and incorporate these into the PDB data-processing system is an important feature for supporting the rapidly evolving technologies associated with high-throughput structure determinations.

The mmCIF metadata architecture is built from three levels as illustrated in Fig.[link] (see also Chapter 2.6[link] ). Individual data files are described at the top level (e.g. Fig.[link]). The contents of these data files are defined by a data dictionary (e.g. Fig.[link]) in the next lower level (see Chapters 3.6[link] and 4.5[link] ). The attributes used in this data dictionary to build data definitions are in turn defined in the dictionary description language (DDL) (e.g. Fig.[link]) in the lowest level (see Chapters 2.6[link] and 4.10[link] ).


Figure | top | pdf |

Files at different levels of the mmCIF metadata architecture. (a) mmCIF data file excerpt. (b) Example mmCIF data dictionary definition. (c) Example DDL dictionary attribute definition.

The major syntactical constructs used by mmCIF are illustrated in the data file example of Fig.[link]. Each data item or group of data items is preceded by an identifying keyword. Groups of related data items are organized into data categories. Two categories, CELL and ENTITY_POLY_SEQ, are shown in the example. CELL contains an individual instance describing a single set of crystallographic cell constants. ENTITY_POLY_SEQ contains a loop_ (i.e. table) of instances describing a polymer residue sequence. Essentially all mmCIF data are described as a set of tabular data structures.

Each mmCIF data item is defined in a data dictionary. Data definitions are given between save-frame delimiters (i.e. save_); apart from this, the data definitions share the same simple syntax as used in data files. An example definition for a crystallographic cell constant is shown in Fig.[link]. Many features of the cell constant are described in this definition, including data type, range restrictions, units of expression, dependent quantities, related definitions, necessity and related precision estimate. Although not shown in this example, dictionary definitions can also include parent–child relationships that have important consequences in maintaining data consistency.

The attributes of each data definition are defined in the DDL dictionary. Fig.[link] shows example DDL definitions describing data types. DDL definitions have the same syntax as definitions used in the data dictionary. Because the attributes of the DDL are also used in DDL definitions, this metadata architecture is described as self-defining.

The RCSB PDB distributes parsing tools that support all three levels of this metadata architecture ( ). The CIFPARSE_OBJ package (Tosic & Westbrook, 2000[link]) provides high-level methods to read, write, validate and manage data from data files, dictionaries and DDLs. Data files can be validated relative to an input data dictionary, and dictionary files can be validated relative to an input DDL. CIFPARSE_OBJ stores information in a collection of table objects. Access methods are provided to search and manipulate the table objects. A companion package, CIFOBJ (Schirripa & Westbrook, 1996[link]), provides an alternative representation of dictionary and DDL data. CIFOBJ organizes dictionary information into a collection of category and item-level objects. Access methods are provided for all dictionary attributes.


First citation Schirripa, S. & Westbrook, J. D. (1996). CIFOBJ. A class library of mmCIF access tools. Reference guide. .Google Scholar
First citation Tosic, O. & Westbrook, J. D. (2000). CIFParse. A library of access tools for mmCIF. Reference guide. .Google Scholar
First citation Westbrook, J. & Bourne, P. E. (2000). STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics, 16, 159–168.Google Scholar

to end of page
to top of page