Representing macromolecular structure data

Westbrook, J. D.; Yang, H.; Feng, Z.; Berman, H. M.

doi:10.1107/97809553602060000755

RELATED SITES: IUCr | IUCr Journals

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G. ch. 5.5, pp. 539-541

Section 5.5.2. Representing macromolecular structure data

J. D. Westbrook,^a ^* H. Yang,^a Z. Feng^a and H. M. Berman^a

^a Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
Correspondence e-mail: [email protected]

5.5.2. Representing macromolecular structure data

| top | pdf |

Macromolecular structure data have historically been represented in a simple record-oriented format developed by the PDB; this format has been widely used in structural and computational biology. While this PDB format has in general been adequate for representing coordinate data, it has proved less satisfactory for the description of related information such as chemical and biological features and experimental methodology. To provide a more rigorous data encoding that includes all of this related information, the Protein Data Bank has in recent years adopted a comprehensive ontology of structure and experiment based on the content of the mmCIF data dictionary.

5.5.2.1. PDB format

| top | pdf |

For the past 30 years, the PDB has served as the single central repository for macromolecular structure data. The data format used to store archival entries in the PDB is a column-oriented data format resembling many data formats developed to accommodate the limitations of paper punched-card technology (see Chapter 1.1 ). An example of the data format is shown in Fig. 5.5.2.1.

Figure 5.5.2.1 | top | pdf |

Excerpt of records from a PDB data file.

Many of the data records in this format are prefixed with a record tag (e.g. CRYST1, ATOM) followed by individual items of data. The specifications for the records in this data format are described informally by Callaway et al. (1996). In addition to the labelled records as in Fig. 5.5.2.1, many data records in the PDB format are presented as unstructured or only semi-structured remark records.

5.5.2.2. Ontology representation of macromolecular structure data

| top | pdf |

In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) assumed the management responsibilities for the PDB. One important outcome was the change in the underlying data representation used to process PDB data. The PDB now collects and processes data using a data representation based on a comprehensive ontology of macromolecular structure and experiment: the PDB exchange data dictionary. This representation is an extension of the mmCIF data dictionary, now the standard data representation for experimentally determined three-dimensional macromolecular structures. The dictionary and data files based on this data ontology (Westbrook & Bourne, 2000) are expressed using Self-defining Text Archival and Retrieval (STAR) syntax (Chapter 2.1 ).

Although the mmCIF dictionary was developed within the crystallographic community, the metadata model employed by mmCIF is quite general and has been adopted by other application domains including NMR, molecular modelling and molecular recognition (dictionaries are available at http://mmcif.pdb.org/ ). Within the crystallographic community, metadata dictionaries have also been developed for other types of diffraction experiments, electron-microscopy data and for the general description of image data. The metadata concepts and tools that have been developed to support mmCIF are sufficiently general that they may be applied to the description of data in virtually any application.

The demands of structural genomics projects have driven the development of extensions to capture an increased level of experimental detail. These are available at http://mmcif.pdb.org/ . Extensions have also been introduced to describe NMR, cryo-electron microscopy and all aspects of protein production. The ability to rapidly add extensions and incorporate these into the PDB data-processing system is an important feature for supporting the rapidly evolving technologies associated with high-throughput structure determinations.

The mmCIF metadata architecture is built from three levels as illustrated in Fig. 5.5.2.2 (see also Chapter 2.6 ). Individual data files are described at the top level (e.g. Fig. 5.5.2.2a). The contents of these data files are defined by a data dictionary (e.g. Fig. 5.5.2.2b) in the next lower level (see Chapters 3.6 and 4.5 ). The attributes used in this data dictionary to build data definitions are in turn defined in the dictionary description language (DDL) (e.g. Fig. 5.5.2.2c) in the lowest level (see Chapters 2.6 and 4.10 ).

Figure 5.5.2.2 | top | pdf |

Files at different levels of the mmCIF metadata architecture. (a) mmCIF data file excerpt. (b) Example mmCIF data dictionary definition. (c) Example DDL dictionary attribute definition.

The major syntactical constructs used by mmCIF are illustrated in the data file example of Fig. 5.5.2.2(a). Each data item or group of data items is preceded by an identifying keyword. Groups of related data items are organized into data categories. Two categories, CELL and ENTITY_POLY_SEQ, are shown in the example. CELL contains an individual instance describing a single set of crystallographic cell constants. ENTITY_POLY_SEQ contains a loop_ (i.e. table) of instances describing a polymer residue sequence. Essentially all mmCIF data are described as a set of tabular data structures.

Each mmCIF data item is defined in a data dictionary. Data definitions are given between save-frame delimiters (i.e. save_); apart from this, the data definitions share the same simple syntax as used in data files. An example definition for a crystallographic cell constant is shown in Fig. 5.5.2.2(b). Many features of the cell constant are described in this definition, including data type, range restrictions, units of expression, dependent quantities, related definitions, necessity and related precision estimate. Although not shown in this example, dictionary definitions can also include parent–child relationships that have important consequences in maintaining data consistency.

The attributes of each data definition are defined in the DDL dictionary. Fig. 5.5.2.2(c) shows example DDL definitions describing data types. DDL definitions have the same syntax as definitions used in the data dictionary. Because the attributes of the DDL are also used in DDL definitions, this metadata architecture is described as self-defining.

The RCSB PDB distributes parsing tools that support all three levels of this metadata architecture (http://sw-tools.pdb.org/ ). The CIFPARSE_OBJ package (Tosic & Westbrook, 2000) provides high-level methods to read, write, validate and manage data from data files, dictionaries and DDLs. Data files can be validated relative to an input data dictionary, and dictionary files can be validated relative to an input DDL. CIFPARSE_OBJ stores information in a collection of table objects. Access methods are provided to search and manipulate the table objects. A companion package, CIFOBJ (Schirripa & Westbrook, 1996), provides an alternative representation of dictionary and DDL data. CIFOBJ organizes dictionary information into a collection of category and item-level objects. Access methods are provided for all dictionary attributes.

5.5.2.3. Supporting other data formats and data delivery methods

| top | pdf |

One of the greatest benefits of a dictionary-based informatics infrastructure is the flexibility that it provides in supporting alternative data formats and delivery methods. Because the data and all of their defining attributes are electronically encoded, translation between data and dictionary formats can be achieved using light-weight software filters without loss of any information.

XML provides a particularly good example of the ease with which data can be converted to and from the mmCIF format. XML translations of mmCIF data files are currently provided on the Worldwide PDB ftp site (ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/XML/ ). These XML files use mmCIF dictionary data-item names as XML tags. These files were created by a translation tool (http://sw-tools.pdb.org/apps/MMCIF-XML-UTIL/ ) that translates mmCIF data files to XML in compliance with an XML schema. The XML schema is similarly software-translated from the PDB exchange data dictionary.

Other delivery methods such as Corba (http://www.omg.org/cgi-bin/doc?lifesci/00–02-02 ) do not require a data format, as data are exchanged using an application program interface (API). A Corba API for macromolecular structure (Greer et al., 2002) based on the content of the mmCIF data dictionary has been approved by the Object Management Group (OMG). Software tools supporting this Corba API (OpenMMS, http://openmms.sdsc.edu , and FILM, http://sw-tools.pdb.org/apps/FILM ) take full advantage of the data dictionary in building the interface definitions and supporting server on which the API is based (see also Section 5.3.8.2 ).

References

Callaway, J., Cummings, M., Deroski, B., Esposito, P., Forman, A., Langdon, P., Libeson, M., McCarthy, J., Sikora, J., Xue, D., Abola, E., Bernstein, F., Manning, N., Shea, R., Stampf, D. & Sussman, J. (1996). Protein Data Bank contents guide: Atomic coordinate entry format description. Brookhaven National Laboratory, New York, USA. Available from http://www.wwpdb.org/documentation/PDB_format_1996.pdf .Google Scholar

Greer, D. S., Westbrook, J. D. & Bourne, P. E. (2002). An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics, 18, 1280–1281.Google Scholar

Schirripa, S. & Westbrook, J. D. (1996). CIFOBJ. A class library of mmCIF access tools. Reference guide. http://sw-tools.pdb.org/apps/CIFOBJ/cifobj/index.html .Google Scholar

Tosic, O. & Westbrook, J. D. (2000). CIFParse. A library of access tools for mmCIF. Reference guide. http://sw-tools.pdb.org/apps/CIFPARSE-OBJ/cifparse/index.html .Google Scholar

Westbrook, J. & Bourne, P. E. (2000). STAR/mmCIF: an ontology for macromolecular structure. Bioinformatics, 16, 159–168.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 5.5, pp. 539-541