Classification and use of macromolecular data

Fitzgerald, P. M. D.; Westbrook, J. D.; Bourne, P. E.; McMahon, B.; Watenpaugh, K. D.; Berman, H. M.

doi:10.1107/97809553602060000738

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 3.6, pp. 195-197
https://doi.org/10.1107/97809553602060000738

Appendix A3.6.2. The Protein Data Bank exchange data dictionary

J. D. Westbrook,^b K. Henrick,^g E. L. Ulrich^h and H. M. Berman^f

In developing a data-management infrastructure, the Protein Data Bank (PDB; Berman et al., 2000) has chosen the mmCIF dictionary technology for describing the data that it collects and disseminates. To accommodate the growth in the PDB's activities, data collection, processing and annotation now occur at three sites worldwide: the Research Collaboratory for Structural Bioinformatics (RCSB/PDB), the Macromolecular Structural Database (MSD) at the European Bioinformatics Institute (EBI) and the Protein Data Bank Japan (PDBj) at Osaka. Together these facilities form the Worldwide PDB (wwPDB) (Berman et al., 2003). In order to maintain the fidelity of the single archive of three-dimensional macromolecular structure, a precise content description is required to support the accurate exchange of data among the different sites and the exchange of information between different file formats.

A key strength of the mmCIF technology is the extensibility afforded by a framework based on a software-accessible data dictionary. The PDB has exploited this functionality by using the mmCIF dictionary as a foundation and supplementing it with extensions in order to describe all aspects of data processing and database operations.

These extensions include content required to support reversible format translation, noncrystallographic structure determination methods and the details of protein production. They also support recommendations by the International Union of Crystallography (IUCr) and the International Structural Genomics Organization (ISGO) as to which data should be deposited. In the following sections, the extensions to the mmCIF data dictionary developed by the PDB (http://mmcif.pdb.org/ ) are described.

A3.6.2.1. Data exchange and format translation

| top | pdf |

The majority of crystallographic and structural concepts embodied in the PDB are already well described in the mmCIF data dictionary. However, while there is a conceptual description of most crystallographic information in PDB-format files within the mmCIF dictionary, the precise representation of this information can differ subtly. To guarantee accurate data exchange and to facilitate reversible format translation between PDB and mmCIF formats, all such differences in representation must be resolved.

To accommodate content and semantic differences between formats, extensions to the dictionary have been created. These extensions take one of two forms: the addition of new definitions to existing categories or the creation of new categories. Where possible, extensions are added to existing categories. This is done when the new definition supplements the content of the category without changing the category definition or its fundamental organization. However, if a new definition cannot be added to an existing category, a new category is created to hold the extension. All new data items and categories include the prefix pdbx_ in their names.

For example, the level of detail in the PDB description of the biological source exceeds the description provided by mmCIF. In this case, dictionary extensions have been added to the existing categories ENTITY_SRC_NAT and ENTITY_SRC_GEN (where `nat' and `gen' stand for naturally occurring and genetically engineered, respectively). The PDB description of atomic coordinates includes two items that are not described in mmCIF: the insertion code and the model number. These have been added to the mmCIF category ATOM_SITE (as _atom_site.pdbx_PDB_ins_code and _atom_site.pdbx_PDB_model_num) and to all related categories that include atom nomenclature.

The convention for defining the hydrogen bonding in β-sheets differs between the PDB and mmCIF representations. Because the PDB model is fundamentally different from that found in mmCIF, a new category was created to hold the PDB data: PDBX_STRUCT_SHEET_HBOND. The correspondence between the PDB and mmCIF formats is tabulated at http://deposit.pdb.org/mmCIF/dictionaries/pdb-correspondence/pdb2mmcif.html .

A3.6.2.2. Extensions for structural genomics

| top | pdf |

An International Task Force on Deposition, Archiving, and Curation of Primary Information for Structural Genomics was formed under the auspices of the International Structural Genomics Organization (ISGO) in 2001 (Berman, 2001) and was asked to develop specifications for data from structural genomics projects to be deposited with the PDB. The recommendations from this working group are summarized at http://deposit.pdb.org/mmcif/sg-data/xstal.html and http://deposit.pdb.org/mmcif/sg-data/nmr.html . For data from crystallography-based projects, the content extensions are largely focused on a more detailed description of phasing, tracing and density modification. All of the ISGO recommendations have been incorporated into the PDB exchange dictionary.

A3.6.2.3. Noncrystallographic methods

| top | pdf |

The IUCr-sponsored development of data dictionaries has been focused exclusively on crystallographic methods. As the repository for all three-dimensional macromolecular structure data, the PDB accepts structures determined using noncrystallographic techniques such as NMR and cryo-electron microscopy. The description of noncrystallographic methods is beyond the remit of the IUCr, so the PDB has worked with the NMR and cryo-electron microscopy communities to develop data dictionaries that describe these techniques within the mmCIF framework.

A3.6.2.3.1. NMR

| top | pdf |

The PDB exchange dictionary includes a description of NMR sample preparation, structure solution methodology, refinement and refinement metrics. These extensions were developed in collaboration with the BioMagResBank (BMRB; Ulrich et al., 1989). The BMRB is the archive for experimental NMR data for biological macromolecules and has played an active role in the development of the mmCIF data dictionary. In selecting a format for archiving NMR data, the BMRB opted to use the STAR syntax (Hall, 1991) rather than the more restrictive CIF syntax. Despite this difference in syntax, the conceptual representation of macromolecular structure in the NMR dictionary (NMRStar) has remained semantically very close to the mmCIF representation. This has facilitated the exchange of data and dictionaries between the BMRB and the PDB, the sharing of software tools, and the development of a common platform for depositing data.

A3.6.2.3.2. Cryo-electron microscopy

| top | pdf |

Cryo-electron microscopy (as a technique for the determination of the structure of large molecular assemblies) is also described in the PDB exchange dictionary. The data extensions for cryo-electron microscopy include a description of the sample preparation, raw volume data (Henrick et al., 2003), structure solution and refinement. These extensions have a prefix of em_ (http://mmcif.pdb.org/dictionaries/mmcif_iims.dic/Index/ ).

A3.6.2.3.3. Protein production

| top | pdf |

The International Task Force on Deposition, Archiving, and Curation of Primary Information for Structural Genomics (Section A3.6.2.2) has also provided recommendations for the deposition of information about protein production. These recommendations are summarized at http://deposit.pdb.org/mmcif/sg-data/protprod.html . These data extensions have been used as the foundation for the Protein Expression Purification and Crystallization database (PEPCdb, http://pepcdb.pdb.org/ ) and for the protein production process model developed to support the Structural Proteomics in Europe initiative (SPINE; http://www.spineurope.org/ ).

A3.6.2.4. Supporting software

| top | pdf |

The RCSB/PDB has developed a set of software tools which support the PDB exchange dictionary framework (Chapter 5.5 ). These include PDB_EXTRACT, a tool to extract data from the output files of structure determination applications; ADIT, a web-based editor for data files based on the PDB exchange dictionary; and CIFTr, a translator from mmCIF to PDB format. These applications and other supporting utilities can be downloaded from http://sw-tools.pdb.org/ .

References

Berman, H. M. (2001). Chair. Report of the task force on the deposition, archiving, and curation of the primary information. Task Force Reports from the Second International Structural Genomics Meeting, Airlie, Virginia, USA. http://www.nigms.nih.gov/news/reports/airlie_tasks.html .Google Scholar

Berman, H. M., Henrick, K. & Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nature Struct. Biol. 10, 980.Google Scholar

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242.Google Scholar

Hall, S. R. (1991). The STAR file: a new format for electronic data transfer and archiving. J. Chem. Inf. Comput. Sci. 31, 326–333.Google Scholar

Henrick, K., Newman, R., Tagari, M. & Chagoyen, M. (2003). EMDep: a web-based system for the deposition and validation of high-resolution electron microscopy macromolecular structural information. J. Struct. Biol. 144, 228–237.Google Scholar

Ulrich, E. L., Markley, J. L. & Kyogoku, Y. (1989). Creation of a nuclear magnetic resonance data repository and literature database. Protein Seq. Data Anal. 2, 23–37.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 3.6, pp. 195-197
https://doi.org/10.1107/97809553602060000738