International
Tables for Crystallography Volume G Definition and exchange of crystallographic data Edited by S. R. Hall and B. McMahon © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. G. ch. 3.6, pp. 190-194
Section 3.6.8. Publication
P. M. D. Fitzgerald,a* J. D. Westbrook,b P. E. Bourne,c B. McMahon,d K. D. Watenpaughe and H. M. Bermanf
a
Merck Research Laboratories, Rahway, New Jersey, USA,bProtein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, New Jersey, USA,cResearch Collaboratory for Structural Bioinformatics, San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA,dInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England,eretired; formerly Structural, Analytical and Medicinal Chemistry, Pharmacia Corporation, Kalamazoo, Michigan, USA, and fProtein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, New Jersey, USA |
The results of the determination of the crystal structure of a biological macromolecule might be published in an academic journal and/or deposited in a structural database. The data items in the core CIF dictionary cover most of the requirements for constructing an article for publication from an mmCIF and the many well defined data fields in mmCIF allow an extensively annotated record of the structure to be deposited in a database. However, the formalism of two of the core CIF categories for publication did not fit the relational database model of mmCIF, so new categories were required. The core CIF category COMPUTING, which is used to list the programs used to determine the structure, is replaced by the mmCIF category SOFTWARE, and the core CIF category DATABASE, which is used to identify the records associated with the structure in various databases, is replaced by the mmCIF category DATABASE_2.
The category groups discussed here are: the CITATION group, which is used to give citations to the literature (Section 3.6.8.1); the COMPUTING group, which is used to cite software (Section 3.6.8.2); the DATABASE group for citing related database entries (Section 3.6.8.3), which includes a group of categories used to ensure compatibility with specific database records in the Protein Data Bank (Section 3.6.8.3.2); journal administration categories that might be used by a publisher (Section 3.6.8.4.1); and the PUBL family of categories used to store the text of an article for publication (Section 3.6.8.4.2).
The categories describing literature citations are as follows:
Data items in these categories are as follows:
The bullet () indicates a category key. The arrow () is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_).
The original core CIF dictionary contained the data item _publ_section_references for citations of journal articles, book chapters and monographs. The authors of the mmCIF dictionary felt that a more detailed and structured approach to literature citations was required. This is provided by the mmCIF categories CITATION, CITATION_AUTHOR and CITATION_EDITOR. These categories were subsequently included in the core CIF dictionary and are used in the same way in both dictionaries. Section 3.2.5.1 may be consulted for details. Although _publ.section_references remains a valid mmCIF data item, it is expected that the CITATION, CITATION_AUTHOR and CITATION_EDITOR categories will be used for literature citations in mmCIFs.
The categories describing software citations are as follows:
It is expected that citations of software packages in an mmCIF will be made using data items in the SOFTWARE category. However, in some cases, a particular publisher or database may require that this information is given using data items in the COMPUTING category instead (see Section 3.2.5.2 for details).
Data items in these categories are as follows:
The bullet () indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow () is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_).
The data item _computing.entry_id has been added to the COMPUTING category to provide the formal category key required by the DDL2 data model.
The data items in the SOFTWARE category are used to cite the software packages used in the structure analysis. The software can be described in great detail if necessary. However, for most applications a small subset of these data items, for example just _software.name and _software.version, could be used (see Example 3.6.8.1).
Most data items in the SOFTWARE category are self-explanatory, but a few require further comment. The data item _software.citation_id provides a way to link the details of a program to the citation of an article in the literature that describes the program; this data item must match a value of _citation.id in the CITATION category. The name and e-mail address of the author of the software can also be given using _software.contact_author and _software.contact_author_email, respectively. (This may be the original author or someone who subsequently modifies or maintains the software; these data items would generally refer to the person most closely associated with the maintenance of the code at the time it was used.) The release date of the software may be recorded in _software.date. As far as possible, the date should be that of the version recorded in _software.version. The data item _software.location may be used to supply a URL from which the software may be downloaded or where it is described in detail.
Categories describing related database entries are as follows:
The purpose of entries in the DATABASE category group is to provide pointers that link the mmCIF to all database entries that result from the deposition of the file. For mmCIF, the relevant category is DATABASE_2, which replaces the DATABASE category of the core dictionary.
Note the distinction between the database pointers provided here and those in the STRUCT_REF family of categories. The latter are intended to provide links to external database entries for any aspect of any subset of the structure that the author may wish to record, including previous determinations of the same structure, other structures containing the same ligand or references to the sequence(s) of the macromolecule(s) in sequence databases. In contrast, the links provided in DATABASE_2 refer to the entire contents of the mmCIF and are designed to cover situations in which the entire file is deposited in more than one database (for example, in the PDB and in a database for protein kinases).
Data items in these categories are as follows:
The bullet () indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow () is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_).
The DATABASE category is retained in the mmCIF dictionary, but only for consistency with the core dictionary.
The role of the data items in the DATABASE_2 category is to store identifiers assigned by one or more databases to the structure described in the mmCIF. In the data model used in the core CIF dictionary, each database has an individual data item. The data model in mmCIF is more general. It comprises the data items _database_2.database_id, which identifies the database, and _database_2.database_code, which is the code assigned by the database to the entry. Thus a new database can be referred to without needing to add an additional data item to the dictionary. If a structure has been deposited in more than one database, the values of _database_2.database_id and _database_2.database_code can be looped.
The institutions and individual databases recognized in the DATABASE_2 category in the current version of the mmCIF dictionary are CAS (Chemical Abstracts Service), CSD (Cambridge Structural Database), ICSD (Inorganic Crystal Structure Database), MDF (Metals Data File), NDB (Nucleic Acid Database), NBS (the Crystal Data database of the National Institute of Standards and Technology, formerly the National Bureau of Standards), PDB (Protein Data Bank), PDF (Powder Diffraction File), RCSB (Research Collaboratory for Structural Bioinformatics) and EBI (European Bioinformatics Institute). It is intended that new databases will be added to this list on an ongoing basis; the purpose of specifying a list of possible databases in the dictionary is to ensure that each database is referenced consistently.
Data items in these categories are as follows:
The bullet () indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow () is a reference to a parent data item.
A major goal of the design of the mmCIF data model was that a file could be transformed from Protein Data Bank (PDB) format to mmCIF format and back again without loss of information. This required the creation of mmCIF data items whose sole purpose is to capture PDB-specific records that do not map onto mmCIF data items. These records would never be created for a de novo mmCIF. This family of categories also belongs to the PDB category group (see Section 3.6.9.3).
The items in the categories DATABASE_PDB_MATRIX and DATABASE_PDB_TVECT are derived from the elements of transformation matrices and vectors used by the Protein Data Bank. The items in the categories DATABASE_PDB_REV and DATABASE_PDB_REV_RECORD record details about the revision history of the data block as archived by the Protein Data Bank.
The items in the DATABASE_PDB_CAVEAT category record comments about the data block flagged as `CAVEATS' by the Protein Data Bank at the time the original PDB archive file was created. A PDB CAVEAT record indicates that the entry contains severe errors. In PDB format, extended comments were stored as a sequence of fixed-length (80-character) format records, columns 9 and 10 being reserved for continuation sequence numbering. The mmCIF representation retains each record as a separate data value and does not attempt to merge continuation records to provide more readable running text. Hence the PDB CAVEAT entry would be represented in mmCIF as
The PDB format used `REMARK' records to store information relating to several aspects of the structure in free or loosely structured text. In some cases, the conventions used for individual types of REMARK record allow structured data to be extracted automatically and translated to specific mmCIF data items. Where this is not possible, the DATABASE_PDB_REMARK category may be used to retain the information that appeared in these parts of PDB format files. Unlike the CAVEAT records, it is possible to collect together several REMARK records sharing a common numbering into a single free-text field. For example, PDB practice has been to repeat the contents of CAVEAT records (see above) as records of type `REMARK 5'. While each separate CAVEAT record is converted to a separate mmCIF data value, the complete text of a REMARK 5 record may be gathered into a single mmCIF data value. Hence the CAVEAT example above would also appear in a PDB file as part of a `REMARK 5' as and would appear in an mmCIF as
Note that by convention the value of _database_PDB_remark.id matches the class of the REMARK record in the PDB file.
Categories used during the publication of an article are as follows:
These categories cover both the metadata for the article (information about the article) and the text of the article itself.
Data items in these categories are as follows:
The bullet () indicates a category key. The arrow () is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_).
In mmCIF, the families of categories used to contain the text of an article for publication and to record information about the handling and processing of the article by a publisher are assigned to the IUCR category group. The name arose from the fact that CIF is sponsored by the International Union of Crystallography and several of the journals of the IUCr can handle articles submitted for publication in CIF format. However, these data items may be freely used by other publishers who wish to handle articles submitted in CIF format. The JOURNAL and JOURNAL_INDEX categories are used in the same way in the core CIF and mmCIF dictionaries, and Section 3.2.5.4 can be consulted for details.
Data items in these categories are as follows:
The bullet () indicates a category key. The arrow () is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_).
The categories PUBL, PUBL_AUTHOR, PUBL_BODY and PUBL_ MANUSCRIPT_INCL are also members of the IUCR group in the mmCIF dictionary. They are used in the same way in the core CIF and mmCIF dictionaries, and Section 3.2.5.5 can be consulted for details.