International
Tables for Crystallography Volume F Crystallography of biological macromolecules Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F. ch. 24.5, pp. 675-677
Section 24.5.2. Data acquisition and processing
H. M. Berman,a* J. Westbrook,a Z. Feng,a G. Gilliland,b T. N. Bhat,b H. Weissig,c I. N. Shindyalovc and P. E. Bourned
a
Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854-8087, USA,bNational Institute of Standards and Technology, Biotechnology Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA,cSan Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA, and dDepartment of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA |
A key component of creating the public archive of information is the efficient capture and curation of the data – data processing. Data processing consists of data deposition, annotation and validation. These steps are part of the fully documented and integrated data-processing system shown in Fig. 24.5.2.1.
In the present system (Fig. 24.5.2.2), data (atomic coordinates, structure factors and NMR restraints) may be submitted via e-mail or via the AutoDep Input Tool [ADIT: http://deposit.rcsb.org/adit/ (Westbrook et al., 1998)] developed by the RCSB. ADIT, which is also used to process the entries, is built on top of the mmCIF dictionary, which is an ontology of 1700 terms that define the macromolecular structure and the crystallographic experiment (Bourne et al., 1997), and a data-processing program called MAXIT (Macromolecular Exchange and Input Tool; Feng, Hsieh et al., 1998). This integrated system helps to ensure that the data that are deposited for an entry are consistent and error-free after annotation.
After a structure has been deposited using ADIT, a PDB identifier is sent to the author automatically and immediately (Fig. 24.5.2.1, step 1). This is the first stage in which information about the structure is loaded into the internal core database (see Section 24.5.3). The entry is then annotated by PDB staff using ADIT; several validation reports about the structure are produced. The completely annotated entry as it will appear in the PDB resource, together with the validation information, is sent back to the depositor (step 2). After reviewing the processed file, the author sends any revisions (step 3). Depending on the nature of these revisions, steps 2 and 3 may be repeated. Once approval is received from the author (step 4), the entry and the tables in the internal core database are ready for distribution.
All aspects of data processing, including communications with the author, are recorded and stored in the correspondence archive. This makes it possible for the PDB staff to retrieve information about any aspect of the deposition process and to monitor the efficiency of PDB operations closely.
Current status information including a list of authors, title and release category is stored for each entry in the core database and is made accessible for query via the WWW interface (http://www.rcsb.org/pdb/status.html ). Entries before release are categorized as `in processing' (PROC), `in depositor review' (WAIT), `to be held until publication' (HPUB) or `on hold until a depositor specified date' (HOLD).
All the data collected from depositors by the PDB are considered primary data. Primary data contain, in addition to the coordinates, general information required for all deposited structures and information specific to the method of structure determination. Table 24.5.2.1 contains the general information that the PDB collects for all structures as well as the additional information collected for those structures determined by X-ray methods. The additional items listed for the NMR structures are derived from the International Union of Pure and Applied Chemistry recommendations (Markley et al., 1998) and will be implemented in the near future.
|
The information content of data submitted by the depositor is likely to change as new methods for data collection, structure determination and refinement evolve and advance. In addition, the ways in which these data are captured is likely to change as the software for structure determination and refinement produce the necessary data items as part of their output. The data-input system for the PDB, ADIT, has been designed so as to incorporate these likely changes easily.
Validation refers to the procedure for assessing the quality of deposited atomic models (structure validation) and for assessing how well these models fit the experimental data (experimental validation). The PDB validates structures using accepted community standards as part of ADIT's integrated data-processing system. All validation reports are communicated directly to the depositor. It is also possible to run these validation checks against structures that are not being deposited. A validation server (http://deposit.rcsb.org/validate/ ) has been made available for this purpose.
Several types of checks are used in this process: PROCHECK (Laskowski et al., 1993) is used for checking the structural features of proteins and NUCheck (Feng, Westbrook & Berman, 1998) is used for checking the structural features of nucleic acids. The information currently checked includes the following: bond lengths and bond angles, nomenclature, sequence, stereochemistry, torsion angles, ligand geometry, planarity of peptide bonds, intermolecular contacts, and positions of water molecules. In consultation with the community, other structure checks will be implemented over the next few years.
The experimental data are also checked. Currently, X-ray crystallographic data are validated and plans for checking NMR data are in progress. For X-ray crystallographic structures, the structure factors are validated using SFCHECK (Vaguine et al., 1999). This program extracts the deposited R factor, resolution and model information, and then compares them with values calculated from coordinate and structure-factor files. It also calculates an overall B factor, coordinate errors, an effective resolution and completeness. The summary of the density correlation shift and B factor are reported for each residue. As specific procedures are developed for checking NMR structures against experimental data, they will be incorporated into the PDB validation procedures.
The PDB staff recognize that NMR data need a special development effort. Historically these data have been retro-fitted into a PDB format defined around crystallographic information. As a first step towards improving this situation, the PDB carried out an extensive assessment of the current NMR holdings and presented the findings to a task force consisting of a cross section of NMR researchers. The PDB is working with this group, the BioMagResBank (BMRB; Ulrich et al., 1989) and other members of the NMR community to develop an NMR data dictionary along with deposition and validation tools specific for NMR structures.
Production processing of PDB entries by the RCSB began on 27 January 1999. As of 1 July 1999, when the RCSB became fully responsible for the PDB, approximately 80% of all structures submitted to the PDB are deposited via ADIT and processed by the RCSB. Another 20% are submitted via AutoDep to the European Bioinformatics Institute (EBI), who process these submissions and forward them to the PDB for archiving and distribution. The average time from deposition to the completion of data processing including author interactions is two weeks. The number of structures with a HOLD release status remains at about 20% of all submissions; 57% are held until publication (HPUB); and 23% are released immediately after processing.
Table 24.5.2.2 shows the breakdown of the types of structures in the PDB. As of 14 September 1999, the PDB contained 10 714 publicly accessible structures with another 1169 entries on hold (not shown). Of these, 8789 (82%) were determined by X-ray methods, 1692 (16%) were determined by NMR and 233 (2%) were theoretical models. Overall, 35% of the entries have deposited experimental data.
|
References
Bourne, P., Berman, H. M., Watenpaugh, K., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). The macromolecular Crystallographic Information File (mmCIF). Methods Enzymol. 277, 571–590.Google ScholarFeng, Z., Hsieh, S.-H., Gelbin, A. & Westbrook, J. (1998). MAXIT: macromolecular exchange and input tool. NDB-120. Rutgers University, New Brunswick, NJ, USA.Google Scholar
Feng, Z., Westbrook, J. & Berman, H. M. (1998). NUCheck. NDB-407. Rutgers University, New Brunswick, NJ, USA.Google Scholar
Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.Google Scholar
Markley, J. L., Bax, A., Arata, Y., Hilbers, C. W., Kaptein, R., Sykes, B. D., Wright, P. E. & Wüthrich, K. (1998). Recommendations for the presentation of NMR structures of proteins and nucleic acids. IUPAC–IUBMB–IUPAB Inter-Union Task Group on the standardization of data bases of protein and nucleic acid structures determined by NMR spectroscopy. J. Biomol. Nucl. Magn. Reson. 12, 1–23.Google Scholar
Ulrich, E. L., Markley, J. L. & Kyogoku, Y. (1989). Creation of a nuclear magnetic resonance data repository and literature database. Protein Seq. Data Anal. 2, 23–37.Google Scholar
Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). SFCHECK: a unified set of procedures for evaluating the quality of macromolecular structure-factor data and their agreement with the atomic model. Acta Cryst. D55, 191–205.Google Scholar
Westbrook, J., Feng, Z. & Berman, H. M. (1998). ADIT – the AutoDep Input Tool. RCSB-99. Department of Chemistry, Rutgers, The State University of New Jersey, USA.Google Scholar