Data acquisition and processing

Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G. L.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E.

doi:10.1107/97809553602060000722

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 24.5, pp. 675-677 | 1 | 2 |

Section 24.5.2. Data acquisition and processing

H. M. Berman,^a ^* J. Westbrook,^a Z. Feng,^a G. Gilliland,^b T. N. Bhat,^b H. Weissig,^c I. N. Shindyalov^c and P. E. Bourne^d

^a Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854-8087, USA,^bNational Institute of Standards and Technology, Biotechnology Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA,^cSan Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA, and ^dDepartment of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA
Correspondence e-mail: berman@rcsb.rutgers.edu

24.5.2. Data acquisition and processing

| top | pdf |

A key component of creating the public archive of information is the efficient capture and curation of the data – data processing. Data processing consists of data deposition, annotation and validation. These steps are part of the fully documented and integrated data-processing system shown in Fig. 24.5.2.1.

Figure 24.5.2.1| top | pdf |

The steps in PDB data processing. Ellipses represent actions and rectangles define content.

In the present system (Fig. 24.5.2.2), data (atomic coordinates, structure factors and NMR restraints) may be submitted via e-mail or via the AutoDep Input Tool [ADIT: http://deposit.rcsb.org/adit/ (Westbrook et al., 1998)] developed by the RCSB. ADIT, which is also used to process the entries, is built on top of the mmCIF dictionary , which is an ontology of 1700 terms that define the macromolecular structure and the crystallographic experiment (Bourne et al., 1997), and a data-processing program called MAXIT (Macromolecular Exchange and Input Tool; Feng, Hsieh et al., 1998). This integrated system helps to ensure that the data that are deposited for an entry are consistent and error-free after annotation.

Figure 24.5.2.2| top | pdf |

The integrated tools of the PDB data-processing system.

After a structure has been deposited using ADIT, a PDB identifier is sent to the author automatically and immediately (Fig. 24.5.2.1, step 1). This is the first stage in which information about the structure is loaded into the internal core database (see Section 24.5.3). The entry is then annotated by PDB staff using ADIT; several validation reports about the structure are produced. The completely annotated entry as it will appear in the PDB resource, together with the validation information, is sent back to the depositor (step 2). After reviewing the processed file, the author sends any revisions (step 3). Depending on the nature of these revisions, steps 2 and 3 may be repeated. Once approval is received from the author (step 4), the entry and the tables in the internal core database are ready for distribution.

All aspects of data processing, including communications with the author, are recorded and stored in the correspondence archive. This makes it possible for the PDB staff to retrieve information about any aspect of the deposition process and to monitor the efficiency of PDB operations closely.

Current status information including a list of authors, title and release category is stored for each entry in the core database and is made accessible for query via the WWW interface (http://www.rcsb.org/pdb/status.html ). Entries before release are categorized as `in processing' (PROC), `in depositor review' (WAIT), `to be held until publication' (HPUB) or `on hold until a depositor specified date' (HOLD).

24.5.2.1. Content of the data collected by the PDB

| top | pdf |

All the data collected from depositors by the PDB are considered primary data. Primary data contain, in addition to the coordinates, general information required for all deposited structures and information specific to the method of structure determination. Table 24.5.2.1 contains the general information that the PDB collects for all structures as well as the additional information collected for those structures determined by X-ray methods. The additional items listed for the NMR structures are derived from the International Union of Pure and Applied Chemistry recommendations (Markley et al., 1998) and will be implemented in the near future.

Table 24.5.2.1| top | pdf |
Content of data in the PDB

(a) Content of all depositions (X-ray and NMR)

Source – specifications such as genus, species, strain, or variant of gene (cloned or synthetic); expression vector and host, or description of method of chemical synthesis

Sequence – full sequence of all macromolecular components

Chemical structure of cofactors and prosthetic groups

Names of all components in structure

Qualitative description of characteristics of structure

Literature citations for the structure submitted

Three-dimensional coordinates

(b) Additional items for X-ray structure determinations

Temperature factors and occupancies assigned to each atom

Crystallization conditions, including pH, temperature, solvents, salts, methods

Crystal data, including the unit-cell dimensions and space group

Presence of noncrystallographic symmetry

Data-collection information describing the methods used to collect the diffraction data including instrument, wavelength, temperature and processing programs

Data-collection statistics including data coverage, R_sym, data above 1, 2, 3σ levels and resolution limits

Refinement information including R factor, resolution limits, number of reflections, method of refinement, σ cutoff, geometry r.m.s.d.

Structure factors – h, k, l, F_obs, σ(F_obs)

Model number for each coordinate set that is deposited and an indication if one should be designated as a representative, or an energy-minimized average model provided

Data-collection information describing the types of methods used, instrumentation, magnetic field strength, console, probe head, sample tube

Sample conditions, including solvent, macromolecule concentration ranges, concentration ranges of buffers, salts, antibacterial agents, other components, isotopic composition

Experimental conditions, including temperature, pH, pressure and oxidation state of structure determination and estimates of uncertainties in these values

Non-covalent heterogeneity of sample, including self-aggregation, partial isotope exchange, conformational heterogeneity resulting in slow chemical exchange

Chemical heterogeneity of the sample (e.g. evidence for deamidation or minor covalent species)

A list of NMR experiments used to determine the structure including those used to determine resonance assignments, NOE/ROE data, dynamical data, scalar coupling constants, and those used to infer hydrogen bonds and bound ligands. The relationship of these experiments to the constraint files are given explicitly

Constraint files used to derive the structure as described in task-force recommendations

The information content of data submitted by the depositor is likely to change as new methods for data collection, structure determination and refinement evolve and advance. In addition, the ways in which these data are captured is likely to change as the software for structure determination and refinement produce the necessary data items as part of their output. The data-input system for the PDB, ADIT, has been designed so as to incorporate these likely changes easily.

24.5.2.2. Validation

| top | pdf |

Validation refers to the procedure for assessing the quality of deposited atomic models (structure validation) and for assessing how well these models fit the experimental data (experimental validation). The PDB validates structures using accepted community standards as part of ADIT's integrated data-processing system. All validation reports are communicated directly to the depositor. It is also possible to run these validation checks against structures that are not being deposited. A validation server (http://deposit.rcsb.org/validate/ ) has been made available for this purpose.

Several types of checks are used in this process: PROCHECK (Laskowski et al., 1993) is used for checking the structural features of proteins and NUCheck (Feng, Westbrook & Berman, 1998) is used for checking the structural features of nucleic acids. The information currently checked includes the following: bond lengths and bond angles, nomenclature, sequence, stereochemistry, torsion angles, ligand geometry, planarity of peptide bonds, intermolecular contacts, and positions of water molecules. In consultation with the community, other structure checks will be implemented over the next few years.

The experimental data are also checked. Currently, X-ray crystallographic data are validated and plans for checking NMR data are in progress. For X-ray crystallographic structures, the structure factors are validated using SFCHECK (Vaguine et al., 1999). This program extracts the deposited R factor, resolution and model information, and then compares them with values calculated from coordinate and structure-factor files. It also calculates an overall B factor, coordinate errors, an effective resolution and completeness. The summary of the density correlation shift and B factor are reported for each residue. As specific procedures are developed for checking NMR structures against experimental data, they will be incorporated into the PDB validation procedures.

24.5.2.3. NMR data

| top | pdf |

The PDB staff recognize that NMR data need a special development effort. Historically these data have been retro-fitted into a PDB format defined around crystallographic information. As a first step towards improving this situation, the PDB carried out an extensive assessment of the current NMR holdings and presented the findings to a task force consisting of a cross section of NMR researchers. The PDB is working with this group, the BioMagResBank (BMRB; Ulrich et al., 1989) and other members of the NMR community to develop an NMR data dictionary along with deposition and validation tools specific for NMR structures.

24.5.2.4. Data-processing statistics

| top | pdf |

Production processing of PDB entries by the RCSB began on 27 January 1999. As of 1 July 1999, when the RCSB became fully responsible for the PDB, approximately 80% of all structures submitted to the PDB are deposited via ADIT and processed by the RCSB. Another 20% are submitted via AutoDep to the European Bioinformatics Institute (EBI), who process these submissions and forward them to the PDB for archiving and distribution. The average time from deposition to the completion of data processing including author interactions is two weeks. The number of structures with a HOLD release status remains at about 20% of all submissions; 57% are held until publication (HPUB); and 23% are released immediately after processing.

Table 24.5.2.2 shows the breakdown of the types of structures in the PDB. As of 14 September 1999, the PDB contained 10 714 publicly accessible structures with another 1169 entries on hold (not shown). Of these, 8789 (82%) were determined by X-ray methods, 1692 (16%) were determined by NMR and 233 (2%) were theoretical models. Overall, 35% of the entries have deposited experimental data.

Table 24.5.2.2| top | pdf |
Demographics of the released data in the PDB as of 14 September 1999

Experimental technique	Molecule type
Experimental technique	Proteins, peptides, and viruses	Protein–nucleic acid complexes	Nucleic acids	Carbohydrates and other	Total
X-ray diffraction and other	7946	390	439	14	8789
NMR	1365	53	270	4	1692
Theoretical modelling	202	16	15	0	233
Total	9513	459	724	18	10714

References

Bourne, P., Berman, H. M., Watenpaugh, K., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). The macromolecular Crystallographic Information File (mmCIF). Methods Enzymol. 277, 571–590.Google Scholar

Feng, Z., Hsieh, S.-H., Gelbin, A. & Westbrook, J. (1998). MAXIT: macromolecular exchange and input tool. NDB-120. Rutgers University, New Brunswick, NJ, USA.Google Scholar

Feng, Z., Westbrook, J. & Berman, H. M. (1998). NUCheck. NDB-407. Rutgers University, New Brunswick, NJ, USA.Google Scholar

Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.Google Scholar

Markley, J. L., Bax, A., Arata, Y., Hilbers, C. W., Kaptein, R., Sykes, B. D., Wright, P. E. & Wüthrich, K. (1998). Recommendations for the presentation of NMR structures of proteins and nucleic acids. IUPAC–IUBMB–IUPAB Inter-Union Task Group on the standardization of data bases of protein and nucleic acid structures determined by NMR spectroscopy. J. Biomol. Nucl. Magn. Reson. 12, 1–23.Google Scholar

Ulrich, E. L., Markley, J. L. & Kyogoku, Y. (1989). Creation of a nuclear magnetic resonance data repository and literature database. Protein Seq. Data Anal. 2, 23–37.Google Scholar

Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). SFCHECK: a unified set of procedures for evaluating the quality of macromolecular structure-factor data and their agreement with the atomic model. Acta Cryst. D55, 191–205.Google Scholar

Westbrook, J., Feng, Z. & Berman, H. M. (1998). ADIT – the AutoDep Input Tool. RCSB-99. Department of Chemistry, Rutgers, The State University of New Jersey, USA.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 24.5, pp. 675-677