The CSD and the PDB: data acquisition and data quality

Allen, F. H.; Cole, J. C.; Verdonk, M. L.

doi:10.1107/97809553602060000713

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 22.4, pp. 558-559 | 1 | 2 |

Section 22.4.2. The CSD and the PDB: data acquisition and data quality

F. H. Allen,^a ^* J. C. Cole^a and M. L. Verdonk^a

^aCambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, England
Correspondence e-mail: allen@ccdc.cam.ac.uk

22.4.2. The CSD and the PDB: data acquisition and data quality

| top | pdf |

22.4.2.1. Statistical inferences

| top | pdf |

With a current total of 200 000 structures and a doubling period of seven years (Fig. 22.4.2.1a), we may expect at least half a million small-molecule crystal structures to be in the CSD by the year 2010. The Protein Data Bank (PDB) (Abola et al., 1997; Berman et al., 2000), which began operations in the mid-1970s, and the Nucleic Acid Database (NDB) (Berman et al., 1992) are the international repositories for macromolecular structure information. Input to the PDB was initially slow but is now showing a rapid growth rate reminiscent of the CSD of the 1970s (Fig. 22.4.2.1b). The PDB archive has a current total of ca 8500 structures (mid-1999) and a doubling period of close to two years. As with the CSD, this early high rate of growth will almost certainly decrease, thus increasing the doubling period. Nevertheless, by the year 2010, we might expect the PDB to contain more than 100 000 structures.

Figure 22.4.2.1| top | pdf |

(a) Growth rate of the CSD and (b) growth rate of the PDB, in terms of the numbers of structures published per annum for the period 1970–1995.

22.4.2.2. Data acquisition and completeness

| top | pdf |

Given the size and diversity of the CSD, it is amazing that searches for some common chemical substructures often yield far fewer hits than might have been expected. Sometimes, the absence of just a few key CSD entries would have negated a successful systematic analysis: some points in a graph would have been missing and a correlation would not have been detected. Similarly, completeness of the PDB is vital for the future of `data mining' or `knowledge engineering' in the macromolecular arena.

Data acquisition by the PDB has always had one valuable advantage in comparison with the CSD. The volume of numerical data generated by a protein structure determination is far too large for primary publication or hard-copy deposition. Thus, the PDB has always acquired data through direct deposition in electronic form, and authors have usually been involved in the validation of their entries. Further, it is a mandatory requirement of the vast majority of journals, and a clear recommendation of appropriate professional organizations, that prior deposition with the PDB is an essential precursor to primary publication. This key involvement of the PDB in the publication process acts as a vital guarantee of the completeness of the archive. The prior-deposition rule must be rigidly adhered to for the long-term benefit of science.

22.4.2.3. Standard formats: CIF and mmCIF

| top | pdf |

The CSD, on the other hand, reflects the published literature, and much of its data content has been re-keyboarded from hard-copy material. The Cambridge Crystallographic Data Centre (CCDC) is now beginning to receive significant amounts of electronic input, a development that owes much to the rapid international acceptance of an agreed standard electronic interchange format, the crystallographic information file or CIF (Hall et al., 1991), and the rapid incorporation of CIF generators within most major structure solution and refinement packages. The CIF offers many advantages, some of which are only just being addressed within the CSD: (a) a clear definition of input data items and their representation; (b) a significant reduction in time spent correcting simple typographical errors; and (c) the possibility of enhancing the overall database content through the electronic availability of all information from the analysis, i.e. more than could reasonably be re-typed from hard-copy material. For the PDB, the recent adoption of the macromolecular CIF (mmCIF) as the agreed international standard offers similar advantages. This development, together with advances in communications technology, now make it possible to automate the deposition process more effectively, but the advantages of mmCIF can only be fully realized once it also becomes a standard output format of all of the relevant software packages.

22.4.2.4. Structure validation

| top | pdf |

The value of research results derived from the CSD and the PDB depends crucially on the accuracy of the underlying data [see e.g. Hooft et al. (1996) with respect to protein data]. As with the early CSD, much current research involves use of data from the developing PDB to establish rules and protocols for the validation of new protein structures (see e.g. Laskowski et al., 1993). This activity, in turn, means that earlier entries in the archive may have to be reassessed periodically to bring their representations into line with best current practice. This sequence of events was commonplace in the CSD of the 1970s and, even now, new structure types entering the CSD can still provoke a reassessment of subclasses of earlier entries.

Secondly, it is important that errors and warnings raised by validation software have clear meanings and that validation results are clearly encoded within each entry. The end user can then make informed choices about which entries to include (or not) in any given application. Recent moves to apply a range of agreed and unambiguous primary checks to new data, and to require resolution of any problems prior to the issue of a publication ID code, represent an important development.

References

Abola, E. E., Sussman, J. L., Prilusky, J. & Manning, N. O. (1997). Protein Data Bank archives of three-dimensional macromolecular structures. Methods Enzymol. 277, 556–571.Google Scholar

Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A. R. & Schneider, B. (1992). The Nucleic Acid Database. A comprehensive relational database of three-dimensional structures of nucleic acids. Biophys. J. 63, 751–759.Google Scholar

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242.Google Scholar

Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The crystallographic information file (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685.Google Scholar

Hooft, R. W. W., Vriend, G., Sander, C. & Abola, E. E. (1996). Errors in protein structures. Nature (London), 381, 272.Google Scholar

Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 22.4, pp. 558-559