International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

International Tables for Crystallography (2006). Vol. F, ch. 24.5, pp. 677-678   | 1 | 2 |

Section 24.5.3. The PDB database resource

H. M. Berman,a* J. Westbrook,a Z. Feng,a G. Gilliland,b T. N. Bhat,b H. Weissig,c I. N. Shindyalovc and P. E. Bourned

aDepartment of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854-8087, USA,bNational Institute of Standards and Technology, Biotechnology Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA,cSan Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA, and dDepartment of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA
Correspondence e-mail:  berman@rcsb.rutgers.edu

24.5.3. The PDB database resource

| top | pdf |

24.5.3.1. The database architecture

| top | pdf |

In recognition of the fact that no single architecture can fully express the information content of the PDB, an integrated system of heterogeneous databases and indices that store and organize the structural data has been created. At present there are five major components (Fig. 24.5.3.1)[link]:

  • (1) The core relational database managed by Sybase (Sybase Inc., 1995[link]) provides the central physical storage for the primary experimental and coordinate data described in Table 24.5.2.1[link]. The core PDB relational database contains all deposited information in a tabular form that can be accessed across any number of structures.

    [Figure 24.5.3.1]

    Figure 24.5.3.1 | top | pdf |

    The integrated query interface to the PDB.

  • (2) The final curated data files (in PDB format) and data dictionaries are the archival data and are present as ASCII files in the ftp archive.

  • (3) The POM-based databases (Shindyalov & Bourne, 1997[link]) consist of indexed objects containing native (e.g. atomic coordinates) and derived properties (e.g. calculated secondary-structure assignments and property profiles). Some properties require no derivation, for example, B factors; others must be derived, for example, exposure of each amino-acid residue (Lee & Richards, 1971[link]) or Cα contact maps. Properties requiring significant computation time, such as structure neighbours (Shindyalov & Bourne, 1998[link]), are pre-calculated when the database is incremented to save considerable user-access time.

  • (4) The Biological Macromolecule Crystallization Database (BMCD; Gilliland, 1988[link]) is organized as a relational database within Sybase and contains three general categories of literature-derived information: macromolecular, crystal and summary data.

  • (5) The Netscape LDAP server is used to index the textual content of the PDB in a structured format and provides support for keyword searches.

In the current implementation, communication among databases has been accomplished using the common gateway interface (CGI). An integrated web interface dispatches a query to the appropriate database(s), which then executes the query. Each database returns the PDB identifiers that satisfy the query, and the CGI program integrates the results. Complex queries are performed by repeating the process and having the interface program perform the appropriate Boolean operation(s) on the collection of query results. A variety of output options are then available for use with the final list of selected structures.

The CGI approach (and in the future a CORBA-based approach) will permit other databases to be integrated into this system, for example, those containing extended data on different protein families. The same approach could also be applied to include NMR data found in the BMRB or data found in other community databases.

24.5.3.2. Database queries

| top | pdf |

Three distinct query interfaces are available for querying data within the PDB: Status Query (http://www.rcsb.org/pdb/status.html ), SearchLite (http://www.rcsb.org/pdb/searchlite.html ) and SearchFields (http://www.rcsb.org/pdb/cgi/queryForm.cgi ). Table 24.5.3.1[link] summarizes the current query and analysis capabilities of the PDB. Fig. 24.5.3.2[link] illustrates how the various query options are organized.

Table 24.5.3.1| top | pdf |
Current query capabilities of the PDB

(a) Query – single or iterative

Free text – any word in the PDB
Specific data items – compound name, author, description, deposition date, resolution, source, citation, cell dimensions, experimental method, data-collection method, refinement method, broad structure type, ligand (using the PDB HET records)
Property pattern – sequence, secondary structure
Structure similarity – 3D comparison

(b) Results analysis – single structure

Synopsis/Snapshot/Atlas – compound name, sequence, chemical components, citation, space group, cell constants, crystallization conditions, refinement details, structure views
Quick report – compound name, author, description, deposition date, resolution, source, citation, cell dimensions, experimental method, data-collection method, refinement method, geometry features
Full report – Quick report results plus secondary structure, chemical components, solvent
Property profiles – sequence, secondary structure
Links – see Table 24.5.3.2[link]
RenderRasMol, Chime, QuickPDB (Java applet), VRML, Protein Explorer
Geometry – bond lengths, bond angles, dihedrals, close contacts, summary visual inspection

(c) Results analysis – multiple structure

Quick report – as above, but collated over multiple structures
Full report – as above, but collated over multiple structures
Structure neighbours – pairwise structure comparison

(d) Other query output options

mmCIF and PDB data files
Compressed files (gzip, tar, compressed)
[Figure 24.5.3.2]

Figure 24.5.3.2 | top | pdf |

The various query options that are available for the PDB.

SearchLite, which provides a single form field for keyword searches, was introduced in February 1999. All textual information within the PDB files as well as dates and some experimental data are accessible via simple or structured queries. SearchFields, accessible since May 1999, is a customizable query form that allows searching over many different data items, including compound, citation authors, sequence (via a FASTA search; Pearson & Lipman, 1988[link]) and release or deposition dates.

Two user interfaces provide extensive information for results sets from SearchLite or SearchFields queries. The `Query result browser' interface allows access to some general information, access to more detailed information in tabular format and the possibility of downloading whole sets of data files for result sets consisting of multiple PDB entries. The `Structure explorer' interface provides information about individual structures as well as cross-links to many external resources for macromolecular structure data (Table 24.5.3.2)[link]. Both interfaces are accessible to other data resources through the simple CGI application programmer interface (API) described at http://www.rcsb.org/pdb/linking.html .

Table 24.5.3.2| top | pdf |
Static cross-links to other data resources currently provided by the PDB

ResourceInformation content
3Dee (Siddiqui & Barton, 1996[link]) Structural domain definitions
BMCD (Gilliland, 1988[link]) Crystallization information about biomacromolecules
CATH (Orengo et al., 1997[link]) Protein fold classification
CE (Shindyalov & Bourne, 1998[link]) Complete PDB and representative structure comparison and alignments
DSSP (Kabsch & Sander, 1983[link]) Secondary-structure classification
Enzyme Structures Database (Laskowski & Wallace, 1998[link]) Enzyme classifications and nomenclature
FSSP (Holm & Sander, 1998[link]) Structurally similar families
GRASS (Nayal et al., 1999[link]) Graphical representation and analysis
HSSP (Dodge et al., 1998[link]) Homology-derived secondary structures
Image (Sühnel, 1996[link]) Image library of biological macromolecules
MMDB (Hogue et al., 1996[link]) Database of three-dimensional structures
MEDLINE (National Library of Medicine, 1989[link]) Direct access to MEDLINE at NCBI
NDB (Berman et al., 1992[link]) Database of three-dimensional nucleic acid structures
PDBObs (Weissig et al., 1998[link]) Obsolete structures database
PDBSum (Laskowski et al., 1997[link]) Summary information about protein structures
SCOP (Murzin et al., 1995[link]) Structure classifications
STING (Neshich et al., 1998[link]) Simultaneous display of structural and sequence information
Tops (Westhead et al., 1998[link]) Protein structure motif comparisons topological diagrams
VAST (Gibrat et al., 1996[link]) Vector Alignment Search Tool (NCBI)
Whatcheck (Hooft et al., 1996[link]) Protein structure checks

Table 24.5.3.3[link] indicates that usage has climbed dramatically since the system was first introduced in February 1999. Currently the PDB receives approximately 90 000 web hits per day, or, on average, one query every second, seven days a week, 24 hours a day.

Table 24.5.3.3| top | pdf |
Web query statistics for the primary RCSB site (www.rcsb.org )

MonthDaily averageMonthly totals
HitsFilesSitesKbytesFilesHits
August 1999 63768 47675 34928 31781561 1477927 1976818
July 1999 75693 54427 38698 35652864 1687265 2346495
June 1999 33256 27054 11586 11164410 622264 764894
May 1999 26890 22085 12405 12463441 684650 833597
April 1999 21140 17099 12261 9925351 512990 634224
March 1999 8406 6911 6292 3560629 214255 260610
February 1999 2944 2433 2246 844536 68133 82453
January 1999 1563 1353 1153 92014 35202 40641

References

Sybase Inc. (1995). 70202–01–1100–01 SYBASE SQL server release 11.0. Emeryville, CA, USA.Google Scholar
Gilliland, G. L. (1988). A Biological Macromolecule Crystallization Database: a basis for a crystallization strategy. J. Cryst. Growth, 90, 51–59.Google Scholar
Lee, B. & Richards, F. M. (1971). The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55, 379–400.Google Scholar
Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 24, 2444–2448.Google Scholar
Shindyalov, I. N. & Bourne, P. E. (1997). Protein data representation and query using optimized data decomposition. Comput. Appl. Biosci. 13, 487–496.Google Scholar
Shindyalov, I. N. & Bourne, P. E. (1998). Protein structure alignment by incremental combinatorial extension of the optimum path. Protein Eng. 11, 739–747.Google Scholar








































to end of page
to top of page