The PDB database resource

Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G. L.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E.

doi:10.1107/97809553602060000722

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 24.5, pp. 677-678 | 1 | 2 |

Section 24.5.3. The PDB database resource

H. M. Berman,^a ^* J. Westbrook,^a Z. Feng,^a G. Gilliland,^b T. N. Bhat,^b H. Weissig,^c I. N. Shindyalov^c and P. E. Bourne^d

^a Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854-8087, USA,^bNational Institute of Standards and Technology, Biotechnology Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA,^cSan Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA, and ^dDepartment of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA
Correspondence e-mail: berman@rcsb.rutgers.edu

24.5.3. The PDB database resource

| top | pdf |

24.5.3.1. The database architecture

| top | pdf |

In recognition of the fact that no single architecture can fully express the information content of the PDB, an integrated system of heterogeneous databases and indices that store and organize the structural data has been created. At present there are five major components (Fig. 24.5.3.1):

(1) The core relational database managed by Sybase (Sybase Inc., 1995) provides the central physical storage for the primary experimental and coordinate data described in Table 24.5.2.1. The core PDB relational database contains all deposited information in a tabular form that can be accessed across any number of structures.

Figure 24.5.3.1| top | pdf |
The integrated query interface to the PDB.
(2) The final curated data files (in PDB format) and data dictionaries are the archival data and are present as ASCII files in the ftp archive.
(3) The POM-based databases (Shindyalov & Bourne, 1997) consist of indexed objects containing native (e.g. atomic coordinates) and derived properties (e.g. calculated secondary-structure assignments and property profiles). Some properties require no derivation, for example, B factors; others must be derived, for example, exposure of each amino-acid residue (Lee & Richards, 1971) or Cα contact maps. Properties requiring significant computation time, such as structure neighbours (Shindyalov & Bourne, 1998), are pre-calculated when the database is incremented to save considerable user-access time.
(4) The Biological Macromolecule Crystallization Database (BMCD; Gilliland, 1988) is organized as a relational database within Sybase and contains three general categories of literature-derived information: macromolecular, crystal and summary data.
(5) The Netscape LDAP server is used to index the textual content of the PDB in a structured format and provides support for keyword searches.

In the current implementation, communication among databases has been accomplished using the common gateway interface (CGI). An integrated web interface dispatches a query to the appropriate database(s), which then executes the query. Each database returns the PDB identifiers that satisfy the query, and the CGI program integrates the results. Complex queries are performed by repeating the process and having the interface program perform the appropriate Boolean operation(s) on the collection of query results. A variety of output options are then available for use with the final list of selected structures.

The CGI approach (and in the future a CORBA-based approach) will permit other databases to be integrated into this system, for example, those containing extended data on different protein families. The same approach could also be applied to include NMR data found in the BMRB or data found in other community databases.

24.5.3.2. Database queries

| top | pdf |

Three distinct query interfaces are available for querying data within the PDB: Status Query (http://www.rcsb.org/pdb/status.html ), SearchLite (http://www.rcsb.org/pdb/searchlite.html ) and SearchFields (http://www.rcsb.org/pdb/cgi/queryForm.cgi ). Table 24.5.3.1 summarizes the current query and analysis capabilities of the PDB. Fig. 24.5.3.2 illustrates how the various query options are organized.

Table 24.5.3.1| top | pdf |
Current query capabilities of the PDB

(a) Query – single or iterative

Free text – any word in the PDB

Specific data items – compound name, author, description, deposition date, resolution, source, citation, cell dimensions, experimental method, data-collection method, refinement method, broad structure type, ligand (using the PDB HET records)

Property pattern – sequence, secondary structure

Structure similarity – 3D comparison

(b) Results analysis – single structure

Synopsis/Snapshot/Atlas – compound name, sequence, chemical components, citation, space group, cell constants, crystallization conditions, refinement details, structure views

Quick report – compound name, author, description, deposition date, resolution, source, citation, cell dimensions, experimental method, data-collection method, refinement method, geometry features

Full report – Quick report results plus secondary structure, chemical components, solvent

Property profiles – sequence, secondary structure

Links – see Table 24.5.3.2

Render – RasMol, Chime, QuickPDB (Java applet), VRML, Protein Explorer

Geometry – bond lengths, bond angles, dihedrals, close contacts, summary visual inspection

Quick report – as above, but collated over multiple structures

Full report – as above, but collated over multiple structures

Structure neighbours – pairwise structure comparison

(d) Other query output options

mmCIF and PDB data files

Compressed files (gzip, tar, compressed)

Figure 24.5.3.2| top | pdf |

The various query options that are available for the PDB.

SearchLite , which provides a single form field for keyword searches, was introduced in February 1999. All textual information within the PDB files as well as dates and some experimental data are accessible via simple or structured queries. SearchFields, accessible since May 1999, is a customizable query form that allows searching over many different data items, including compound, citation authors, sequence (via a FASTA search; Pearson & Lipman, 1988) and release or deposition dates.

Two user interfaces provide extensive information for results sets from SearchLite or SearchFields queries. The `Query result browser' interface allows access to some general information, access to more detailed information in tabular format and the possibility of downloading whole sets of data files for result sets consisting of multiple PDB entries. The `Structure explorer' interface provides information about individual structures as well as cross-links to many external resources for macromolecular structure data (Table 24.5.3.2). Both interfaces are accessible to other data resources through the simple CGI application programmer interface (API) described at http://www.rcsb.org/pdb/linking.html .

Table 24.5.3.2| top | pdf |
Static cross-links to other data resources currently provided by the PDB

Resource	Information content
3Dee (Siddiqui & Barton, 1996)	Structural domain definitions
BMCD (Gilliland, 1988)	Crystallization information about biomacromolecules
CATH (Orengo et al., 1997)	Protein fold classification
CE (Shindyalov & Bourne, 1998)	Complete PDB and representative structure comparison and alignments
DSSP (Kabsch & Sander, 1983)	Secondary-structure classification
Enzyme Structures Database (Laskowski & Wallace, 1998)	Enzyme classifications and nomenclature
FSSP (Holm & Sander, 1998)	Structurally similar families
GRASS (Nayal et al., 1999)	Graphical representation and analysis
HSSP (Dodge et al., 1998)	Homology-derived secondary structures
Image (Sühnel, 1996)	Image library of biological macromolecules
MMDB (Hogue et al., 1996)	Database of three-dimensional structures
MEDLINE (National Library of Medicine, 1989)	Direct access to MEDLINE at NCBI
NDB (Berman et al., 1992)	Database of three-dimensional nucleic acid structures
PDBObs (Weissig et al., 1998)	Obsolete structures database
PDBSum (Laskowski et al., 1997)	Summary information about protein structures
SCOP (Murzin et al., 1995)	Structure classifications
STING (Neshich et al., 1998)	Simultaneous display of structural and sequence information
Tops (Westhead et al., 1998)	Protein structure motif comparisons topological diagrams
VAST (Gibrat et al., 1996)	Vector Alignment Search Tool (NCBI)
Whatcheck (Hooft et al., 1996)	Protein structure checks

Table 24.5.3.3 indicates that usage has climbed dramatically since the system was first introduced in February 1999. Currently the PDB receives approximately 90 000 web hits per day, or, on average, one query every second, seven days a week, 24 hours a day.

Table 24.5.3.3| top | pdf |
Web query statistics for the primary RCSB site (www.rcsb.org )

Month	Daily average		Monthly totals
Month	Hits	Files	Sites	Kbytes	Files	Hits
August 1999	63768	47675	34928	31781561	1477927	1976818
July 1999	75693	54427	38698	35652864	1687265	2346495
June 1999	33256	27054	11586	11164410	622264	764894
May 1999	26890	22085	12405	12463441	684650	833597
April 1999	21140	17099	12261	9925351	512990	634224
March 1999	8406	6911	6292	3560629	214255	260610
February 1999	2944	2433	2246	844536	68133	82453
January 1999	1563	1353	1153	92014	35202	40641

References

Sybase Inc. (1995). 70202–01–1100–01 SYBASE SQL server release 11.0. Emeryville, CA, USA.Google Scholar

Gilliland, G. L. (1988). A Biological Macromolecule Crystallization Database: a basis for a crystallization strategy. J. Cryst. Growth, 90, 51–59.Google Scholar

Lee, B. & Richards, F. M. (1971). The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol. 55, 379–400.Google Scholar

Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 24, 2444–2448.Google Scholar

Shindyalov, I. N. & Bourne, P. E. (1997). Protein data representation and query using optimized data decomposition. Comput. Appl. Biosci. 13, 487–496.Google Scholar

Shindyalov, I. N. & Bourne, P. E. (1998). Protein structure alignment by incremental combinatorial extension of the optimum path. Protein Eng. 11, 739–747.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 24.5, pp. 677-678