Contents and access to the PDB archives

Sussman, J. L.; Lin, D.; Jiang, J.-S.; Manning, N. O.; Prilusky, J.; Abola, E. E.

doi:10.1107/97809553602060000718

International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. F. ch. 24.1, pp. 650-653 | 1 | 2 |

Section 24.1.3.1. Contents and access to the PDB archives

J. L. Sussman,^a ^* D. Lin,^b J. Jiang,^b N. O. Manning,^b J. Prilusky^c and E. E. Abola^d

^a Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel,^bBiology Department, Bldg 463, Brookhaven National Laboratory, Upton, NY 11973-5000, USA,^cBioinformatics Unit, Weizmann Institute of Science, Rehovot 76100, Israel, and ^dThe Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
Correspondence e-mail: joel.sussman@weizmann.ac.il

24.1.3.1. Contents and access to the PDB archives

| top | pdf |

The archives contain atomic coordinates, bibliographic citations, primary- and secondary-structure information, crystallographic structure factors, and NMR experimental data. Annotations in the structure entries include amino-acid or nucleotide sequences (with notes of any conflicts between the structure in the PDB and sequence databases), source organisms from which the biological materials were derived, descriptions of the experiments, secondary structures, complexes with small molecules included within the structures, references to papers etc. Third-party annotations include images and movies of structures; pointers to specialized databases (maintained by others), such as the Protein Kinase Resource (http://www.kinasenet.org/pkr/Welcome.do ) and ESTHER (ESTerases and α/β Hydrolase Enzymes and Relatives) (http://www.ensam.inra.fr/cholinesterase/ ), and pointers to databases that provide additional experimental information, such as the BioMagResBank (BMRB) NMR structural database (http://www.bmrb.wisc.edu/ ). Table 24.1.3.1 gives a summary of the contents of the PDB archives.

Table 24.1.3.1 | top | pdf |
PDB archive contents as of May 1999

9862	Atomic coordinate entries
2768	Structure-factor files
560	NMR restraint files

Molecule type:
8754	Proteins, peptides and viruses
415	Protein/nucleic acid complexes
681	Nucleic acids
12	Carbohydrates

Experimental technique:
8103	Diffraction
1544	NMR
215	Theoretical modelling

PDB entries are available on CD-ROM, which PC users can search using the PDB-SHELL browser included (Abola, 1994). UNIX users can also search the CD-ROM if they download a copy of the browser software. The entries are also available over the WWW from Brookhaven and 17 mirror sites worldwide (Table 24.1.3.2). They can be searched and retrieved via the PDB's 3DB Browser (Sussman, 1997), which is interfaced through web browsers such as Netscape Communicator and Internet Explorer. Probably the best way to get a feeling for 3DB Browser is just to try it. A simple example of its use is illustrated in Fig. 24.1.3.1 in a search for a structure related to recent papers in Nature (Kwong et al., 1998) and Science (Rizzuto et al., 1998).

Table 24.1.3.2 | top | pdf |
PDB mirror sites as of May 1999

Official PDB mirror sites
Argentina: University of San Luis
Australia: Australian National Genomic Information Service, Sydney; The Walter and Eliza Hall Institute of Medical Research, Melbourne
Brazil: ICB-UFMG, Inst. de Ciencias Biologicas, Univ. Federal de Minas Gerais
China: Institute of Physical Chemistry, Peking University, Beijing
France: Institut de Génétique Humaine, Montpellier
Germany: GMD, German National Research Center for Information Technology, Sankt Augustin
India: Bioinformatics Centre, University of Pune
Israel: Weizmann Institute of Science, Rehovot
Japan: Institute of Protein Research, Osaka University
Poland: ICM - Interdisciplinary Centre for Modelling, Warsaw University
Taiwan: National Tsing Hua University, HsinChu
United Kingdom: Cambridge Crystallographic Data Centre, Cambridge; EMBL Outstation, EBI, Hinxton
United States: Bio Molecular Engineering Research Center, Boston University; North Carolina Supercomputing Center, Research Triangle Park; University of Georgia, Athens, Georgia; PDB at Brookhaven National Laboratory

Figure 24.1.3.1 | top | pdf |

3DB Browser as a tool to visualize recently published structures. (1) Search for author: Hendrickson; text query: HIV. (2) Six hits obtained, PDB ID Code 1GC1 highlighted. (3) 3DB Browser Atlas page. Ovals highlight the expression systems used for the different components in the multicomponent system. (4) Structure as visualized with MDL's Chemscape Chime plug-in.

3DB Browser has a number of features that make it easy to access information found in PDB entries. Users can search according to any combination of fields, such as compound name, experiment title, authors (depositors), biological source, journal references, date of deposition and nature of small molecules (ligands and heterogens) complexed with the structure. Boolean operators allow highly complex search strings. Entries selected can be retrieved automatically, and the molecular structures can be displayed using the public-domain molecular viewer RasMol (Sayle & Milner-White, 1995), MDL's Chemscape Chime plug-in, or a similar viewer. The entries also include HyperText links to the SwissProt protein-sequence database (Bairoch & Boeckmann, 1994), the BioMagResBank (BMRB) NMR structural database (Seavey et al., 1991), the Enzyme Commission Database (Bairoch, 1994), PubMed access to the Medline database, and several other databases (see Table 24.1.3.3 for a list of linked external data sources).

Table 24.1.3.3 | top | pdf |
3DB Browser's linked external data sources

Source name	Short description
BioMagResBank	Relational database for sequence-specific protein NMR data
BLOCKS	Database of conserved regions in groups of proteins
CATH	Protein structure classification
DALI/FSSP	Families of structurally similar proteins
EMBL	European Molecular Biology Laboratory sequence database
Entrez	NCBI's documentation database
ENZYME	Enzyme nomenclature database
ESTHER	Esterases and alpha/beta hydrolase enzymes and relatives database
GenBank	NIH genetic sequence database
GDB	Genome Data Base
Kinase	Protein Kinase Database Project
KineMage	Protein Science's Kinemage server
LPFC	Library of Protein Family Cores
MacroMolecule	Crystal MacroMolecule files at the EBI
MMDB	Molecular Modelling Database
NDB	Nucleic Acid Database
OLDERADO	Core, domain and representative structure database
PDBObs	Archive of obsolete PDB entries at SDSC
PDBREPORT	Structure verification reports for X-ray structures
PIR	Protein Information Resource
PROSITE	Dictionary of protein sites and patterns
ProtMotDB	Protein Motions Database
PubMed	Medline bibliographic database
SCOP	Structural classification of proteins
Swiss 3D-Image	3D images of proteins and other biological macromolecules
SwissProt	Annotated protein sequence database
TREMBL	Translation from EMBL sequence database

The main source of information for the 3DB Browser is the data from the PDB. These data are highly structured, and most crystallographers usually consider a datum from a PDB entry as belonging to a particular `record' or `field'. It makes sense to use these fields to constrain the search. Searching for `rich' as a keyword has a different meaning from searching for the author Rich.

The simplest operation with the browser is to enter one or more words in the `Text query' field and press the `Search' button. The browser engine will come back with those entries from the database that contain or are related to the words provided.

The symbol `*' can be used as a wild card to denote a sequence of any number (including 0) of arbitrary characters. Just add an asterisk, `*', at the beginning or end of a word (or both) to `extend' the search. For example, enter `*tox*' in the keyword field to retrieve those entries containing keywords like neurotoxic and toxin. Wild cards have no meaning in number-only fields, like Resolution and Date.

The Boolean operator AND is the default for 3DB Browser and is mandatory (it cannot be changed) between fields (see Table 24.1.3.4). If `ATP' is entered in the Associated group field and `kinase' in the Keyword field, only those entries matching both constraints are returned. Inside a given field, Boolean logical operators may be applied at will to the words entered. The available Boolean logical operators are AND, OR and NOT. The case is unimportant. The operator AND can be represented by `+' and the operator NOT can be represented by `−'.

Table 24.1.3.4 | top | pdf |
Search fields of 3DB Browser

Search field	PDB entry
Entry ID code	Four-character accession code
Keyword	Molecule name, class or family, or related term (HEADER, TITLE, KEYWDS and COMPND fields)
Author	Family name of depositor or author of associated publication (AUTHOR and JRNL fields)
Text query	Any word in the complete PDB text, excluding most field names
Experiment	Method of structure determination
FASTA Search	FASTA search of the sequence
Resolution	A unique value or range of values, in Å (REMARK 2 field)
Space group	Both extended and standard Hermann–Mauguin symbols (CRYST1 field)
Organism	Trivial name, systematic name or expression system (SOURCE field)
Date (lower)	Date entry was deposited or released
Date (upper)	Date entry was deposited or released
Associated group	Prosthetic group, metal ion, ligand, substrate, or its three-letter PDB abbreviation (HET and HETNAM fields)
Chain size	A unique value or range of values

For example, `zinc and (torpedo or snake)' in the Text query field will return those entries that contain either the word torpedo or the word snake, but only if the word zinc is also present. In addition, many specific records can be searched for regular expressions or numerical limits, as shown in Table 24.1.3.4 [see Protein Data Bank Quarterly Newsletter (1998), 83, pp. 3–5, The `Intelligent' Search Engine Behind the 3DB Browser^TM, and Protein Data Bank Quarterly Newsletter (1998), 84, pp. 3–4, 3DB Browser^TM: Tips, Questions and Answers at http://www.rcsb.org/pdb/general_information/news_publications/newsletters/newsletter.html ).

One of the main concerns for us, as database-interface developers, is the `false negative', that is, the failure to return data after a query even when the data are available in the database. Frequently, this happens because the user was unable to express the query in a way compatible with the search engine or used words or keywords unknown to the search engine.

3DB Browser deals with this problem by incorporating several automatic and semi-automatic mechanisms to help the user retrieve the requested data. The request from the user gets filtered and transformed by one or more engines. At the end, the resulting query is the one used for the search (see Table 24.1.3.5).

Table 24.1.3.5 | top | pdf |
Search engines used by 3DB Browser

Engine	Example
American–British	`Amoeba' and `ameba' are equivalent
Synonyms	`Protease' is equivalent to `proteinase'
Spelling search	Based on a dictionary built from the current PDB data, the spelling engine will produce words that are close to the entered one. As an example, entering `imune' will offer `immune' as a valid alternative.
Soundex search	Based on the soundex algorithm that approximates the sound of the word when spoken by an English speaker. Looking for author `Weich' will offer as alternatives Weiss, Wess and Wyss

A search in 3DB Browser brings up a rich Atlas page , summarizing additional information about the entry of interest. The links in this Atlas page carry one to the original sources of information. The number of external sources that 3DB Browser searches and dynamically incorporates into the Atlas pages increases daily (Table 24.1.3.3).

The PDB's WWW server is the major tool used to access the three-dimensional macromolecular structural information archived at the PDB. Thousands of times a day, scientists, students and other users around the world visit the PDB to browse through and access these data. In order to meet the need for rapid access worldwide, a global network of 17 official mirror sites has been established. To help orient the user, 3DB Browser incorporates CloserSite (see http://pdb.weizmann.ac.il/pdb-docs/closerSite.html ), an automatic script that detects one's location and offers closer alternative sites (in the network sense).

The information on the PDB's web server changes frequently. New information is generated on a daily basis. Synchronizing the PDB and its mirror sites to provide exactly the same services while requiring minimum human involvement is a necessary but nontrivial task.

A protocol for the automatic mirroring of the web sites was developed at BNL based on ftp mirroring technology. This protocol has been used successfully by PDB and its mirror sites for approximately two years.

Fig. 24.1.3.2 outlines the web mirroring protocol, which consists of the following five steps.

(1) Develop and test HTML pages and common-gateway-interface (CGI) codes on the development server in a special source-code control area.

Figure 24.1.3.2 | top | pdf |

Schematic diagram of the PDB WWW mirror system.

(2) Copy the working code and HTML pages to a read-only area.
(3) Mirror the updated information onto an internal test server that uses its own directory tree, distinct from that used for development. This internal server simulates the production environment under controlled conditions. For example, we verify that updated files are mirrored properly and that relative HTML links work.
(4) Copy the files outside the firewall to an account accessible only to the mirror sites.
(5) Activate the mirror software to transfer the updated files to the PDB web server. Official mirror-site servers are updated automatically by their own mirroring procedures.

Special steps are taken to isolate files, thus obviating problems associated with the existence of files and directories not related to PDB web activities. HTML documents are stored under the directory /pdb-docs/, and executables are stored under the directory /pdb-bin/. In addition, index files and local configuration files are stored in the directory /PDB-support/.

Specific areas on the http server are dedicated to PDB web activities. All the HTML pages and CGI scripts are in the /pdb-docs/ and /pdb-bin/ directories, respectively. There are also index files and local configuration files in /PDB-support/. This avoids confusing PDB applications with other applications on the same server, which would complicate the mirror procedure.

Relative links are used in all the HTML pages and the HTML pages generated by the scripts. For example, to create a hyperlink to 3DB Browser in the file named index.html, <a href=“/pdb-bin/pdbmain”>3DB Browser</a> is used instead of <a href=“http://www.pdb.bnl.gov/pdb-bin/pdbmain”>3DB Browser</a>. The advantage of relative links is that pages copied to the mirror sites' machines will point to local resources without having to be edited locally. This is one of the key points in automating the web mirror procedure. To make relative links work properly, the mirror sites maintain a local configuration file. The configuration file reflects the local directory tree and available resources. The PDB provides a generic template, and mirror sites modify it according to their setup. This configuration file is excluded from the automatic mirroring procedure to avoid being overwritten by the original template file. Changes to the configuration files are sent to mirrors by e-mail one week in advance, to be included manually.

To avoid duplication and allow easy maintenance of the resources, PDB's web and ftp servers share some files. All mirror sites support both web and ftp servers. When a hyperlink points to a file on the ftp server, a server side include (SSI) script is used to access the local ftp server of each mirror site. Its function is to use configuration variables to generate a path to the local file dynamically.

HTML pages and CGI scripts are put into a read-only account available to official mirror sites. Mirror sites use the ftp mirror tool mirror.pl (ftp://sunsite.org.uk/packages/mirror/) to mirror the updated information from this account. For security reasons, this account is not an anonymous ftp account, but requires a password for access. In addition, this account can only be accessed by ftp. This process can be made as a cron job to automate the update procedures fully. Although the procedure is automatic, an e-mail message is sent to mirror sites for update verification. For details on the PDB mirror system, see Protein Data Bank Quarterly Newsletter (1999), 87, pp. 3–5, PDB World Wide Web Mirroring System at http://www.rcsb.org/pdb/general_information/news_publications/newsletters/newsletter.html ).

Web access to the archives has become the primary mode of retrieving entries from the PDB. However, the PDB continues to receive a considerable number of orders for our CD-ROM product. The PDB anticipates that this will continue to be so for a variety of reasons. For example, network performance still remains poor in a number of locations, and these disks, released quarterly, provide local access to the contents of the archive. PDB files may first be copied from the CD-ROM to a local disk, and then incremental updates can easily be made using mirroring software.

References

Abola, E. E. (1994). PDB-SHELL. Available at ftp://pdb.bmc.uu.se/pub/databases/pdb/pdb_software/pdbshell/.Google Scholar

Bairoch, A. (1994). The ENZYME data bank. Nucleic Acids Res. 22, 3626–3627.Google Scholar

Bairoch, A. & Boeckmann, B. (1994). The SWISS-PROT protein sequence data bank: current status. Nucleic Acids Res. 22, 3578–3580.Google Scholar

Kwong, P. D., Wyatt, R., Robinson, J., Sweet, R. W., Sodroski, J. & Hendrickson, W. A. (1998). Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature (London), 393, 648–659.Google Scholar

Rizzuto, C. D., Wyatt, R., Hernandez-Ramos, N., Sun, Y., Kwong, P. D., Hendrickson, W. A. & Sodroski, J. (1998). A conserved HIV gp120 glycoprotein structure involved in chemokine receptor binding. Science, 280, 1949–1953.Google Scholar

Sayle, R. A. & Milner-White, E. J. (1995). RASMOL: biomolecular graphics for all. Trends Biochem. Sci. 20, 374–376.Google Scholar

Seavey, B. R., Farr, E. A., Westler, W. M. & Markley, J. L. (1991). A relational database for sequence-specific protein NMR data. J. Biomol. Nucl. Magn. Reson. 1, 217–236.Google Scholar

Sussman, J. L. (1997). Bridging the gap. Nature Struct. Biol. 4, 517.Google Scholar

International Tables for Crystallography (2006). Vol. F. ch. 24.1, pp. 650-653