International
Tables for Crystallography Volume F Crystallography of biological macromolecules Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F. ch. 24.1, pp. 650-653
Section 24.1.3.1. Contents and access to the PDB archives
a
Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel,bBiology Department, Bldg 463, Brookhaven National Laboratory, Upton, NY 11973-5000, USA,cBioinformatics Unit, Weizmann Institute of Science, Rehovot 76100, Israel, and dThe Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA |
The archives contain atomic coordinates, bibliographic citations, primary- and secondary-structure information, crystallographic structure factors, and NMR experimental data. Annotations in the structure entries include amino-acid or nucleotide sequences (with notes of any conflicts between the structure in the PDB and sequence databases), source organisms from which the biological materials were derived, descriptions of the experiments, secondary structures, complexes with small molecules included within the structures, references to papers etc. Third-party annotations include images and movies of structures; pointers to specialized databases (maintained by others), such as the Protein Kinase Resource (http://www.kinasenet.org/pkr/Welcome.do ) and ESTHER (ESTerases and α/β Hydrolase Enzymes and Relatives) (http://www.ensam.inra.fr/cholinesterase/ ), and pointers to databases that provide additional experimental information, such as the BioMagResBank (BMRB) NMR structural database (http://www.bmrb.wisc.edu/ ). Table 24.1.3.1 gives a summary of the contents of the PDB archives.
|
PDB entries are available on CD-ROM, which PC users can search using the PDB-SHELL browser included (Abola, 1994). UNIX users can also search the CD-ROM if they download a copy of the browser software. The entries are also available over the WWW from Brookhaven and 17 mirror sites worldwide (Table 24.1.3.2). They can be searched and retrieved via the PDB's 3DB Browser (Sussman, 1997), which is interfaced through web browsers such as Netscape Communicator and Internet Explorer. Probably the best way to get a feeling for 3DB Browser is just to try it. A simple example of its use is illustrated in Fig. 24.1.3.1 in a search for a structure related to recent papers in Nature (Kwong et al., 1998) and Science (Rizzuto et al., 1998).
|
3DB Browser has a number of features that make it easy to access information found in PDB entries. Users can search according to any combination of fields, such as compound name, experiment title, authors (depositors), biological source, journal references, date of deposition and nature of small molecules (ligands and heterogens) complexed with the structure. Boolean operators allow highly complex search strings. Entries selected can be retrieved automatically, and the molecular structures can be displayed using the public-domain molecular viewer RasMol (Sayle & Milner-White, 1995), MDL's Chemscape Chime plug-in, or a similar viewer. The entries also include HyperText links to the SwissProt protein-sequence database (Bairoch & Boeckmann, 1994), the BioMagResBank (BMRB) NMR structural database (Seavey et al., 1991), the Enzyme Commission Database (Bairoch, 1994), PubMed access to the Medline database, and several other databases (see Table 24.1.3.3 for a list of linked external data sources).
|
The main source of information for the 3DB Browser is the data from the PDB. These data are highly structured, and most crystallographers usually consider a datum from a PDB entry as belonging to a particular `record' or `field'. It makes sense to use these fields to constrain the search. Searching for `rich' as a keyword has a different meaning from searching for the author Rich.
The simplest operation with the browser is to enter one or more words in the `Text query' field and press the `Search' button. The browser engine will come back with those entries from the database that contain or are related to the words provided.
The symbol `*' can be used as a wild card to denote a sequence of any number (including 0) of arbitrary characters. Just add an asterisk, `*', at the beginning or end of a word (or both) to `extend' the search. For example, enter `*tox*' in the keyword field to retrieve those entries containing keywords like neurotoxic and toxin. Wild cards have no meaning in number-only fields, like Resolution and Date.
The Boolean operator AND is the default for 3DB Browser and is mandatory (it cannot be changed) between fields (see Table 24.1.3.4). If `ATP' is entered in the Associated group field and `kinase' in the Keyword field, only those entries matching both constraints are returned. Inside a given field, Boolean logical operators may be applied at will to the words entered. The available Boolean logical operators are AND, OR and NOT. The case is unimportant. The operator AND can be represented by `+' and the operator NOT can be represented by `−'.
|
For example, `zinc and (torpedo or snake)' in the Text query field will return those entries that contain either the word torpedo or the word snake, but only if the word zinc is also present. In addition, many specific records can be searched for regular expressions or numerical limits, as shown in Table 24.1.3.4 [see Protein Data Bank Quarterly Newsletter (1998), 83, pp. 3–5, The `Intelligent' Search Engine Behind the 3DB BrowserTM, and Protein Data Bank Quarterly Newsletter (1998), 84, pp. 3–4, 3DB BrowserTM: Tips, Questions and Answers at http://www.rcsb.org/pdb/general_information/news_publications/newsletters/newsletter.html ).
One of the main concerns for us, as database-interface developers, is the `false negative', that is, the failure to return data after a query even when the data are available in the database. Frequently, this happens because the user was unable to express the query in a way compatible with the search engine or used words or keywords unknown to the search engine.
3DB Browser deals with this problem by incorporating several automatic and semi-automatic mechanisms to help the user retrieve the requested data. The request from the user gets filtered and transformed by one or more engines. At the end, the resulting query is the one used for the search (see Table 24.1.3.5).
|
A search in 3DB Browser brings up a rich Atlas page, summarizing additional information about the entry of interest. The links in this Atlas page carry one to the original sources of information. The number of external sources that 3DB Browser searches and dynamically incorporates into the Atlas pages increases daily (Table 24.1.3.3).
The PDB's WWW server is the major tool used to access the three-dimensional macromolecular structural information archived at the PDB. Thousands of times a day, scientists, students and other users around the world visit the PDB to browse through and access these data. In order to meet the need for rapid access worldwide, a global network of 17 official mirror sites has been established. To help orient the user, 3DB Browser incorporates CloserSite (see http://pdb.weizmann.ac.il/pdb-docs/closerSite.html ), an automatic script that detects one's location and offers closer alternative sites (in the network sense).
The information on the PDB's web server changes frequently. New information is generated on a daily basis. Synchronizing the PDB and its mirror sites to provide exactly the same services while requiring minimum human involvement is a necessary but nontrivial task.
A protocol for the automatic mirroring of the web sites was developed at BNL based on ftp mirroring technology. This protocol has been used successfully by PDB and its mirror sites for approximately two years.
Fig. 24.1.3.2 outlines the web mirroring protocol, which consists of the following five steps.
Special steps are taken to isolate files, thus obviating problems associated with the existence of files and directories not related to PDB web activities. HTML documents are stored under the directory /pdb-docs/, and executables are stored under the directory /pdb-bin/. In addition, index files and local configuration files are stored in the directory /PDB-support/.
Specific areas on the http server are dedicated to PDB web activities. All the HTML pages and CGI scripts are in the /pdb-docs/ and /pdb-bin/ directories, respectively. There are also index files and local configuration files in /PDB-support/. This avoids confusing PDB applications with other applications on the same server, which would complicate the mirror procedure.
Relative links are used in all the HTML pages and the HTML pages generated by the scripts. For example, to create a hyperlink to 3DB Browser in the file named index.html, <a href=“/pdb-bin/pdbmain”>3DB Browser</a> is used instead of <a href=“http://www.pdb.bnl.gov/pdb-bin/pdbmain”>3DB Browser</a>. The advantage of relative links is that pages copied to the mirror sites' machines will point to local resources without having to be edited locally. This is one of the key points in automating the web mirror procedure. To make relative links work properly, the mirror sites maintain a local configuration file. The configuration file reflects the local directory tree and available resources. The PDB provides a generic template, and mirror sites modify it according to their setup. This configuration file is excluded from the automatic mirroring procedure to avoid being overwritten by the original template file. Changes to the configuration files are sent to mirrors by e-mail one week in advance, to be included manually.
To avoid duplication and allow easy maintenance of the resources, PDB's web and ftp servers share some files. All mirror sites support both web and ftp servers. When a hyperlink points to a file on the ftp server, a server side include (SSI) script is used to access the local ftp server of each mirror site. Its function is to use configuration variables to generate a path to the local file dynamically.
HTML pages and CGI scripts are put into a read-only account available to official mirror sites. Mirror sites use the ftp mirror tool mirror.pl (ftp://sunsite.org.uk/packages/mirror/) to mirror the updated information from this account. For security reasons, this account is not an anonymous ftp account, but requires a password for access. In addition, this account can only be accessed by ftp. This process can be made as a cron job to automate the update procedures fully. Although the procedure is automatic, an e-mail message is sent to mirror sites for update verification. For details on the PDB mirror system, see Protein Data Bank Quarterly Newsletter (1999), 87, pp. 3–5, PDB World Wide Web Mirroring System at http://www.rcsb.org/pdb/general_information/news_publications/newsletters/newsletter.html ).
Web access to the archives has become the primary mode of retrieving entries from the PDB. However, the PDB continues to receive a considerable number of orders for our CD-ROM product. The PDB anticipates that this will continue to be so for a variety of reasons. For example, network performance still remains poor in a number of locations, and these disks, released quarterly, provide local access to the contents of the archive. PDB files may first be copied from the CD-ROM to a local disk, and then incremental updates can easily be made using mirroring software.
References
Abola, E. E. (1994). PDB-SHELL. Available at ftp://pdb.bmc.uu.se/pub/databases/pdb/pdb_software/pdbshell/.Google ScholarBairoch, A. (1994). The ENZYME data bank. Nucleic Acids Res. 22, 3626–3627.Google Scholar
Bairoch, A. & Boeckmann, B. (1994). The SWISS-PROT protein sequence data bank: current status. Nucleic Acids Res. 22, 3578–3580.Google Scholar
Kwong, P. D., Wyatt, R., Robinson, J., Sweet, R. W., Sodroski, J. & Hendrickson, W. A. (1998). Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature (London), 393, 648–659.Google Scholar
Rizzuto, C. D., Wyatt, R., Hernandez-Ramos, N., Sun, Y., Kwong, P. D., Hendrickson, W. A. & Sodroski, J. (1998). A conserved HIV gp120 glycoprotein structure involved in chemokine receptor binding. Science, 280, 1949–1953.Google Scholar
Sayle, R. A. & Milner-White, E. J. (1995). RASMOL: biomolecular graphics for all. Trends Biochem. Sci. 20, 374–376.Google Scholar
Seavey, B. R., Farr, E. A., Westler, W. M. & Markley, J. L. (1991). A relational database for sequence-specific protein NMR data. J. Biomol. Nucl. Magn. Reson. 1, 217–236.Google Scholar
Sussman, J. L. (1997). Bridging the gap. Nature Struct. Biol. 4, 517.Google Scholar