Building a structure-determination data pipeline

Westbrook, J. D.; Yang, H.; Feng, Z.; Berman, H. M.

doi:10.1107/97809553602060000755

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.5, pp. 542-543

Section 5.5.3.3. Building a structure-determination data pipeline

J. D. Westbrook,^a ^* H. Yang,^a Z. Feng^a and H. M. Berman^a

^a Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
Correspondence e-mail: jwest@rcsb.rutgers.edu

5.5.3.3. Building a structure-determination data pipeline

| top | pdf |

One goal of high-throughput structural genomics is the automatic capture of all the details of each step in the process of structure determination. Fig. 5.5.3.5 shows a simplified structure-determination data pipeline. The essential details of each pipeline step are extracted and later assembled to make a data file for PDB deposition. The RCSB PDB data-processing infrastructure has been developed in anticipation of a data pipeline in which automated deposition would be the terminal step. The dictionary technology and software tools developed by the RCSB PDB to process and manage mmCIF data can be reused to provide the data-handling operations required to build the pipeline.

Figure 5.5.3.5 | top | pdf |

Schematic diagram of a structure-determination data pipeline.

Dictionary definitions have been carefully developed to describe the details of each step in the structure-determination pipeline. These data items are typically accessible in electronic form after each program step. The information is either exported directly in mmCIF format or is printed in a program output file. To deal with the latter case, a utility program, PDB_EXTRACT (http://sw-tools.pdb.org/apps/PDB_EXTRACT ), has been developed to parse program output files and extract key data values. In either case, the results of this incremental extraction of data from each program step must be merged to build a complete mmCIF data file ready for deposition. The PDB_EXTRACT program also carrys out this merging operation.

Some steps in the structure-determination pipeline may not be driven by software. For instance, the details of protein production may be held in laboratory databases or within laboratory notebooks. A version of ADIT with a data view including all of the structural genomics data extensions has been created for entering these data. This ADIT tool can also be used to validate and check the completeness of the final data file.

References

International Tables for Crystallography (2006). Vol. G. ch. 5.5, pp. 542-543