Strategies in designing a CIF-aware application

Bernstein, H. J.

doi:10.1107/97809553602060000751

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.1, pp. 483-486

Section 5.1.3. Strategies in designing a CIF-aware application

H. J. Bernstein^a ^*

^a Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA
Correspondence e-mail: yaya@bernstein-plus-sons.com

5.1.3. Strategies in designing a CIF-aware application

| top | pdf |

There are multiple strategies to consider when designing a CIF-aware application. One can use external filters. One can use existing CIF-aware libraries. One can write CIF-aware code from scratch.

5.1.3.1. Working with filter utilities

| top | pdf |

One solution to making an existing application aware of a new data format is to leave the application unchanged and change the data instead. For almost all crystallographic formats other than CIF, the Swiss-army knife of conversion utilities is Babel (Walters & Stahl, 1994). Babel includes conversions to and from PDB format. Therefore, by the use of cif2pdb (Bernstein & Bernstein, 1996) and pdb2cif (Bernstein et al., 1998) combined with Babel, many macromolecular applications can be made CIF-aware without changing their code (see Figs. 5.1.3.1 and 5.1.3.2). If the need is to extract mmCIF data from the output of a major application, the PDB provides PDB_EXTRACT (http://sw-tools.pdb.org/apps/PDB_EXTRACT/ ).

Figure 5.1.3.1 | top | pdf |

Example of using filters to make a PDB-aware application CIF-aware.

Figure 5.1.3.2 | top | pdf |

Example of using filters to make a general application CIF-aware.

Creating a filter program to go from almost any small-molecule format to core CIF is easy. In many cases one need only insert the appropriate `loop_' headers. Creating a filter to go from CIF to a particular small-molecule format can be more challenging, because a CIF may have its data in any order. This can be resolved by use of QUASAR (Hall & Sievers, 1993) or cif2cif (Bernstein, 1997), which accept request lists specifying the order in which data are to be presented (see Fig. 5.1.3.3).

Figure 5.1.3.3 | top | pdf |

Using QUASAR or cif2cif to reorder CIF data for an order-dependent application or filter.

There are a significant and growing number of filter programs available. Several of them [QUASAR, cif2cif, ciftex (ftp://ftp.iucr.org/pub/ciftex.tar.Z ) (to convert from CIF to $[\hbox{\TeX}]$ ) and ZINC (Stampf, 1994) (to unroll CIFs for use by Unix utilities)] are discussed in Chapter 5.3 . In addition there are CIF2SX by Louis J. Farrugia (http://www.chem.gla.ac.uk/~louis/software/utils/ ), to convert from CIF to SHELXL format, and DIFRAC (Flack et al., 1992) to translate many diffractometer output formats to CIF. The program cif2xml (Bernstein & Bernstein, 2002) translates from CIF to XML and CML. The PDB provides CIFTr by Zukang Feng and John Westbrook (http://sw-tools.pdb.org/apps/CIFTr/ ) to translate from the extended mmCIF format described in Appendix 3.6.2 to PDB format and MAXIT (http://sw-tools.pdb.org/apps/MAXIT/ ), a more general package that includes conversion capabilities. See also Chapter 5.5 for an extended discussion of the handling of mmCIF in the PDB software environment.

5.1.3.2. Using existing CIF libraries and APIs

| top | pdf |

Another approach to making an existing application CIF-aware or to design a new CIF-aware application is to make use of one (or more) of the existing CIF libraries and application programming interfaces (APIs). Because the data involved need not be reprocessed, code that uses a library directly is often faster than equivalent code working with filter programs. The code within an application can be tuned to the internal data structures and coding conventions of the application.

The approach to internal design depends on the language, data structures and operating environment of the application. A few years ago, the precise details of language version and operating system would have been major stumbling blocks to conversion. Today, however, almost every platform supports a variation of the Unix application programming interface and many languages have viable interfaces to C and/or C++. Therefore it is often feasible to consider use of C, C++ or Objective-C libraries, even for Fortran applications. Star_Base (Spadaccini & Hall, 1994; Chapter 5.2 ) is a program for extracting data from STAR Files. It is written in ANSI C and includes the code needed to parse a STAR File. OOSTAR (Chang & Bourne, 1998; Chapter 5.2 ) is an Objective-C package that includes another parser for STAR Files (http://www.sdsc.edu/pb/cif/OOSTAR.html ). CIFLIB (Westbrook et al., 1997) provides a CIF-specific API. CIFPARSE (Tosic & Westbrook, 1998) is another C-based library for CIF. CBFlib (Chapter 5.6 ) is an ANSI C API for both CIF and CBF/imgCIF files. The CifSieve package (Hester & Okamura, 1998) provides specialized code generation for retrieval of particular data items in either C or Fortran (see Chapter 5.3 for more details). The package cciflib (Keller, 1996) (http://www.ccp4.ac.uk/dist/html/mmcifformat.html ) is used by the CCP4 program suite to support mmCIF in both C and Fortran applications. If an application in Fortran is to be converted with a purely Fortran-based library, the package CIFtbx (Hall, 1993; Hall & Bernstein, 1996) is a solution. See Chapter 5.4 for more details.

The common interface provided in C-based applications is for the library to buffer the entire CIF file into an internal data structure (usually a tree), essentially creating a memory-resident database (see Fig. 5.1.3.4). This preload greatly reduces any demands on the application to deal with the order-independence of CIF, at the expense of what can be a very high demand for memory. The problem of excessive memory demand is dealt with in CBFlib by keeping large text fields on disk, with only pointers to them in memory. In some libraries, validation of tags against dictionaries is handled by the API. In others it is the responsibility of the application programmer. While the former approach helps to catch errors early, the second, `lightweight' approach is more popular when fast performance is required.

Figure 5.1.3.4 | top | pdf |

Typical dataflow of a C-based CIF API.

The most commonly used versions of Fortran do not include dynamic memory management. In order to preload an arbitrary CIF, one needs to use one of the C-based libraries. Alternatively, a pure Fortran application can transfer CIFs being read to a disk-based random access file. CIFtbx does this each time it opens a CIF. The user never works directly with the original CIF data set. This provides a clean and simple interface for reading, but slows all read access to CIFs. In Fortran, compromises are often necessary, with critical tables handled in memory rather than on disk, but this may force changes in dimensions and then recompilation when dictionaries or data sets become larger than anticipated.

5.1.3.3. Creating a CIF-aware application from scratch

| top | pdf |

The primary disadvantage of using an existing CIF library or API in building an application is that there can be a loss of performance or a demand for more resources than may be needed. The common practice followed by most libraries of building and preloading an internal data structure that holds the entire CIF may not be the optimal choice for a given application. When reading a CIF it is difficult to avoid the need for extra data structures to resolve the issue of CIF order independence. However, when writing data to a CIF, it may be sufficient simply to write the necessary tags and values from the internal data structures of an application, rather than buffering them through a special CIF data structure.

It is tempting to apply the same reasoning to the reading of CIF and create a fixed ordering in which data are to be processed, so that no intermediate data structure will be needed to buffer a CIF. Unless the application designer can be certain that externally produced CIFs will never be presented to the application, or will be filtered through a reordering filter such as QUASAR or cif2cif, working with CIFs in an order-dependent mode is a mistake.

Because of the importance of being able to accept CIFs written by any other application, which may have written its data in a totally different order than is expected, it is a good idea to make use of one of the existing libraries or APIs if possible, unless there is some pressing need to do things differently.

If a fresh design is needed, e.g. to achieve maximal performance in a time-critical application, it will be necessary to create a CIF parser to translate CIF documents into information in the internal data structures of the application. In doing this, the syntax specification of the CIF language given in Chapter 2.2 should be adhered to precisely. This result is most easily achieved if the code that does the parsing is generated as automatically as possible from the grammar of the language. Current `industrial' practice in creating parsers is based on use of commonly available tools for lexical scanning of tokens and parsing of grammars based on lex (Lesk & Schmidt, 1975) and yacc (Johnson, 1975). Two accessible descendants of these programs are flex (by V. Paxson et al.) and bison (by R. Corbett et al.). See Fig. 5.1.3.5 for an example of bison data in building a CIF parser. Both flex and bison are available from the GNU project at http://www.gnu.org .

Figure 5.1.3.5 | top | pdf |

Example of bison data defining a CIF parser (taken from CBFlib).

Neither flex nor bison is used directly by the final application. Each may be used to create code that becomes part of the application. For example, both are used by CifSieve to generate the code it produces. There is an important division of labour between flex and bison; flex is used to produce a lexicographic scanner, i.e. code that converts a string of characters into a sequence of `tokens'. In CIF, the important tokens are such things as tags and values and reserved words such as loop_. Once tokens have been identified, responsibility passes to the code generated by bison to interpret. In practice, because of the complexities of context-sensitive management of white space to separate tokens and the small number of distinct token types, flex is not always used to generate the lexicographic scanner for a CIF parser. Instead, a hand-coded lexer might be used.

The parser generated by bison uses a token-based grammar and actions to be performed as tokens are recognized. There are two major alternatives to consider in the design: event-driven interaction with the application or building of a complete data structure to hold a representation of the CIF before interaction with the application. The advantage of the event-driven approach is that a full extra data structure does not have to be populated in order to access a few data items. The advantage of building a complete representation of the CIF is that the application does not have to be prepared for tags to appear in an arbitrary order.

References

Bernstein, F. C. & Bernstein, H. J. (1996). Translating mmCIF data into PDB entries. Acta Cryst. A52 (Suppl.), C-576. Google Scholar

Bernstein, H. J. (1997). cif2cif – CIF copy program. Bernstein + Sons, Bellport, NY, USA. Included in http://www.bernstein-plus-sons.com/software/ciftbx . Google Scholar

Bernstein, H. J. & Bernstein, F. C. (2002). YAXDF and the interaction between CIF and XML. Acta Cryst. A58 (Suppl.), C257.Google Scholar

Bernstein, H. J., Bernstein, F. C. & Bourne, P. E. (1998). CIF applications. VIII. pdb2cif: translating PDB entries into mmCIF format. J. Appl. Cryst. 31, 282–295. Software available from http://www.bernstein-plus-sons.com/software/pdb2cif .Google Scholar

Chang, W. & Bourne, P. E. (1998). CIF applications. IX. A new approach for representing and manipulating STAR files. J. Appl. Cryst. 31, 505–509.Google Scholar

Flack, H. D., Blanc, E. & Schwarzenbach, D. (1992). DIFRAC, single-crystal diffractometer output-conversion software. J. Appl. Cryst. 25, 455–459.Google Scholar

Hall, S. R. (1993). CIF applications. IV. CIFtbx: a tool box for manipulating CIFs. J. Appl. Cryst. 26, 482–494.Google Scholar

Hall, S. R. & Bernstein, H. J. (1996). CIF applications. V. CIFtbx2: extended tool box for manipulating CIFs. J. Appl. Cryst. 29, 598–603.Google Scholar

Hall, S. R. & Sievers, R. (1993). CIF applications. I. QUASAR: for extracting data from a CIF. J. Appl. Cryst. 26, 469–473.Google Scholar

Hester, J. R. & Okamura, F. P. (1998). CIF applications. X. Automatic construction of CIF input functions: CifSieve. J. Appl. Cryst. 31, 965–968.Google Scholar

Johnson, S. C. (1975). YACC: Yet Another Compiler-Compiler. Bell Laboratories Computing Science Technical Report No. 32. Bell Laboratories, Murray Hill, New Jersey, USA. (Also in UNIX Programmer's Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984.)Google Scholar

Keller, P. A. (1996). A mmCIF toolbox for CCP4 applications. Acta Cryst. A52 (Suppl.), C-576.Google Scholar

Lesk, M. E. & Schmidt, E. (1975). Lex – a lexical analyzer generator. Bell Laboratories Computing Science Technical Report No. 39. Bell Laboratories, Murray Hill, New Jersey, USA. (Also in UNIX Programmer's Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984.) Google Scholar

Spadaccini, N. & Hall, S. R. (1994). Star_Base: accessing STAR File data. J. Chem. Inf. Comput. Sci. 34, 509–516.Google Scholar

Stampf, D. R. (1994). ZINC – galvanizing CIF to work with UNIX. Manual. Protein Data Bank, Brookhaven National Laboratory, USA.Google Scholar

Tosic, O. & Westbrook, J. D. (1998). CIFPARSE: A library of access tools for mmCIF. Reference guide. Version 3.1. Nucleic Acid Database Project, Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, USA. http://sw-tools.pdb.org/apps/CIFPARSE/cifparse/cifparse.html . Google Scholar

Walters, P. & Stahl, M. (1994). BABEL reference manual. Version 1.06. Dolata Research Group, Department of Chemistry, University of Arizona, USA.Google Scholar

Westbrook, J. D., Hsieh, S.-H. & Fitzgerald, P. M. D. (1997). CIF applications. VI. CIFLIB: an application program interface to CIF dictionaries and data files. J. Appl. Cryst. 30, 79–83.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 5.1, pp. 483-486