Creating a CIF-aware application from scratch

Bernstein, H. J.

doi:10.1107/97809553602060000751

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.1, pp. 484-486

Section 5.1.3.3. Creating a CIF-aware application from scratch

H. J. Bernstein^a ^*

^a Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA
Correspondence e-mail: yaya@bernstein-plus-sons.com

5.1.3.3. Creating a CIF-aware application from scratch

| top | pdf |

The primary disadvantage of using an existing CIF library or API in building an application is that there can be a loss of performance or a demand for more resources than may be needed. The common practice followed by most libraries of building and preloading an internal data structure that holds the entire CIF may not be the optimal choice for a given application. When reading a CIF it is difficult to avoid the need for extra data structures to resolve the issue of CIF order independence. However, when writing data to a CIF, it may be sufficient simply to write the necessary tags and values from the internal data structures of an application, rather than buffering them through a special CIF data structure.

It is tempting to apply the same reasoning to the reading of CIF and create a fixed ordering in which data are to be processed, so that no intermediate data structure will be needed to buffer a CIF. Unless the application designer can be certain that externally produced CIFs will never be presented to the application, or will be filtered through a reordering filter such as QUASAR or cif2cif, working with CIFs in an order-dependent mode is a mistake.

Because of the importance of being able to accept CIFs written by any other application, which may have written its data in a totally different order than is expected, it is a good idea to make use of one of the existing libraries or APIs if possible, unless there is some pressing need to do things differently.

If a fresh design is needed, e.g. to achieve maximal performance in a time-critical application, it will be necessary to create a CIF parser to translate CIF documents into information in the internal data structures of the application. In doing this, the syntax specification of the CIF language given in Chapter 2.2 should be adhered to precisely. This result is most easily achieved if the code that does the parsing is generated as automatically as possible from the grammar of the language. Current `industrial' practice in creating parsers is based on use of commonly available tools for lexical scanning of tokens and parsing of grammars based on lex (Lesk & Schmidt, 1975) and yacc (Johnson, 1975). Two accessible descendants of these programs are flex (by V. Paxson et al.) and bison (by R. Corbett et al.). See Fig. 5.1.3.5 for an example of bison data in building a CIF parser. Both flex and bison are available from the GNU project at http://www.gnu.org .

Figure 5.1.3.5 | top | pdf |

Example of bison data defining a CIF parser (taken from CBFlib).

Neither flex nor bison is used directly by the final application. Each may be used to create code that becomes part of the application. For example, both are used by CifSieve to generate the code it produces. There is an important division of labour between flex and bison; flex is used to produce a lexicographic scanner, i.e. code that converts a string of characters into a sequence of `tokens'. In CIF, the important tokens are such things as tags and values and reserved words such as loop_. Once tokens have been identified, responsibility passes to the code generated by bison to interpret. In practice, because of the complexities of context-sensitive management of white space to separate tokens and the small number of distinct token types, flex is not always used to generate the lexicographic scanner for a CIF parser. Instead, a hand-coded lexer might be used.

The parser generated by bison uses a token-based grammar and actions to be performed as tokens are recognized. There are two major alternatives to consider in the design: event-driven interaction with the application or building of a complete data structure to hold a representation of the CIF before interaction with the application. The advantage of the event-driven approach is that a full extra data structure does not have to be populated in order to access a few data items. The advantage of building a complete representation of the CIF is that the application does not have to be prepared for tags to appear in an arbitrary order.

References

Johnson, S. C. (1975). YACC: Yet Another Compiler-Compiler. Bell Laboratories Computing Science Technical Report No. 32. Bell Laboratories, Murray Hill, New Jersey, USA. (Also in UNIX Programmer's Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984.)Google Scholar

Lesk, M. E. & Schmidt, E. (1975). Lex – a lexical analyzer generator. Bell Laboratories Computing Science Technical Report No. 39. Bell Laboratories, Murray Hill, New Jersey, USA. (Also in UNIX Programmer's Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984.) Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 5.1, pp. 484-486