Implementation issues

Spadaccini, N.; Hall, S. R.; McMahon, B.

doi:10.1107/97809553602060000752

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.2, p. 494

Section 5.2.3.5. Implementation issues

N. Spadaccini,^a ^* S. R. Hall^b and B. McMahon^c

^a School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia,^bSchool of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia, and ^cInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail: nick@csse.uwa.edu.au

5.2.3.5. Implementation issues

| top | pdf |

Star_Base is implemented in the C programming language, and exploits Gnu's flex and bison compiler-compiler system to generate a lexer and parser for the STAR File and a separate lexer and parser for the Star_Base query language.

The STAR File parser builds an in-memory representation (much like most programming-language compilers) of the file contents, and differs from similar applications that are based on a single pass over a stream (like SAX for XML applications).

While a system like CIFtbx retains a block copy of the STAR File in memory, the initial Star_Base processing removes all comments and formatting, and stores the meaningful tokens in a binary tree representation. For each STAR File container (global block, data block, save frame, loop or data item) there is a C structure defined. For each of these there are additional structures defined that hold sequences of containers. The nodes of this tree are populated with these structures. Each leaf of the tree is the data item consisting of the data name and its associated value. A binary tree of the global-block sequences is built in reverse order (that is, in an order reverse to that in which they appear in the file), making it simple to identify the global values in scope for a specific data block. It will be recalled that the STAR File semantics require a backward scan through the file to pick up the global blocks in scope.

The binary search algorithm employed is the classic tsearch of Knuth (1973), which is part of the standard C libraries. Given modern computer systems, the implementation is extremely fast and efficient. There are no files in existence whose size would test the limits of Star_Base.

The use of a binary tree simplifies the process by which a legitimate STAR File is returned as output by Star_Base and also how the scope over which the conditionals operate can be controlled by the user. The program stores references to the data nodes of the tree it needs to extract when outputting. Since the location in the original data tree is always stored, the program is easily able to reconstruct the correct structure of the file by walking the tree, identifying the nodes that need to be output in addition to the data.

Star_Base is by default the `gold standard' for testing other applications for correctness with respect to the syntax and semantics of the STAR File. It can be said that the output of Star_Base is not optimal, since it is yet another STAR File and one which is devoid of the original comments and formatting. However Star_Base is in essence an API for STAR File applications, rather than a stand-alone program (although it is often used in that way). Star_Base was the platform from which BioMagResBank's starlib (Section 5.2.6.4) was developed.

References

Knuth, D. E. (1973). The art of computer programming. Vol. 3, Sorting and searching, pp. 422–447. Reading, MA: Addison-Wesley.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 5.2, p. 494