Specification of the STAR File

Hall, S. R.; Spadaccini, N.

doi:10.1107/97809553602060000727

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 2.1, pp. 16-19
https://doi.org/10.1107/97809553602060000727

Appendix A2.1.1. Backus–Naur form of the STAR syntax and grammar

S. R. Hall^a ^* and N. Spadaccini^b

^a School of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia, and ^bSchool of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia
Correspondence e-mail: syd@crystal.uwa.edu.au

This description of the STAR syntax and grammar is annotated to clarify issues that cannot be represented in a pure extended Backus–Naur form (EBNF) definition.

The allowed character set in STAR is restricted to ASCII 09–13, 32–126. Other characters from the ASCII set are illegal. If such characters are present in a file the error state is well defined, but the functionality of the error handler is not specified. For instance, one may choose to return an illegal file exception and terminate the application or equally one may choose to ignore and skip over the illegal characters.

The concept of white space 〈wspace〉 includes a comment, since these only serve (in a parser sense) to delimit tokens anyway. We adopt the convention here of enclosing terminal symbols in single forward quotes. There are necessary provisos to this, and for representing formatting characters they are:

` \'' represents the single-quote character, i.e. the \ is an escape character.
` \' represents the single backslash character, i.e. the \ is an escape character.
` \f' represents the form-feed character, i.e. ASCII 12.
` \n' represents the new-line character, i.e. ASCII 10.
` \r' represents the carriage-return character, i.e. ASCII 13.
` \t' represents the tab character, i.e. ASCII 09.
` \v' represents the vertical-tab character, i.e. ASCII 11.

There are STAR specifications not definable in the EBNF. The EBNF can be used to define the tokenization of the input stream, and a STAR parser should test that the following condition is true. The number of data elements in 〈data_loop_values〉 of a 〈data_loop〉 production must be an integer multiple of the number of data names in the associated 〈data_loop_field〉.

The STAR syntax specified in the EBNF follows.

A2.1.1.1. Lexical tokens

| top | pdf |

We accept a space, a horizontal and a vertical tab as 〈blank〉. [Scheme scheme16]

The non-printing single ASCII characters 10, 12, 13 or any sequence of these are always interpreted as being a single line terminator. In this way there should not be operating-system-dependent ambiguity in those architectures that use character sequences as line terminators. This necessarily requires that these characters can only be used for line termination. [Scheme scheme17]

We define a `comment' to be initiated with 〈blank〉 or 〈terminate〉 and the character #, followed by any sequence of characters (which include 〈blank〉). The only characters not allowed are those in the production 〈terminate〉, and hence these characters terminate a comment. Note the requirement of a leading 〈blank〉 or 〈terminate〉 is dropped if the # character is the first character in the file. [Scheme scheme18]

We accept as white space all elements in the above three productions. White spaces are the lexemes able to delimit the lexical tokens. Note that a comment is a legitimate white space because it must end with a line terminator, and hence delimits tokens. [Scheme scheme19]

Non-blank characters are composed of all the characters in our set, excluding 〈blank〉 and 〈terminate〉 characters. [Scheme scheme20]

〈char〉 characters are composed of all the characters in our set, excluding 〈terminate〉 characters. [Scheme scheme21]

We define a `line of text' to be a line contained within a semicolon-bounded text block. Hence the first character cannot be a semicolon, and is followed by any number of characters from the set 〈char〉 and terminated with a line-termination character or just the termination character. This allows for `blank' lines in the semicolon-bounded text block. [Scheme scheme22]

Productions for specific characters. [Scheme scheme23]

All printable characters except the double quote. [Scheme scheme24]

All printable characters except the single quote. [Scheme scheme25]

All printable characters except the left and right square brackets. [Scheme scheme26]

All printable characters except the semicolon. [Scheme scheme27]

Ordinary characters are all those printable characters that can initiate a non-quoted text string. These exclude the special characters, ", #, $, ' and _ and in some cases ;. [Scheme scheme28]

The keywords (in a case-insensitive form). [Scheme scheme29]

The operating-system-dependent end-of-file marker. [Scheme scheme30]

A2.1.1.2. STAR grammar

| top | pdf |

A STAR File may be an empty file, or it may contain one or more data blocks or global blocks. [Scheme scheme31]

There can be any amount of white spaces (remember 〈wspace〉 includes comments) before and at least one white space or an end of file (EOF) after a data or global block. This forces white space between data (and global) blocks in a single file. There must be at least one data item in any data or global block. This means a file consisting of just a data or global block heading is invalid. [Scheme scheme32]

There can be any amount of white spaces (remember 〈wspace〉 includes comments) before a save-frame block. This forces white space between save-frame blocks also. There is no need to include the { 〈wspace〉+ | 〈EOF〉 } found in data and global blocks, since those productions cover the situation of a save-frame block terminating the file. [Scheme scheme33]

A data-block or save-frame heading consists of the relevant five-character keyword (case-insensitive) immediately followed by at least one non-blank character. This does not preclude the associated block name or frame name consisting of just one or more punctuation characters. [Scheme scheme34]

Data come in the following three forms.

(1) A data-name tag separated from its associated value by a trailing 〈blank〉. Note it is explicitly a 〈blank〉 and not a 〈wspace〉. These are type I data.
(2) A data-name tag separated from its associated value by a 〈terminate〉. These are type II data.
(3) Looped data.

[Scheme scheme35]

We must allow for white space preceding the loop_ (case-insensitive) keyword, since this is not covered by any of the other productions. [Scheme scheme36]

The name list for a loop must include at least one data name or a nested loop. [Scheme scheme37]

A data name is initiated by an underscore character and followed by one or more non-blank and non-terminating characters from the STAR character set. This does not preclude data names consisting of just one or more punctuation characters. [Scheme scheme38]

Loop values are represented in the same way as the 〈data〉 production, except that the possibility of nested data loops introduces the need for the stop_ keyword. [Scheme scheme39]

Data values of type I data are immediately preceded by a 〈blank〉. Data values of type II data are immediately preceded by a 〈terminate〉. [Scheme scheme40]

A type-I unquoted string is immediately preceded by a 〈blank〉. It cannot begin with a number of characters (the complement of the 〈ordinary_char〉 set) i.e. ", #, $, ', [, ] and _. However, it can begin with a semicolon. Then it is followed by any number of non-blank characters. [Scheme scheme41]

A type-II unquoted string is immediately preceded by a line break. As with type I, it too cannot begin with a ", #, $, ', [, ] or _. It also cannot begin with a semicolon, since this would match the semicolon-delimited data production. [Scheme scheme42]

Specific exceptions to lexemes which match both types of unquoted strings are:

(1) No string beginning with an underscore is an unquoted string.
(2) No string that matches a production for 〈data_heading〉, 〈save_heading〉, 〈LOOP_〉, 〈STOP_〉, 〈SAVE_〉 or 〈GLOBAL_〉 is an unquoted string.

If one wishes to define data values which match lexemes excluded in cases (1) and (2) above, they should be quoted data values.

The string between a set of double quotes can consist of any character that is not a double quote, or it can be a double quote as long as it is immediately followed by a non-blank character or any number of double quotes at the end of the string. This final rule picks up cases of double-quote delimited strings that end in one or more double quotes, like "ABC"". [Scheme scheme43]

The string between a set of single quotes can consist of any character that is not a single quote, or it can be a single quote as long as it is immediately followed by a non-blank character or any number of single quotes at the end of the string. This final rule picks up cases of single-quote delimited strings that end in one or more single quotes, like 'ABC''. [Scheme scheme44]

The string bounded by semicolons can begin with any number of characters (including those in the 〈blank〉 production) but is necessarily terminated by a line break. This forces a line break on the line that contains the `opening' semicolon. After the first line, one can have any number of 〈line_of_text〉. Note we treat the first line as special, since it can contain a leading semicolon, which is not true of 〈line_of_text〉. A 〈line_of_text〉 is always terminated with a line break, thus ensuring the closing semicolon is in column 1. [Scheme scheme45]

The string bounded by square brackets can consist of any character including 〈terminate〉 and 〈blank〉, and excluding the characters [ and ] unless they are escaped or are balanced. [Scheme scheme46]

References

International Tables for Crystallography (2006). Vol. G. ch. 2.1, pp. 16-19
https://doi.org/10.1107/97809553602060000727