Lexical tokens

Hall, S. R.; Spadaccini, N.; Brown, I. D.; Bernstein, H. J.; Westbrook, J. D.; McMahon, B.

doi:10.1107/97809553602060000728

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 2.2, pp. 30-32

Section 2.2.7.3. Lexical tokens

S. R. Hall,^a ^* N. Spadaccini,^c I. D. Brown,^d H. J. Bernstein,^e J. D. Westbrook^b and B. McMahon^f

2.2.7.3. Lexical tokens

| top | pdf |

(45) We define a `comment' to be initiated with the character #. This can be followed by any sequence of characters (which include 〈SP〉 or 〈HT〉). The only characters not allowed are those in the production 〈eol〉, which 〈eol〉 terminates a comment. A comment is recognized only at the beginning of a line or after blanks, i.e. only after space, tab or 〈eol〉. For this reason we define both comments and `tokenized comments'. No portion of the essential machine-readable content within a CIF is conveyed by the comments. Comments are for the convenience of human readers of CIFs and may be freely introduced or removed. Note however the optional structured comment sanctioned in paragraph (34) above, which has the purpose of indicating the file type and revision level to general-purpose file-handling software. [Scheme scheme11]

(46) We accept as white space all appropriate combinations of spaces, tabs, end of lines and comments, as well as the beginning of the file. White space are the characters able to delimit the lexical tokens. [Scheme scheme12]

(47) Non-blank characters are composed of all the characters in our set, excluding 〈SP〉 and 〈HT〉 and 〈eol〉 characters. [Scheme scheme13]

(48) AnyPrintChar characters are composed of all the characters in our set excluding 〈eol〉 characters. [Scheme scheme14]

(49) We define a `line of text' to be a line contained within a semicolon-bounded text field. Hence the first character cannot be a semicolon; it may be followed by any number of characters from the set 〈char〉 and terminated with a line-termination character. We define the characters in 〈TextLeadChar〉 as those in 〈AnyPrintChar〉 except for the semicolon. [Scheme scheme15]

(50) Ordinary characters are all those printable characters that can initiate a non-quoted character string. These exclude the special characters ", #, $, ', [, ] and _, and in some cases ;. [Scheme scheme16]

(51) The reserved word data_ (in a case-insensitive form). [Scheme scheme17]

(52) The reserved word loop_ (in a case-insensitive form). [Scheme scheme18]

(53) The reserved word save_ (in a case-insensitive form). [Scheme scheme19]

(54) The reserved word stop_ (in a case-insensitive form). [Scheme scheme20]

(55) The reserved word global_ (in a case-insensitive form). This is actually a reserved word of STAR, but we define it here so that it may be explicitly excluded as an unquoted string. We do this so that any possible future adoption of STAR features will not invalidate existing CIFs. [Scheme scheme21]

(56) Quoted strings need to be recognized in the lexical scan, because their definition is context-sensitive. A string quoted by single quotes may contain a single quote as long as it is not followed by white space. A string quoted by double quotes may contain a double quote as long as it is not followed by white space. Formally we express this with context-sensitive productions. In practice, it requires a one-character look-ahead to decide to continue the scan if the opening quote is encountered, but the following character is not space, tab or end of line. When processing a semicolon-delimited text field, the column position has to be remembered to decide whether a semicolon should be recognized.

For a semicolon-delimited text string, failure to provide trailing white space is an error. The 〈WhiteSpace〉 on the left-hand side must evaluate to the same string instance on the right-hand side and the parse must terminate on the first valid match reading left to right. [Scheme scheme22]

(57) Tags and values are appropriate lexical tokens. The special values of ` .' and ` ?' represent data that are inapplicable or unknown, respectively.

(i) No string that matches the production for 〈LOOP_〉 is accepted as a non-quoted string.

(ii) No string that matches the production for 〈STOP_〉 is accepted as a non-quoted string.

(iii) No string in which the initial five characters match the production for 〈DATA_〉 is accepted as a non-quoted string.

(iv) No string in which the initial five characters match the production for 〈SAVE_〉 is accepted as a non-quoted string.

(v) No string that matches the production for 〈GLOBAL_〉 is accepted as a non-quoted string.

Unquoted strings are described by a pair of productions to permit the initial letter of an unquoted string to be a semicolon so long as that does not occur at the beginning of a line. The parser is required to evaluate 〈noteol〉 to the same string instance on both sides of the production. [Scheme scheme23]

2.2.7.3.1. CIF grammar

| top | pdf |

(58) A CIF may be an empty file, or it may contain only comments or white space, or it may contain one or more data blocks. Comments before the first block are acceptable, and there must be white space between blocks. [Scheme scheme24]

(59) For a data block, there must be a data heading and zero or more data items or save frames. [Scheme scheme25]

(60) A data-block heading consists of the five characters data_ (case-insensitive) immediately followed by at least one non-blank character selected from the set of ordinary characters or the non-quote-mark, non-blank printable characters. [Scheme scheme26]

(61) For a save frame, there must be a save-frame heading, some data items and then the reserved word save_. [Scheme scheme27]

(62) A save-frame heading consists of the five characters save_ (case-insensitive) immediately followed by at least one non-blank character selected from the set of ordinary characters or the non-quote-mark, non-blank printable characters. [Scheme scheme28]

(63) Data come in two forms:

(i) A data-name tag separated from its associated value by a 〈WhiteSpace〉.

(ii) Looped data. The number of values in the body must be a multiple of the number of tags in the header. [Scheme scheme29]

References

International Tables for Crystallography (2006). Vol. G. ch. 2.2, pp. 30-32