International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G. ch. 5.3, pp. 512-515

Section 5.3.5.3.  ciftex : translating to a typesetting language

B. McMahona*

a International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail: bm@iucr.org

5.3.5.3. ciftex: translating to a typesetting language

| top | pdf |

The program ciftex (McMahon, 1993[link]) was developed to create files for typesetting the journal Acta Crystallographica using the text-formatting language [\hbox{\TeX}] (Knuth, 1986[link]). Details of its use in the journal production process are given in Chapter 5.7[link] . It is discussed here as an example of translating a CIF to some output format where data values are annotated with different text depending on their accompanying data names.

5.3.5.3.1. Basic operation of ciftex

| top | pdf |

The program is designed to act as a filter, typically in a Unix-style environment, reading a CIF on the standard input channel and outputting a modified data stream to standard output. The output is a file of [\hbox{\TeX}] code that is processed by the [\hbox{\TeX}] program to produce a device-independent file describing the content of a formatted typeset document. Further post-processing allows the formatted document to be viewed on the screen or printed.

Each input token (number, character or text string; data name; loop_ or data_ keywords) is transformed as it is identified; there is no lookahead and minimal retention of context. The data stream is treated purely syntactically; no transformations are applied on the basis of the supposed meaning of any of the file contents.

5.3.5.3.1.1. Non-looped data

| top | pdf |

For portions of the CIF that are not contained in looped lists, the transformations are trivial. A (data name, data value) pair is transformed to a [\hbox{\TeX}] macro and its argument. The macro name is determined from an external `map' file which the program reads at run time; this file associates CIF data names and the corresponding [\hbox{\TeX}] macros through a simple lookup table.

A CIF data value is in most cases passed as the argument to the corresponding [\hbox{\TeX}] macro with few modifications. If the data value is a character string beginning with an integer, full point, hyphen or plus character, it is assumed to be of type `numb'. A space is introduced ahead of an embedded open parenthesis (to separate a standard uncertainty from its parent value). A leading zero is printed before any bare decimal point. An embedded E is taken to indicate exponential notation and the format of the number is accordingly modified.

If the input data value is of type `char' (i.e. is a single token beginning with characters other than those recognized as the leading characters for numerical data; or contains multiple tokens delimited by quote marks or semicolons), the program will search the map file for key values exactly matching each token, and if found will substitute the token by its replacement word or text. If no replacement is specified in the map file, the token is passed unchanged to the standard output channel. This facility was found to be useful in making global substitutions of individual words during file processing, but must be used with care since the substitutions are unconditional, without any reference to context.

Some small examples of typical non-looped data items are shown in Fig. 5.3.5.4[link] and the corresponding ciftex translation based on a map file used for typesetting Acta Crystallographica Section C is shown in Fig. 5.3.5.5[link].

[Figure 5.3.5.4]

Figure 5.3.5.4 | top | pdf |

Sample CIF data input to ciftex.

[Figure 5.3.5.5]

Figure 5.3.5.5 | top | pdf |

Output from ciftex run on the data of Fig. 5.3.5.4[link].

Note the transformations of the numerical arguments and the translation of `sulphate' to `sulfate'.

5.3.5.3.1.2. Looped data

| top | pdf |

If the input token is a loop_ keyword, the program enters a different mode of operation. Looped data may be represented in print either as repetitive lists or in tabular format. There is no indication in a CIF dictionary of the appropriate representation (nor should there be, for what is essentially a matter of presentation) and the choice is made based on a flag associated with each data name in the map file. For non-tabular lists, the structure[Scheme scheme5] is translated to a sequence of [\hbox{\TeX}] codes of the form [Scheme scheme6]

In the case of tabulated data, the loop_ header is translated into a set of table headings and typographic codes are introduced to lay out in columnar format the values in the body of the list. The number of different data names in the loop header is counted and the data values are identified by their position in the loop modulo the total number of data names in the header (in effect, by their `phase' in the loop). In the simplest case, a [\hbox{\TeX}] command is emitted that builds a table with n columns, where n is the number of different data names. Then the data values are counted as they are processed. After every nth data value, a [\hbox{\TeX}] code is emitted indicating `end of table row' and a further code is emitted before the next value (if there is one) that means `beginning of new table row'. In all other cases, a code is emitted signifying `move to next column'.

Fig. 5.3.5.6[link] is a simplified extract from a table of atomic coordinates derived from the _atom_site_ loop in a CIF.

[Figure 5.3.5.6]

Figure 5.3.5.6 | top | pdf |

[\hbox{\TeX}] markup for typesetting a table of atomic coordinates.

5.3.5.3.1.3. The ancillary map file

| top | pdf |

The translation between a CIF data name and its replacement text in the [\hbox{\TeX}] output file is defined in the external map file. The format of the translation is very simple, as illustrated in Fig. 5.3.5.7[link].

[Figure 5.3.5.7]

Figure 5.3.5.7 | top | pdf |

Example map file for use with ciftex.

Each line starts with a CIF data name, which is terminated by a space character. The next character is either ` T' or ` N' to indicate whether the output should be tabulated or not. The next character is an arbitrary character from the ASCII character set, and is chosen to collect together data that will appear in the same logical section of the output file. This locator character may be associated, in another ancillary file described below, with additional text for output. The remainder of the line is the replacement text.

In the example supplied, the cell-length parameters map to the [\hbox{\TeX}] macros \cella, \cellb and \cellc (each preceded by a standard [\hbox{\TeX}] macro forbidding a page break immediately before the contents are printed). The details of the publication authors are described by a set of [\hbox{\TeX}] macros that will occur in two different locations in the output file (the authors' names and addresses may be looped together in the location labelled by the character a; any explanatory footnotes and email addresses will be printed elsewhere in the paper, at the location labelled X). The anisotropic displacement parameters Uij will be printed in a table and the replacement text consists of the [\hbox{\TeX}] codes that will be printed at the head of each column in the table.

The initial text on the line need not be a CIF data name; it may be any other single word. In this case, every occurrence of that word in the input CIF will be replaced by the replacement text.

If the initial character of the line is a hash mark #, the line is treated as a comment and discarded.

5.3.5.3.1.4. The ancillary format file

| top | pdf |

Because a printed paper may be more verbose than its parent CIF data file, it is necessary to add text to the output from ciftex to represent section headings, line spaces or other formatting instructions. The program reads an ancillary file, known as the format file, for such additional text.

Each line in the format file begins with a hash mark #, a single ASCII character and a colon. The second character is chosen to match the corresponding locator character associated with data names in the map file. The rest of the line is text to be output. When the locator character associated with the data name currently being processed differs from the previous one, the output text from all lines in the format file with the new locator character are output.

The special strings #[: and #]: indicate text to be emitted at the beginning and end of the output stream, respectively.

Fig. 5.3.5.8[link] is an example of a simplified format file. The first line is printed at the start of the output [\hbox{\TeX}] file; the second line at the end. The next line will be printed on the first occurrence of a data name flagged with the locator code a in the map file. In this example, that will be the name or address of an author of the paper; some typographic directives are emitted immediately before the authors' names and addresses, including the introduction of a blank line (`vertical skip', or `vskip') of height 10 typographic points.

[Figure 5.3.5.8]

Figure 5.3.5.8 | top | pdf |

Example format file for ciftex.

The lines beginning #g: are emitted immediately before the first data name in the group that is associated with locator code g. In this example, the effect is to output a heading and subheading before printing the cell-length parameters and to switch to double-column format. The line containing only the characters #g: provides for the introduction of a blank line into the [\hbox{\TeX}] file, with the sole purpose of making the file more readable by human editors.

The lines beginning #U: are emitted at the beginning of the table of anisotropic U values.

The mechanism looks complicated at first sight, but addresses the need to generate headings at standard locations in a printed paper when the exact content of the paper is not known in advance.

The different format for directives in the map and format files means that the same file can be used for both purposes, if required. In practice it is often easier to maintain different files: the same mapping between CIF data names and [\hbox{\TeX}] macros might be common to different journals, while each journal uses its own format file.

5.3.5.3.2. Invocation of the program

| top | pdf |

The program reads a CIF on the standard input channel and outputs [\hbox{\TeX}] code on standard output. There is no provision to specify file names. It is therefore invoked within a Unix-style operating system by a command such as

ciftex 〈 infile outfile

where infile and outfile are the input and output files respectively; or it may be called as part of a pipeline of procedures:

program 1 infile | ciftex | program 2

A number of command-line options may be supplied to modify the operation of the program. Other than the specification of the map and format files, they are largely relevant to differing house styles for IUCr journals.

The options -map mapfile and -format formatfile specify the names of the ancillary map and format files. If not specified, they are sought in default locations on the user's file system (different values may be defined when the program is compiled) or as specified in the environment variables $CIFTEX_MAP and $CIFTEX_FORMAT, respectively.

The options -H and -N specify, respectively, whether or not hydrogen atoms in coordinate tables should be printed. The hydrogen-atom lines in the table are in fact always emitted on standard output, but in the case of the -N option are prefixed by a % ([\hbox{\TeX}] comment) character and so ignored by [\hbox{\TeX}].

Options -c and -F specify the printing of centred decimal points or commas for decimal points, respectively. Finally, the option -d modifies certain assumptions that ciftex makes when typesetting CIF dictionaries. The details are of interest only to a specialist.

5.3.5.3.3. Some general comments

| top | pdf |

Although ciftex is available for public use and redistribution within the academic community, it is clearly of most interest to users who need to generate typeset representations of the contents of CIFs. Nevertheless, some elements of its design are relevant to other applications that perform on-the-fly file transformations on a strictly syntactic basis.

First, the functionality is very simple, essentially tokenizing the input data stream and exchanging tokens for replacement text as directed. An immediate consequence of this is the need for additional utilities to manipulate the input file if, for example, the data need to be presented in a particular order. In the journals production process, QUASAR is used to reorder an input file before passing it to ciftex.

Second, the replacement text should be externalized as much as possible. The use of map and format files means that the same basic program can be used for formatting according to any set of typographic rules; only the ancillary files need to be modified. In the current version of ciftex, the program performs some replacements internally; an objective of further development is to remove this function from the program and to externalize it either in more sophisticated table lookup files or in separate methods modules.

Third, the concept of replacement should be abstracted as much as possible. The software was written initially with the objective of replacing data names with [\hbox{\TeX}] macros. Experience suggests that a generic transformation program could be written with the philosophy of replacing data names and data values by directives implemented before and after the occurrence of the data value, and as events upon its first, last and intervening occurrences. Such `directives' and `events' could be mapped to arbitrary replacement strings in any markup scheme, such as SGML, XML, HTML, [\hbox{\TeX}], [\hbox{\LaTeX}] or commercial word-processing encodings.

References

First citation Knuth, D. E. (1986). The [\hbox{\TeX}]book. Computers and Typesetting, Vol. A. Reading, MA: Addison-Wesley.Google Scholar
First citation McMahon, B. (1993). ciftex: translation utility from CIF to [\hbox{\TeX}]. ftp://ftp.iucr.org/pub/ciftex.tar.Z .Google Scholar








































to end of page
to top of page