Data instances and context

Spadaccini, N.; Hall, S. R.; McMahon, B.

doi:10.1107/97809553602060000752

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.2, pp. 488-491

Section 5.2.2. Data instances and context

N. Spadaccini,^a ^* S. R. Hall^b and B. McMahon^c

^a School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, Perth, WA 6009, Australia,^bSchool of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia, and ^cInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail: nick@csse.uwa.edu.au

5.2.2. Data instances and context

| top | pdf |

In a STAR File, a data item consists of a value, which is a simple ASCII character string, and an associated identifier or data name which precedes the value, and is invariably an ASCII character string beginning with an underscore character and not including any white-space character, such as _date or _chemical_formula_sum. (The detailed and formal syntax rules for STAR Files are given in Chapter 2.1 .)

5.2.2.1. Single and multiple values

| top | pdf |

A data item may have a single value, in which case the data name may immediately precede the data value, separated only by white space, e.g. [Scheme scheme1]

Alternatively, a data item may occur multiple times, in a vector or a list. In such a case, the data identifiers appear in a loop header and the values follow in the order of presentation in the loop header. For the simple example of a tabular array, the loop header plays the role of column header, e.g. [Scheme scheme2]

Here the instances of the data item identified by the data name _chapter_number have two values, 5.2 and 5.3. Likewise the instances of the data item identified by _chapter_title have two values.

Note an important point: the example has been chosen to suggest to the reader a tabular relationship between the two data items, and in many STAR File applications such a relationship is intended and perhaps formalized through an external dictionary defining the relationships between these data names. However, the existence of such a relationship is not mandated by the STAR File syntax. It is legitimate for a generic STAR application to extract a single data item from such an aggregated loop without making any supposition about its relationship with other data items in the same loop. (It should be emphasized that in practice such physical juxtaposition of data items will almost invariably represent a real relationship, and that most application-specific programming will depend on this fact; but it is not an essential component of STAR in its most abstract form.)

It is also axiomatic that the ordering of the multiple values within a list structure has no intrinsic significance in the STAR paradigm. (Again, specific applications may override this by enforcing an ordering, but this is not fundamental to STAR.)

5.2.2.2. Loop packets and context within lists

| top | pdf |

Where multiple data names are declared in a loop header, STAR does however enforce the notion of a `loop packet'. The loop packet is the data structure including all individual data values at a particular iteration through the loop. Hence, in the simple example above, 5.2 and STAR File utilities comprise the tuple of values in a single loop packet. For the single level of loop considered so far, the loop packet plays the role of a table row.

For nested loops, the situation is more complex. Consider Fig. 5.2.2.1, which is an example of quantum chemistry basis sets for hydrogen and lithium. (The examples in this chapter are derived from various test applications, and do not represent specific adopted exchange protocols in the selected subject areas.) For each element, a list of basis sets is presented, each containing a set of parameters and a table of functional values. At the outermost level of looping in this example, a loop packet comprises all the data associated with an individual atom type, for example hydrogen. At the next inner level of looping, a loop packet corresponds to an individual basis set (including its embedded table of coefficients). At the innermost loop level, a loop packet is simply a row within a table of exponents and coefficients of the basis set function.

Figure 5.2.2.1 | top | pdf |

Example quantum chemistry basis set functions in STAR File format.

If one were to treat this example file as a database of indeterminate structure and query the values associated with one of the data names, for example, _basis_set_function_exponent, one would retrieve a series of strings 1.3324838E+01, 2.0152720E-01 etc. However, the value strings in themselves are insufficient to allow the reconstitution of any data structure in the file. One also needs an expression of the levels within the nested loop structure at which the values were located, and an indication that they were associated with different packets of information at those various levels. This additional information about the context of each value is sufficient to determine its position within the data structure without any other a priori information regarding the data model. The context is most easily expressed by listing the output values in STAR File format.

Fig. 5.2.2.2 is an output listing of the requested values for this example, where the context is expressed as the innermost of three nested loop levels and distinct packets at this level are indicated. It will be seen also that by tracing the disposition of stop_ words the embedding within higher-level loop packets can also be inferred.

Figure 5.2.2.2 | top | pdf |

Retrieval from the example file in Fig. 5.2.2.1 of the value of _basis_set_function_exponent with associated context.

5.2.2.3. Context in data sets

| top | pdf |

Another indicator of context in the previous example is the data-block header, which was reproduced in the output of Fig. 5.2.2.2.

The STAR File allows data instances in three types of location: in a data block, in a save frame or in a global block.

The usual way to partition a STAR File is by data blocks; each such block represents a data set in which a data name (associated with a single or multiple values) may be declared once only.

Data blocks may include save frames. A save frame is an encapsulated subsidiary data set, effectively insulated from the contents of the surrounding data block, in which data items may occur that have the same names as items in the parent data block. Indeed, `parent' is potentially a misleading term, since no relationship is implied between the data within a save frame and those in the data block in which the save frame occurs. A reference to a save frame may, however, occur as a data value within the data block where the save frame is specified. Recall from Section 2.1.3.6 that save frames within a data block are uniquely identified by the framecode header.

Global blocks may also occur in a STAR File, preceding or interspersed between data blocks. For each data item defined within a global block, that definition is inherited by each succeeding data block that does not contain an internal definition of a data item with the same name. If there is a definition of a data item with the same name within a data block, that internal definition overrides the global definition within that data block. The situation is then re-evaluated in the next data block. If that data block does not contain an internal definition, the global definition holds.

The scope of data values is well defined (see Section 2.1.3.9 ). Only data expressed in a global block have values that are inherited in later portions of the STAR File. Data values in data blocks or save frames are restricted in scope to the current data block or save frame, respectively.

A consequence of these rules of scope and encapsulation is that a full description of the context of a STAR data value must also reflect any values carried through as global data or by de-referencing associated save frames. The results are not always intuitive.

Consider Fig. 5.2.2.3, which represents a partial description of a chemical reaction where one of the reactants is expressed as a generic structure described by the save frame save_R1. However, the generic structure in this case is restricted to a small number of alkyl groups, each described in its own save frame. Setting aside this prior knowledge, we see that a request for _atom_identity_symbol must return not only the data values in their embedded save frames, but also the save frames in their entirety and the higher-order data values that reference the matching save frames. It is only in this way that we can guarantee that the value can be used by any application. Fig. 5.2.2.4 demonstrates the full context of the returned requested data values.

Figure 5.2.2.3 | top | pdf |

Example STAR data structure where save frames encapsulate related data sets. See text for details.

Figure 5.2.2.4 | top | pdf |

Context for the requested values of _atom_identity_symbol in the preceding example. See text for details.

Notice that the requested item occurs (among other places) in the save frame save_carboxylic_acid and this instance of the item is presented solely in the context of the save header and closure strings (it is shown in italics in Fig. 5.2.2.4). However, one of the values extracted from this location is the save-frame reference pointer $R1 that identifies save_R1, and the complete contents of this save frame are presented (because the data structure represented by the save frame is itself one of the values of the requested data item). Further de-referencing of the save-frame pointers within save_R1 results in the extraction also of the complete save frames save_methyl and save_ethyl. In this example it is coincidental that there are instances of the requested data item (_atom_identity_symbol) within these returned save frames as well.

Notice, however, that establishing the full context of the returned data demands also that data values referencing the save_carboxylic_acid frame be presented. In this example, the value of _reaction_component_symbol at the outermost level of the data block is returned, a result that may at first seem surprising. It is only in this way that one can be sure that an arbitrary application will have access to the full semantic information carried by the data item.

Data values declared in global blocks should be presented in the same spirit of supplying the complete context in which the value was instantiated, and not simply the value in isolation. For example, given the trivial STAR File [Scheme scheme3] a request for _example should return the identical file, not the interpolated result [Scheme scheme4] despite the latter's equivalence purely in terms of the non-contextual values returned.

References

International Tables for Crystallography (2006). Vol. G. ch. 5.2, pp. 488-491