Constructing a DDL2 dictionary

McMahon, B.

doi:10.1107/97809553602060000733

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 3.1, pp. 79-83

Section 3.1.6. Constructing a DDL2 dictionary

B. McMahon^a ^*

^a International Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail: bm@iucr.org

3.1.6. Constructing a DDL2 dictionary

| top | pdf |

The DDL2 dictionary definition language was designed to specify a relational data model and has provision for including within a dictionary tables of relationships between data entries. Like a relational database which contains tables describing the data tables in the database, DDL2-based dictionaries contain definition blocks describing CIF categories, units and relationships as well as data items.

Unlike DDL1 dictionaries, a DDL2 dictionary is presented as a single data block. Within this data block a number of looped lists describe properties of the dictionary as a whole, or properties and relationships shared across the items defined in the dictionary. Typically these are: the dictionary name, version identifiers and revision history; the category groupings that give structure to the items defined by the dictionary; the labels that identify closely related data items; and the physical units employed in the dictionary, their definitions in terms of base units and their interconversion factors.

Definitions of individual data items and categories are contained within save frames. While the save frames are not referenced by name in any dictionary application, they permit multiple occurrences of data definition tags within the scope of a single data block and are therefore suitable for structuring a data dictionary. It is a convention that the name of a save frame defining a category is given in capitals, and the name of a save frame for a definition of a data item is given as lower-case. For example, save_ATOM_SITE is the name of the save frame defining the category with the atom_site identifier, while save_ _atom_site.details is the name of the save frame holding the definition of the individual data name _atom_site.details (note how the initial underscore character of the data name is preserved following the initial save_ string of the save-frame name).

As with DDL1 dictionaries, the name of the dictionary itself (given by the data name _dictionary.title) is usually of the form cif_identifier.dic, where the identifier is a short code for the topic area of the dictionary (e.g. `img' for the image dictionary, `sym' for the symmetry dictionary).

As is invariable with DDL2 data names, the names themselves are formed from the category name separated by a full stop from the specific descriptor of the item.

Fig. 3.1.6.1 shows the structure of the macromolecular CIF dictionary. The ordering of the various looped lists and save frames is of no significance for machine parsing. The sole data block has the same name as the dictionary title string and the data block is introduced by the dictionary identification data items. The dictionary revision history introduces the file, followed by information about the extended data types and physical units used within the current dictionary. These are followed by the lists of closely related items (corresponding to `irreducible sets' in DDL1 dictionaries and called `subcategories' in the terminology of DDL2) and lists of category groupings. The body of the dictionary contains category and item definitions. Each category definition is followed by the definitions of its component data items. The ordering is alphabetic by category and then alphabetic by item name within categories.

Figure 3.1.6.1 | top | pdf |

Schematic structure of the macromolecular CIF dictionary. (a) Dictionary identifiers. (b) Dictionary history. (c) Subcategory and category group listings. (d) Data types, units descriptions and conversion tables. (e) Multiple category and item definition blocks.

3.1.6.1. Dictionary identification

| top | pdf |

Dictionary files must contain information that unambiguously states their identity and version. In DDL2-based dictionaries this is done using the dictionary attributes described in Section 2.6.6.4 . The name of the data block comprising the whole content of a DDL2 dictionary is by convention the same as the dictionary identification string given as _dictionary.title. This value is repeated as the value of _dictionary.datablock_id (see Example 3.1.6.1) for use in checking the consistency of the dictionary.

Example 3.1.6.1. DDL2 dictionary identification entries.

[Scheme scheme10]

The dictionary history is also an important audit record of changes to the dictionary content. Unlike in DDL1-based dictionaries where the history is contained in a single field, DDL2 provides a looped list of version labels, dates and annotations. For convenience, the history records in large DDL2-based dictionaries are sometimes placed at the end of the dictionary file.

3.1.6.2. Subcategory definitions

| top | pdf |

In the DDL1 formalism, particular relationships between data items may sometimes be stated within a text description or may be implied by the organization of the dictionary (where several data items are defined in the same data block and are understood to share the common attributes itemized in that data block).

Within DDL2, there are mechanisms for more formal and machine-parsable statements of relationships. The _sub_category.id attribute is a label shared by several data items within a category that are related in a specific way described by the associated _sub_category.description attribute. The relationships may be rather general, such as elements of a matrix; or they may be specific physical properties or attributes, such as the collection of axis lengths of a unit cell. The dictionary should list all such labels that occur within its included data definition blocks. Example 3.1.6.2 is an extract from the macromolecular dictionary.

Example 3.1.6.2. DDL2 subcategories defined in the mmCIF dictionary.

[Scheme scheme11]

3.1.6.3. Category groupings

| top | pdf |

In the DDL2 data model, a category of data corresponds to a set of related data items that may be stored in a single relational database table. A number of such tables may collectively describe the complete properties of some physical object. This is expressed formally by assigning the same label ( _category_group.id) to the relevant categories. While relationships between categories are implied in DDL1 dictionaries by the hierarchical structure of the names of data items, in DDL2 dictionaries the relationships are formally stated.

For subcategories, the category-group relationships present in the dictionary are listed in a separate looped list. Example 3.1.6.3 is an extract from the macromolecular dictionary. The inclusive_group entry shows the common parentage of all categories (and ultimately all data items) in the dictionary.

Example 3.1.6.3. Category groups in a DDL2 dictionary.

[Scheme scheme12]

3.1.6.4. Category definitions

| top | pdf |

In the DDL2 formalism, a category of data items may be mapped to a relational table. The dictionary entry for a category includes the name of the category (an identifying label which is referenced by the _item.category_id attribute of each component data item) and a list of the category groups of which it may be considered a member. The category key is explicitly specified – that is, the data item (or group of items) that uniquely identifies an individual row in a table of data of that category.

Where a category encompasses a set of data items that are not normally specified in a looped list, the category may nevertheless be taken to represent a degenerate table with a single row, and therefore there is still a category key. For degenerate categories the key value is often set equal to the name of the parent data block.

Example 3.1.6.4 shows a category of non-looped core data items. It may be compared with the DDL1 version in Example 3.1.5.2.

Example 3.1.6.4. A category description in a DDL2 dictionary.

[Scheme scheme13]

For categories of looped items (those normally presented in a table of values) it is sometimes appropriate to have as the category key a data item that has the sole function of indexing unique table rows. However, it is also often the case that a composite key is formed from existing data items, and in these cases the category definition must loop the components of the key, as in Example 3.1.6.5 from the macromolecular dictionary definition of the GEOM_BOND category.

Example 3.1.6.5. A DDL2 category with a composite key.

[Scheme scheme14]

It must be remembered that, in practice, data files may lack some of the items required to determine the category key formally. For example, in the data set given in the GEOM_BOND example here, it is possible that the _geom_bond.site_symmetry_ items may be absent because the listing is for a single connected molecule within an asymmetric unit. Robust parsing software must construct data keys by assigning NULL or other suitable default values to the missing key components.

Careful inspection of corresponding definitions in the DDL1 and DDL2 versions of core data items will demonstrate that the explicit category key specification in DDL2 dictionaries may be deduced within DDL1 dictionaries from the appropriate _list_reference, _list_mandatory and _list_uniqueness attributes of data-item definitions within a category (see also Section 2.5.6.4 ).

3.1.6.5. Data-item definitions

| top | pdf |

The bulk of a DDL2 data dictionary comprises the save frames that include descriptions of the meaning and properties of individual data names.

Unlike DDL1 dictionaries, where the definitions of several data names may be contained in a single data block (most commonly for a set of items that form a logical irreducible set), save frames in DDL2 dictionaries each contain the definition for a single addressable concept.

For example, the three Miller index components of a diffraction reflection ( _diffrn_refln_index_h, _diffrn_refln_index_k, _diffrn_refln_index_l that are described in the DDL1 core CIF dictionary in the data block data_diffrn_refln_) are described in a DDL2 dictionary in three separate save frames, save_ _diffrn_refln.index_h, save_ _diffrn_refln.index_k and save_ _diffrn_refln.index_l. In the DDL2 formalism, the intimate relationship between these three components is expressed through the common _item_sub_category.id value of miller_index and the mutual reference of the other Miller-index components by the _item_dependent.dependent_name entries in each separate save frame.

An apparent exception to this general rule is the case of save frames defining an item, often a category key, that is an identifier common to several categories. In this case, the save frame defining the `parent' identifier implicitly defines the complete property set of each child identifier. For completeness, the respective child identifiers are each declared in their own save frames, but these act only as back references to the parent definition. This is explained more completely in Section 3.1.6.5.1 below.

3.1.6.5.1. Inheritance of identifiers

| top | pdf |

Example 3.1.6.6 is from an mmCIF of two related categories that describe characteristics of an active site in a macromolecular complex. The sites are described in general terms with a label and textual description in the STRUCT_SITE category (the first looped list in the example). Details of how each site is generated from a list of structural features form the STRUCT_SITE_GEN category (second loop or table).

Example 3.1.6.6. Illustration of parent/child relationships between identifiers in related categories.

[Scheme scheme15]

It is clear that each instance of the data item _struct_site_gen.site_id in the second table must have one of the values listed as _struct_site.id in the first loop, because it is the purpose of these identifiers to relate the two sets of data: they are the glue between the two separate tables and must have the same values to ensure the referential integrity of the data set (that is, the consistency and completeness of cross-references between tables). Within a group of related categories like this, it is normal to consider one as the `parent' and the others as `children'.

Because all such linking data items must have compatible attributes, it is conventional in DDL2 dictionaries to define all the attributes in a single location, namely the save frame which hosts the definition of the `parent' data item. In early drafts of DDL2 dictionaries, the `children' were not referenced at all in separate save frames; software validating a data file against a dictionary was required to obtain all information about a child identifier from the contents of the save frame defining the parent. However, subsequent drafts introduced a minimal save frame for the children to accommodate dictionary browsers that depended on the existence of a separate definition block for each individual data item.

Consequently, the definition blocks in current DDL2 dictionaries conform to the structure in Example 3.1.6.7, which refers to the simple STRUCT_SITE example used above.

Example 3.1.6.7. A definition of an identifier which is parent to identifiers in other categories.

[Scheme scheme16]

Note that the dependent data names are listed twice: once in the loop that declares their _item.name values and the categories with which they are associated; and again in a loop that makes the direction of the relationship explicit. A parent data item may have several children, but each child can have only a single parent (i.e. related data name whose value may be checked for referential integrity). Note also that each listed item has an _item.mandatory_code value of yes: because they are identifiers which link categories, they must be present in a table to allow the relationships between data items in different tables to be traced.

Other than the specific description text field, any declared attributes (in this example only the data type) have a common value across the set of related identifiers.

As mentioned above, it is not formally necessary to have a separate save frame for the individual children; but it is conventional to have such individual save frames containing minimal definitions that serve as back references to the primary information in the parent frame. These also provide somewhere for the specific text definitions for the children to be stored. The definition frame for _struct_site_gen.id is shown in Example 3.1.6.8.

Example 3.1.6.8. Definition of a child identifier.

[Scheme scheme17]

3.1.6.5.2. Definitions of single quantities

| top | pdf |

While it is important to ensure the referential integrity of the data in a CIF through proper book-keeping of links between tables, the crystallographer who wishes to create or extend a CIF dictionary will be more interested in the definitions of data items that refer to real physical quantities, the properties of a crystal or the details of the experiment. The DDL2 formalism makes it easy to create a detailed machine-readable listing of the attributes of such data.

Example 3.1.6.9 parallels the example chosen for DDL1 dictionaries of the ambient temperature during the experiment.

Example 3.1.6.9. DDL2 definition of a physical quantity.

[Scheme scheme18]

In the definition save frame, the category is specifically listed (although it is deducible from the DDL2 convention of separating the category name from the rest of the name by a full stop in the data name). The data type is specified as a floating-point number. (In the core dictionary there are fewer data types and the fact that the value may be a real rather than integer number must be inferred from the declared range.) The range of values is also specified with separate maximum and minimum values (unlike in DDL1 dictionaries, which give a single character string that must be parsed into its component minimum and maximum values). The assignment of the same value to a maximum and a minimum means that the absolute value is permitted; without the repeated `0.0' line the range in this example would be constrained to be positive definite; the equal value of 0.0 for maximum and minimum means that it may be identically zero.

The _item_units.code value must be one of the entries in the units table for the dictionary and can thus be converted into other units as specified in the units conversion table.

The aliases entries identify the corresponding quantity defined in the DDL1 core dictionary.

3.1.6.6. Units

| top | pdf |

As with data files described by DDL1 dictionaries, the physical unit associated with a quantitative value in a DDL2-based file is specified in the relevant dictionary. There is no option to express the quantity in other units. However, DDL2 permits a dictionary file to store not only a table of the units referred to in the dictionary (listed under _item_units_list.code and the accompanying descriptive item _item_units_list.detail), but also a table specifying the conversion factors between individual codes in the _item_units_list.code list. In principle, this allows a program to combine or otherwise manipulate different physical quantities while handling the units properly.

References

International Tables for Crystallography (2006). Vol. G. ch. 3.1, pp. 79-83