Detailed DDL2 specifications

Westbrook, J. D.; Berman, H. M.; Hall, S. R.

doi:10.1107/97809553602060000732

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 2.6, pp. 65-70

Section 2.6.6. Detailed DDL2 specifications

J. D. Westbrook,^a ^* H. M. Berman^a and S. R. Hall^b

^a Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA, and ^bSchool of Biomedical and Chemical Sciences, University of Western Australia, Crawley, Perth, WA 6009, Australia
Correspondence e-mail: jwest@rcsb.rutgers.edu

2.6.6. Detailed DDL2 specifications

| top | pdf |

DDL2 is presented here (Chapter 4.10 ) in the form of a dictionary that is defined in terms of its own definitional elements. This self-consistent description not only provides a prototype for other application dictionaries, but also provides a mechanism by which the consistency and relational integrity of the DDL data model can be independently verified. DDL2 defines a relatively simple set of organizational elements including data blocks, categories, category groups, subcategories and items. Data dictionaries (e.g. mmCIF) apply these elements provided by the DDL to describe the knowledge base of an application domain. The following sections provide detailed specifications of each definitional element of DDL2.

2.6.6.1. DDL2 definitions describing data items

| top | pdf |

In this section, the DDL2 categories that describe the properties of data items are presented. Figs 2.6.4.1 and 2.6.4.2 illustrate the organization of definitional elements in these categories.

2.6.6.1.1. ITEM

| top | pdf |

The category named ITEM is used to assign membership of data items to categories. This category forms the bridge between the category and data-item levels of abstraction. The key data item in this category is the full data-item name, _item.name. This name contains both the category and data-item identifiers, and is thus a unique identifier for the data item. The category identifier, _item.category_id, is included in this category as a separate mandatory data item. This has been done to provide an explicit reference to those categories that use the category identifier as a unique identifier.

One could alternatively use the category and item identifiers as the basis for this category rather than the concatenated form of the item name, and thus eliminate the redundant specification of the category identifier. The full name has been used here in order to provide compatibility with existing applications.

The item category also includes a code to indicate whether a data item is mandatory in a category and therefore must be included in any tuple of items in the category. This code, _item.mandatory_code, may have three values: yes, no and implicit. This last named value indicates that the item is mandatory, but that the value of this item may be derived from the context. In the case of an item name or a category identifier, these values can be obtained from the current save-frame name. Implicit specification dramatically simplifies the appearance of each dictionary definition because it avoids the repeated declaration of item names and category identifiers that are basis components or the unique identifiers for most categories.

Although the data item _item.name is the basis for all of the item-level categories, its definition and properties need only be specified at a single point. Here, the data items that occur in multiple categories are defined only in the parent category. In certain situations, a child data item may be used in a manner which requires a description distinct from the parent data item. For instance, _item_linked.parent_name and _item_linked.child_name are both data-item names as well as children of _item.name, but clearly the manner in which these items are used in the ITEM_LINKED category requires additional description. It is important to note that although the design of this DDL supports the definition of data items in multiple categories within the parent category, it is also possible to provide separate complete definitions within each category.

2.6.6.1.2. ITEM_ALIASES

| top | pdf |

The DDL category ITEM_ALIASES defines the alias names that can be substituted for a data-item name. The alias mechanism also provides a means of identifying items by names other than those that follow the naming conventions used in this DDL. This feature should be used primarily to guarantee the stability of names defined in previously published dictionaries. The items _item_aliases.name, _item_aliases.dictionary and _item_aliases.version form the key for this category. The items _item_aliases.dictionary and _item_aliases.version are provided to distinguish between dictionaries and different versions of the same dictionary. Any number of unique alias names can be defined for a data item.

2.6.6.1.3. ITEM_DEFAULT

| top | pdf |

The DDL category ITEM_DEFAULT holds default values assigned to data items. Default data values are specified in item _item_default.value. Default values are assigned to data items that are not declared within a category. The key item for this category, _item_default.name, is a child of _item.name. A single default value may be specified for a data item.

2.6.6.1.4. ITEM_DEPENDENT

| top | pdf |

The ITEM_DEPENDENT category defines dependency relationships among data items within a category. Each data item on which a particular data item depends is specified as an item _item_dependent.dependent_name. For a data item to be considered completely defined, each of its dependent data items must also be specified.

2.6.6.1.5. ITEM_DESCRIPTION

| top | pdf |

The DDL category ITEM_DESCRIPTION holds a description for each data item. The key item for this category is _item_description.name, which is defined in the parent category ITEM. The text of the item description is held by data item _item_description.description. A single description may be provided for each data item.

2.6.6.1.6. ITEM_ENUMERATION

| top | pdf |

The DDL category ITEM_ENUMERATION holds lists of permissible values for a data item. Each enumerated value is specified in item _item_enumeration.value, each of which may have an associated description item _item_enumeration.detail. The combination of items _item_enumeration.name and _item_enumeration.value form the key for this category. The parent definition of the former item is defined in the category ITEM. Multiple unique enumeration values may be specified for each data item.

2.6.6.1.7. ITEM_EXAMPLES

| top | pdf |

The DDL category ITEM_EXAMPLES is provided to hold examples associated with individual data items. An example specification consists of the text of the example, _item_examples.case, and an optional comment item, _item_examples.detail, which can be used to qualify the example. Multiple examples may be provided for each item.

2.6.6.1.8. ITEM_LINKED

| top | pdf |

The ITEM_LINKED category defines parent–child relationships between data items. This provides the mechanism for specifying the relationships between data items that may exist in multiple categories. Link relationships are most commonly defined between key items, which form the keys for many different categories.

In the DDL definition, all child relationships are expressed within the parent category.

Because the item _item_linked.parent_name has been defined as an implicit item, the child relationships can be specified most economically in the parent category where the parent item name can be automatically inferred. If link relationships are specified in a child category, then both parent and child item names must be specified.

Both parent and child item names in this category are children of _item.name, which ensures that all link relationships can be properly resolved. However, it is possible to define cyclical link relationships within this category. Any implementation of this DDL category should include a method to check for the existence of such pathological cases.

2.6.6.1.9. ITEM_METHODS

| top | pdf |

The ITEM_METHODS category is used to associate method identifiers with data items. Any number of unique method identifiers may be associated with a data item. The method identifiers reference the full method definitions in the parent METHOD_LIST category.

2.6.6.1.10. ITEM_RANGE

| top | pdf |

The ITEM_RANGE category defines a restricted range of permissible values for a data item. The restrictions are specified as one or more sets of the items _item_range.minimum and _item_range.maximum. These items give the lower and upper bounds for a permissible range. To specify that an item value may be equal to the upper or lower bound or a range, the minimum and maximum values of the range are equated. The special STAR value indicating that a data value is not appropriate (denoted by a period, ` .') can be used to avoid expressing an upper or lower bound value. When limits are applied to character data, comparisons are made following the collating sequence of the character set. When limits are applied to abstract data types, methods must be provided to define any comparison operations that must be performed to check the boundary conditions.

2.6.6.1.11. ITEM_RELATED

| top | pdf |

The ITEM_RELATED category describes specific relationships that exist between data items. These relationships are distinct from the parent–child relationships that are expressed in the category. The related item is identified as the item _item_related.related_name that is a child of _item.name.

Item relationships defined by _item_related.function_code in this category include some of the following (Table 2.6.5.1): an item is related to another item by a conversion factor; an item is a replacement for another item; an item is replaced by another item; an item is an alternative expression of an item; items which differ only in some convention of their expression; and items which express a set of related characteristics. One can also identify whether the declaration of an item is mutually exclusive with its alternative item. Multiple related items can be associated with each data item and multiple relationship codes can be specified for each related item.

2.6.6.1.12. ITEM_STRUCTURE

| top | pdf |

The ITEM_STRUCTURE category holds a code which identifies a structure definition that is associated with a data item. A structure in this context is a reusable matrix or vector definition declared in category ITEM_STRUCTURE_LIST. The data item _item_structure.code is a child of the item _item_structure_list.code. The item _item_structure.code provides an indirect reference into the list of structure-type definitions in category ITEM_STRUCTURE_LIST. The _item_structure.organization item describes the row/column precedence of the matrix organization.

2.6.6.1.13. ITEM_STRUCTURE_LIST

| top | pdf |

The ITEM_STRUCTURE_LIST category holds definitions of matrices and vectors that can be associated with data items. A component of the key for this category is _item_type_list.code, which is referenced by _item_structure.code to assign a structure type to a data item. The definition of a structure involves the specification of a length for each dimension of the matrix structure. The combination of items _item_structure_list.code and _item_structure_list.index forms the key for this category. The latter index item is the identifier for the dimension, hence multiple unique dimensions can be specified for each structure code. The length of each dimension is assigned to _item_structure_list.dimension.

2.6.6.1.14. ITEM_SUB_CATEGORY

| top | pdf |

The ITEM_SUB_CATEGORY category is used to assign subcategory membership for data items. A data item may belong to any number of subcategories. Each subcategory must be defined in a category named SUB_CATEGORY.

2.6.6.1.15. ITEM_TYPE

| top | pdf |

The ITEM_TYPE category holds a code that identifies the data type of each data item. The data item _item_type.code is a child of the item _item_type_list.code. Data-type definitions are actually made in the ITEM_TYPE_LIST parent category. The item _item_type.code provides an indirect reference into the list of data-type definitions in category ITEM_TYPE_LIST. This indirect reference is provided as a convenience to avoid the redeclaration of the full data-type specification for each data item. The key item for this category is _item_type.name, which is defined in the parent category ITEM. Only one data type may be specified for a data item.

2.6.6.1.16. ITEM_TYPE_CONDITIONS

| top | pdf |

The category ITEM_TYPE_CONDITIONS defines special conditions applied to a data-item type. This category has been included in order to comply with previous applications of STAR and CIF. Since the constructions that are embodied in this category are antithetical to the data model that underlies DDL2, it is recommended that this category only be used for the purpose of parsing existing data files and dictionaries.

2.6.6.1.17. ITEM_TYPE_LIST

| top | pdf |

The ITEM_TYPE_LIST category holds the list of item data-type definitions. The key item in this category is _item_type_list.code. Data types are associated with data items by references to this key from the ITEM_TYPE category. One of the data-type codes defined in this category must be assigned to each data item.

The definition of a data type consists of the specification of the item's primitive type and a regular expression that defines the pattern that must be matched by any occurrence of the item. The primitive type code, _item_type_list.primitive_code, can assume values of char, uchar, numb and null. This code is provided for backward compatibility with STAR and CIF applications that employ loose data typing. The data item _item_type_list.construct holds the regular expression that must be matched by the data type. Simple regular expressions can be used to define character fields of restricted width, floating-point and integer formats.

Molecular Information File (MIF) applications (Allen et al., 1995) have extended the notion of the regular expression to include data-item components. This permits the construction of complex data items from one or more component data items using regular expression algebra. These extended regular expressions are defined in the category ITEM_TYPE_CONDITIONS.

Example 2.6.6.1 illustrates the data types that are defined within this DDL. The DDL uses a number of character data types which have subtly different definitions. For instance, the data type identified as code defines a single-word character string; char extends the code type with the addition of a white-space character; and text extends the char type with the addition of a newline character. Two special character data types name and idname are used to define the full STAR data name and the STAR name components, respectively. The data type any is used to match any potential data type. This type is used for data items that may hold a variety of data types. The data type int is defined as one or more decimal digits and the yyyy-mm-dd type defines a date string.

Example 2.6.6.1. The description of permitted data types in the DDL2 dictionary.

[Scheme scheme6]

2.6.6.1.18. ITEM_UNITS

| top | pdf |

The ITEM_UNITS category holds a code that identifies the system of units in which a data item is expressed. The data item _item_units.code is a child of the item _item_units_list.code. Unit definitions are actually made in the ITEM_UNITS_LIST parent category. The item _item_units.code provides an indirect reference into the list of data-type definitions in category ITEM_UNITS_LIST. This indirect reference is provided as a convenience to avoid the redeclaration of the full data-type specification for each data item. The key item for this category is _item_units.name, which is defined in the parent category ITEM. Only one type of unit may be specified for a data item.

2.6.6.1.19. ITEM_UNITS_CONVERSION

| top | pdf |

The ITEM_UNITS_CONVERSION category holds a table of conversion factors between the systems of units described in the ITEM_UNITS_LIST category. The systems of units are identified by a *.from_code and a *.to_code, which are both children of the item _item_units_list.code. The conversion is defined in terms of an arithmetic operator and a conversion factor, _item_units_conversion.operator and _item_units_conversion.factor, respectively.

2.6.6.1.20. ITEM_UNITS_LIST

| top | pdf |

The ITEM_UNITS_LIST category holds the descriptions of systems of physical units. The key item in this category is _item_units_list.code. Units are assigned to data items by references to this key from the ITEM_UNITS category.

2.6.6.2. DDL2 definitions describing categories

| top | pdf |

In this section, the DDL definitions that describe the properties of categories, category groups and subcategories are presented. Fig. 2.6.4.2 illustrates the organization of these categories.

2.6.6.2.1. CATEGORY

| top | pdf |

The category named CATEGORY contains the data items that describe the properties of collections of related data items. A DDL category is essentially a table. In this category the characteristics of the table as a whole are defined. This category includes the data items _category.id to identify a category name; _category.description to describe a category; _category.mandatory_code to indicate whether the category must appear in a data block; and _category.implicit_key, which can be used to merge like categories between data blocks. The category identifier _category.id is a component of the key in most of the DDL categories in this section. The parent definition of the category identifier and all its child relationships are defined in this category.

Because special rules exist in the STAR grammar for the specification of data items that belong to a common category, the organization of data items within categories has a significant influence on how these items may be expressed in a data file. For example, a data category may be specified only once within a STAR data block or save frame, and at any level of a STAR loop structure only data items of a common category may appear.

2.6.6.2.2. CATEGORY_EXAMPLES

| top | pdf |

The category named CATEGORY_EXAMPLES holds examples that apply to an entire category. This typically includes a complete specification of the category with annotations. An example specification consists of the text of the example, _category_examples.case, and an optional comment item, _category_examples.detail, which can be used to qualify the example. The key for this category includes the items _category_examples.id and _category_examples.case. The former is completely defined in the parent category named CATEGORY.

2.6.6.2.3. CATEGORY_GROUP

| top | pdf |

The category CATEGORY_GROUP names the category groups to which a category belongs. The assignment of a category to a category group is made when the category is defined. Each category group that is specified in this category must also be defined in the parent category, CATEGORY_GROUP_LIST. The basis for this category also includes the category identifier _category_group.category_id, which is completely defined in the parent category named CATEGORY.

2.6.6.2.4. CATEGORY_GROUP_LIST

| top | pdf |

The DDL category CATEGORY_GROUP_LIST holds data items that define category groups. Category groups are collections of related categories. Parent–child relationships may be defined for these groups. The specification of category groups and the relationships between these groups allow a complicated collection of categories to be organized into a hierarchy of more relevant groups. This higher level of structure is essential for large application dictionaries that may contain hundreds of category definitions.

The category CATEGORY_GROUP_LIST holds the description of each category group, _category_group_list.description, and an optional identifier of the parent group, _category_group_list.parent_id. Category groups can be formed from collections of base categories, those categories that hold data. Category groups can also be formed from collections of base categories and category groups.

Example 2.6.6.2 illustrates the category groups that are defined in this DDL. These include the group of categories that define categories, the group of categories defining data items and the group of categories that define properties of the dictionary. An additional compliance group is also defined for categories that are included specifically for compliance with previous versions of DDL. Each of these category groups is defined as a child of the group named ddl_group to which all of the base DDL categories belong.

Example 2.6.6.2. Category groups defined in the DDL2 dictionary.

[Scheme scheme8]

2.6.6.2.5. CATEGORY_KEY

| top | pdf |

The category CATEGORY_KEY identifies the data items within a category that form the basis for the category. The category basis uniquely identifies each group or tuple of items in the category. In the analogy of the category as a table, no row in a table may have duplicate values for its key data items.

The choice of basis has important consequences in the specification of a category. It is important to ensure that the key items that form the category basis can unambiguously identify any tuple of data items within the category. If this is not the case, then it may not be possible to reliably recover data items that are stored in the category. Because key items are required to address each tuple of items in a category, key items are considered mandatory items in the category.

It is interesting to note how the key data items have been selected for the categories that define the DDL, and how this choice of key items influences the structure of the DDL dictionary. In the DDL category CATEGORY_KEY, the basis includes both the identifier for the category, _category_key.id, and the name of the key data item, _category_key.name. This choice of basis allows for any unique groups of items in a category to be defined as key items. Duplicate key-item values within a category are forbidden by the data model. In the DDL category ITEM_TYPE, the basis includes only the identifier for the item name, _item_type.name. This choice of basis has the desired effect of limiting the specification of item data type, _item_type.code, to a single choice for each data item.

2.6.6.2.6. CATEGORY_METHODS

| top | pdf |

The CATEGORY_METHODS category is used to associate method identifiers with categories. Any number of unique method identifiers may be associated with a category. The method identifiers reference the full method definitions in the parent METHOD_LIST category.

2.6.6.2.7. SUB_CATEGORY

| top | pdf |

The category SUB_CATEGORY provides data items to describe a subcategory and to associate a procedure with the subcategory (see Section 2.6.6.2.9). A subcategory is a set of data items within a category that have a particular association. A typical example would be a triad of positional coordinates x, y, z that are collectively assigned to a `cartesian' subcategory.

2.6.6.2.8. SUB_CATEGORY_EXAMPLES

| top | pdf |

The DDL category SUB_CATEGORY_EXAMPLES holds examples of a subcategory. A subcategory example might illustrate valid instances of the items comprising the subcategory. An example specification contains the text of the example, _sub_category_examples.case, and an optional comment item, _sub_category_examples.detail, that can be used to qualify the example. The key for this category includes the items _sub_category_examples.id and _sub_category_examples.case. This compound basis permits multiple unique examples to be provided for each subcategory.

2.6.6.2.9. SUB_CATEGORY_METHODS

| top | pdf |

The SUB_CATEGORY_METHODS category is used to associate method identifiers with subcategories. Any number of unique method identifiers may be associated with a subcategory. The method identifiers reference the full method definitions in the parent METHOD_LIST category.

The procedure that is identified as _sub_category_methods.method_id may be used to validate the subcategory identified as _sub_category_methods.sub_category_id. Subcategory validation may be required in instances where conditions are placed on the values of data items within the subcategory that are more restrictive than those associated with each component data item. A simple example of such a restriction would be a normalization restriction on the components of a subcategory. Any procedure referenced in this category must also be defined in the category METHOD_LIST.

2.6.6.3. DDL2 definitions describing methods

| top | pdf |

In this section, the DDL categories that define the methods associated with data blocks, categories, subcategories and items are presented. Figs. 2.6.4.1, 2.6.4.2 and 2.6.4.3 illustrate the relationships between the method categories and other DDL categories.

2.6.6.3.1. METHOD_LIST

| top | pdf |

The METHOD_LIST category defines methods that can be associated with data blocks, categories, subcategories and items. This category attempts to capture only the essential information required to define these methods, without defining any implementation details. The implementation details are appropriately left to application-dictionary developers. It is assumed here that, within a domain of dictionaries, a consistent method interface will be adopted that is tailored to the requirements of that domain. This of course complicates the sharing of methods between domains; however, it would be impossible at this time to define an implementation strategy inside the DDL that would even begin to satisfy the diverse requirements of potential DDL users. Consequently, the definition of each method is limited to: its unique identifier, _method_list.id; a textual description, _method_list.detail; the source text of the method, _method_list.inline; the name of the language in which the method is expressed, _method_list.language; and a code to identify the purpose of the method, _method_list.code.

2.6.6.4. DDL2 definitions describing dictionaries and data blocks

| top | pdf |

In this section, the DDL categories that describe the characteristics of dictionaries and data blocks are presented. In this context, a dictionary is defined as a group of related definitions within a STAR data block. Fig. 2.6.4.3 illustrates the organization for these categories.

2.6.6.4.1. DATABLOCK

| top | pdf |

The DATABLOCK category holds the essential identifying information for a data block: the name of the data block, _datablock.id; and a description of the block, _datablock.description. The _datablock.id is the parent identifier for both _category.implicit_key and _dictionary.datablock_id. The former guarantees that the identifier for the data block, and hence the dictionary, is added implicitly to the key of each category.

2.6.6.4.2. DATABLOCK_METHODS

| top | pdf |

The DATABLOCK_METHODS category may be associated with a data block. The method identifiers reference the full method definitions in the parent METHOD_LIST category.

2.6.6.4.3. DICTIONARY

| top | pdf |

The DICTIONARY category holds the essential identifying information for a data dictionary. The items recorded in this category include the title for the dictionary, _dictionary.title, the current version identifier, _dictionary.version, and the data-block identifier in which the dictionary is defined, _dictionary.datablock_id. The version identifier references the parent identifier in the DICTIONARY_HISTORY category in which each dictionary revision is described.

2.6.6.4.4. DICTIONARY_HISTORY

| top | pdf |

The DICTIONARY_HISTORY category holds the revision history for a dictionary. Each revision is assigned a version identifier that acts as the key item for the category. Along with the version information, a text description of the revision and date of revision must be specified.

References

Allen, F. H., Barnard, J. M., Cook, A. F. P. & Hall, S. R. (1995). The Molecular Information File (MIF): core specifications of a new standard format for chemical data. J. Chem. Inf. Comput. Sci. 35, 412–427.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 2.6, pp. 65-70