Background

Bernstein, H. J.

doi:10.1107/97809553602060000751

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.1, pp. 481-483

Section 5.1.2. Background

H. J. Bernstein^a ^*

^a Department of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA
Correspondence e-mail: yaya@bernstein-plus-sons.com

5.1.2. Background

| top | pdf |

There have been many efforts at creating agreed formats for data to be used in crystallography (see Chapter 1.1 ). We need to consider how software has been created to make use of such formats, especially software to make use of CIF.

Agreement on formats evolved from the earliest efforts at collaboration among research groups. Within crystallography, recognition of the need to use data formats as standards and to adapt applications to agreed formats, rather than to adapt formats to the caprices of particular applications or diffractometers or graphics engines, began in the late 1960s and early 1970s with the establishment of computerized data resources for the chemical and crystallographic community and the increasing availability of computer networks (Lykos, 1975). We will discuss three early data-resource efforts: the Cambridge Crystallographic Data Centre Structural Database File (CSD) (Allen et al., 1973), the Brookhaven National Laboratory Protein Data Bank (PDB) (Bernstein et al., 1977) and the NIH/EPA Chemical Information System (CIS) (Heller et al., 1977). The differences and similarities among application development efforts related to these resources illustrate some of the issues that now face software developers working with CIF: conformance to agreed formats versus deviations from standards to improve performance, as well as cross-platform portability.

The Cambridge Crystallographic Data Centre was established in 1965 `to compile a database containing comprehensive information on small-molecule crystal structures, i.e. organics and metallo-organic compounds containing up to 500 non-H atoms, the structures of which had been determined by X-ray or neutron diffraction' (Allen, 2002). The Protein Data Bank was established at Brookhaven National Laboratory in 1971 as an archive of macromolecular structural information. The NIH/EPA Chemical Information System was established in 1975 as a confederation of databases including mass spectroscopy, NMR and the data from the CSD. The three resources, CSD, PDB and CIS, took different approaches to applications development. The CSD was an integrated software system centred on a database. Both the software and the database were distributed on magnetic tape for users to use on their local computers. The developers of the software had to be concerned with portability of the software across the multiple computer systems used by crystallographers, but retained control of the design of the retrieval software and a core suite of applications. The PDB was an archive, rather than a database. Some software and the data were distributed on magnetic tape, but the application development model was what would now be called `open', with users and software developers taking the data and the PDB format specification and creating software that would do useful things with PDB entries. The CIS was a remotely accessed confederation of databases on a central computer. The developers of software for the CIS did not have to be concerned with cross-platform portability, or with changes in syntax or semantics of data files impacting on external software developers. Developers of software for the CSD and the PDB had to be concerned with strict compliance with the rules for the respective data formats, albeit on somewhat different timescales. Developers of software for the centralized CIS database could negotiate for immediate changes in the data format to improve performance of the relevant application.

The CSD had agreed internal formats (Cambridge Structural Database, 1978). However, as noted in Chapter 1.1 , there were many different formats in use for small-molecule crystallography and related fields. One may conjecture that one of many causes for such divergence was the CCDC practice of acquiring much of its data from journals, after differences among data formats had been masked by the publication process. The transition from this Tower of Babel to CIF is described in Chapter 1.1 , and that history will not be repeated here, but it is important to note that an application writer working in the domain of small-molecule crystallography still has to be aware of a wide variety of formats in addition to CIF.

In the beginning, the PDB went through a relatively rapid format change and then achieved a stable format for more than two decades. The PDB differed from the CSD in depending on user deposition of data prior to publication. The better a user conformed to PDB data-format conventions, the more efficiently could the data move from deposition to release. The initial standard PDB format (PDB, 1974) was derived from the format used in a popular refinement program of the day (Diamond, 1971) and used 132-character records identified by the character strings in the first six columns. Starting in 1976, the PDB spent more than a year (PDB, 1976a,b, 1977) converting to an 80-column format, extensions of which are still in use to this day. Many external programs were developed using this 80-column format and it has become a major de facto standard for macromolecular software applications. Most application packages producing crystallographic macromolecular structures made a gradual transition from having output options for producing `Diamond format' to having output options for producing PDB format. Macromolecular applications working with other disciplines shared the small-molecule applications penchant for multiple formats.

The CIS, working in a completely closed, central service environment, had little direct impact on the formats to be used for applications. The CIS would acquire data from existing archives and databases and meld them into its master database. It would deliver its data as text on a CRT. Much of the impact of CIS data formats was to be restricted to its own internal application development.

Most of the formats resulting from these early efforts were fixed-field, fixed-order formats. The result was that adapting an application to a data format was simple if the processing flow of the application conformed to the fixed order of the data format. Frequently, the data flow did conform. When the processing flow did not conform, it was necessary to create internal data structures or temporary files to allow the unfortunately timed arrival of data to be time-shifted until it was needed. In general, the heaviest burden was imposed on applications that needed to write data conforming to one of the agreed formats. As the complexity of such time-shifting processes increased, it became clear that the cleanest solution was to base an application on an internal database and to populate the database as the data were processed. When data were to be written by an application, the data could be extracted from the database in whatever order was required.

In the 1970s and early 1980s, such a procedure was a serious burden to place on an application. With limited memory and processor speeds, there was a strong argument for adapting agreed formats to the `natural' processing flow, reducing or avoiding the need for an internal database. As the speed and size of computers have changed and as programming language and operating-system support for dynamic allocation of resources has improved, the need to have agreed formats driven by applications has become less pressing.

We need to understand three major thrusts in data representation: the development of markup languages, of data-representation frameworks and of database application support. Modern applications can benefit from all three.

5.1.2.1. Markup languages

| top | pdf |

A markup language allows the raw text of a document to be annotated with interleaved `markup' specifying layout information for the bracketed text. For document processing, the implicit assumption of the use of an internal database became formalized with the gradual adoption of agreed markup languages in the late 1980s and early 1990s [e.g. $[\hbox{\TeX}]$ (Knuth, 1986), SGML (ISO, 1986), RTF (Andrews, 1987), HTML (Berners-Lee, 1989)]. When used in this manner, such a language has the implicit ordering assumption of reading forward in the document. However, with modern demands for multidimensional layout and document reflow, applications managing such documents achieve the best performance and flexibility when they store the entire marked-up document in an internal data structure that allows random access to all the information.

5.1.2.2. Data-representation frameworks

| top | pdf |

A data-representation framework provides the concepts for managing data and data about the management of data (`metadata'). Such frameworks may be based on programming languages or markup languages or built from scratch. They provide a mechanism for representing data (e.g. as data sets, graphs or trees) and a mechanism for representing metadata (e.g. as dictionaries or schemas). Four are of particular importance in crystallography: CIF, ASN.1, HDF and XML.

As noted in Chapter 1.1 , CIF was created to rationalize the publication process for small molecules. It combines a very simple tag–value data representation with a dictionary definition language (DDL) and well populated dictionaries. CIF is table-oriented, naturally row-based, has case-insensitive tags and allows two levels of nesting. CIF is order-independent and uses its dictionaries both to define the meanings of its tags and to parameterize its tags. It is interesting to note that, even though CIF is defined as order-independent, it effectively fills the role of an order-dependent markup language in the publication process. We will discuss this issue later in this chapter.

Abstract Syntax Notation One (ASN.1) (Dubuisson, 2000; ISO, 2002) was developed to provide a data framework for data communications, where great precision in the bit-by-bit layout of data to be seen by very different systems is needed. Although targeted for communications software, ASN.1 is suitable for any application requiring precise control of data structures and, as such, primarily supports the metadata of an application, rather than the data. ASN.1 can be compiled directly to C code. The resulting C code then supports the data of the application. ASN.1 notation found application in NCBI's macromolecular modelling database (Ohkawa et al., 1995). ASN.1 has case-sensitive tags and allows case-insensitive variants. It manages order-dependent data structures in a mixed order-dependent/order-independent environment.

HDF (NCSA, 1993) is `a machine-independent, self-describing, extendible file format for sharing scientific data in a heterogeneous computing environment, accompanied by a convenient, standardized, public domain I/O library and a comprehensive collection of high quality data manipulation and analysis interfaces and tools' (http://ssdoo.gsfc.nasa.gov/nost/formats/hdf.html ). HDF was adopted by the Neutron and X-ray Data Format (NeXus) effort (Klosowski et al., 1997). HDF allows the building of a complete data framework, representing both data and metadata. Two parallel threads of software development, focused on the management and exchange of raw data from area detectors, began in the mid-1990s: the Crystallographic Binary File (CBF) (Hammersley, 1997) and NeXus. The volumes of data involved were daunting and efficiency of storage was important. Therefore both proposed formats assumed a binary format. CBF was based on a combination of CIF-like ASCII headers with compressed binary images. NeXus was based on HDF. The first API for CBF was produced by Paul Ellis in 1998. CBF rapidly evolved into CBF/imgCIF with a complete DDL2 dictionary and a fully CIF-compliant API (Chapter 5.6 ). As of mid-2004, NeXus was still evolving (see http://www.nexusformat.org/ ).

XML is a simplified form of SGML, drawing on years of development of tools for SGML and HTML. XML is tree-oriented with case-sensitive entity names. It allows unlimited nesting and is order-dependent. Metadata are managed as a `document type definition' (DTD), which provides minimal syntactic information, or as schemas, which allow for more detail and are more consistent with database conventions. In fields close to crystallography, the first effort at adopting XML was the chemical markup language (CML) (Murray-Rust & Rzepa, 1999). CML is intentionally imprecise in its ontology to allow for flexibility in development. The CSD and PDB have released their own XML representations (http://www.ccdc.cam.ac.uk/support/documentation/relibase/3_0/relibase_DPG/toc.html ; http://pdbml.rcsb.org ).

It may seem from this discussion that the application designer faces an unmanageable variety of data frameworks in an unstable, evolving environment. To some extent this is true. Fortunately, however, there are signs of convergence on CIF dictionary-based ontologies and the use of transliterated CIFs. This means that an application adapted to CIF should be relatively easy to adapt to other data frameworks.

References

Cambridge Structural Database (1978). Cambridge Crystallographic Database User Manual. Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, England.Google Scholar

ISO (1986). ISO 8879. Information processing – Text and office systems – Standard Generalized Markup Language (SGML). Geneva: International Organization for Standardization.Google Scholar

ISO (2002). ISO/IEC 8824–1. Abstract Syntax Notation One (ASN.1). Specification of basic notation. Geneva: International Organization for Standardization.Google Scholar

NCSA (1993). NCSA HDF: specification and developer's guide. Version 3.2. University of Illinois at Urbana-Champaign, USA.Google Scholar

PDB (1974). PDB Newsletter 1. Brookhaven National Laboratory, USA.Google Scholar

PDB (1976a). PDB Newsletter 2. Brookhaven National Laboratory, USA.Google Scholar

PDB (1976b). PDB Newsletter 3. Brookhaven National Laboratory, USA.Google Scholar

PDB (1977). PDB Newsletter 4. Brookhaven National Laboratory, USA. Google Scholar

Allen, F. H. (2002). The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Cryst. B58, 380–388.Google Scholar

Allen, F. H., Kennard, O., Motherwell, W. D. S., Town, W. G. & Watson, D. G. (1973). Cambridge Crystallographic Data Centre. II. Structural Data File. J. Chem. Doc. 13, 119–123. Google Scholar

Andrews, N. (1987). Rich Text Format standard makes transferring text easier. Microsoft Syst. J. 2, 63–67.Google Scholar

Berners-Lee, T. (1989). Information management: a proposal. Internal Report. Geneva: CERN. http://www.w3.org/History/1989/proposal-msw.html . Google Scholar

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542.Google Scholar

Diamond, R. (1971). A real-space refinement procedure for proteins. Acta Cryst. A27, 436–452.Google Scholar

Dubuisson, O. (2000). ASN.1 – communication between heterogeneous systems. San Francisco, CA: Morgan Kaufmann. (Translated from the French by P. Fouquart.)Google Scholar

Hammersley, A. P. (1997). FIT2D: an introduction and overview. ESRF Internal Report ESRF97HA02T. Grenoble: ESRF.Google Scholar

Heller, S. R., Milne, G. W. A. & Feldmann, R. J. (1977). A computer-based chemical information system. Science, 195, 253–259.Google Scholar

Klosowski, P., Koennecke, M., Tischler, J. Z. & Osborn, R. (1997). NeXus: a common format for the exchange of neutron and synchrotron data. Physica B Condens. Matter, B241–243, 151–153.Google Scholar

Knuth, D. E. (1986). The $[\hbox{\TeX}]$ book. Computers and typesetting, Vol. A. Reading, MA: Addison-Wesley.Google Scholar

Lykos, P. (1975). Editor. Computer networking and chemistry. ACS Symposium Series, Vol. 19. Washington DC: American Chemical Society.Google Scholar

Murray-Rust, P. & Rzepa, H. (1999). Chemical markup, XML and the WWW, Part I: Basic principles. J. Chem. Inf. Comput. Sci. 39, 928–942.Google Scholar

Ohkawa, H., Ostell, J. & Bryant, S. (1995). MMDB: an ASN.1 specification for macromolecular structure. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, England, 16–19 July 1995, pp. 259–267. Menlo Park, CA: American Association for Artificial Intelligence.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 5.1, pp. 481-483