International
Tables for
Crystallography
Volume I
X-ray absorption spectroscopy and related techniques
Edited by C. T. Chantler, F. Boscherini and B. Bunker

International Tables for Crystallography (2024). Vol. I. ch. 1.3, pp. 13-15
https://doi.org/10.1107/S1574870723004585

Chapter 1.3. Deposition of XAFS data

Sydney R. Hall,a James R. Hesterb and Brian McMahonc*

aSchool of Molecular Sciences, University of Western Australia, 35 Stirling Highway, Perth, WA 6009, Australia,bAustralian Nuclear Science and Technology Organisation, Locked Bag 2001, Kirrawee DC, NSW 2232, Australia, and cInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, United Kingdom
Correspondence e-mail:  [email protected]

Best practice in modern research data management argues for the deposition of validated experimental data sets in publicly accessible repositories. Such depositions should meet the FAIR principles (that scientific data should be findable, accessible, interoperable and reusable), on which funding bodies increasingly insist. In so doing, they will include sufficient metadata to allow verification and reproducibility of published research results, and provide the scientific community with a curated collection of valuable data. The benefits of such a collection include independent validation, reanalysis and the ability to extract new science from existing data as new techniques appear.

Keywords: deposition; data management.

1. Introduction

It has become fashionable in the late twentieth and early twenty-first centuries to speak of `data-driven science' in response to developments in instrumentation, computation and storage. The term is certainly a misnomer, as observational and experimental sciences have always relied upon the collection, evaluation and analysis of data to confirm or suggest hypotheses and models. Nevertheless, it reflects the recognized scale of the quantity of current scientific research, and the potential for greater collaboration, data sharing and knowledge discovery that modern technologies provide (Hey et al., 2009link to reference).

Funding bodies began to require that scientific research groups in receipt of their support should formulate and work to data-management plans, so that observational and experimental data sets were organized and preserved to allow their reuse, both to validate original research results and to make possible other studies based on those data sets, which were often collected at significant expense. A consortium of scientists and research organizations formulated a set of principles intended to develop best practice in such reusability (Wilkinson et al., 2016link to reference). These principles used the acronym FAIR to emphasize the need for findability, accessibility, interoperability and reusability of collected data sets. That is, in order for them to be reusable, scientists need to become aware of the existence of useful data sets, and they need to be able to retrieve these data sets and process them with their own software tools. Missing from the acronym FAIR, but implicit in the intention to facilitate reuse, is the notion that the data should be trustworthy according to some agreed criteria, and that they should therefore be validated against those criteria.

We shall consider the application of all of these principles to XAFS data, but here we shall concentrate on the best approach to making experimental data sets accessible to other researchers, and we shall argue for the orderly deposition of primary data in repositories that provide unique identifiers, persistent locators and some measure of searchability.

2. Data characterization and standards

The goals of findability and interoperability are equally addressed by standardization efforts, so that the community of users has a shared understanding of the terminology associated with the core concepts in their domain, and shared formats to facilitate the loading of numerical data into different computer programs.

Let us first address the second of these considerations. Historically, it was important to establish rigid file formats because software program input would load data from fixed-width records into limited in-memory representations of numerical values and text strings (Hall & McMahon, 2005link to reference). Current programming languages are better able to handle free-format input, and it is increasingly common to find data sets that use plain-text files where embedded tags identify the nature of each item of data. Examples of such formats are XML, JSON and the Crystallographic Information File (CIF). Where very large volumes of data are collected, especially at very high speed, binary files are preferred, such as HDF5, where data are stored in a hierarchical structure. Such files are not immediately readable on a computer screen, but the contents can be displayed by dedicated libraries that understand the hierarchy, which is defined by associating attributes to each node of the tree structure that may contain additional children (`branches' or `leaves'). Conceptually, these attributes fulfil the same role as the tags mentioned in the description of free-format text files: namely, they provide in-place descriptions of the associated data points. Regardless of the physical structure of the file (for a fuller description of common formats, see Hester, 2024alink to reference), the essential requirement for portable data is a well defined ontology; that is, a set of agreed concepts and relationships that can be associated with the discrete data items presented in a file. Given the existence of such an ontology, the transformation of representations between different file formats is sometimes trivial, and at least mechanical.

A good starting point in the development of an ontology is an agreed nomenclature of technical terms, and a set of definitions was proposed in 2009 by the IUCr Commission on XAFS and incorporated into the Online Dictionary of Crystallography. This nomenclature is discussed in the current volume by Chantler (2024link to reference).

An early illustration of how to present X-ray absorption spectral information in CIF format was given by Ravel et al. (2012link to reference) and further exemplified in the supporting information to Trevorah et al. (2019link to reference). However, there is not as yet a standard CIF dictionary in this area, and further work on standardization would be timely in order to reduce the likelihood of a great multiplicity of representational formats. A benefit of developing a suitable ontology within the Crystallographic Information Framework would be its compatibility with a large number of related structural science techniques and disciplines (Hall & McMahon, 2016link to reference).

To achieve reusability, the data set must contain not only the numeric data values of measurements and observations, but also sufficient metadata (information describing the parameters of the experiment) to allow other users to interpret those numbers correctly. Hester (2024blink to reference) identifies a number of experimental metadata items suitable for basic validation of XAFS data sets.

3. The importance of data deposition

In recent years, there has been much discussion about the `reproducibility crisis' in science (Baker, 2016link to reference). In brief, there is a perception that there is not sufficient deposition of experimental data sets in open repositories to allow published research findings to be replicated (and, therefore, verified or challenged). The FAIR data principles mentioned above arose at least in part to satisfy the requirements for reproducibility, and there is no doubt that a larger proportion of scientific research is now accompanied by the deposition of well characterized data sets.

There has been some pushback against this sense of crisis (see, for example, Fanelli, 2018link to reference), and it is fair to say that the level of concern depends on factors such as the scientific domain, the impact of research findings on human wellbeing and the cost of conducting the experiment, as well as the availability of tests for determining the extent to which a repeat study does reproduce the original results. Nevertheless, there is broad agreement that it is desirable to deposit experimental data sets as much as possible in accordance with the FAIR principles.

In the structural science domain, in 2011 the IUCr commissioned a Diffraction Data Deposition Working Group (DDDWG) to consider the rationale and practicalities of routine deposition, initially of X-ray diffraction images. The scope of the working group later extended to other experimental techniques, and its recommendations (Helliwell et al., 2017link to reference) continue to be monitored by a successor body, the IUCr Committee on Data (https://www.iucr.org/resources/data/commdat ). The working group identified a number of reasons for depositing data sets and making them available alongside a scientific publication (Kroon-Batenburg & Helliwell, 2014link to reference):

(i) To enhance the reproducibility of a scientific experiment.

(ii) To verify or support the validity of deductions from an experiment.

(iii) To safeguard against error.

(iv) To better safeguard against fraud than is apparently the case at present.

(v) To allow other scholars to conduct further research based on experiments already conducted.

(vi) To allow reanalysis at a later date, especially to extract `new' science as new techniques are developed.

(vii) To provide example materials for teaching and learning.

(viii) To provide long-term preservation of experimental results and future access to them.

(ix) To permit systematic collection for comparative studies.

Although there is some overlap amongst these reasons, they are broadly applicable to many other experimental methodologies, and certainly to the XAS field. The working group and the Committee on Data went on to conduct further studies and analysis (see, for example, Guss & McMahon, 2014link to reference; Coles & Sarjeant, 2020link to reference). From these, it became apparent that different communities had different attitudes towards and requirements for the routine deposition of raw data sets. For biological macromolecular structure determination, deposition of raw data images is recommended for IUCr journals (Helliwell et al., 2019link to reference), while for chemical crystallography the outcome of a targeted workshop (Diffraction Data Deposition Working Group, 2021link to reference) was more nuanced. One particular outcome was the development of a new journal section, Raw Data Letters (Kroon-Batenburg et al., 2022link to reference), in IUCrData to provide a home for the discussion of `interesting' data sets, namely those identified by researchers as of interest to methods and software developers for purposes such as reanalysis by newer methods, or that possess features that are not amenable to routine analysis techniques that might be relevant to the structural interpretation.

The longevity and economics of a raw data archive have been discussed by Guss & McMahon (2014link to reference), and subsequent practice has not favoured the establishment of curated domain-specific archives, although some of these have been established for biological macromolecules (Grabowski et al., 2016link to reference; Kurisu, 2021link to reference). An essential requirement for making deposited data sets available over a long time period is the assignment of a persistent URL (i.e. one that is not dependent on specific server-based domain names or an underlying method of serving to web browsers). A resource that is currently favoured by many researchers for depositing their scientific data sets is Zenodo (https://zenodo.org ), which was developed under the European Union OpenAIRE programme and is operated by CERN. Zenodo will provide for each submission a persistent URL in the form of a Digital Object Identifier, thus providing for the accessibility requirement of the FAIR principles.

The findability of such deposited data sets currently relies primarily on hyperlinks to the deposition from associated journal publications, and there is limited search ability offered by Zenodo. The metadata characterizing each deposition are rather general in nature (as befits a general-purpose resource). The archive may be explored programmatically using OAI-PMH and REST API protocols, which support a metadata schema developed by the DataCite organization for the characterization of research data sets (DataCite, 2017link to reference). As argued by Guss & McMahon (2014link to reference), domain-specific repositories can provide much richer metadata and thus enhance the findability of deposited data sets, and the XAS community may wish to carry out its own cost–benefit analysis of the merits of developing such a dedicated resource.

References

First citationBaker, M. (2016). Nature, 533, 452–454.Google Scholar
First citationChantler, C. T. (2024). Int. Tables Crystallogr. I, ch. 9.1, 1059–1067 .Google Scholar
First citationColes, S. J. & Sarjeant, A. (2020). IUCr Newsl. 28, https://www.iucr.org/news/newsletter/volume-28/number-1/raw-data-availability-the-small-molecule-crystallography-perspective .Google Scholar
First citationDataCite (2017). DataCite OAI Schema v.1.1. https://schema.datacite.org/oai/oai-1.1/ .Google Scholar
First citationDiffraction Data Deposition Working Group (2021). Workshop on When Should Small Molecule Crystallographers Publish Raw Diffraction Data? https://www.iucr.org/resources/data/commdat/prague-workshop-cx .Google Scholar
First citationFanelli, D. (2018). Proc. Natl Acad. Sci. USA, 115, 2628–2631.Google Scholar
First citationGrabowski, M., Langner, K. M., Cymborowski, M., Porebski, P. J., Sroka, P., Zheng, H., Cooper, D. R., Zimmerman, M. D., Elsliger, M.-A., Burley, S. K. & Minor, W. (2016). Acta Cryst. D72, 1181–1193.Google Scholar
First citationGuss, J. M. & McMahon, B. (2014). Acta Cryst. D70, 2520–2532.Google Scholar
First citationHall, S. R. & McMahon, B. (2005). International Tables for Crystallography, Vol. G, edited by S. R. Hall & B. McMahon, pp. 2–10. Dordrecht: Springer.Google Scholar
First citationHall, S. R. & McMahon, B. (2016). Data Sci. J. 15, 3.Google Scholar
First citationHelliwell, J. R., McMahon, B., Androulakis, S., Szebenyi, M., Kroon-Batenburg, L., Terwilliger, T., Westbrook, J. & Weckert, E. (2017). Final Report of the IUCr Diffraction Data Deposition Working Group. https://www.iucr.org/resources/data/dddwg/final-report .Google Scholar
First citationHelliwell, J. R., Minor, W., Weiss, M. S., Garman, E. F., Read, R. J., Newman, J., van Raaij, M. J., Hajdu, J. & Baker, E. N. (2019). Acta Cryst. D75, 455–457.Google Scholar
First citationHester, J. R. (2024a). Int. Tables Crystallogr. I, ch. 7.1, 857–860 .Google Scholar
First citationHester, J. R. (2024b). Int. Tables Crystallogr. I, ch. 7.2, 861–862 .Google Scholar
First citationHey, T., Tansley, S., Tolle, K. & Gray, J. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond: Microsoft Research.Google Scholar
First citationKroon-Batenburg, L. M. J. & Helliwell, J. R. (2014). Acta Cryst. D70, 2502–2509.Google Scholar
First citationKroon-Batenburg, L. M. J., Helliwell, J. R. & Hester, J. R. (2022). IUCrData, 7, x220821.Google Scholar
First citationKurisu, G. (2021). Acta Cryst. A77, C667.Google Scholar
First citationRavel, B., Hester, J. R., Solé, V. A. & Newville, M. (2012). J. Synchrotron Rad. 19, 869–874.Google Scholar
First citationTrevorah, R. M., Chantler, C. T. & Schalken, M. J. (2019). IUCrJ, 6, 586–602.Google Scholar
First citationWilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J. G., Groth, P., Goble, C., Grethe, J. S., Heringa, J., 't Hoen, P. A. C., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J. & Mons, B. (2016). Sci Data, 3, 160018.Google Scholar








































to end of page
to top of page