|
International
Tables for Crystallography Volume I X-ray absorption spectroscopy and related techniques Edited by C. T. Chantler, F. Boscherini and B. Bunker © International Union of Crystallography 2024 |
International Tables for Crystallography (2024). Vol. I. ch. 1.3, pp. 13-15
https://doi.org/10.1107/S1574870723004585 Chapter 1.3. Deposition of XAFS dataaSchool of Molecular Sciences, University of Western Australia, 35 Stirling Highway, Perth, WA 6009, Australia,bAustralian Nuclear Science and Technology Organisation, Locked Bag 2001, Kirrawee DC, NSW 2232, Australia, and cInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, United Kingdom Best practice in modern research data management argues for the deposition of validated experimental data sets in publicly accessible repositories. Such depositions should meet the FAIR principles (that scientific data should be findable, accessible, interoperable and reusable), on which funding bodies increasingly insist. In so doing, they will include sufficient metadata to allow verification and reproducibility of published research results, and provide the scientific community with a curated collection of valuable data. The benefits of such a collection include independent validation, reanalysis and the ability to extract new science from existing data as new techniques appear. Keywords: deposition; data management. |
It has become fashionable in the late twentieth and early twenty-first centuries to speak of `data-driven science' in response to developments in instrumentation, computation and storage. The term is certainly a misnomer, as observational and experimental sciences have always relied upon the collection, evaluation and analysis of data to confirm or suggest hypotheses and models. Nevertheless, it reflects the recognized scale of the quantity of current scientific research, and the potential for greater collaboration, data sharing and knowledge discovery that modern technologies provide (Hey et al., 2009
).
Funding bodies began to require that scientific research groups in receipt of their support should formulate and work to data-management plans, so that observational and experimental data sets were organized and preserved to allow their reuse, both to validate original research results and to make possible other studies based on those data sets, which were often collected at significant expense. A consortium of scientists and research organizations formulated a set of principles intended to develop best practice in such reusability (Wilkinson et al., 2016
). These principles used the acronym FAIR to emphasize the need for findability, accessibility, interoperability and reusability of collected data sets. That is, in order for them to be reusable, scientists need to become aware of the existence of useful data sets, and they need to be able to retrieve these data sets and process them with their own software tools. Missing from the acronym FAIR, but implicit in the intention to facilitate reuse, is the notion that the data should be trustworthy according to some agreed criteria, and that they should therefore be validated against those criteria.
We shall consider the application of all of these principles to XAFS data, but here we shall concentrate on the best approach to making experimental data sets accessible to other researchers, and we shall argue for the orderly deposition of primary data in repositories that provide unique identifiers, persistent locators and some measure of searchability.
The goals of findability and interoperability are equally addressed by standardization efforts, so that the community of users has a shared understanding of the terminology associated with the core concepts in their domain, and shared formats to facilitate the loading of numerical data into different computer programs.
Let us first address the second of these considerations. Historically, it was important to establish rigid file formats because software program input would load data from fixed-width records into limited in-memory representations of numerical values and text strings (Hall & McMahon, 2005
). Current programming languages are better able to handle free-format input, and it is increasingly common to find data sets that use plain-text files where embedded tags identify the nature of each item of data. Examples of such formats are XML, JSON and the Crystallographic Information File (CIF). Where very large volumes of data are collected, especially at very high speed, binary files are preferred, such as HDF5, where data are stored in a hierarchical structure. Such files are not immediately readable on a computer screen, but the contents can be displayed by dedicated libraries that understand the hierarchy, which is defined by associating attributes to each node of the tree structure that may contain additional children (`branches' or `leaves'). Conceptually, these attributes fulfil the same role as the tags mentioned in the description of free-format text files: namely, they provide in-place descriptions of the associated data points. Regardless of the physical structure of the file (for a fuller description of common formats, see Hester, 2024a
), the essential requirement for portable data is a well defined ontology; that is, a set of agreed concepts and relationships that can be associated with the discrete data items presented in a file. Given the existence of such an ontology, the transformation of representations between different file formats is sometimes trivial, and at least mechanical.
A good starting point in the development of an ontology is an agreed nomenclature of technical terms, and a set of definitions was proposed in 2009 by the IUCr Commission on XAFS and incorporated into the Online Dictionary of Crystallography. This nomenclature is discussed in the current volume by Chantler (2024
).
An early illustration of how to present X-ray absorption spectral information in CIF format was given by Ravel et al. (2012
) and further exemplified in the supporting information to Trevorah et al. (2019
). However, there is not as yet a standard CIF dictionary in this area, and further work on standardization would be timely in order to reduce the likelihood of a great multiplicity of representational formats. A benefit of developing a suitable ontology within the Crystallographic Information Framework would be its compatibility with a large number of related structural science techniques and disciplines (Hall & McMahon, 2016
).
To achieve reusability, the data set must contain not only the numeric data values of measurements and observations, but also sufficient metadata (information describing the parameters of the experiment) to allow other users to interpret those numbers correctly. Hester (2024b
) identifies a number of experimental metadata items suitable for basic validation of XAFS data sets.
In recent years, there has been much discussion about the `reproducibility crisis' in science (Baker, 2016
). In brief, there is a perception that there is not sufficient deposition of experimental data sets in open repositories to allow published research findings to be replicated (and, therefore, verified or challenged). The FAIR data principles mentioned above arose at least in part to satisfy the requirements for reproducibility, and there is no doubt that a larger proportion of scientific research is now accompanied by the deposition of well characterized data sets.
There has been some pushback against this sense of crisis (see, for example, Fanelli, 2018
), and it is fair to say that the level of concern depends on factors such as the scientific domain, the impact of research findings on human wellbeing and the cost of conducting the experiment, as well as the availability of tests for determining the extent to which a repeat study does reproduce the original results. Nevertheless, there is broad agreement that it is desirable to deposit experimental data sets as much as possible in accordance with the FAIR principles.
In the structural science domain, in 2011 the IUCr commissioned a Diffraction Data Deposition Working Group (DDDWG) to consider the rationale and practicalities of routine deposition, initially of X-ray diffraction images. The scope of the working group later extended to other experimental techniques, and its recommendations (Helliwell et al., 2017
) continue to be monitored by a successor body, the IUCr Committee on Data (https://www.iucr.org/resources/data/commdat
). The working group identified a number of reasons for depositing data sets and making them available alongside a scientific publication (Kroon-Batenburg & Helliwell, 2014
):
(i) To enhance the reproducibility of a scientific experiment.
(ii) To verify or support the validity of deductions from an experiment.
(iii) To safeguard against error.
(iv) To better safeguard against fraud than is apparently the case at present.
(v) To allow other scholars to conduct further research based on experiments already conducted.
(vi) To allow reanalysis at a later date, especially to extract `new' science as new techniques are developed.
(vii) To provide example materials for teaching and learning.
(viii) To provide long-term preservation of experimental results and future access to them.
(ix) To permit systematic collection for comparative studies.
Although there is some overlap amongst these reasons, they are broadly applicable to many other experimental methodologies, and certainly to the XAS field. The working group and the Committee on Data went on to conduct further studies and analysis (see, for example, Guss & McMahon, 2014
; Coles & Sarjeant, 2020
). From these, it became apparent that different communities had different attitudes towards and requirements for the routine deposition of raw data sets. For biological macromolecular structure determination, deposition of raw data images is recommended for IUCr journals (Helliwell et al., 2019
), while for chemical crystallography the outcome of a targeted workshop (Diffraction Data Deposition Working Group, 2021
) was more nuanced. One particular outcome was the development of a new journal section, Raw Data Letters (Kroon-Batenburg et al., 2022
), in IUCrData to provide a home for the discussion of `interesting' data sets, namely those identified by researchers as of interest to methods and software developers for purposes such as reanalysis by newer methods, or that possess features that are not amenable to routine analysis techniques that might be relevant to the structural interpretation.
The longevity and economics of a raw data archive have been discussed by Guss & McMahon (2014
), and subsequent practice has not favoured the establishment of curated domain-specific archives, although some of these have been established for biological macromolecules (Grabowski et al., 2016
; Kurisu, 2021
). An essential requirement for making deposited data sets available over a long time period is the assignment of a persistent URL (i.e. one that is not dependent on specific server-based domain names or an underlying method of serving to web browsers). A resource that is currently favoured by many researchers for depositing their scientific data sets is Zenodo (https://zenodo.org
), which was developed under the European Union OpenAIRE programme and is operated by CERN. Zenodo will provide for each submission a persistent URL in the form of a Digital Object Identifier, thus providing for the accessibility requirement of the FAIR principles.
The findability of such deposited data sets currently relies primarily on hyperlinks to the deposition from associated journal publications, and there is limited search ability offered by Zenodo. The metadata characterizing each deposition are rather general in nature (as befits a general-purpose resource). The archive may be explored programmatically using OAI-PMH and REST API protocols, which support a metadata schema developed by the DataCite organization for the characterization of research data sets (DataCite, 2017
). As argued by Guss & McMahon (2014
), domain-specific repositories can provide much richer metadata and thus enhance the findability of deposited data sets, and the XAS community may wish to carry out its own cost–benefit analysis of the merits of developing such a dedicated resource.
References
Baker, M. (2016). Nature, 533, 452–454.Google Scholar
Chantler, C. T. (2024). Int. Tables Crystallogr. I, ch. 9.1, 1059–1067
.Google Scholar
Coles, S. J. & Sarjeant, A. (2020). IUCr Newsl. 28, https://www.iucr.org/news/newsletter/volume-28/number-1/raw-data-availability-the-small-molecule-crystallography-perspective
.Google Scholar
DataCite (2017). DataCite OAI Schema v.1.1. https://schema.datacite.org/oai/oai-1.1/
.Google Scholar
Diffraction Data Deposition Working Group (2021). Workshop on When Should Small Molecule Crystallographers Publish Raw Diffraction Data? https://www.iucr.org/resources/data/commdat/prague-workshop-cx
.Google Scholar
Fanelli, D. (2018). Proc. Natl Acad. Sci. USA, 115, 2628–2631.Google Scholar
Grabowski, M., Langner, K. M., Cymborowski, M., Porebski, P. J., Sroka, P., Zheng, H., Cooper, D. R., Zimmerman, M. D., Elsliger, M.-A., Burley, S. K. & Minor, W. (2016). Acta Cryst. D72, 1181–1193.Google Scholar
Guss, J. M. & McMahon, B. (2014). Acta Cryst. D70, 2520–2532.Google Scholar
Hall, S. R. & McMahon, B. (2005). International Tables for Crystallography, Vol. G, edited by S. R. Hall & B. McMahon, pp. 2–10. Dordrecht: Springer.Google Scholar
Hall, S. R. & McMahon, B. (2016). Data Sci. J. 15, 3.Google Scholar
Helliwell, J. R., McMahon, B., Androulakis, S., Szebenyi, M., Kroon-Batenburg, L., Terwilliger, T., Westbrook, J. & Weckert, E. (2017). Final Report of the IUCr Diffraction Data Deposition Working Group. https://www.iucr.org/resources/data/dddwg/final-report
.Google Scholar
Helliwell, J. R., Minor, W., Weiss, M. S., Garman, E. F., Read, R. J., Newman, J., van Raaij, M. J., Hajdu, J. & Baker, E. N. (2019). Acta Cryst. D75, 455–457.Google Scholar
Hester, J. R. (2024a). Int. Tables Crystallogr. I, ch. 7.1, 857–860
.Google Scholar
Hester, J. R. (2024b). Int. Tables Crystallogr. I, ch. 7.2, 861–862
.Google Scholar
Hey, T., Tansley, S., Tolle, K. & Gray, J. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond: Microsoft Research.Google Scholar
Kroon-Batenburg, L. M. J. & Helliwell, J. R. (2014). Acta Cryst. D70, 2502–2509.Google Scholar
Kroon-Batenburg, L. M. J., Helliwell, J. R. & Hester, J. R. (2022). IUCrData, 7, x220821.Google Scholar
Kurisu, G. (2021). Acta Cryst. A77, C667.Google Scholar
Ravel, B., Hester, J. R., Solé, V. A. & Newville, M. (2012). J. Synchrotron Rad. 19, 869–874.Google Scholar
Trevorah, R. M., Chantler, C. T. & Schalken, M. J. (2019). IUCrJ, 6, 586–602.Google Scholar
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J. G., Groth, P., Goble, C., Grethe, J. S., Heringa, J., 't Hoen, P. A. C., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J. & Mons, B. (2016). Sci Data, 3, 160018.Google Scholar