Data mining. I. Machine learning in crystallography

Hofmann, D. W. M.; Kuleshova, L. N.

doi:10.1107/S1574870721008235

International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by T. R. Welberry

pdf

International Tables for Crystallography (2022). Vol. C. ch. 1.4,
https://doi.org/10.1107/S1574870721008235

Chapter 1.4. Data mining. I. Machine learning in crystallography

Detlef Walter Maria Hofmann^a ^* and Liudmila Nikolaevna Kuleshova^b

^aCRS4, Loc. Piscina Manna 1, 09010 Pula, Cagliari, Italy, and ^bFlexCryst, Schleifweg 23, 91080 Uttenreuth, Germany
Correspondence e-mail: dwmhofmann@aol.com

The application of data-mining and machine-learning techniques in crystallography is discussed and insights are presented into the development of force fields for molecular modelling. The mathematical concepts behind the latest developments in data mining in crystallography are highlighted. The importance of anomaly detection and precise analysis of outliers in the cleansing of structural information in databases, and validation of data-mining models, is also considered.

Keywords: crystallography; data mining; force fields; machine learning; crystal structure prediction.

1. Introduction

The scientific community became aware of the value of crystal structure data immediately after the discovery of the diffraction of X-rays by crystals and the realization that this could be used to determine crystal structures (Laue, 1913 ; Bragg & Bragg, 1913 ). Crystallographers soon realized that detailed crystal structure information about the arrangements of molecules and atoms was new knowledge that could contribute to all fields of science, from applied physics to biology. Initially, the crystallographic data were recorded in books; see, for example, Ewald & Hermann (1931 ) and Hendricks (1930 ). For several generations of students, memorizing these structures was part of their education. Just 37 crystal structures were stored in the first book of this kind. Today, improved and automated methods of crystal structure solution have boosted the number of known crystal structures to in excess of a million, many of them stored in the Cambridge Structural Database (CSD) managed by the Cambridge Crystallographic Data Centre (CCDC).

As early as the 1930s, analysis of crystallographic data had already provided important knowledge about the radii of atoms and typical lengths of chemical bonds and non-covalent interactions (Pauling, 1939 ). Based on these data, many theoretical models of chemical bonding were tested; see Pauling (1939 ) and references therein. In parallel, while some crystallographers tried to derive rules from and find patterns within the data, for example Donohue (1952 ), others attempted to use these rules for crystal structure prediction (CSP); for example Kitaigorodskii (1957 , 1961 ).

An important milestone in data collection was established in the 1960s with the introduction of electronic media for the storage and management of data. As a result, electronic databases such as the Cambridge Structural Database (CSD), the Inorganic Crystal Structure Database (ICSD) and, most recently, the Protein Data Bank (PDB), were developed. These databases provided new impulse to systematic studies. Bürgi & Dunitz (1994 ) gave many examples of how `a wealth of valuable but hitherto unused information can be extracted from available structural data'. Examination and comparison of large numbers of crystal structures aided the development of the theory of supramolecular synthons by Nangia & Desiraju (1998 ), and contributed to previous works in this field by Kuleshova & Zorky (1980 ), Etter et al. (1990 ) and Etter (1991 ). These ideas have in turn led to the development of a new scientific discipline – crystal engineering, as a part of materials science.

Manual extraction of patterns and trends from data had to be automated as the volume of data increased. Early methods of identifying patterns in data include application of Bayes' theorem (dating from the 1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has increased the rate of data collection and storage. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. Data mining has appeared as the tool of choice for extracting hidden patterns and new knowledge from data by using computing power and applying new techniques of machine learning. This has been aided by such discoveries in computer science as neural networks (1943), clustering, genetic algorithms (1950s), decision trees (1960s) and support vector machines (1980s). A volume dedicated to the application of some techniques of data mining and machine learning in crystallography has been published as part of the series Structure and Bonding (Hofmann & Kuleshova, 2010 ). Some further developments in this field will be described in this chapter, which is organized as follows. In Section 2 we will introduce the overall concept of data mining and the terminology used. In the following sections, we will highlight the main tasks of data mining in crystallography and the methods for handling these tasks. The application of methods such as clustering (Section 3 ), classification and regression (Section 4 ), anomaly detection (Section 5 ), and dimensionality reduction (Section 6 ) will be discussed based on recent trends in organic crystallography.

2. Key concepts and definitions

There is no field in science where data mining and machine learning have not found an application. This diversity of applications has resulted in a huge amount of highly specialized literature, with diverse terminology, ways of presentation and notation. This is exacerbated by the rapid evolution of methods and tools for data mining, and the explosive growth of computational possibilities. In the different areas of crystallography (inorganic, organic and protein) the tasks of data mining (and accordingly the terminology and choice of methods of machine learning) differ significantly (Hofmann & Kuleshova, 2010 ). Therefore, we give below some of the definitions and terminology used in the present chapter, following Fayyad et al. (1996 ). Even though most attention will be paid to the applications of data mining in organic crystallography, the basic approaches in the other areas will be mentioned.

Data mining is a very powerful suite of tools and algorithms for analysing data and predicting properties. Sometimes the terms data mining and machine learning are conflated or mixed up (Mannila, 1996 ). Here, we will distinguish between data mining, which is the scientific task (or problem to be solved), and machine learning, which is the set of mathematical concepts and algorithms used to tackle that task. Broadly speaking, data mining has two `high-level' primary goals: pattern recognition (description) and pattern prediction (Fayyad et al., 1996 ). Both functions are very important for modern crystal engineering and form the foundations for understanding the behaviour of materials. They serve as a basis for deriving correlations, trends, clusters, trajectories and anomalies among disparate data. The interpretation of these patterns is intrinsically tied to an understanding of materials physics and chemistry. In contrast to other procedures, this approach does not need any a priori hypothesis about the behaviour of the system or its structure–property relationships, and therefore allows the description of highly complex processes.

The key data-mining components are shown in Fig. 1 . In this scheme the `Experimental data' (measured or simulated properties of objects) serve as an initial driving force (input information). During data mining the experimental input data will be connected with the `Target property' (for example, force fields for molecular modelling) through `Model development', while the parameters of the model will be trained, refined and validated using the methods of `Machine learning', such as recursive fitting, decision trees, support vector machines (SVMs), singular value decomposition (SVD), neural networks (NNs) and genetic algorithms (GAs). Ultimately, it is these models that can then be used for crystal engineering and design, to obtain the desired properties `intelligently' rather than through trial and error. It is inherent in data mining that during the training not only will the model be improved and the parameters of the model be refined, but also that the initial data set will be cleansed, subsequent to receiving feedback from the training results (outputs). The mechanisms of this scheme are intended to stress the strong coherence between the experimental data, the mathematical concepts and the development of predictive models.

Figure 1

Data mining requires close cooperation between four scientific areas: experimenters, who collect data; theoreticians, who formulate reasonable models; the application, which formulates worthwhile aims; and data miners, who connect the other three groups. Adapted from Hofmann & Kuleshova (2018 ), with permission from Elsevier.

2.1. Experimental data

In crystallography (as in all other fields) obtaining and handling experimental data (including data collection, data cleansing and data warehousing) is usually the first, and always a very important, stage. The comprehensiveness of the structural data collection, the structure and quality of the data sets, and the selection of relevant data sets are extremely important for getting reasonable results from the fully automatic procedures of data mining. The application of data mining in crystallography is natural because crystallographic data sets are very large. They are stored in the three main databases: the CSD, the main repository for small-molecule organic and metal-organic crystal structures (more than 1 160 000 entries); the ICSD (completely identified inorganic crystal structures; more than 185 000 entries); and the PDB (biological macromolecular structures; more than 187 000 structural hits). The Powder Diffraction File from the International Centre for Diffraction Data, with over 1 076 000 powder diagrams, should be mentioned too. For discussions of these databases, we refer the reader to numerous reviews and publications, for example Buchsbaum et al. (2010 ), Allen (2002 ) and Berman (2008 ). Recent developments in CSD tools have been summarized by Groom et al. (2016 ).

The collections of experimental data serve as input information in the data-mining procedure. In the data-mining literature one can also find the terms `input', `input object', `input vector', `input variables' and so on. In the following we will use the term input vector, $[\boldnu]$ . The input vector in data mining can contain the values of the various descriptors¹ for an individual object, which can be given their own weights w (in the following the terms weight vector and weight matrix are used), reflecting the importance of a given descriptor for a given input vector. The input vector can be presented as a list of values or as a row in a table (matrix). In a more general way, the `descriptors' are the template for defining individual objects. In crystallography the individual objects are obviously the crystals, and the most widespread template for describing individual crystals and their structures is the Crystallographic Information Format (CIF) [see International Tables for Crystallography, Volume G (Hall & McMahon, 2006 )]. The CIF template contains fields for the unit-cell parameters, crystal symmetry, atomic coordinates, molecular formula, crystal density, data on the X-ray experiment, temperature, pressure and so on. Which of these descriptors will be used in the input vector depends on the target property or pattern of interest. To complete the input vector, data filtering is normally applied. Here it is worthwhile to note that in the following discussions we will distinguish between the filtering and the cleansing of data. While the filtering serves to extract subsets from the whole data set according to some features, for example the experimental conditions, cleansing is the process of detecting and correcting (or removing) incomplete, incorrect, inaccurate or irrelevant parts of the data.

2.2. Data analysis and model development

Data analysis and model development are consecutive, and strongly correlated, steps that use the algorithms of machine learning. The choice of appropriate instruments and approaches (or their combination) depends on the current task. The global goals of `pattern recognition' (description) and `pattern prediction', mentioned above, are normally achieved in crystallography within the following most important data-mining tasks:

Clustering – This is the task of discovering groups in the data that were not predefined. The members of each group are `similar' in some way. Cluster analysis, when integrated with the proper similarity function, can serve as a powerful tool for crystal structure prediction, crystal structure determination and rapid screening of databases. The applications of cluster analysis in crystallography will be discussed in Section 3 .

Anomaly detection – This does the opposite to clustering, by identifying data or observations significantly different from the norm. The ability to detect such anomalies (or outliers) is critical in materials science because it can help to identify a new class of materials with an unusual property, or to anticipate potential harmful effects which would normally only appear through a retrospective analysis. Anomaly detection can be used to isolate faulty data in databases and clean initial data sets. It can identify unusual data records that might be interesting and reward further investigation. Some examples of uses of anomaly detection in crystallography will be given in Section 5 .

Classification – This is the task of finding a function that arranges the data into predefined groups that may have well behaved correlations and can form the basis of physically and statistically based models. Classification allows estimation of physical properties as well as the derivation of model parameters (for example, parameters of force fields). Classification predictive modelling approximates a mapping function from input vectors to discrete output vectors (for more details see Section 4 ).

Regression – Regression attempts to find a function that maps the input vectors onto predefined target values. These target values are the measurement of a physical property, which should be predicted by the parameterized model. Regression predictive modelling maps the obtained function from input vectors (variables) to a continuous output vector. The success of the regression models also helps to refine the usefulness and relevance of the input vectors. Some results obtained with regression will be given in Section 4 .

Summarization (dimensionality reduction) – This provides a more compact representation of the data set, including visualization and report generation (Section 6 ).

From a mathematical point of view, the tasks and the algorithms to manage the tasks can be classified as unsupervised or supervised methods. Cluster analysis is an unsupervised method, whereas classification and regression are supervised methods. Anomaly detection and summarization can be either (or both). For example, summarization (also known as dimensionality reduction) is the process of reducing the number of random descriptors (also referred to in the literature as `variables', `attributes' or `features') needed to describe the variations in the data set. In some cases, data analysis such as regression or classification can be done in the dimensionally-reduced space more accurately than in the original space. It can be unsupervised (e.g. principal component analysis, PCA, and singular value decomposition, SVD), or supervised (e.g. linear discriminant analysis, LDA). The algorithms most often used in crystallography are recursive fitting, decision tree, support vector machines (SVMs), singular value decomposition (SVD), neural networks (NNs) and genetic algorithms (GAs). Some of them have specific applications, others are broader in application and can be used in both unsupervised and supervised tasks. For example, recursive fitting was used in Hofmann et al. (2007 ) for deriving a `reactive' potential for water. SVMs for classification tasks have been used and described in detail in a series of works (Hofmann & Apostolakis, 2003 ; Apostolakis et al., 2001 ; Hofmann, 2010 ). The application of SVD will be described in Sections 4.2 [see also Hofmann (2002 )] and 4.3.4 .

Neural networks (NNs) were first described by McCulloch & Pitts (1943 ) and achieved enormous popularity in the 1990s (Craven & Shavlik, 1997 ; Boser et al., 1992 ), especially in protein crystallography [see, for example, Riis & Krogh (1996)]. The neural-network algorithm was attractive owing to its high flexibility and ability to fit any type of function. However, the choice of network architecture is not always trivial and therefore the method often produces incomprehensible models and requires long training times in classification tasks (Maimon, 2010 ; Zhang, 2010 ; Nikam, 2015 ). These days, these types of problems are addressed in a significantly simpler way by SVMs (Apostolakis, 2010 ). Even so, both algorithms (SVMs and NNs) are used in data mining to manage classification and regression tasks.

It is important to note that SVD allows for the physical interpretation of results (weight vectors, mathematical models and parameters, Section 4.3.4 ), while for NNs this is not possible owing to the large number of parameters in the concatenated matrix (Fig. 2 ).

Figure 2

Generalized NN flowchart for predicting the structure of ice. The sum formula is multiplied by the weight matrix to propagate from one layer to the next layer. The multiplication in the last layer produces the desired crystal-structure properties as the output vector.

2.3. Target property (pattern)

The final step of knowledge discovery from data is to verify that the patterns produced by the data-mining algorithms occur in the wider data set. Not all patterns found by the data-mining algorithms are necessarily valid. It is common for the algorithms to find patterns in the training set that are not present in the general data set. This is called `overfitting', and often happens if a small number of data points are used. To overcome this, the evaluation uses a test set of data that the data-mining algorithm was not trained on. The learned patterns are applied to this test set and the resulting output is compared with the desired output. A number of statistical methods may be used to evaluate the algorithm, such as receiver operating characteristic curves. If the learned patterns do not meet the desired standards, then it is necessary to re-evaluate and change the preprocessing and data mining. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

If databases also contain target values, or if a measurement for the desired target property exists for at least some input vectors, the supervised learning algorithms `learn' how to predict these values with new records when the output is unknown. More generally, they can also consider several properties of interest. In this context `the target property (pattern)' might be more appropriately defined as the output vector or output variables. In the data-mining literature the values being predicted are also called the `dependent variables', `output variables', `target variables' or `outcome variables'.

3. Clustering

Cluster analysis is an unsupervised method, and as an instrument of `pattern recognition' assigns crystal structures to different groups (clusters) such that the structures in each subset share some common traits. These days it is widely used in crystallography, for example in crystal structure prediction (CSP), where a huge number of generated hypothetical structures can be converged to the same structure, so removing duplicates to produce a list of unique structures (see Section 3.4 ). It is also useful in crystal structure determination (see Section 3.5 ), where, for example, thousands of simulated powder diagrams have to be compared with the experimental powder diagram; or for testing the databases themselves to find polymorphs, manifold entrances and erroneous data; see Section 5 .

3.1. Clustering algorithms

Data clustering according to similarity is a common technique for statistical data analysis. It is used in many fields, including machine learning, pattern recognition, image analysis and bioinformatics (Duda & Hart, 1973 ). Clustering is normally performed using one of two types of clustering algorithms; hierarchical or partitional (Fig. 3 ). Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Partitional algorithms divide a set, $[{\cal X}]$ , into non-overlapping `parts' (or `blocks' or `cells') that cover all of $[{\cal X}]$ , Fig. 3 (a). More formally, these cells are both collectively exhaustive and mutually exclusive with respect to the set being partitioned. Hierarchical algorithms can be agglomerative (`bottom-up') or divisive (`top-down'), Fig. 3 (b). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Two-way clustering, co-clustering or bi-clustering, which should be mentioned here, are the names for clustering where not only the objects but the features of the objects are clustered (i.e., if the data are represented in a data matrix, the rows and columns are clustered simultaneously).

Figure 3

(a) Partitional and (b) hierarchical clustering.

The most popular algorithm for clustering in biology, pharmacy and crystallography is hierarchical clustering, with the traditional representation of the hierarchy as a tree, or a dendrogram, Fig. 3 (b), with individual elements at one end and a single cluster containing every element at the other. Familiar examples of dendrograms are family trees and the evolutionary tree of Darwin. Cutting the tree at a given height will give a clustering at a selected precision, e.g. in family trees different levels of relationship can be selected. Cutting at zero gives identities.

The common algorithms for agglomerative hierarchical clustering are single linkage, complete linkage, and average or centroid linkage. These methods differ in how the distance (similarity) between each cluster is measured. Clustering is started by assigning each structure of the set its own cluster. For example, if we have set of 50 crystal structures, we will define 50 different clusters. Clusters are then successively merged according to the calculated distance (similarity) between elements of the clusters. The distance between two clusters $[{\cal A}]$ with elements a_i and $[{\cal B}]$ with elements b_j (Fig. 4 ) is defined by minimum, maximum or average values of a similarity s_ij (distance) (Apostolakis, 2010 ) between elements of different clusters as follows.

Figure 4

Hierarchical clustering according to similarity.

Single linkage. In single-linkage clustering, the distance between two clusters $[{\cal A}]$ and $[{\cal B}]$ is equal to the minimal (shortest) distance between elements a_i and b_j in each cluster: $[d^{\rm min}_{{\cal A}{\cal B}} = \min{\{\,s_{ij}:a_{i}\in{\cal A}\,,b_{j} \in{\cal B}\,\}}.\eqno(1)]$ It is therefore very efficient and simple to understand; it tends, however, to produce large elongated clusters that are not necessarily internally compact. The reason for this is that the merging criterion does not take the overall structure of the new cluster into account.

Complete linkage. In complete-linkage hierarchical clustering, the distance (similarity) between two clusters $[{\cal A}]$ and $[{\cal B}]$ is defined as the maximal (longest) distance between elements a_i and b_j in each cluster: $[d^{\rm max}_{{\cal A}{\cal B}} = \max{\{\,s_{ij}:a_{i}\in{\cal A}\,,b_{j} \in{\cal B}\,\}}.\eqno(2)]$ Complete linkage tends to create many small clusters at intermediate levels of the cluster tree. While complete-linkage clustering is not able to identify clusters of complex (e.g. elongated) structure, even when they clearly exist in the data, it has often been observed in practice to produce very useful hierarchies.

Average or centroid linkage. In average-linkage hierarchical clustering, the distance between two clusters $[{\cal A}]$ and $[{\cal B}]$ is defined as the average distance between each element a_i in one cluster to every element b_j in the other cluster: $[d^{\rm average}_{{\cal A}{\cal B}} = {1\over{N_{{\cal A}}N_{{\cal B}}}} \sum_{i}^{N_{{\cal A}}}\sum_{j}^{N_{{\cal B}}}s_{ij},\eqno(3)]$ where $[N_{\cal A}]$ and $[N_{\cal B}]$ are the number of elements in the clusters. While average-linkage clustering can overcome the problems of the other two methods, it requires more computation. In some versions the average distance to the centroid is taken, which then again corresponds to a minimization of the squared error, in the sense that the two clusters are merged that lead to the lowest possible (average) deviation from the cluster representative (the centroid).

Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there are a sufficiently small number of clusters (number criterion). A graphical representation of the results of hierarchical cluster analysis is the dendrogram, Fig. 4 . The branches represent clusters obtained at each step of hierarchical clustering.

We also mention k-means clustering, a simple partitional algorithm which provides good results when the number of required clusters is predefined, or the user has a reasonable idea what it is (Rajan, 2010 ). The results of k-means clustering are often presented as `structural maps'. The principal components obtained from PCA may be used as indicators for k-means clustering (Ding & He, 2004 ); see, for example, Balachandran & Rajan (2012 ) and Rajan (2010 ). This approach is widely used for crystal structure (and property) prediction in inorganic crystallography.

3.2. Similarity functions

3.2.1. The search for similarity

It is well known that directly comparing crystal structures, based on space groups and cell parameters, can wrongly find two structures to be different because they are described in different ways. This can occur because different unit-cell vectors were selected, or by choosing different space groups during X-ray analysis. For example, if a unit cell is described by the unit-cell vectors a, b, c, it can always be transformed into another crystal structure with a′, b′, c′, where the transformed vectors can be represented by a linear combination of the original vectors. Another problem that often arises is the result of the molecule itself having some symmetry. For example, if, in a crystal structure of space group $[P\bar{1}]$ , the molecule takes a position on the crystallographic inversion centre, the structure could be described in space group P1 as well. Several ways to avoid ambiguity in crystal structure description have been suggested. As an example, the program CRYCOM uses the atomic coordinates in addition to the space group and unit cell (Dzyabchenko, 1994 ). The similarity between two structures is evaluated by calculating the root-mean-square deviation (r.m.s.d.) between structures, based on a comparison of the fractional atomic coordinates for the reference molecule, suitably transformed by space-group operators for maximum overlap. Another example is Polymorph Predictor, which uses, in addition to the lattice vectors of a reduced elementary cell and the space-group symmetry, comparison of the positions of the molecular centres (van Eijck & Kroon, 1998 ). It is also possible to use a radial distribution function that combines atomic coordinates with partial atomic charges to take into account intermolecular interactions (Verwer & Leusen, 1998 ).

Modern crystal structure engineering needs a fully automated computational method for comparing and clustering crystal structures. In the past, the obstacle for computational clustering of crystal structures was the lack of an appropriate similarity (or dissimilarity) index that would fulfil three main requirements: (1) the similarity between crystal structures should be evaluated quickly; (2) the algorithm should recognize similar crystal structures, even if they are described in different unit cells and/or space groups; and (3) it should allow control over the convergence of the calculation.

It was found that a successful algorithm for similarity search in crystal structures should compare distances between selected atomic positions rather than lattice vectors or space-group symmetry. This was achieved in, for example, the program COMPACK, which uses the relative positions and orientations of molecules (Chisholm & Motherwell, 2005 ). Other ways to determine the similarity of crystal structures independent of the choice of origin and setting of the unit cell are by comparing crystal structures in reciprocal space, or by direct comparison of computed powder diagrams (Karfunkel et al., 1993 ; Putz et al., 1999 ; de Gelder et al., 2001 ), or by comparing integrated powder diagrams (Hofmann & Kuleshova, 2005 ). By using such methods, cluster analysis (clustering) has become one of the most important data-mining instruments in crystallography.

3.2.2. Comparing molecular coordination spheres, RMSD

This algorithm (Chisholm & Motherwell, 2005 ; Lommerse, 2000 ; Motherwell, 2010 ) is based on the original idea of Gavezzotti & Filippini (1995 ) to compare the coordination sphere of molecules around a central molecule in the crystal structure. The algorithm is used in the program COMPACK and is implemented in the Materials Module of Mercury (Macrae et al., 2020 ).

The reference crystal structure (for example, benzene, Fig. 5 ) is analysed and represented by a coordination shell of the central molecule and its N closest neighbouring molecules. The precise arrangements of the molecules in the cluster, their relative positions and their orientations are captured using a set of interatomic distances drawn between nearest-neighbour molecules. Once a crystal structure, or part of a crystal structure, has been represented in this way, the task of searching for structural similarity becomes a matter of searching the three-dimensional coordinates of a target structure for an arrangement of molecules that matches the search query within a specified tolerance. If the structures match within the specified geometric tolerances (i.e., distances and angles) then the coordination shells are overlaid and the root-mean-squared deviation of the atomic positions is calculated for all matching molecules. This method was used for an analysis of the results of CSP in the blind tests organized by the CCDC (Bardwell et al., 2011 ; Day et al., 2009 ).

Figure 5

The coordination shell of 15 molecules in benzene [prepared with Mercury (CCDC, 2016b )].

Currently, comparison of coordination shells is recognized as the most precise method of determining the similarity of crystal structures. Nevertheless, the computational expense of the method calls for a more appropriate approach to bring about the possibility of automated screening and clustering.

3.2.3. Comparing crystal structures in reciprocal space and the `integral similarity index', s(i, j)

Powder diffraction diagrams are independent of the choice of origin and setting of the unit cell and, being one-dimensional, do not need to be re-oriented to find the best fit. However, comparison of powder diffraction patterns is very sensitive to shifts in peak positions and so depends on the temperature and pressure at which the data were collected. Thus, the direct comparison of powder diagrams point-by-point is problematic. Two possible corrections have been suggested by van der Streek & Motherwell (2005 ): the normalized similarity measure (de Gelder et al., 2001 ), and volume normalization based on Hofmann's average atomic values (Hofmann, 2002 ) (see also Section 4.2 ).

The similarity index s(i, j) introduced by Hofmann & Kuleshova (2005 ) is based on the comparison of integrated powder diagrams instead of powder diagrams themselves, and has to be introduced for each pair in a set $[{\cal A}]$ of crystal structures. The index is defined as the mean difference between the integrals of powder diagrams PD₁ and PD₂, and is proportional to the area between these two curves.

In the first step, the PDs should be normalized for the scattering angles θ and the intensities of the reflections I so that $[{\rm PD}(\theta,I)\ \mapsto\ {\rm PD}^{\rm norm}(\theta^{\rm norm},I^{\rm norm}),\eqno(4)]$ where $[\theta^{{\rm norm}} = {\theta-\theta_{{\rm min}}\over\theta_{ {\rm max}}-\theta_{{\rm min}}}]$ and $[I^{{\rm norm}}_{i} = {I_{i}\over\textstyle\sum_{i = 0}^{{\rm all}}I_{i}}.]$ For a given PD, θ_min and θ_max are the minimum and maximum scattering angles, and I_i is one of the reflections of the PD. The normalization will guarantee in the following the boundedness of the similarity index, since the intensities and the angles are bounded now to the interval between zero and one: $[\theta^{{\rm norm}},I^{{\rm norm}}\in[0,1].\eqno(5)]$

In Fig. 6 , we illustrate the evaluation of the similarity index, s(i, j), for the case of powder diagrams of the experimental and the minimized crystal structure of a small molecule, 1-aza-2-cyclobutanone (CSD refcode FEPNAP), determined by Yang et al. (1987 ). In the positive region in Fig. 6 , the calculated normalized powder diagram (red) for the minimized structure, $[{\rm PD}^{{\rm norm}}_{i}]$ , is plotted; in the negative region, the calculated normalized powder diagram (black) for the experimental structure of FEPNAP, $[{\rm PD}^{{\rm norm}}_{j}]$ , is shown. The powder diagrams look different, since the force field used for minimization distorts the experimental crystal structure slightly. However, the similarity index stays small and clustering visualizes the similarity of the structures.

Figure 6

The simulated powder diagrams of the experimental (black) and the minimized (red) crystal structures of FEPNAP, along with their anti-derivative, F(x), (grey) and corresponding similarity index (blue). Reprinted from Hofmann et al. (2009 ), with permission from Elsevier.

The subtraction of the second normalized powder diagram, $[{\rm PD}^{{\rm norm}}_{j}]$ , from the first, $[{\rm PD}^{{\rm norm}}_{i}]$ , gives us the anti-derivative, F(x), shown in Fig. 6 in grey: $[F(x) = \textstyle\int\limits_{0}^{x}\left({\rm PD}^{{\rm norm}}_{i}-{\rm PD}^{{\rm norm}}_{j}\right)\,{\rm d} \theta^{{\rm norm}}.\eqno(6)]$ The integration of the absolute value of the anti-derivative, F(x), gives us the similarity index, s(i, j): $[s(i,j) = \textstyle\int\limits_{0}^{1}\left|F(x)\right|\,{\rm d}x.\eqno(7)]$

The similarity index defined in this way has the properties of commutativity, identity and boundedness. This means that the similarity of two crystal structures does not depend on the order of the subtraction of the powder diagrams: $[s(i,j) = s(j,i)\quad \forall\,i,j\,\in\,{\cal A}\quad {\rm (commutativity)}.\eqno(8)]$ For identical powder diagrams, the similarity index is zero: $[s(i,i) = 0\quad\forall\,i\,\in\,{\cal A}\quad{\rm (identity).}\eqno(9)]$ For all other cases, it has a positive value and ranges from one to zero: $[0\leq s(i,j)\leq 1\quad\forall\,i,j\,\in\,{\cal A}\quad{\rm (boundedness).}\eqno(10)]$

This similarity index allows us to compare several hundred crystal structures within a few seconds and does not depend on the size of the molecule.

3.3. Clustering of polymorphs and structural correlations

The calculation of the similarity indexes and subsequent clustering can be used for rapid direct screening of databases to find polymorphic and isostructural crystal structures and manifold entrances. This is because we have to take into account only a very limited number of reflections. The calculation time depends on the choice of the interval θ_min and θ_max. For simulations of powder diagrams the interval from 0° to 60° was used in the work we describe here. The upper value is large enough to include reflections even for very small cells, and there is no limit for the lower value in simulated powder diagrams. In experimental powder diagrams at small angles there is a blind region and the minimum angle is greater than zero.

Specific examples of cluster analysis help us to visualize and to understand the effects of external conditions on structural changes in crystals.

As an example, we show the clustering of benzene polymorphs, which allowed the detection and analysis of structural changes in polymorphs caused by temperature and pressure variations (Hofmann & Kuleshova, 2014 ). The set of 15 structures was obtained by extracting the experimental crystal structures with atomic coordinates from the CSD. The results of clustering are presented in Fig. 7 . The clustering threshold is set at s(i, j) = 0.012, corresponding to the RMSD similarity criterion used for the Cambridge Blind Test (Day et al., 2009 ; Bardwell et al., 2011 ; Reilly et al., 2016 ). The clustering clearly revealed high-pressure polymorphs II (16, 17) and III (03, 04), which exist over a narrow pressure interval. The entries for polymorph I, despite being defined over a broad range of temperatures and pressures, are subdivided into three structural groups reflecting the external pressure and temperature conditions: a sub-cluster consisting of entries 00 (indicated just as BENZEN in Fig. 7), 02, 15 and 18 (defined at quasi-ambient conditions), a single structure 12 (at extreme high pressure) and a cluster comprised of entries 01, 07, 06, 14, 13 and 11 (defined at non-ambient conditions). The crystal densities in the quasi-ambient and non-ambient clusters differ significantly: from 1.02 to 1.12 g cm⁻³ in the first cluster and from 1.14 to 1.26 g cm⁻³ in the second cluster.

Figure 7

Clustering of benzene polymorphs reflects structural changes caused by temperature and pressure variations.

3.4. Clustering in crystal structure prediction (CSP)

Clustering followed by summarizing of results is a crucial instrument in CSP, where a huge number of generated crystal structures have to be analysed for similarity, and multiple and manifold structures have to be removed. A simple ranking based on an energy criterion does not work here, because, firstly, several hundred crystal structures can be found for one molecule within a very narrow energy range and, secondly, completely different crystal structures can have nearly identical energies.

As an example, we analyse the results of CSP for alanine, an α-amino acid that is used in the biosynthesis of proteins. It contains an α-amino group (which is in the protonated $[{\rm NH}_{3}^{+}]$ form), an α-carboxylic acid group (which is in the deprotonated COO⁻ form) and a side-chain methyl group, classifying it as a nonpolar (at physiological pH) aliphatic amino acid. The L-isomer (left-handed version) of alanine is one of the 20 amino acids encoded by the human genetic code. The right-handed form, D-alanine, occurs in bacterial cell walls and in some peptide antibiotics. CSP was performed for this molecule with FlexCryst (Hofmann et al., 2016 ). The CSP algorithm used by this program is described in detail in Hofmann (2010 ). It is implemented with a particular energy function (Hofmann & Apostolakis, 2003 ) derived by data mining. This force field is able to reproduce crystal structures with hydrogen bonds (Hofmann et al., 2004 ) and special sorts of atoms, including ions (see Sections 4.3.4 and 4.4 ). Because the reference molecule is optically active, the calculations were performed with the most frequently occurring acentric space groups P2₁ and P2₁2₁2₁. After 10 minutes, 200 structures had been generated with a coverage (completeness of crystal structure generation) of 76% (Hofmann et al., 2009 ). In the following steps the generated crystal structures then needed to be clustered, manifold structures needed to be removed and the resulting structures needed to be compared with the experimental one.

For alanine, additional problems arise: the CSD contains in total more than 90 entries for crystal structures of alanine. Among them are 48 entries for L-alanine (eight defined at ambient conditions of temperature and pressure), 21 entries for D-alanine (one at ambient conditions) and 23 entries for DL-alanine (three at ambient conditions). So, firstly, the reference experimental structure had to be chosen. In Fig. 8 we show the clustering of crystal structures that were only determined at ambient conditions. At the threshold s(i, j) = 0.012, two clusters were defined: one cluster with the DL form of alanine (DLALNI) and the other with optically active L and D forms (LALNIN and ALUCAL). Since the similarity works in reciprocal space, the L and D forms fall into one cluster. After the summarization, two structures were retained: ALLUCAL04 (space group P2₁2₁2₁) and DLALNI01 (space group Pna2₁). Both of them were used as reference experimental structures of alanine.

Figure 8

Clustering of experimental structures of alanine from the CSD. Two structures (marked with a cross) are retained as representative after summarization.

The clustering of generated (predicted) crystal structures is shown in Fig. 9 . For clarity of visualization, we present only the 20 top-ranked crystal structures with the lowest values of crystal energy. At the threshold [s(i, j) = 0.012], four clusters were retained; from each cluster only the structure of lowest energy is retained as representative after summarization.

Figure 9

Clustering of predicted crystal structures of L-alanine (the 20 top-ranked structures are presented). From each cluster only the structure of lowest rank (marked with a cross) is retained after summarization.

Finally, the experimental and predicted crystal structures were compared (Fig. 10 ). At the standard threshold of s(i, j) = 0.012, the correctly predicted crystal structure was revealed as that of rank 15.

Figure 10

Comparison of the retained predicted structures with the experimental structures revealed the correct crystal structure to be of rank 15.

Comparing crystal structures within the molecular coordination spheres consisting of the 15 closest neighbours (the criterion in RMSD), performed with Mercury (CCDC, 2016b ), gave the same results. In Fig. 11 we show the overlay of the predicted crystal structure of rank 15 and the experimental crystal structure ALUCAL04.

Figure 11

Overlay of the experimental (grey) and predicted (green) structures of alanine, r.m.s.d. = 0.272.

The method presented here was used for the evaluation of the results in the fourth blind test organized by the Cambridge Crystallographic Data Centre (Day et al., 2009 ), along with the traditionally used estimations of r.m.s.d. values using RMSD (Chisholm & Motherwell, 2005 ) and of energy functions (van Eijck, 2005 ). The results obtained with all three methods were identical. However, the cluster method using similarity introduced by Hofmann & Kuleshova (2005 ), working in reciprocal space, reduced the calculation times by orders of magnitude.

3.5. Clustering in crystal structure determination

The possibility of easily finding isostructural entries in databases hints at the use of clustering for crystal structure determination from powder diagrams. In fact, if the available single crystals are too low quality for single-crystal structure determination and only an experimental powder diagram can be obtained, it is worth performing a search of databases for entries that might be isostructural. If an isostructural entry is found in a database, it can be used as a starting point for finding the crystal structure corresponding to the experimental powder diagram. In this way, the crystal structures of Pigment Red 3 (4-methyl-2-triphenylarsine-2-naphthol, CSD refcode MNIPZN) and Pigment Orange 5 [1-((2,4-dinitrophenyl)azo)-2-naphthol, CSD refcode CICCUN] (Schmidt et al., 2007 ) were solved. In a similar way, the crystal structure Pigment Red 170 was solved: first the crystal structure of the methyl derivative of Pigment Red 170 was found, and then the crystal structure of Pigment Red 170 was found to be isostructural to it (Schmidt et al., 2006 ).

Another example is crystal structure determination from unindexed powder diagrams. The algorithm for this works as follows: first the crystal structures are generated (using CSP), then their powder diagrams are simulated and compared (clustered) with the available experimental powder diagram. Afterwards, the structures most similar to the experimental crystal structure are minimized according to the evaluation function (in this case it is the similarity between the experimental and simulated powder diagrams). In principle, crystal structure determination from powder X-ray data can be considered as a combination of CSP and clustering of a virtual data set. In the case of indexed powder diagrams, the X-ray diffraction patterns can be included as additional information during crystal structure generation as a boundary condition on the lattice parameters.

The advantages of the integral similarity index for crystal structure determination are (Hofmann & Kuleshova, 2005 ): similar powder diagrams can be recognized, even if peaks are shifted; the similarity index is calculated quickly because the evaluation works pointwise; and during refinement, interchanged peaks can be reordered correctly. The most widespread similarity index used for comparing powder diagrams is the mean-square deviation between two diagrams (R value, as used in Rietveld refinement). Unfortunately, this metric is very sensitive to even very small shifts of the peaks, and does not allow minimization to be performed if the unit cells have slightly different parameters. The similarity indices mentioned above (Karfunkel et al., 1993 ; Hofmann & Kuleshova, 2005 ) are tolerant of small cell differences. However, the index proposed by Karfunkel et al. (1993 ) arbitrarily provides either positive or negative values of similarity and therefore cannot be used for refinement.

As an example of a crystal structure determination, we show the case of Pigment Red 181 (Hofmann & Kuleshova, 2005 ). The comparison of simulated powder diagrams of predicted crystal structures with the experimental powder diagram and subsequent structure refinement indicated that the candidate of rank 12 was the most promising crystal structure. The initial value of the similarity index of this structure was 0.036, and this decreased (improved) during refinement to 0.003.

In Fig. 12 , one can observe shifted peaks at 10 and 20° 2θ for the predicted crystal structure. The shift of these peaks (the 010 and 020 peaks) is caused by the predicted cell parameter being 0.4 Å too short. During refinement, this inaccuracy was corrected and the cell parameter was elongated from 9.21 to 9.63 Å. Another problem for refinement would be caused by the two peaks at 25° and 27°. In the powder diagram for the predicted crystal structure, these peaks overlap and are located at 27°. The algorithm correctly recognizes that this peak has to be split and separates it. As for the case of interchanged peaks, other similarity indices would have a local minimum for this situation, making further refinement impossible. The packing of the molecules that was determined is shown in Fig. 13 .

Figure 12

Structure determination of Pigment Red 181. From the predicted polymorphs, the structure giving a simulated powder diagram closest to the experimental powder diagram is selected. After refinement, the powder diagrams are nearly identical. Reprinted from Hofmann & Kuleshova (2005).

Figure 13

Molecular packing of Pigment Red 181. Reprinted from Hofmann & Kuleshova (2005).

4. Classification and regression in crystallography

4.1. Supervised methods and model parameterization

The goal of supervised machine learning is the parameterization of a model function, f, based on labelled training data P^exp, to produce (simulate) the desired property P^sim as well as possible. Ideally P^exp = P^sim. The training data consist of a set of training examples. In supervised learning, each example k is a pair consisting of an input object (typically an input vector $[\boldnu_k]$ ) and an output value, $[P_{k}^{{\rm exp}}]$ . The labelled input objects are different descriptors. In our cases they contain the crystallographic, chemical and some physical properties of the compounds, for example the sum formula, all atom–atom distances, the atom weights or the sublimation heat. The desired output value might be the crystal density, the crystal structure, the solubility, the sublimation heat, the chemical shift in nuclear magnetic resonance (NMR) spectroscopy, or whether the property belongs to a certain class (in particular, the possible existence or non-existence of an object). From the mathematical point of view, we can differentiate several cases of functional relationship between the output training $[P_{k}^{{\rm exp}}]$ and input data $[\boldnu_k]$ .

Linear regression. The model function is linear and the output value is an experimental value. The output value is obtained as scalar product of the input vector $[\boldnu]$ and the weight vector w with the basic equation $[P_{k}^{{\rm exp}} = P_{k}^{{\rm sim}}({\boldnu}_{k}) = w_{0}+{\bf w}{\boldnu }_{k}.\eqno(11)]$

Nonlinear regression. The model function is nonlinear and the output value is an experimental value. There is no basic form of an equation, as nonlinear equations can take many different forms. A few of the most flexible curve-fitting forms are n degrees (n > 1), trigonometric functions, exponents and so on: $[P_{k}^{{\rm exp}} = P_{k}^{{\rm sim}}({\boldnu}_{k}) = f({\boldnu}_{k}^{n}, \sin({\boldnu}_{k}),\cos({\boldnu}_{k}),\exp({\boldnu}_{k}),\ldots).\eqno(12)]$

Classification. The model is nonlinear and the output value is the membership of a class. In this case the non-equality is used (δP is normally introduced to avoid a trivial solution, when all the variables are zero, i.e. the solution is the zero vector): $[P_{k}^{{\rm exp}}\leq P_{k}^{{\rm sim}}({\boldnu}_{k})+\delta P.\eqno(13)]$

Depending on the type of function (equality, non-equality, linear or not) several methods can be used to fit the experimental data and parameterize the model. Recursive fitting, regression and classification are most often used. Which technique to use, classification or regression, depends on the results that we want to achieve. If we are trying to predict which class (or category) something belongs to, we use classification. If we are trying to predict a certain value, regression should be used. The difference between classification and regression is defined by the loss function. In regression and classification, the goal is to minimize the value of the loss function. In regression, the loss function increases the further you are away from the correct value for a training instance. In classification, typically, the loss function is 0 for the right class and 1 for the wrong class (perception). The result is that classification only `cares' if it gets the right class, and there is no measure of how `close' it was (or how likely it is that the class is correct). The `goal' of regression is to predict the training values as closely as possible. In regression, different loss functions affect how being close and being really far off are weighted in relation to all training examples. Regression will be also used in Section 6 , where structure–property relationships, together with the methods of PCA and dimensionality reduction, will be discussed.

4.2. Simple and multiple linear regression (crystal density prediction)

The simplest case of data fitting is `simple linear regression'. Simple linear regression fits a straight line through the set of k points in such a way that it makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible. The solution of the problem is straightforward and well known.

The most widely known application of this tool in crystallography is crystal density prediction. Here, the experimentally determined values of the crystal density serve as the target values and the molecular sum formula serves as the input vector. The sum formula can be considered as the most important input vector $[\boldnu]$ in chemistry. The descriptors in the case of a chemical formula are the elements and the input vector contains the frequency of the elements. Many approaches to quantitative structure–property relation (QSPR) and quantitative structure–activity relation (QSAR) analysis use this vector together with some additional descriptors. In our examples we are using it for the prediction of the densities of crystals.

In principle, if a molecule is built up from different sorts of atoms, this is a multiple linear regression task. By assuming that all non-hydrogen atoms of a molecule are of only one sort, Kempster & Lipson (1972 ) simplified the case to a simple linear regression solution. The result was the so-called 18 Å³ rule: $[V^{\rm sim} = 18\times\boldnu_{\rm atoms}]$ , where V^sim is the estimated molecular volume, 18 is the weight factor (i.e. w = 18) and $[\boldnu_{\rm atoms}]$ is the input vector, which in this case is the sum of all the non-hydrogen atoms in the molecule.

Obviously, this model is very approximate and can be improved by reverting to the `multiple linear regression' task. Multiple linear regression fits a hyperplane through the set of k points in such a way that it makes the sum of squared distances (see Section 3.1 ) of the model (i.e., the distances from the points to the plane) as small as possible. The equation system that has to be solved is an over-determined linear equation system. The optimal solution of such a system can be obtained with SVD. In this case, the parameterization of a model results in an increment system. The use of increment systems is very simple in practical application. The user just has to count some descriptors, multiply them by the weights and sum them up, which is easy to do. Therefore, increment systems are very popular in crystallography. One of the first increment systems, to predict the sublimation energy of crystals, was developed by Bondi as early as 1963 (Bondi, 1963 ). Another example is the estimation of the hydration energy of drug-like molecules by counting functional groups in a specific drug set (Pèpe et al., 2002 ).

The density prediction method of Kempster & Lipson (1972 ) has been significantly improved by extension of the single descriptor for the non-hydrogen atoms $[\boldnu]$ to the sum formula itself (Hofmann, 2002 ). In this approach, every atom sort i, where i is the atomic number, gets its own number of atoms ν_i and its own `weight' w_i. For simple linear regression $[\displaystyle V_{{\rm molecule}}^{\rm sim} = 18\,{\rm \AA}^{3}\times\boldnu_{ {\rm atoms}},]$ whereas for multiple linear regression $[V_{{\rm molecule}}^{\rm sim} = \textstyle\sum\limits_{i = 1}^{{i = 108}}w_{i} \times\nu_{i} = {\bf w}\times{\boldnu}\eqno(14)]$ (where 108 is the total number of accessible atom types from more than 110 known at the time of writing) and the density is given by $[\displaystyle\rho = {m_{{\rm molecule}}\over V_{{\rm molecule}}},]$ where m_molecule is the molecular mass.

In Fig. 14 , the results obtained with simple linear regression (red) and with multiple linear regression (green) are compared. It is clear that the predictability, in this case, was improved significantly. The simple but effective approach of multiple linear regression to estimate the molecular volume is widely used these days for cleansing erroneous data from data sets. For instance, the estimate of the density is obtained by multiplying the input vector of the elements $[\nu_i]$ by the weight vector w_i, which is the average volume for the element [equation (14 )]. A more complex case is that of force fields, which will be discussed in following sections.

Figure 14

Comparing the results of crystal-density predictions obtained with simple linear regression (red) and multiple linear regression (green) for 10 000 crystal structures taken at random from the CSD.

In the common case of linear models the weight vector can be presented as a weight matrix. If the output vector consists of a single value, the weight matrix shrinks to a weight vector. Note that in statistics the application of a `weight' (or `weight vector') is a mathematical device used when performing a sum, integral or average to give to some descriptors more weight, or influence, on the result compared with other descriptors in the same set. Very often, the weight is a simple factor and the score for a possible event is obtained by multiplying the input vector $[\boldnu]$ by the weight factor. However, in science the weight factor very often has a physical interpretation.

4.3. Nonlinear regression (force fields and intermolecular interactions)

Nonlinear regression is a form of regression analysis in which observational data are modelled by a function that is a nonlinear combination of the model parameters and depends on one or more independent variables. This kind of analysis is very important in crystal engineering, because along with the classification suggested earlier (Apostolakis et al., 2001 ), it allows the derivation of intermolecular potentials and their parameterization. Since obtaining the direct solutions of the systems becomes very difficult and time consuming, the data are normally fitted by successive approximations. The `discretization' and `linearization' techniques are used the most often to solve nonlinear problems.

4.3.1. Discretization and linearization

In mathematics, the term `discretization' concerns the process of transferring continuous functions, models and equations into discrete counterparts. This process is usually carried out as the first step towards making them suitable for numerical evaluation and implementation on digital computers. In the simplest case, a continuous function f(r) with $[r\in{\bb R}]$ (i.e. a real number) is mapped to a discrete function f(k) defined only for the natural numbers $[k\in{\bb N}]$ , which are the grid points. $[\displaystyle f(r:r\in {\bb R})\ \mapsto\ f(g(k):k\in {\bb N}).\eqno(15)]$ The set of grid points g(k) can then be chosen as equidistant [for example, all natural numbers $[{\bb N}\,(g(k) = k)]$ ], or obtained with a special function for computational convenience or for different significance in the different regions of the function. For instance, for intermolecular interactions a useful partitioning might be $[r = g(k) = \left({{{N_{{\rm mesh}}} / {k}}} \right)^{1/2},\eqno(16)]$ where N_mesh is the mesh size of the interval in which the potential function will be defined. (For example, N_mesh = 200 defines an interval of 10 Å for k = 2.)

The application of this method in knowledge discovery requires balancing the mesh size (the total number of mesh points n_mesh) with the available data. The predictions become more accurate with a larger number of available data, not only in terms of statistical significance, but also because the mesh can be finer.

The term `linearization' in mathematics refers to finding the linear approximation of a function at a given point. In the first step the function has to be discretized and thereafter the function should be linearly interpolated between the mesh points.

To illustrate, the discretization of a Lennard-Jones potential (where r are interatomic distances) is $[\eqalignno{\displaystyle\varepsilon(r) &= {A\over r^{12}}-{B\over r^{6}}&\cr&\approx\varepsilon_{k+1} \left[k(r)-\lfloor k(r)\rfloor\right]+\varepsilon_{k}\left[\lceil k(r)\rceil-k(r)\right],&(17)}]$ where A and B are parameters, and $[k(r) = g^{-1}(g(k)) = {{n_{{\rm mesh}}} \over {r^{2}}}.]$

In Fig. 15 , a typical atom pair function for intermolecular interactions is shown. The potential is defined on the mesh points 200/k. At short distances the points are dense, to describe the interaction potential accurately. At large distances they are sparse to economize on the number of descriptors. In between the mesh points the potential is linearly interpolated.

Figure 15

An atom pair potential, which is discretized and linearly interpolated between the mesh points.

4.3.2. Unknown functional dependency

This technique is of special interest, because it can be used to derive intermolecular potentials when the functional dependency (the form of the potential function) between the input and output data is unknown.

Only three requirements must be met:

(1) the problem is `right unique';

(2) the functional dependency between the output and the input values is continuous; and

(3) the observed state is a local minimum (in the special case of thermodynamics this means the same as a metastable state).

The requirement of being a `right unique' problem means that the output value is uniquely defined for a given set of input values. The most common case in which this requirement is not fulfilled is if data for some significant descriptor were not collected. An illustrative example is the sales of umbrellas. Obviously, the sales cannot be predicted if the data on rainfall have not been collected as input data. In the extreme situation where not one single input value exists, it is obvious that no output value can be predicted from the input value.

The second requirement, that of continuity of functional dependency, allows us to linearize the functional dependency. Any continuous function can be linearized by replacing the function ɛ = f(r) by a piecewise linear function, which assigns to mesh points a specific value commonly named the `weight'; $[r_{i}\ \mapsto\ \varepsilon_{i}]$ . During the linearization, the functional dependency is replaced by a series of descriptors. Consequently, rather than evaluating the function to obtain the output value, the number of occurrences of the descriptors has to be counted. As a result, a input vector, $[\boldnu]$ , is obtained. The output value, G, is not obtained by evaluating the function; it is obtained as the scalar product between the frequency vector, $[\boldnu]$ , and the `weight vector', $[\boldvarepsilon]$ : $[G=\boldnu\cdot\boldvarepsilon]$ .² In this general case we obtain an over-determined equation system of the type $[\left(\matrix{\nu_{11}&\ldots&\nu_{1n}\cr \nu_{21}&\ldots&\nu_{2n}\cr \vdots&\ddots&\cr \vdots&&\cr \nu_{m-1}&\ldots&\nu_{m-1n}\cr \nu_{m1}&\ldots&\nu_{mn}\cr }\right)\cdot\left(\matrix{\varepsilon_{1}\cr \vdots\cr \varepsilon_{n}\cr }\right) = \left(\matrix{G_{1}\cr G_{2}\cr \vdots\cr \vdots\cr G_{m-1}\cr G_{m}\cr }\right).\eqno(18)]$

The requirement `belongs to a local minimum' says that the experimental data must be a local minimum in the degrees of freedom. In other words, the experimental system must be a minimum for the Gibbs energy G.

For instance, the set of degrees of freedom X₁,…, X_N for a crystal with one rigid molecule is the lengths of the cell vectors a, b, c, the angles α, β, γ, and the Euler angles φ, θ, ψ for the rotation of the molecules. The above requirement implies that the derivative $[{{\partial G} / {\partial X_i}}]$ for the change of a crystal structure in one of the degrees of freedom must be zero or, in other words, the limit for a tiny distortion must be zero ( $[\lim_{\delta{\boldnu}\rightarrow 0}\delta{\boldnu} = 0]$ ). Under this condition the general case in equation (18 ) transforms to the specific case of an over-determined equation system: $[\left(\matrix{\delta\nu_{11}&\ldots&\delta\nu_{1n}\cr \delta\nu_{21}&\ldots&\delta\nu_{2n}\cr \vdots&\ddots&\cr \vdots&&\cr \delta\nu_{m-1}&\ldots&\delta\nu_{m-1n}\cr \delta\nu_{m1}&\ldots&\delta\nu_{mn}\cr }\right)\cdot\left(\matrix{\varepsilon_{1}\cr \vdots\cr \varepsilon_{n}\cr }\right) = \left(\matrix{0\cr 0\cr \vdots\cr \vdots\cr 0\cr 0\cr }\right).\eqno(19)]$

The standard method for solving an equation system such as (18 ) or (19 ) is singular value decomposition (SVD). The advantage of this approach compared with the classification solution suggested earlier (Hofmann & Apostolakis, 2003 ; Apostolakis et al., 2001 ; Hofmann, 2010 ) is that the solution of this system is unique and does not depend on an initial guess (see Section 4.4 ). It is only necessary to take care that during the linearization the number of available data is balanced with the number of mesh points; obviously, the number of mesh points cannot exceed the number of available data per descriptor, $[\eqalignno{&{\rm No.\ of\ descriptors}\times{\rm No.\ of\ mesh\ points}&\cr&\quad\,\lt\, {\rm No.\ of\ experimental\ data},&(20)}]$ otherwise we have more parameters than data and we end up with an under-determined equation system. The equation system can quickly become very large and might cause some program-language problems. For example, Java allows for the indexing of arrays up to a maximum of 2 GB of addresses. This corresponds to a equation system of 200 000 equations, 100 descriptors ɛ and 100 mesh points per descriptor.

4.3.3. The basic functional form of force fields

The basic functional form of a force field encapsulates both bonded terms, relating to atoms that are linked by covalent bonds, and non-bonded `non-covalent' terms. For the non-covalent terms, a large number of functionals exist, which differ in their atom types and the functional dependency of the energy on the distance. One of the earliest potentials used was derived from theoretical considerations about the dispersion and electrostatic interactions. For the long-range interactions it followed an r⁻⁶ law and for the electrostatic term it varied as r⁻¹. This potential was then completed with a term for short-range repulsive interactions. For computational convenience an r⁻¹² (Lennard-Jones, 1931 ) term $[\varepsilon(r) = 4\varepsilon_{{\rm vdW}}\left[\left({{\sigma} \over {r}}\right)^{12}- \left({{\sigma} \over {r}}\right)^{6}\right],\eqno(21)]$ or an exponential term (Buckingham, 1938 ) $[\varepsilon(r) = {A\exp({-Br})-Cr^{-6}}\eqno(22)]$ were chosen. By the 1970s it was recognized that this functional form does not describe hydrogen bonds well, and an additional `10–12 potential' was proposed for hydrogen bonds (McGuire et al., 1972 ): $[\varepsilon(r) = {Ar^{-10}-Br^{-12}}.\eqno(23)]$

Since then, more and more complex forms of potentials have been developed. For example, the shifted force n–m potential (Clarke et al., 1986 ) is $[\eqalignno{\displaystyle\varepsilon(r) &= {\alpha E_{0}\over n-m}\left[m\beta^{n}\left\{\left({ r_{0}\over r}\right)^{n}-\left({1\over\gamma}\right)^{n}\right\}-n\beta^{m} \left\{\left({r_{0}\over r}\right)^{m}-\left({1\over\gamma}\right)^{m}\right\} \right]&\cr&\quad+{nm\alpha E_{0}\over n-m}{r-\gamma r_{0}\over\gamma r_{0}}\left\{\left ({\beta\over\gamma}\right)^{n}-\left({\beta\over\gamma}\right)^{m}\right\}&(24)}]$ with $[\eqalign{\displaystyle\gamma &= r_{{\rm cutoff}}/r_{0},\cr \displaystyle\beta &= \gamma\left({\gamma^{m+1}-1\over\gamma^{n+1}-1}\right)^{1 / (n-m)}\ {\rm and}\cr \displaystyle\alpha &= {n-m\over n\beta^{m}(1+m/\gamma-m-1/\gamma^{m})-m\beta^{n} (1+n/\gamma-n-1/\gamma^{n})}.}]$ The various parameters of the three formulas above can be interpreted by curve sketching or by a machine-learning approach. In terms of curve sketching, ɛ_vdW (E₀) defines the minimum depth of the potential curve (usually referred as a dispersion energy), r₀ is the position of the minimum and σ is the distance at which the value of the potential curve is zero; r_cutoff is introduced for a numerical reason and defines the distance at which the curve should go to zero for all values above the cutoff. A, B and C are parameters that can be interpreted in the terms of machine learning as weight factors for the strengths of the different interactions (see Table 1 ).

Table 1
The physical significance of the various negative powers

Power	Interaction	Magnitude (kJ mol⁻¹)	Name of potential	Temperature dependence
1	charge–charge	250	Coulomb
2	charge–dipole	15		T⁻¹
3	dipole–dipole	2		T⁻¹
4	dipole–quadrupole	< 1		T⁻¹
5	quadrupole–quadrupole	$[\ll 1]$		T⁻¹
6	induced dipole–induced dipole	1	London dispersion
10	hydrogen bond
12	electron–electron		Exchange energy, Pauli energy

4.3.4. Derivation of intermolecular potentials

The method described above in Section 4.3.2 allows derivation of the effective non-covalent interaction without assuming a functional dependency from the outset. The derivation and parameterization of force fields with regression is a more complex case than density estimation. In this case, as a very special representation of the input vector we consider the radial distribution function, RDF. The radial distribution function (or pair correlation function) g(r) in a system of particles (atoms, molecules, colloids and so on) describes how their density (occurrence), ρ, varies as a function of distance from a reference particle. If a given particle is taken to be at the origin O, and if ρ₀ = N/V is the average value of the density of particles, then the density (occurrence) at a distance r from O is $[{{\rho(r)}/ {\rho_{0}}}]$ . If the descriptors are the discretized frequencies of interactions within the interval r(k) to r(k + 1), then the density ρ₀ is connected with the input vector by $[\nu_{k} = \rho_{0}\textstyle\int\limits_{r = r(k)}^{r(k+1)}4\pi r^{2}g(r)\,{\rm d}r.\eqno(25)]$

If the RDF is calculated for a single crystal structure, it reflects all the intermolecular distances that occur between all atom types in this crystal structure. The energy of the system, G, can be written as the integral of the RDF with the corresponding atom pair function ɛ. The integral transforms, after discretization, into a sum. For example, in the simplified case of aromatic hydrocarbons with only two kinds of atoms (hydrogen, H, and sp²-hybridized carbon, C) we obtain $[G = {\boldvarepsilon}_{\rm CC}{\boldnu}_{\rm CC}+ {\boldvarepsilon}_{\rm CH}{\boldnu}_{\rm CH}+{\boldvarepsilon}_{\rm HH}{\boldnu}_{\rm HH},\eqno(26)]$ where the weights $[\boldvarepsilon_{ij}]$ are the energies of the atom pair function at the mesh points. The weights are still simple factors, but, in contrast to a simple statistical analysis, can be negative as well as positive. The input vectors $[\boldnu_{ij}]$ are obtained from the transformed RDF [equation (25 )]. For this case, the number of required potentials results from n_potentials = n_atom types(n_atom types + 1)/2 and for two atom types n_potentials = 3. If the energy is added as a dimension, the problem is a four-dimensional one. The method also works for higher dimensions (we have tested it for up to 67 dimensions), but a four-dimensional space is the limit for reasonable clarity of presentation.

To complete the input vector for deriving the force field for aromatic hydrocarbons, as a first step we extracted from the CSD crystal structures that contain only sp²-hybridized carbon atoms and hydrogen atoms using the `atom sort' filter. This subset of structures was then cleansed. Typical tasks in cleansing are checking for inconsistency in the input vector and checking for its completeness. A further task is the analysis of the data for outliers. For the details and results of this analysis we refer readers to Section 5 .

Here the use of filters simplifies the problem of the visualization of the results, as it restricts the number of required descriptors to only two atom types and significantly reduces the dimension of the input vector from roughly 100 to 2. In real applications [see, for example, Hofmann (2010 )] we use as a filter the `only organic compounds' option. Obviously, the results obtained using the trained model can only be used for predictions about the corresponding object class: hydrocarbons as in the given case, or organic crystals as in Hofmann (2010 ).

In total after filtering and cleansing the input vector was composed of 215 structures. On average, these structures had around 15 degrees of freedom and we obtained 3282 equations. The distances for the mesh points were defined at $[r_{i} = {200/i^{2}}]$ . This choice was advantageous, because it avoided the need to calculate the root of the distances during evaluation of the energy. The experimental data exhibited 90 descriptors; contacts longer than a cutoff of 10 Å were not taken into account and contacts at short distances, below the minimum van der Waals distance, did not occur. The result is shown in Fig. 16 . The curves obtained are similar to common interaction potentials, but they are not completely smooth. Since we did not presuppose a functional dependency between the distance and the potential, the potentials show some noise. As a reference, in Fig. 16 we have added the radial distribution functions for the corresponding contacts. To a first approximation one would expect that the minimum of each potential should coincide with the maximum of the RDF and the curves should be related to the Boltzmann law, $[\varepsilon(r)]$ = $[-k_{\rm B}T\ln[{{\rm RDF}(r)/{\rm lim}_{r\rightarrow\infty}{\rm RDF}(r)}]]$ , where k_B is the Boltzmann constant. In fact, the minima are shifted to slightly higher distances than the RDFs suggest.

Figure 16

The potentials for aromatic hydrocarbons obtained by singular value decomposition (SVD) are shown in dark red, dark green and dark blue. For comparison, the corresponding radial distribution functions (pale red, pale green and pale blue) are also shown.

It is important to mention that the solution of the equation system using SVD indicated that the matrix was not full rank. It gave a solution for only 88 weights, and two weights could not be determined. This kind of result is common. In general, some equations can appear as co-linear. An obvious reason for this is if several descriptors occur only in a single experimental data point. In this case, the weights of these descriptors can assume arbitrary values and a defined value only exists for the sum of them. For example, this happened for the structure with the CSD refcode SALFUG (Fig. 17 ), which has an intermolecular H⋯H interaction of length 2.098 Å. This singular distance contributed to the frequency vectors of the two adjacent descriptors: as the floor $[\lfloor{200/ 2.098\,\AA}\rfloor]$ is the same as the ceiling $[\lceil{200/ 2.098\,\AA}\rceil]$ , these results were co-linear. Only a visual check of the structure allowed determination of whether the structure was correct or whether some inaccuracies in the H-atom position were present, hence this structure had not been removed by cleansing earlier. The molecule is quite unusual, which might be responsible for unusual contacts as well. This anomaly could only be detected at the stage of SVD.

Figure 17

The structure with the CSD refcode SALFUG has a very short H⋯H distance and was recognized as unusual during multiple regression.

4.4. Classification (intermolecular interactions)

Another method for the parameterizing of force-field models is classification. This method, using a support vector machine (SVM) algorithm, was described in detail and used in Hofmann & Apostolakis (2003 ), Apostolakis et al. (2001 ), and Hofmann (2010 ).

In classification, if data can be divided into different classes, then the parameters of a model can be optimized to separate these classes as effectively as possible. Often only two classes are considered, one `true' and one `false'. The advantages of the method are that the parameters can be determined for an arbitrary model, the model can be multidimensional and any arbitrary function for the functional dependency is allowed. The largest drawback of the approach becomes apparent as soon as we begin to consider optimization. Any algorithm for optimization results in a local minimum, and no one algorithm can be guaranteed to find a global minimum. The more complex and multidimensional the input data are, the more likely it is that the algorithm does not find the global minimum. As a result, iterations of the parameterization with different starting points will finish with different values, whereas (and this should be emphasized) parameterization of the force field by regression give us the unique solution. Therefore, force-field parameterization by classification requires a very good initial guess. Nevertheless, the method is very attractive because of its tolerance of any model and its broad applicability. The problem has only two requirements for its application:

(1) for any experimental data a reference state is known, which is higher in its score (energy); and

(2) the experimental data are a local minimum of the model.

If these two requirements are fulfilled, we can always define two classes: `true' and `false'. The first class, which we will call `experimental data' or `true' henceforth, can be defined with the help of the first requirement: it contains the experimental data minus the reference state. The scores of all elements in this class must be zero. In crystals, this means explicitly that the interaction energy within a crystal must be below the energy of the free molecules. If the conformation energy remains constant or there are only trivial changes in the conformation during crystallization, it is equivalent to the statement that the free energy of the crystal formation must be below zero. The second class, which we will call `virtual data' or `false' henceforth, can be defined with the help of the second requirement: it contains the virtual data minus the reference state. The virtual data can be generated in the case of crystal structures by applying some small distortions to experimental structures. From the second requirement it follows now that the score (the free energy) of the distorted structure minus the score of the experimental structure must be above zero. In total, we obtain the class `true', where each element must have a score below zero, and the class `false', in which each element must have a score above zero. In Fig. 18 we show the results of classification for the case of aromatic hydrocarbons, described above in Section 4.3.4 . The set of the data to classify was composed of the 215 experimental structures used above plus the set of the virtual structures (around 10 virtual structures are generated for every experimental structure). The algorithms that allow the parameterization to achieve an optimal separation of the two classes are SVMs. They were first used in crystallography as early as 2001 (Apostolakis et al., 2001 ).

Figure 18

Results of the classification according to the energy (vertical axis, interval 200 score points). The structures on the horizontal axis are ordered according to their scores (from the lowest to the highest value). All the experimental structures (blue) have energies below zero and most of virtual structures have energies above zero (green). However, some virtual structures have energy values below zero (red).

As pointed out above, the first step of the method is to define a good starting guess. As a starting point, the parameters of one of the common force fields or the force field obtained by regression (Section 4.3.4 ) can be used. The possibility of using the force field obtained from regression opens new horizons in developing highly customized potential functions for special groups of compounds, or for compounds under unusual or extreme conditions. A general possibility for overcoming the problem of the unknown functional dependency of the atom pair interaction as function of the distance is to develop the function as a series. Obviously, the terms of the series are forbidden to be co-linear and they should fulfil the limit conditions. In the case of atom pair functions, one implication of this is that for infinite distance the interaction energy has to be zero. It is possible to use reciprocal power series of the form $[\varepsilon(r) = \sum{{a_{i}} / {(r^{n})^{i}}}]$ , with n ≥ 2. With this condition for n the common problem of the divergence of the Coulomb term is avoided and the Coulomb energy during parameterization becomes intrinsically included into the other terms. If n = 6 is chosen and the last term is taken as r⁻¹², the potential corresponds to the common Lennard-Jones potential. A full list of the different terms and their physical meaning can be found in Table 1 .

Here we demonstrate the method for the aromatic hydrocarbons, the same example as used in Section 4.3.4 . In the first step we generated the false (virtual) structures, which we then used for training. The number of searched atom pair potentials was three, as before. Series of the form $[\textstyle\sum_{i = 1}^{4}{{a_{i}} / {(r^{3})^{i}}}]$ were used. Any such series contains four terms and, in total, 12 parameters have to be determined. The initial guesses for the parameters were obtained by fitting the functions to the atom pair functions obtained by regression in Section 4.3.4 . Starting with this fit, the sum of the distances of the outliers from the hyperplane was minimized by a simplex minimization. In such a procedure, the distance for the experimental structure is given by $[\Delta_{{\rm exp}}E = \max\{{0,-E_{{\rm exp}}}\}]$ , while for the virtual data the error is given by $[\Delta_{{\rm vir}}E = \max\{{0,E_{{\rm vir}}}\}]$ . Since the system has a trivial solution by setting all parameters to zero, a boundary condition needed to be introduced during minimization. The functions were always scaled to keep the minimum value at −10; $[\min\left\{{\varepsilon_{{\rm HH}}(r),\varepsilon_{{\rm HC}}(r),\varepsilon_{ {\rm HC}}(r)}\right\} = -10]$ .

4.4.1. Outliers in classification (energy cleanser)

Special attention should be paid to the analysis of the results of classification. In Fig. 19 , we plot the energies of the virtual structures against their occurrence in the interval from −0.1 to 0.1 score points, a total range of 0.2 score points, enlarged from Fig. 18 .

Figure 19

More detailed results of the classification according to the energy (vertical axis). The structures on the horizontal axis are ordered according to their scores. Only the virtual structures are shown. The energies of most of the virtual structures are above zero (green). However, some virtual structures (outliers) have values below zero (red).

As one could see in Fig. 18 , all the experimental structures have an energy below zero and most of the virtual structures have energies above zero. However, and this can be seen on the enlargement shown in Fig. 19 , some virtual structures have energy values below zero (red points). The fact that all the experimental structures are assigned to negative energy shows that the internal cleansers of the CSD work very well and, at least for the present set of training structures, no structure with catastrophic errors was found. This was not always the case (Hofmann & Lengauer, 1997 ), and we should thank the continuous improvements in the data cleansing during the last 20 years (Groom et al., 2016 ) for the good quality of the present data.

The importance of outlier analysis for model development and data cleansing was discussed in Hofmann & Kuleshova (2010 ). Here we give some further insights to the problem. As we have mentioned above, most of the outliers were detected in the virtual data. It was noted that 1.89% of the virtual structures have energies below zero. This means that some distorted structures have a lower energy than the experimental structures.

This can happen if the model is not accurate enough and/or (for example) too few parameters (descriptors) are available. Another possible reason can be errors in the experimental data. Therefore, it is always worth looking at the outliers in detail to work out whether it is possible to improve the method. In this case, the strongest outlier was the structure ANNULE10 (the isolated red point on Fig. 19 ). This single structure contributes 12% of the overall error. Fortunately, the database contains a second entry for the same compound, with the CSD refcode ANNULE01. The direct overlaying of the molecules, Fig. 20 , from the two structures immediately reveals a problem: in ANNULE10 (green), some of the hydrogen atoms are positioned inaccurately. The C—H bond is a little bit too long and the hydrogen-atom position is too far out of the plane of the sp²-hybridized carbon atoms. The length of the C—H bond can be normalized easily by using standard bond lengths for hydrogen in the programs Mercury (Macrae et al., 2020 ), FlexCryst (Hofmann et al., 2016 ) and Materials Studio (Accelrys Software Inc., 2013 ). The position of an `out-of-plane' hydrogen atom cannot be corrected automatically yet. As another result of the bad positioning of the hydrogen atoms, one H⋯C interaction is shortened from 2.803 Å to 2.599 Å. These small errors in hydrogen-atom positions were detected only during the classification, and cause the structure ANNULE10 to be the largest outlier.

Figure 20

The overlay of molecular fragments in ANNULE01 (grey) and ANNULE10 (green) illustrates the error found by the classification. ANNULE10 has a C⋯H contact that is too short: 2.599 Å.

5. The importance of anomaly (outlier) detection

All methods of machine learning, supervised and unsupervised, can detect outliers. These outliers are worth close analysis, since they can reveal unusual events and are very often connected with errors. These days, anomaly detection (also known as outlier detection) is the identification of items, events or observations that do not conform to an expected pattern or to other items in a data set. Typically, the anomalous items indicate some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomaly detection is applicable in a variety of domains, such as intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks and detecting ecosystem disturbances.

In crystallography, anomaly detection serves two main aims: on the one hand, experimental errors should be detected and, if possible, corrected; and on the other hand (perhaps not so obviously), deficiencies in models should be recognized. There exists in machine learning an intrinsic interest in detecting anomalous data and correcting or removing them during preprocessing by cleansing. During this process the input vector is checked for consistency and completeness. A typical case of inconsistency is when the molecular formula does not coincide with the given record of the atoms. A typical case of incompleteness is when some atoms are missing. For experimental reasons, this commonly concerns hydrogen atoms. In organic crystallography, deposition of a crystal structure in the CSD requires that two cleansers are passed: the first, enCIFer (CCDC, 2016a ), checks the syntax of the data, while the second, PLATON (Spek, 2016 ), checks for the reasonableness of the structure and, for instance, annotates the records for structures that contain large voids. The removal of anomalous data from the data set often results in a significant increase in accuracy.

The outliers can arise for three different reasons: systematic errors in the data collection, random errors in data collection and missing data, i.e. the failure to collect data for some important descriptor. Erroneous data can be detected by all methods of machine learning, while missing data are most easily spotted by supervised classification.

5.1. Systematic errors (wrong space-group assignment, wrong handedness)

To our knowledge, systematic errors are not common in databases, since the error can usually be eliminated if the cause of the systematic error can be identified. The methods for detecting a systematic error are well known in statistics. A simple check is whether the measured average values coincide with the expected average values or not.

Conceptually, the detection of systematic errors during supervised classification is not completely automatic; nevertheless, it might be the most powerful method. What commonly happens in the case of systematic errors is that during supervised classification a large part of the data cannot be assigned to the correct class and stay as outliers. In the next step these outliers have to be analysed and the crystallographer has to use their intuition to find a descriptor that differentiates between the correctly classified data and the outliers. If such a descriptor is identified, the origin of the error can be traced back.

Density cleanser. The effectiveness of this approach can be illustrated by the inclusion of the crystal density as an additional descriptor for monitoring crystal structures. For example, during the classification of the crystal structures with space group P1 performed by Hofmann (2002 ) (according to the scheme described in Section 4.4 ) a huge number of outliers were observed. Further analysis of the two classes obtained with linear regression for the crystal density (Section 4.2 ) revealed different weights for the average volume of the atoms in these groups. The values for the major class were less than half of those for the minor class (the outliers). Attempts to trace a possible reason for this suggested that in some cases an overbar had been lost during text processing, and that the structures in the minor class in fact belonged to the space group $[P\bar{1}]$ . In the past, several systematic errors have been detected; other examples are errors in the setting (for example P2₁/a ↔ P2₁/n ↔ P2₁/c) or the handedness in certain space groups (P2₁2₁2₁). We have already mentioned that a density cleanser now is one of the cleansers used by the CSD. The other example of anomaly detection with classification was described in Section 4.4.1 and was referred to as an `energy cleanser'.

5.2. Random errors (ill-positioned atoms)

Random errors are always present in experimental data. They depend on the (possible) accuracy of the data acquisition. They are easily recognizable, within the methods of data mining, because the data are scattered around the line in regression, around the hyperplane in multiple regression, or the clusters are blurred during classification. But the errors do not show any structure. Only the estimation of the average value can be improved by cleansing (preprocessing) the experimental data set.

A single set of data can be only improved by changing the data acquisition. A common problem in this context is the determination of hydrogen-atom positions by X-ray crystallography. Since this method detects the electron density rather than the positions of the nuclei, the positions of the nuclei of atoms without electrons cannot be determined. Therefore, the positions of highly acidic hydrogen atoms are very often ill-conditioned. The only ways to overcome the problem are to change the method and collect data using neutron diffraction, or to set the relevant values to average ones (i.e. use normalized hydrogen-atom positions).

RDF cleanser: That some atoms are ill-positioned becomes obvious if the radial distribution functions (RDFs) are plotted as in Fig. 21 . It is seen that at short distances (and especially for hydrogen atoms) some contacts fall outside normality and form isolated clusters.

Figure 21

Average RDFs for the set of aromatic hydrocarbons (shown in dark red, dark blue and dark green) used in Section 4.3.4 , and the RDFs for a single structure (pale red, pale blue and pale red), CSD refcode BOXGAW.

For the detailed investigation of the outliers in this case, a measurement of anomaly (similarity η) has to be introduced. This index has to be high for the outliers and low for normal structures. For this aim the relative difference between the RDF of the sum of the structures, RDF^set, and the RDF of a specific structure, RDF^structure, can be used [equation (27 )]. A good measure for the value of η for the anomaly will be the maximum of this relation at the mesh points. Furthermore, it is desirable that the error should have a Gaussian distribution, since Gaussian distributions can be analysed with descriptive statistics, which are highly developed. For instance, they allow us to differentiate between systematic and random errors and to estimate the likelihood that the structure is wrong. Ratios can be mapped to Gaussian distributions by taking the logarithm, which gives us an equation for differentiating between ill- and correctly positioned atoms: $[\eta = \log{\left(\max{\left\{{{\rm RDF}^{\rm structure}_{k}-{\rm RDF}^{\rm set}_{k}} \over {{\rm RDF}^{\rm set}_{k}}\right\}}\right)}.\eqno(27)]$

The result can be plotted as histogram. In Fig. 22 the result for all organic crystal structures containing common atom types is shown. The set consists of 74 625 crystal structures. The anomaly of the structures rises with increasing value of η, while with decreasing η the structures become more and more similar to the average. At the mean the structures have a score of around two (mean value of η: 2.05; median: 1.97).

Figure 22

Anomalous structures obtained with the RDF cleanser.

The most anomalous was found to be the crystal structure VEXXUS (highlighted in the histogram by the red arrow). VEXXUS (Gallardo et al., 2007 ) is a cocrystal of tetraethylammonium with 1,3,5-trinitrobenzene and, in fact, possesses a very particular crystal structure: analysis revealed very unusual contacts between the molecules of VEXXUS (see Fig. 23 ), including a $[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{2}}]$ interaction with a very short distance of 2.478 Å.

Figure 23

According to the RDF cleanser results shown in Fig. 22, the most anomalous structure is VEXXUS, with a very short C⋯C distance of 2.478 Å.

The ten most anomalous structures are listed in Table 2 . They have errors above 9.85.

Table 2
The most anomalous structures according to the RDF cleanser

CSD refcode	Similarity η	Interaction type	r (Å)
VEXXUS	11.2	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{2}}]$	2.50
WABKAL	10.7	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{3}}]$	2.62
IVEZAK	10.3	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{2}}]$	2.62
PARNIG	10.2	C—H⋯ $[{\rm C}_{sp^{2}}]$	1.96
VUCFUU	10.2	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{2}}]$	2.32
UXIPEX	10.1	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{2}}]$	2.54
SEQBUN	10.0	C—H⋯ $[{\rm C}_{sp^{2}}]$	2.02
PYRIDO	10.0	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{2}}]$	2.62
LAJWIE	9.9	$[{\rm C}_{sp^{2}}\cdots{\rm C}_{sp^{3}}]$	2.88
VAMVEI	9.9	C—H $[\cdots{\rm C}_{sp^{3}}]$	1.94

5.3. Model errors, missing descriptors

Model errors can only be detected during supervised learning. They can be recognized when a lot of data cannot be assigned to the right class or have bad correlation coefficients during regression and the output data can be reflected only poorly by the trained model. To understand that the data are reflected poorly, it is necessary to examine the output data during machine learning, and in this sense only supervised learning is able to detect whether some important descriptors are missing. An obvious strategy for solving this problem is to extend the number of descriptors and to collect all data for any descriptors that can be thought of. Below we highlight the importance of the proper choice of descriptors when looking for errors in models and missing descriptors.

In Section 3.3 on the clustering of crystal structures, it was demonstrated that clustering allows us to recognize descriptors that have been overlooked, like the temperature and pressure. It is astonishing in this context that methods of CSP take no account of the effects of temperature and pressure and yet are successful (Reilly et al., 2016 ), even though some experimental crystallographers specialize in high-pressure or low-temperature studies. For example, crystallography under high pressure has significance for the deep sea (for instance, in the handling of the oil spill in the Gulf of Mexico), whereas crystallography at low temperature is relevant to space and astronomy.

In Section 4.2 it was shown that prediction of the crystal density works much better within the multiple linear regression when a weight factor is added to the single descriptor of the number of atoms.

In Section 4.4 , however, when discussing the analysis of intermolecular interactions in crystals it was demonstrated that extending the descriptors has limits and that the number of descriptors cannot become infinite. The number of descriptors must always be lower than the number of data, otherwise the problem gives rise to an undetermined equation system. If the connectivity between the input data and the output data is a function, this limits the mesh size that can be applied when discretizing the function. The accuracy increases when a finer mesh is chosen, but the mesh size is limited by the available data.

In Section 6.1 , we will show how to reduce an infinite number of descriptors to a reasonable number of descriptors by principal component analysis, PCA, for the prediction of crystal solubility.

Currently the use of PCA is the poor cousin in machine learning. Still, computing power is growing quickly enough to keep pace with the amount of data being collected. While small centres, like the Center for Advanced Studies, Research and Development in Sardinia (CRS4), can store some petabytes, the largest data storage centre of the world, the Range International Information Group data centre in Langfang, China, stores some thousands of petabytes. A theoretical limit for the present technology is that common processors with a 64-bit address bus can address a maximum data set with 16 000 petabytes of entries.

6. Dimensionality reduction and property estimation

Another common tool in machine learning is reduction of the dimensionality of the input information. In materials science, in order to predict the properties of novel, or even hypothesized, crystalline materials we face the problem of handling very large, inhomogeneous and multidimensional data sets. The classification models built on this information could become very complicated and have hundreds or thousands of features (also called attributes, variables or descriptors). At the same time it is likely that many (if not most) of the features have low information content. Some of these features can be common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low-information features can decrease performance. Machine learning can tackle this problem by dimensionality reduction. As an example of this we will look at the molecular volume used in QSPR schemes to predict solubility (see Section 6.2 ).

Eliminating low-information features gives the model clarity by removing noisy data. It can prevent overfitting and helps to avoid the curse of high dimensionality. When we use only the higher-information features, we can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Intuitively, removing some features may look wrong, but it is important to see the results. Dimensionality reduction is not only useful for speeding up an algorithm, but it might actually help with the final classification/clustering accuracy as well. Too many noisy or even faulty input data often lead to less than desirable algorithm performance. Removing uninformative or disinformative data might indeed help the algorithm to find more general classification regions and rules, and overall achieve better performance on new data.

There are two main approaches:

Feature selection – this tries to find a subset of the original variables. There are three strategies: filter (e.g. information gain) and wrapper (e.g. search guided by accuracy) approaches, and embedding (features are selected to add or remove while building the model based on the prediction errors); see Section 6.2 .

Feature extraction – this transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in PCA (Section 6.1 ), but many nonlinear dimensionality-reduction techniques also exist. For multidimensional data, tensor representations can be used in dimensionality reduction through multilinear subspace learning.

6.1. Principal component analysis

Principal component analysis (PCA) is probably the most popular multivariate statistical technique and it is used by almost all scientific disciplines. It is also likely to be the oldest multivariate technique. It can be traced back to Pearson (1901 ) and for more details we direct the reader to, for example, Abdi & Williams (2010 ) and Samudrala et al. (2014 ). PCA analyses a data table representing observations described by several dependent variables, which are, in general, intercorrelated. Its goal is to extract the important information from the data table and to express this information as a set of new orthogonal variables called principal components. PCA also represents the pattern of similarity of the observations and the variables by displaying them as points in maps (Jolliffe, 2002 ).

The main linear technique for dimensionality reduction, PCA, performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors of this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behaviour of the system. The original space (with dimension equal to the number of points) is reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.

6.1.1. Molecular superposition

PCA is usually explained via an eigen decomposition of the covariance matrix. However, it can also be performed via singular value decomposition, SVD, of the data matrix. An illustrative example of the relationship between SVD and PCA, with which many readers might be familiar, is the application of SVD to the positions of the atoms in a molecule. In Euclidean space the atoms have three descriptors for their position: x, y, z. With SVD we obtain three weight vectors. The first gives the direction in which the molecule is most extended; the second gives the width and the last gives the flatness. The singular values give the variances for the three directions. The more dispersed the points are, the larger the singular values. If the molecule is completely flat one singular value is zero and one dimension is redundant. If the molecule is linear, two singular values are zero and the dimensionality can be reduced to a single descriptor. Let us apply SVD to the determination of the axis of inertia in a molecule. The matrix of interest in this context is the tensor of inertia. In contrast to the more general matrix in equation (18 ), which is rectangular, in the case of the tensor of inertia the matrix shrinks to a quadratic 3 × 3 matrix. In this case SVD becomes equal to solving an eigenvalue problem. The weight vectors are identical to the eigenvectors and the eigenvalues correspond to the singular values. Applying SVD gives as a result the three weight vectors corresponding to the axes of inertia. The largest of the singular values is the moment of inertia for the principal axis. Using this approach gives us a very quick and efficient algorithm for the superposition of molecules. For two molecules to be compared the principal axis of each has to be determined and a principal-axis transformation performed. Then a similarity index can be calculated, for instance, the root-mean square of the shortest distance between the atoms in the molecules. An advantage of this algorithm in comparison with other algorithms is that the molecules do not need to have the same number of atoms. An example, the superposition of biphenyl and fluorene molecules, is shown in Fig. 24 .

Figure 24

Superimposing biphenyl and fluorene with singular value decomposition. The calculated inertia ellipsoids are a measurement for the similarity. For flat molecules one dimension becomes redundant.

6.2. Physical intuition (crystal solubility)

In the previous sections many different sub-areas of machine learning have been described, and the reader might get the impression that everything can be easily solved by machine learning. In this section another method will be presented. One of the main aims of physicists and other scientists is to find out experimentally how an event (the output data) depends on different influences and to try to exclude unimportant influences. In this sense the experimental work is an experimental PCA. Therefore, we can simplify a problem a great deal by taking well known results into account. The example chosen here is the pharmaceutical application of crystallography. Most drugs are microcrystals, and crystallography plays an important role in the search for new drugs, their patenting and in quality control. With the arrival of combinatorial chemistry, it has become possible to prepare millions of compounds in a single process. This raises the question: Is it possible to predict the solubilities or biological activities of the compounds, or it is simpler to determine them by high-throughput screening?

The natural approach using data mining is to describe the molecules as much as possible using descriptors, perform PCA, eliminate the redundant descriptors, and to train the model by a supervised method, as described in Section 4 . This scheme can be used to predict an arbitrary property of a molecule or a crystal. Schemes that use multiple linear regression (Sections 4.2 and 4.3.4 ) are the most popular. In this case, the result is an increment system, which is easy to use. If the aim is to predict an arbitrary property the scheme is called a quantitative structure–property relationship (QSPR) method, and in the specific case of biological activity the scheme is called a quantitative structure–activity relationship (QSAR) method. An early example of QSPR in chemistry was the prediction of the melting points of alkanes in 1842 (Kopp, 1842 ). Recent applications in inorganic crystallography can be found, for example, in Rajan (2010 ), Samudrala et al. (2014 ), Villar et al. (2008 ) and Rajan et al. (2009 ).

The accuracy of this method grows with the increasing numbers of descriptors. In the early days, experimental data were used as input data. Since the measuring of input data for large data sets (such as from combinatorial chemistry) can become tedious, these days calculated descriptors are very often used. To enable the use of as many descriptors as possible, programs have been equipped with estimates of more and more descriptors (Todeschini & Consonni, 2009 ). One of the leading programs in this field, Dragon, calculates 5270 molecular descriptors at present (Mauri et al., 2006 ). The list of descriptors includes the simplest (atom types, functional groups and fragment counts), as well as topological and geometrical descriptors, three-dimensional descriptors, quantum-mechanical descriptors, properties and so on.

One of the first to perform an intuitive reduction of this flood of descriptors for QSAR methods was Lipinski (Lipinski et al., 2012 ). He ended up with `the rule of five'. Only four descriptors were assumed to be valuable. In addition, the input vectors are logicals (true or false), and are weighted equal (true). The simplicity of equation (28 ) makes this rule one of the most popular (the article has been cited more than 10 000 times). $[\eqalignno{&{\rm Activity} = &\cr&({\rm true\ true\ true\ true})\left(\matrix{ {\rm hydrogen}\hbox{-}{\rm bond\ donors}\,\lt\, 5\cr {\rm hydrogen}\hbox{-}{\rm bond\ acceptors}\,\lt\,10\cr {\rm molecular\ mass}\,\lt\,500\ {\rm g\ mol}^{-1}\cr {\rm octanol{-}water\ partition\ coefficient}\leq 5}\right).\cr&&(28)}]$

Looking at the problem of biological activity of oral drugs in more detail, it becomes clear that it can be divided into several aspects. The oral drugs must go into solution and then penetrate the cell membranes. Consequently, methods exist that try to predict separately the solubility or the intestinal permeability. To predict the solubility of an innovative drug at a very early stage, ideally before synthesis, is a challenging problem of high pharmaceutical significance. Fortunately, the solubility is easy and accurate to measure. Therefore, large and accurate databases are available, for instance at the National Institute of Standards and Technology (NIST). This makes the idea of applying QSPR to this question very attractive, and a large number of studies of this kind have been undertaken.

The large number of descriptors in QSAR methods is in contrast to the presentation given in many textbooks. In thermodynamics, the solubility, K_s, of a crystalline substance, for the case of simple dissolution, depends mainly on two kinds of interactions. The first defines how tightly the molecule is bound to its own crystal lattice, G_cryst, and the second how strongly the molecule associates with the solvent, G_solution (or G_hydration for the solubility in water). In thermodynamic equilibrium these two energies have to be equal. The solubility itself is a function of the solution energy in an infinitely dilute solution (the `standard energy of solution') and the concentration. The free energy of solution depends in an additive way on the standard energy of solution and the concentration: $[G_{{\rm cryst}} = G_{{\rm solution}} = G_{{\rm solution}}^{{\rm standard}}+G_{{\rm concentration}}.\eqno(29)]$

The free energy for a solution at a particular concentration is connected, to a first approximation, to the concentration by Boltzmann's entropy formula, which connects the entropy for a system with the energy. The considerations are based on the partition function for a number of equal particles, therefore the concentration has to be expressed in terms of mole fractions, X. Replacing the free energy for a solution at a particular concentration we obtain $[G_{{\rm cryst}} = G_{{\rm solution}} = G_{{\rm solution}}^{{\rm standard}}+RT\ln(X),\eqno(30)]$ where R is the universal gas constant and T is the temperature. $[RT\ln(X)]$ can be expressed as a linear regression: $[RT\ln(X) = a_{{\rm cryst}}G_{{\rm cryst}}+b_{{\rm cryst}}+a_{ {\rm solution}}G_{{\rm solution}}+b_{{\rm solution}},\eqno(31)]$ where a and b are parameters.

It is problematic, however, that neither G_cryst nor G_solution is easily accessible by experiment or by simulations. Nevertheless, recent progress in crystal structure prediction (CSP) allows us to estimate the free lattice energy of the crystals, G_cryst, with passable computational effort. Even if crystal structure prediction by itself still requires enormous computational effort, the lattice energy can be quickly estimated by a simplified version of CSP. It is important in this context that during CSP a lot of polymorphs are found within a very narrow range of energy. For predicting the solubility, it is sufficient to estimate the lattice energy roughly and knowledge of the real crystal structure is not needed. In Fig. 25 the calculated lattice energies of some experimentally determined crystal structures and the lattice energies of their top-ranked predicted structures are compared. With this simplified CSP the lattice energy is constantly predicted as a little bit lower than the real lattice energy, but the correlation is nearly perfect, as the correlation coefficient of 0.99 shows. (A list of the drugs and drug-like compounds that feature in Figs. 25–27 is given in Table 3 .)

Table 3
The list of drugs and drug-like compounds used in Fig. 25 –27

Acetamide	Chrysin	Ibuprofen	L-Proline	Proline
Acetanilide	Cyanuric acid	Ketoprofen	Nitrofurantoin	Hexachlorobenzene
Aspirin	Dexamethasone	Lidocaine	Oxalic acid	Nifedipine
Atropine	Diflunisal	Malononitrile	Paracetamol	Salicylic acid
Benzocaine	1,2-Diformylhydrazine	Methyl carbamate	Uracil	Flurbiprofen
Benzoic acid	Ebalzotan	Morin	Phenacetin	Urea
Caffeine	Fenbufen	N-Methylacetamide	Piperazine	Codeine
Citric acid	2-Azetidinone	N-Methylurea	Primidone	Naringenin
Cocaine	Fluconazole	Naphthalene	Naproxen

Figure 25

Prediction of the lattice energy (in kJ mol⁻¹) with FlexCryst.

The other descriptor, i.e., the solution or the hydration energy, can be obtained from an increment system (Pèpe et al., 2002 ) or by molecular dynamics. It is quicker to evaluate the hydration energy using the increment system; however, using molecular dynamics is more precise. The increment system is based on 2D descriptors of the molecule, i.e. the input vector contains the frequency of different functional groups of the molecule. Combining both descriptors and performing a two-dimensional regression according to equation (31 ) gives a good correlation between the experimental and the predicted solubility.

For further improvement of the prediction more descriptors can be added to the formula. One of the most common descriptors is the molecular volume (Lüder et al., 2009 ). The result of three-dimensional regression with this additional descriptor is plotted in Fig. 26 . The correlation between the solubility and the molecular volume is negligible (the correlation coefficient of 0.01 is tiny). This means that when only two descriptors – the lattice energy and the hydration energy – are used the molecular volume becomes redundant.

Figure 26

The correlation between solubility and molecular volume. The molecular volume is redundant.

A comparison of the predicted solubility method with other methods is shown in Table 4 . The data are taken from Duchowicz et al. (2008 ) and show the success of the different methods for a set of 21 drugs. The r.m.s.d. value of 0.85 is slightly higher than for the best model of Hou et al. (2004 ) (r.m.s.d. = 0.664) and the model of Huuskonen (2000 ) (r.m.s.d. = 0.810). However, their models both use more descriptors than the size of the training set and the ratio of the number of parameters to the number of data sets is 4:1 for the model of Hou et al. (2004 ) and 1.5:1 for the model of Huuskonen (2000 ). In both cases this is questionable; overfitting might have occurred. The method presented here is better as it uses only two descriptors, and the ratio of the number of parameters to the number of data sets is much lower at 1:17.

Table 4
Performance of different methods applied to a test set of drugs

Author	Type of descriptors	No. of descriptors	R.m.s.d.	Test set	Year
Klopman	2D group contributions	34	1.213	21	1992
Huuskonen	molecular topology	30	0.810	21	2000
Yan	3D structure	40	1.286	21	2003
Hou	atomic	78	0.664	21	2004
Duchowicz	molecular, Dragon	3	1.202	21	2008
Hofmann	G_cryst, G_hydration	2	0.85	34

In Fig. 27 the correlation between the predicted and experimental solubility is shown. The experimental data were taken from the online database of the Virtual Computational Chemistry Laboratory (2016 ). The data set covers a broad range of solubility and varies over seven orders of magnitude. Fenbufen, for example, is nearly insoluble, with a pK_s of −5.06, in contrast to acetamide, which is nearly mixable with water (pK_s = 1.58 ≃ 2.25 kg l⁻¹). The correlation coefficient of 0.80 compares well with other methods. By PCA it is possible to show that the irrelevant descriptor, the molecular volume, cannot improve the prediction. As suggested in many textbooks, it is very likely that the concentration should enter in some way as a descriptor. It is well known that for high concentrations, as for urea and acetamide, the hydration shell around the molecules is incomplete because of a lack of water, and that the solute molecules interact directly with each other. To correct this problem activity coefficients are added to the first approximation of equation (30 ).

Figure 27

Solubility prediction with reduced dimensionality (only hydration and lattice energies considered).

6.3. Mathematical reformulation (relative solubility of cocrystals)

Another example, and one of how to estimate the solubility even if some very important data are missing, is the case of cocrystals (Kuleshova et al., 2013 ). In pharmacy, cocrystal formulation is very common, with the aim of improving the solubility of the active pharmaceutical ingredient (API), so it would be very valuable to be able to estimate, even if not the solubility by itself, then the change in solubility after cocrystal (CC) formation with one of the possible coformers (CFs). For crystals of an API, the basic expression for solubility can be written as $[-RT\ln K_{\rm s}^{{\rm API}} = \Delta{G_{\rm s}} = G_{{\rm cryst}}^{{\rm API}}-G_ {{\rm hydr}}^{{\rm API}}.\eqno(32)]$ Here, as in Section 6.2 , the first term, $[G_{{\rm cryst}}^{{\rm API}}]$ , defines how tightly the molecule is bound to its own crystal lattice and the second, $[G_{{\rm hydr}}^{{\rm API}}]$ , how strongly the molecule associates with the solvent. Within the approach described above (see Sections 4.3.4 and 4.4 ), the calculation of the crystal energy of any molecular system is not a problem, and takes very little computational power.

In recent years several other relatively simple and quick approaches have been developed (Karamertzanis et al., 2009 ; Musumeci et al., 2011 ; Perlovich & Raevsky, 2010 ). On the other hand, molecular dynamics (MD) estimation of hydration energy requires complex simulations and high computational power (Orozco & Luque, 2000 ; Wan et al., 2004 ; Nagy, 2004 ; Duffy & Jorgensen, 2000 ; Åberg et al., 2004 ; Jorgensen & Duffy, 2000 ).

A simple mathematical reformulation allows us to exclude the second term and to avoid the tedious calculations of a hydration energy.

Let us assume that an API can form different cocrystals (CCs) with a variety of coformers (CFs). The simple dissolution of a cocrystal in water, Fig. 28 , is described by the reaction $[{\rm CC}_{{\rm API-CF}}\rightleftharpoons {\rm API}^{{\rm aq}}+{\rm CF}^{{\rm aq}}\eqno(33)]$ with the equilibrium constant K_s.

Figure 28

Scheme of simple dissolution of a cocrystal in water.

For the case of a two-component 1:1 cocrystal, equation (32 ) will transform to $[-RT\ln K_{\rm s}^{{\rm CC}} = \Delta{G_{\rm s}} = 2G_{{\rm cryst}}^{{\rm CC}}-G_{ {\rm hydr}}^{{\rm API}}-G_{{\rm hydr}}^{{\rm CF}}.\eqno(34)]$

As soon as we need just information on the change in the solubility, it is sufficient to consider the difference between equations (32 ) and (34 ), which will characterize the `relative solubility' $[K_{\rm s}^{{\rm rel}}]$ and, in addition, will allow us to exclude from consideration the free energy of hydration of the API, $[G_{{\rm hydr}}^{{\rm API}}]$ : $[-RT\ln K_{\rm s}^{{\rm API}}+RT\ln K_{\rm s}^{{\rm CC}} = G_{{\rm cryst}}^{ {\rm API}}-2G_{{\rm cryst}}^{{\rm CC}}+G_{{\rm hydr}}^{{\rm CF}}.\eqno(35)]$

Further simplification of the equation is achieved by replacing the free energy of hydration of the CF, $[G_{{\rm hydr}}^{{\rm CF}}]$ , by $[G_{{\rm hydr}}^{{\rm CF}} = RT\ln K_{\rm s}^{{\rm CF}}+G_{{\rm cryst}}^{ {\rm CF}}.\eqno(36)]$ After consequent transformations the final expression for the relative solubility, $[K_{\rm s}^{{\rm rel}}]$ , can be written as $[K_{\rm s}^{{\rm rel}} = K_{\rm s}^{{\rm CC}}/K_{\rm s}^{{\rm API}} = K_{\rm s}^{{\rm CF}}\times \exp({\Delta G/RT}),\eqno(37)]$ where $[\Delta G = G_{{\rm cryst}}^{{\rm API}}+G_{{\rm cryst}}^{{\rm CF}}-2G _{{\rm cryst}}^{{\rm CC}}.\eqno(38)]$ From expression (37 ) it follows that the relative solubility of the cocrystal will be defined by the solubility of the coformer, CF, and by the value of ΔG calculated according to equation (38 ). For cases where the solubility of the API and coformers are known, the solubility of the cocrystals can be calculated directly: $[K_{\rm s}^{{\rm CC}} = K_{\rm s}^{{\rm API}}\times K_{\rm s}^{{\rm CF}}\times \exp({ \Delta G/RT}).\eqno(39)]$

Data for pharmaceutically acceptable coformers are often stored in specific libraries and their physico-chemical characteristics are known. In these cases expression (37 ) allows a quick estimate of the relative solubility (virtual screening) of extensive numbers of practicable cocrystals.

To illustrate how this scheme works, we found the structures of four cocrystals (COLHER, WIKTEP, DAQYOK and RILJEB) and their pure components with known data on solubility in the CSD (Table 5 ). The APIs are the flavonoid naringenin (DOLRIF) and 4-aminobenzoic acid (AMBNAC); as coformers, CFs, we have triethyldiamine (TETDAM), trinitrobenzene (TNBENZ), 4-nitrobenzoic acid (NBZOAC) and 4-nitroaniline (NANILI). For all of them crystal structure predictions were performed in accordance with the protocol described in Section 3.4 and the values of ΔG were calculated according to equation (38 ).

Table 5
The estimation of the relative solubility for some cocrystals

Cocrystal	$[K_{\rm s}^{{\rm API}}]$ (mol l⁻¹)	$[K_{\rm s}^{{\rm CF}}]$ (mol l⁻¹)	ΔG (kJ mol⁻¹)	$[K_{\rm s}^{{\rm rel}}]$	$[K_{\rm s}^{{\rm CC}}]$ (mol l⁻¹)
COLHER (DOLRIF:TETDAM)	0.002	4.012	4.690	22.5	0.045
WIKTEP (AMBNAC:TNBENZ)	0.398	0.001	6.758	0.153	0.061
DAQYOK (AMBNAC:NBZOAC)	0.398	0.002	3.500	0.007	0.003
RILJEB (AMBNAC:NANILI)	0.398	0.004	2.530	0.010	0.004

In all cases ΔG was found to be positive, which means that our calculations suggest that cocrystal formation is favourable. As one can see, the cocrystallization of naringenin with much more soluble TETDAM helps to increase the solubility of the flavonoid by more than 20-fold. Crystallization of AMBNAC with the poorly soluble coformers reduces the solubility of the API. In these cases the values of the relative solubility are less than unity. For coformers with comparable solubility (in our example TNBENZ, NBZOAC and NANILI) the values of ΔG begin to play a leading role in the final solubility of the cocrystal, so the solubility of WIKTEP is higher, owing to the higher value of ΔG.

7. Concluding remarks

Data mining is still a very young and quickly developing field of science, even if some of the main algorithms used now were known in mathematics as early as the beginning of the last century. The explosive growth in data mining has been caused by the accumulation of huge amounts of data and the growing ability of computers to store them. There is no single field in science where methods of data mining and machine learning have not found an application. Partly for this reason, it is very difficult to find a unified terminology and presentation within the huge amount of specialized literature and handbooks on data mining. In the present chapter we have tried to help the reader to navigate between tasks, algorithms and their applications as used in crystallography.

In Section 2 we presented the concept, terminology and overall scheme of data mining as used at present. In subsequent sections we described the main tasks of data mining in crystallography, and methods for handling these tasks. In Section 3 we gave an overview of clustering algorithms and methods of similarity detection (also known as pattern recognition) which can be used in the fully automatic process of clustering. Several important applications of crystal structure clustering were described. Clustering of polymorphs defined at non-ambient conditions revealed significant structural changes and the need to take into account additional descriptors (temperature and pressure) during CSP.

In Section 4 the mathematical concept behind the tasks of classification and regression was presented, and illustrated using examples drawn from crystal-density estimation and force-field development. The new approach for the refining of the model and parameters of a force field allows us to handle any types of atoms including metals and ions, and to take into account external conditions. It helps to improve significantly the predictability of CSP, which belongs to one of two high-level primary goals of data mining – pattern prediction, see Section 2 ).

In Section 5 the importance of anomaly (outlier) detection was highlighted. Since all methods of machine learning are very sensitive to data quality, they can easily detect anomalies and outliers. An accurate analysis of any outliers is very often worthwhile as it can indicate the kinds of problems present: erroneous data, or the inefficiency or unsuitability of the models. Several cleansers have been developed for use during data preprocessing in addition to the two cleansers used at present in the Cambridge Structural Database, the first of which, enCIFer (CCDC, 2016a ), checks the syntax of the data, while the second cleanser, PLATON (Spek, 2016 ), checks for the reasonableness of the structure and, for instance, annotates the records of structures that contain large voids. We also illustrated how the data cleansers developed during machine learning work within the algorithms and how they can be used for database cleansing. The removal of anomalous data from a data set often results in a significant increase in the accuracy of data-mining procedures.

In Section 6 we discussed approaches to reducing the dimensionality of data for calculating and predicting solubility. The solubility of crystalline materials is of crucial importance for pharmaceutical developments and drug discovery, as are estimates of the crystal energy for monitoring the stability of proposed new drugs.

We hope that the methods and concepts presented here will make data mining and machine learning more accessible to the general practitioner in crystallography, and allow new applications in the field and the discovery of nontrivial and scientifically relevant knowledge.

7.1. Outlook

Further developments will obviously be made possible by growth in both databases and computer power. These will allow for the introduction of additional descriptors, which are still considered to be beyond the current scope of data mining. For instance, data-mining force fields only recently reached an accuracy that allows them to consider the temperature and, accordingly, to describe thermal effects in molecular crystals (Hofmann & Kuleshova, 2018 ). This has become accessible since data for variable-temperature experiments were added as a separate descriptor to the CIF template used by the CSD. The inclusion of more descriptors in the CIF template (for example, pressure, sublimation energy and so on) would open new possibilities for knowledge discovery.

Data miners are also currently developing methods for combining data from various databases. Combining existing databases is problematic for commercial reasons, but more and more algorithms are appearing that allow data to be used from a combination of various databases. This is not only the case for databases of the same class, such as the structural databases (the CSD, PDB and ICSD); it is also possible to combine data from databases of different classes, such as from thermodynamic databases (for instance, from the NIST database and from structural databases). The latter would allow the scaling of score functions obtained with data mining to standard energy units.

As a growing problem, one should mention the appearance of `fake data' in databases. Because it is very often simpler and cheaper to use an existing model to calculate a specific property rather than to measure it or conduct a literature search for experimental data, people tend to confuse virtual data and experimental data. There is thus a danger that the line between the real and the virtual may be blurred. The organizations behind the databases are well aware of this problem; however, we sometimes find structures in the CSD that have been calculated but not measured.

There is also progress at present in `hybrid methods', where the crystal structure is calculated within a common quantum-chemistry approach and only the exchange correlation term of density functional theory is obtained by data mining.

The further development of algorithms is difficult to predict, but we have presented in this chapter instances in which the weight vectors and the vector matrices have a physical meaning. This should help the reader to understand the methods. However, neural networks, for instance, can be used as a completely `black box' tool. While the trained network might predict perfectly, and better than any other model, this has some drawbacks that contradict the classical approach of science. Because black boxes do not produce any knowledge, they cannot extrapolate, and even interpolation can fail. But one can hope for significant progress with such approaches in the future.

References

Abdi, H. & Williams, L. J. (2010). WIREs Comput. Stat. 2, 433–459.Google Scholar

Åberg, K., Lyubartsev, A., Jacobsson, S. & Laaksonen, A. (2004). J. Chem. Phys. 120, 3770–3776.Google Scholar

Accelrys Software Inc. (2013). Materials Studio. San Diego, USA.Google Scholar

Allen, F. H. (2002). Acta Cryst. B58, 380–388.Google Scholar

Apostolakis, J. (2010). An introduction to data mining. Data Mining in Crystallography, edited by D. W. M. Hoffmann & L. N. Kuleshova, pp. 1–36. Structure and Bonding, Vol. 134. Berlin, Heidelberg: Springer.Google Scholar

Apostolakis, J., Hofmann, D. W. M. & Lengauer, T. (2001). Acta Cryst. A57, 442–450.Google Scholar

Balachandran, P. V. & Rajan, K. (2012). Acta Cryst. B68, 24–33.Google Scholar

Bardwell, D. A., Adjiman, C. S., Arnautova, Y. A., Bartashevich, E., Boerrigter, S. X. M., Braun, D. E., Cruz-Cabeza, A. J., Day, G. M., Della Valle, R. G., Desiraju, G. R., van Eijck, B. P., Facelli, J. C., Ferraro, M. B., Grillo, D., Habgood, M., Hofmann, D. W. M., Hofmann, F., Jose, K. V. J., Karamertzanis, P. G., Kazantsev, A. V., Kendrick, J., Kuleshova, L. N., Leusen, F. J. J., Maleev, A. V., Misquitta, A. J., Mohamed, S., Needs, R. J., Neumann, M. A., Nikylov, D., Orendt, A. M., Pal, R., Pantelides, C. C., Pickard, C. J., Price, L. S., Price, S. L., Scheraga, H. A., van de Streek, J., Thakur, T. S., Tiwari, S., Venuti, E. & Zhitkov, I. K. (2011). Acta Cryst. B67, 535–551.Google Scholar

Berman, H. M. (2008). Acta Cryst. A64, 88–95.Google Scholar

Bondi, A. (1963). J. Chem. Eng. Data, 8, 371–381.Google Scholar

Boser, B., Guyon, I. & Vapnik, V. (1992). A Training Algorithm for Optimal Margin Classifiers, pp. 144–152. ACM Press.Google Scholar

Bragg, W. H. & Bragg, W. L. (1913). Proc. R. Soc. Lond. Ser. A, 88, 428–438.Google Scholar

Buchsbaum, C., Hoeler-Schlimm, S. & Rehme, S. (2010). Data bases, the base for data mining. Data Mining in Crystallography, edited by D. W. M. Hoffmann & L. N. Kuleshova, pp. 37–58. Structure and Bonding, Vol. 134. Berlin, Heidelberg: Springer.Google Scholar

Buckingham, R. A. (1938). Proc. Phys. Soc. London A, 168, 264–283.Google Scholar

Bürgi, H-B. & Dunitz, J. D. (1994). Structure Correlation, Vols. 1 and 2. Weinheim: Wiley-VCH Verlag, GmbH.Google Scholar

CCDC (2016a). enCIFer. Cambridge Crystallographic Data Centre, UK.Google Scholar

CCDC (2016b). Mercury. Version 3.8. Cambridge Crystallographic Data Centre, UK.Google Scholar

Chisholm, J. A. & Motherwell, S. (2005). J. Appl. Cryst. 38, 228–231.Google Scholar

Clarke, J., Smith, W. & Woodcock, L. (1986). J. Chem. Phys. 84, 2290–2294.Google Scholar

Craven, M. W. & Shavlik, J. W. (1997). Future Generation Comput. Syst. 13, 211–229.Google Scholar

Day, G. M., Cooper, T. G., Cruz-Cabeza, A. J., Hejczyk, K. E., Ammon, H. L., Boerrigter, S. X. M., Tan, J. S., Della Valle, R. G., Venuti, E., Jose, J., Gadre, S. R., Desiraju, G. R., Thakur, T. S., van Eijck, B. P., Facelli, J. C., Bazterra, V. E., Ferraro, M. B., Hofmann, D. W. M., Neumann, M. A., Leusen, F. J. J., Kendrick, J., Price, S. L., Misquitta, A. J., Karamertzanis, P. G., Welch, G. W. A., Scheraga, H. A., Arnautova, Y. A., Schmidt, M. U., van de Streek, J., Wolf, A. K. & Schweizer, B. (2009). Acta Cryst. B65, 107–125.Google Scholar

Ding, C. & He, X. (2004). Proceedings of the 21st International Conference on Machine Learning, pp. 1–9. Banff, Canada. New York: Association for Computing Machinery.Google Scholar

Donohue, J. (1952). J. Phys. Chem. 56, 502–510.Google Scholar

Duchowicz, P. R., Talevi, A., Bruno-Blanch, L. E. & Castro, E. A. (2008). Bioorg. Med. Chem. 16, 7944–7955.Google Scholar

Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: John Wiley & Sons.Google Scholar

Duffy, E. & Jorgensen, W. (2000). J. Am. Chem. Soc. 122, 2878–2888.Google Scholar

Dzyabchenko, A. V. (1994). Acta Cryst. B50, 414–425.Google Scholar

Eijck, B. P. van (2005). Acta Cryst. B61, 528–535.Google Scholar

Eijck, B. P. van & Kroon, J. (1998). J. Comput. Chem. 18, 1036–1042.Google Scholar

Etter, M. (1991). J. Phys. Chem. 95, 4601–4610.Google Scholar

Etter, M. C., MacDonald, J. C. & Bernstein, J. (1990). Acta Cryst. B46, 256–262.Google Scholar

Ewald, P. & Hermann, C. (1931). Editors. Strukturbericht 1913–1928. Leipzig: Akademische Verlagsgesellschaft mbH.Google Scholar

Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996). AI Magazine, 17, 37.Google Scholar

Gallardo, I., Guirado, G., Marquet, J. & Vilà, N. (2007). Angew. Chem. Int. Ed. 46, 1321–1325.Google Scholar

Gavezzotti, A. & Filippini, G. (1995). J. Am. Chem. Soc. 117, 12299–12305.Google Scholar

Gelder, R. de, Wehrens, R. & Hageman, J. A. (2001). J. Comput. Chem. 22, 273–289.Google Scholar

Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171–179.Google Scholar

Hall, S. R. & McMahon, B. (2006). Editors. International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data. Chichester: Wiley.Google Scholar

Hendricks, S. (1930). Chem. Rev. 7, 476–477.Google Scholar

Hofmann, D. W. M. (2002). Acta Cryst. B58, 489–493.Google Scholar

Hofmann, D. W. M. (2010). Data mining in organic crystallography. Data Mining in Crystallography, edited by D. W. M. Hoffmann & L. N. Kuleshova, pp. 89–134. Structure and Bonding, Vol. 134. Berlin, Heidelberg: Springer.Google Scholar

Hofmann, D. W. M. & Apostolakis, J. (2003). J. Mol. Struct. 647, 17–39.Google Scholar

Hofmann, D. W. M. & Kuleshova, L. (2005). J. Appl. Cryst. 38, 861–866.Google Scholar

Hofmann, D. W. M. & Kuleshova, L. N. (2010). Editors. Data Mining in Crystallography. Structure and Bonding, Vol. 134. Berlin, Heidelberg: Springer.Google Scholar

Hofmann, D. W. M. & Kuleshova, L. N. (2014). Cryst. Growth Des. 14, 3929–3934.Google Scholar

Hofmann, D. W. M. & Kuleshova, L. (2018). Chem. Phys. Lett. 699, 115–124.Google Scholar

Hofmann, D. W. M., Kuleshova, L. N. & Antipin, M. Y. (2004). Cryst. Growth Des. 4, 1395–1402.Google Scholar

Hofmann, D. W. M., Kuleshova, L. & D'Aguanno, B. (2007). Chem. Phys. Lett. 448, 138–143.Google Scholar

Hofmann, D., Kuleshova, L., Hofmann, F. & D'Aguanno, B. (2009). Chem. Phys. Lett. 475, 149–155.Google Scholar

Hofmann, D. W. M. & Lengauer, T. (1997). Acta Cryst. A53, 225–235.Google Scholar

Hofmann, F., Hofmann, D. & Kuleshova, L. (2016). FlexCryst. Version 2.03.05. FlexCryst, Uttenreuth, Germany.Google Scholar

Hou, T. J., Zand, W. & Xu, J. (2004). J. Chem. Inf. Model. 44, 266–275.Google Scholar

Huuskonen, J. (2000). J. Chem. Inf. Model. 40, 773–777.Google Scholar

Jolliffe, I. (2002). Principal Component Analysis. Springer.Google Scholar

Jorgensen, W. & Duffy, E. (2000). Bioorg. Med. Chem. Lett. 10, 1155–1158.Google Scholar

Karamertzanis, P., Kazantsev, A., Issa, N., Welch, G. W. A., Adjiman, C., Pantelides, C. & Price, S. (2009). J. Chem. Theory Comput. 5, 1432–1448.Google Scholar

Karfunkel, H. R., Rohde, B., Leusen, F. J. J., Gdanitz, R. J. & Rihs, G. (1993). J. Comput. Chem. 14, 1125–1135.Google Scholar

Kempster, C. J. E. & Lipson, H. (1972). Acta Cryst. B28, 3674.Google Scholar

Kitaigorodskii, A. (1957). The Theory of Crystal Structure Analysis. Moscow: Academy of Science of the USSR.Google Scholar

Kitaigorodskii, A. J. (1961). Organic Chemical Crystallography. New York: Consultants Bureau.Google Scholar

Kopp, H. (1842). Ann. Chem. Pharm. 41, 169–189.Google Scholar

Kuleshova, L., Hofmann, D. & Boese, R. (2013). Chem. Phys. Lett. 564, 26–32.Google Scholar

Kuleshova, L. N. & Zorky, P. M. (1980). Acta Cryst. B36, 2113–2115.Google Scholar

Laue, M. (1913). Ann. Phys. 346, 989–1002.Google Scholar

Lennard-Jones, J. E. (1931). Proc. Phys. Soc. 43, 461–482.Google Scholar

Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J. (2012). Adv. Drug Deliv. Rev. 64, 4–17.Google Scholar

Lommerse, J. P. M. (2000). Acta Cryst. A56, s196.Google Scholar

Lüder, K., Lindfors, L., Westergren, J., Nordholm, S., Persson, R. & Pedersen, M. (2009). J. Comput. Chem. 30, 1859–1871.Google Scholar

Macrae, C. F., Sovago, I., Cottrell, S. J., Galek, P. T. A., McCabe, P., Pidcock, E., Platings, M., Shields, G. P., Stevens, J. S., Towler, M. & Wood, P. A. (2020). J. Appl. Cryst. 53, 226–235.Google Scholar

McCulloch, W. & Pitts, W. (1943). Bull. Math. Biophys. 5, 115–133.Google Scholar

McGuire, R., Momany, F. & Scheraga, H. (1972). J. Phys. Chem. 76, 375–393.Google Scholar

Maimon, O. (2010). Knowledge Discovery and Data Mining: The Info-Fuzzy Network (IFN) Methodology. Berlin: Springer-Verlag.Google Scholar

Mannila, H. (1996). In Proceedings of the Eighth International Conference on Scientific and Statistical Database Management, pp. 2–9. IEEE Computer Society.Google Scholar

Mauri, A., Consonni, V., Pavan, M. & Todeschini, R. (2006). Match, 56, 237–248.Google Scholar

Motherwell, W. D. S. (2010). CrystEngComm, 12, 3554–3570.Google Scholar

Musumeci, D., Hunter, C., Prohens, R., Scuderi, S. & McCabe, F. (2011). Chem. Sci. 2, 883–890.Google Scholar

Nagy, P. (2004). J. Phys. Chem. B, 108, 11105–11117.Google Scholar

Nangia, A. & Desiraju, G. (1998). Top. Curr. Chem. 198, 57–95.Google Scholar

Nikam, S. S. (2015). Orient. J. Comput. Sci. Technol. 8, 13–19.Google Scholar

Orozco, M. & Luque, F. J. (2000). Chem. Rev. 100, 4187–4226.Google Scholar

Pauling, L. (1939). The Nature of the Chemical Bond. Ithaca: Cornell University Press.Google Scholar

Pearson, K. (1901). London Edinb. Dubl. Philos. Mag. J. Sci. 2, 559–572.Google Scholar

Pèpe, G., Guiliani, G., Loustalet, S. & Halfon, P. (2002). Eur. J. Med. Chem. 37, 865–872.Google Scholar

Perlovich, G. L. & Raevsky, O. A. (2010). Cryst. Growth Des. 10, 2707–2712.Google Scholar

Putz, H., Schön, J. C. & Jansen, M. (1999). J. Appl. Cryst. 32, 864–870.Google Scholar

Rajan, K. (2010). Data mining and inorganic crystallography. Data Mining in Crystallography, edited by D. W. M. Hoffmann & L. N. Kuleshova, pp. 59–88. Structure and Bonding, Vol. 134. Berlin, Heidelberg: Springer.Google Scholar

Rajan, K. Ch., Suh, C. & Mendez, P. (2009). Stat. Anal. Data Mining, 1, 361–371.Google Scholar

Reilly, A. M., Cooper, R. I., Adjiman, C. S., Bhattacharya, S., Boese, A. D., Brandenburg, J. G., Bygrave, P. J., Bylsma, R., Campbell, J. E., Car, R., Case, D. H., Chadha, R., Cole, J. C., Cosburn, K., Cuppen, H. M., Curtis, F., Day, G. M., DiStasio, R. A. Jr, Dzyabchenko, A., van Eijck, B. P., Elking, D. M., van den Ende, J. A., Facelli, J. C., Ferraro, M. B., Fusti-Molnar, L., Gatsiou, C.-A., Gee, T. S., de Gelder, R., Ghiringhelli, L. M., Goto, H., Grimme, S., Guo, R., Hofmann, D. W. M., Hoja, J., Hylton, R. K., Iuzzolino, L., Jankiewicz, W., de Jong, D. T., Kendrick, J., de Klerk, N. J. J., Ko, H.-Y., Kuleshova, L. N., Li, X., Lohani, S., Leusen, F. J. J., Lund, A. M., Lv, J., Ma, Y., Marom, N., Masunov, A. E., McCabe, P., McMahon, D. P., Meekes, H., Metz, M. P., Misquitta, A. J., Mohamed, S., Monserrat, B., Needs, R. J., Neumann, M. A., Nyman, J., Obata, S., Oberhofer, H., Oganov, A. R., Orendt, A. M., Pagola, G. I., Pantelides, C. C., Pickard, C. J., Podeszwa, R., Price, L. S., Price, S. L., Pulido, A., Read, M. G., Reuter, K., Schneider, E., Schober, C., Shields, G. P., Singh, P., Sugden, I. J., Szalewicz, K., Taylor, C. R., Tkatchenko, A., Tuckerman, M. E., Vacarro, F., Vasileiadis, M., Vazquez-Mayagoitia, A., Vogt, L., Wang, Y., Watson, R. E., de Wijs, G. A., Yang, J., Zhu, Q. & Groom, C. R. (2016). Acta Cryst. B72, 439–459.Google Scholar

Riis, S. K. & Krogh, A. (1996). J. Comput. Biol. 3, 163–183.Google Scholar

Samudrala, S. K., Balachandran, P. V., Zola, J., Rajan, K. & Ganapathysubramanian, B. (2014). Integrating Mater. Manuf. Innov. 3, 1.Google Scholar

Schmidt, M. U., Buchsbaum, C., Schnorr, J. M., Hofmann, D. W. M. & Ermrich, M. (2007). Z. Kristallogr. 222, 30–33.Google Scholar

Schmidt, M., Hofmann, D., Buchsbaum, C. & Metz, H. (2006). Angew. Chem. Int. Ed. 45, 1313–1317.Google Scholar

Spek, A. (2016). PLATON. University of Utrecht, The Netherlands.Google Scholar

Streek, J. van de & Motherwell, S. (2005). Acta Cryst. B61, 504–510.Google Scholar

Todeschini, R. & Consonni, V. (2009). Molecular descriptors for chemoinformatics. Methods and Principles in Medicinal Chemistry, edited by R. Mannhold, H. Kubinyi & G. Folkers, Vol. 41. Weinheim: Wiley-VCH Verlag GmbH.Google Scholar

Verwer, P. & Leusen, F. (1998). Rev. Comput. Chem. 12, 327–365.Google Scholar

Villar, P., Daam, J., Shikata, Y., Rajan, K. & Iwata, S. (2008). Chem. Met. Alloys, 1, 1–23.Google Scholar

Virtual Computational Chemistry Laboratory (2016). Alogps. Version 2.1. http://www.vcclab.org/lab/alogps/.Google Scholar

Wan, S., Stote, R. H. & Karplus, M. (2004). J. Chem. Phys. 121, 9539–9548.Google Scholar

Yang, Q.-C., Seiler, P. & Dunitz, J. D. (1987). Acta Cryst. C43, 565–567.Google Scholar

Zhang, G. P. (2010). Neural networks for data mining. In Data Mining and Knowledge Discovery Handbook. Boston: Springer.Google Scholar

International Tables for Crystallography (2022). Vol. C. ch. 1.4,
https://doi.org/10.1107/S1574870721008235