Data mining. II. Prediction of protein structure and optimization of protein crystallizability

Kouza, M.; Faraggi, E.; Sen, T. Z.; Jernigan, R. L.; Kloczkowski, A.

doi:10.1107/S1574870721008247

International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by T. R. Welberry

pdf

International Tables for Crystallography (2022). Vol. C. ch. 1.5,
https://doi.org/10.1107/S1574870721008247

Chapter 1.5. Data mining. II. Prediction of protein structure and optimization of protein crystallizability

Maksim Kouza,^a,^b Eshel Faraggi,^b,^c,^d Taner Z. Sen,^e Robert L. Jernigan^f and Andrzej Kloczkowski^b,^g ^*

^aFaculty of Chemistry, University of Warsaw, Poland,^bBattelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, Ohio, USA,^cDepartment of Physics, Indiana University–Purdue University Indianapolis, Indianapolis, Indiana, USA,^dResearch and Information Systems, LLC, Indianapolis, Indiana, USA,^eCrop Improvement and Genetics Research Unit, US Department of Agriculture, Agricultural Research Service, Albany, California, USA,^fDepartment of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Indiana, USA, and ^gDepartment of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio, USA
Correspondence e-mail: andrzej.kloczkowski@nationwidechildrens.org

Recent advances in computational technology have made it possible to store and mine huge data sets, much larger than ever before. Data-mining techniques, which extract useful information from these massive data sets, have evolved into a mature technique which can be employed as an efficient predictive tool across broad scientific fields, including but not limited to molecular biology, astronomy, bioinformatics, physics and medicine. In this chapter, first we discuss an original method (fragment data mining, FDM), which mines the structural segments from the Protein Data Bank (PDB) and utilizes structural information by matching the sequence of these structural fragments with the aim of improving the prediction of secondary structure. We also discuss further improvements by combining FDM with the classical GOR V secondary structure prediction method, which is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments. We also discuss a second approach that is newer, and more accurate for secondary structure predictions, the SPINE-X method based on a machine-learning methodology to predict secondary structures by mining protein sequences and structures. Our results strongly suggest that data mining can be an efficient and accurate approach for secondary structure prediction in proteins. The last part of the chapter discusses applications of data mining to the problem of optimizing protein crystallization conditions. Data mining can be used to improve the yield and quality of protein crystals, and thus aid in solving protein structures by X-ray crystallography. There is a vast amount of data for protein structures, sequences and crystallization conditions that can be mined to aid in structure prediction and structure determination.

Keywords: data mining; protein crystallography; protein structure prediction; secondary structure prediction; protein crystallization.

1. Introduction

Understanding the mechanism of protein folding remains one of the most challenging problems in molecular biology (Dill & MacCallum, 2012 ; Onuchic & Wolynes, 2004 ). The seminal experiment reported by Anfinsen on unfolding and refolding of ribonuclease A demonstrated that the protein sequence contains all of the information to specify how a protein can fold spontaneously into its active native state (Anfinsen, 1973 ). Protein structure prediction and folding have been studied by computational modelling at different levels of resolution and time scales. Through advances in computational power it has become possible to set new limits for simulating the folding of proteins (Shaw et al., 2010 ), but folding simulations for the largest proteins still remain beyond the reach of all-atom molecular dynamics (MD) simulations with explicit solvent. (A list of the abbreviations used in this chapter is provided in Section 7 .)

1.1. Coarse-grained models and conformational sampling

Various strategies for overcoming the size and time-scale limitations have been proposed, including the development of novel coarse-grained (CG) methods and implicit solvent models. The CG models reduce the complexity of each amino acid by representing them by fewer numbers of pseudo-atoms (nodes) than the actual number of atoms, which allows a significant decrease in the number of degrees of freedom (Liwo et al., 2011 ; Kolinski, 2004 ; Kmiecik et al., 2016 ; Wabik et al., 2013 ), especially since coarse graining may typically be performed at the level of one geometric point per amino acid. Despite the fact that the improved computational performance of CG models comes at the cost of loss of some structural accuracy, CG models are still powerful and successful tools for de novo structure prediction (Kmiecik et al., 2016 ).

There are various alternative ways to improve conformational sampling efficiency: one can use enhanced sampling techniques such as replica-exchange (Hansmann, 1997 ; Kouza & Hansmann, 2011 ) and umbrella sampling (Boczko & Brooks, 1995 ), or reduce the conformational space further by implementing a fragment-based approach (Zhang, 2008 ; Simons et al., 1997 ; Blaszczyk et al., 2013 , 2016 ).

1.2. The Rosetta method of protein structure prediction

The fragment-based approach to predicting protein structures was adopted early on in the Rosetta suite (Simons et al., 1997 ). Rosetta creates a library of possible conformations of short fragments of the target protein sequence. Those short fragments, derived from a set of known structures, are assembled by a Monte Carlo (MC) minimization together with a scoring function rewarding native-like properties. The success of Rosetta in structure and function predictions shows the importance of the information in structural fragments for theoretical predictive studies of proteins. Newer versions using sequence correlation information predict contacts within structures with great success (Ovchinnikov et al., 2017 ).

1.3. Homology modelling

It is widely assumed that a protein's three-dimensional structure is encoded in its sequence of amino acids. Although there are counterexamples, as sometimes one mutation in a sequence can lead to very different protein folds and functions (Alexander et al., 2009 ; Kouza & Hansmann, 2012 ), a sufficiently high similarity between protein sequences generally implies similarity between protein structures. This observation is the basis of many structure prediction algorithms that use sequence similarity searches to identify homologous proteins. While protein crystallography may be the method of choice for structure determination, it is too expensive and too slow to deal with the exponentially growing number of protein sequences being produced by high-throughput genome sequencing. Computational methods are thus going to be needed to solve this problem (Baker & Sali, 2001 ). One of the key steps in the prediction of tertiary structures of proteins is the prediction of secondary structures. Typically, in ab initio methods this is the first step. Prediction of secondary structure is a one-dimensional problem, which is much easier than three-dimensional structure prediction, and has been a subject of interest for a long time. However, de novo protein structure prediction (and secondary structure prediction) still remains a challenging problem in molecular biology.

1.4. Data mining of protein structures

Recent advances in technology have made storing huge data sets extremely cheap. More than 95% of protein structural data has been generated and stored in the last decade and there seems to be no end to new data generation. Data-mining techniques for extracting useful information from huge data sets is a maturing field. To analyse and process these enormous amounts of biological data, new `big data' methodologies have been developed. They are based on mapping, searching and analysing for patterns in sequences and data mining of three-dimensional structures. Mining for information in biological databases involves various forms of data analysis such as clustering, sequence homology searches, structure homology searches, examination of statistical significance and so on.

1.5. BLAST

The most basic data-mining tool in biology is the basic local alignment search tool (BLAST; Altschul et al., 1990 ) for examining a new nucleic acid or protein sequence. BLAST compares the new sequence to all sequences in the PDB database to find sequences that are the most similar to a query sequence. This is normally the way genes and proteins are assigned functions.

1.6. Statistical and machine-learning methods

Data-mining tasks can be descriptive, by uncovering patterns and relationships in the available data, or predictive, based on models derived from these data. Most popular and frequently used are automated data-mining tools that employ sophisticated algorithms based on statistics or machine learning to discover hidden patterns in biological data. Data mining can be applied to a variety of biological problems such as analysis of protein interactions, finding homologous sequences or homologous structures, multiple sequence alignment, construction of phylogenetic trees, genomic sequence analysis, gene finding, gene mapping, gene expression data analysis, drug discovery etc. All of these different problems can be studied by using various data-mining tools and techniques.

1.7. Chapter outline

In this chapter, first we discuss the application of data mining to the problem of protein secondary structure prediction. We describe an original method (fragment data mining, FDM), which has proven to be a good predictor of the secondary structure of proteins. The FDM method mines structural segments in the Protein Data Bank (PDB) and utilizes structural information from the matching sequence fragments, with the aim of improving predictions of secondary structure. Next, we briefly describe the consensus data mining (CDM) method, which combines FDM with the classical GOR V secondary structure prediction method, which is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments. Finally, we discuss the application of the data-mining approach to the problem of optimizing protein crystallization conditions.

2. Fragment database mining method

The fragment database method proposed by us in 2005 was pursued because of the success of Baker and collaborators, who developed the Rosetta algorithm (Simons et al., 1997 , 1999 ), in tertiary structure prediction based on the use of short structural fragments. Rosetta takes the distribution of local structures adopted by short sequence segments and identifies patterns (so called I-sites) that correspond to local structures of proteins in the PDB. It assembles the I-site library that is used for the prediction of protein three-dimensional structure. The Rosetta method considers both the local and superlocal sequence structure biases and uses the fragment insertion Monte Carlo method to build the three-dimensional structure of a protein. Its success in structure and function predictions shows the importance of the information carried in the structural fragments.

2.1. Local alignment and evolutionary information

Taking inspiration from the approach implemented in Rosetta, we developed the FDM method. Because a sufficiently high similarity between protein sequences implies similarity between protein structures and conserved local motifs may assume a similar shape, we searched for a local alignment method to obtain structure information for predicting the secondary structure of query sequences. In the FDM method, BLAST is applied to query all the structures from the PDB. Then, evolutionary information is introduced to the prediction of secondary structure by analysing the fragments of the alignments belonging to proteins in the PDB. In the prediction and evaluation part, each query sequence from the data set is subjected to the following procedure.

The procedure starts from the weights assignment to the matching segments obtained from BLAST. In this step various types of parameters are considered, including different substitution matrices, similarity/identity cutoffs, degree of exposure to solvent of residues, and protein classification and sizes. In the second step, we calculate normalized scores for each residue. This is followed by prediction of the secondary structure for that residue according to the normalized scores. It should be noted that in order to predict the secondary structure according to the normalized scores of residues, we applied two different strategies. The first strategy is to choose the highest-scoring structure class as the prediction, and the second one is to use machine-learning approaches to choose a classification based on training. Finally, we calculate Q₃ (the sum of the fractions of successful predictions for α-helix, β-strand and coil) and the Matthews' correlation coefficients.

2.2. Multistage procedure used in the FDM method

Fig. 1 shows a scheme of the multistage procedure employed in the fragment database mining (FDM) method.

Figure 1

Multistage procedure for the prediction of secondary structure using the fragment database mining (FDM) method.

For training we used the CB513 benchmarked data set of 513 non-redundant domains developed by Cuff & Barton (2000 , 1999 ). Local sequence alignments were generated by using BLAST with different substitution matrices (Henikoff & Henikoff, 1992 ), including BLOSUM-45, BLOSUM-62, BLOSUM-80, PAM-30 and PAM-70.

2.3. Secondary structure alphabet

Unlike in DSSP (Kabsch & Sander, 1983 ), where an eight-letter structural alphabet is used, we instead used a reduced three-letter structural alphabet of secondary structures: α-helix (H), extended (β-sheet) (E) and coil (C). H, E and C in our three-letter structural code correspond to helices (H, G and I), strands and bridges (E and B) together with turn, bend and coil (T, S and C) in the DSSP code, respectively.

2.4. Weight assignment

For weight assignments we defined identity scores and their powers (id^c, where c is a positive real number) as the weights of matching segments. Here id is the ratio of the number of exact matches of residues to the total number of residues in the matching segment. Weights were then adjusted to obtain the best match. This is illustrated in Fig. 2 . At each position, the predicted secondary structure is determined by the secondary structures of the matches at that position. Each match is assigned a weight according to the similarity or identity score of the alignment from BLAST. At each position, the weights are normalized, and the normalized scores for each position within each of the secondary structure states are calculated.

Figure 2

An example showing a query sequence and its matching segments based on sequence matches (sequences not shown). The matching segments are expressed as secondary structure elements. The weights are shown for each segment.

2.5. Normalized score

We defined s(H, i) as the normalized score for position i to be in the state H, $[s({\rm H}, i) = {{\textstyle\sum {w({\rm H},i)} } \over {\textstyle\sum {w({\rm H},i) + \textstyle\sum {w({\rm E},i) + \sum {w({\rm C},i)} } } }}.\eqno(1)]$ Here w(H, i) is the weight for one matching segment with a residue at the ith position in a helix, and w(E, i) and w(C, i) are similarly defined. For example s(H, 2) = 0.2/(0.1 + 0.2 + 0.4) = 0.29, and s(E, 5) = (0.1 + 0.2)/(0.1 + 0.2 + 0.3 + 0.4) = 0.3. The secondary structure state having the highest score is chosen as the final prediction result for a given position in the sequence. For the ith position of a query sequence, we have three normalized scores for each secondary structure state: s(H, i), s(E, i) and s(C, i). In our prediction scheme, we always choose the highest score among these three to determine the secondary structure prediction at the ith position.

2.6. Substitution matrices

We have used two different substitution matrices: PAM and BLOSUM. The PAM (percent accepted mutation) matrix was introduced by Dayhoff et al. (1978 ) to quantify the amount of evolutionary change in a protein sequence, based on observation of how often different amino acids replace other amino acids during evolution. BLOSUM (blocks substitution matrix) was introduced by Henikoff & Henikoff (1992 ) to obtain a better measure of differences between two proteins, specifically for distantly related ones. The BLOSUM matrix is derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. We used several versions of these matrices (BLOSUM-45, BLOSUM-62, BLOSUM-80, PAM-30 and PAM-70) in BLAST. For example, PAM-30 refers to 30 mutations per 100 amino acids of sequence, and BLOSUM-62 means that it was derived from sequence blocks clustered at the 62% identity level.

2.7. Identity powers

We tried different combinations of matrices and identity powers. The best results were obtained by using BLOSUM-45 and id³ as the weight assignment method. [See Fig. 3 in Cheng et al. (2005 ) for prediction accuracies computed for the CB513 data set using five different substitution matrices and several powers of identity scores.]

2.8. Identity cutoffs

To test the concept of fragment assembly for secondary structure prediction, we set limits to the extent of similarity or identity for fragments to be included by using different cutoffs: 99, 90, 80, 70 and 60% of similarity or identity scores. The matches having similarity or identity scores higher than a cutoff were eliminated from the matching lists of segments that are used for calculating normalized scores. Fig. 3 shows results obtained for BLOSUM-45 and id³. We observe a constant decrease in Q₃ with a decrease in the cutoff.

Figure 3

Prediction accuracy Q₃ using different identity cutoffs.

Matches with the highest identity scores (greater than the id cutoff) were filtered out, but `reasonably' high id matches were kept. We define the `reasonable' high-id matches to be those that have relatively high identity scores (greater than the id cutoff) and are not too short (>5 residues) or too long (less than 90 or 95% of the length of the query sequence).

2.9. Prediction accuracies

The prediction accuracies Q₃ were compared for three cases, always using BLOSUM-45 with an id cutoff of 0.90. In case (1), all high-id matches (matches with identity scores higher than the identity cutoff) are filtered out. This case is used as a control. In case (2), sequences with identity scores greater than the id cutoff (0.90 here), lengths longer than 5 residues and lengths less than 90% of query sequence are retained. In case (3), sequences with identity scores greater than 90%, lengths longer than 5 residues and lengths below 95% of query sequence are kept.

Table 1 shows the accuracies for all three cases. We observe again that id³ gives the best accuracy.

Table 1
Prediction accuracies with an identity cutoff of 90% for three cases using the BLOSUM-45 substitution matrix

The value in bold indicates the best prediction.

id cutoff	High id matches processing	id^1/3	id^1/2	id¹	id²	id³
0.90	Case 1 (all filtered)	0.675	0.680	0.697	0.725	0.735
0.90	Case 2 (length longer than 5 residues and less than 90% of query sequence)	0.677	0.683	0.701	0.730	0.740
0.90	Case 3 (length longer than 5 residues and less than 95% of query sequence)	0.678	0.683	0.702	0.731	0.742

2.10. Computed solvent-exposed surface area improves predictions

It is interesting to include approximate tertiary information in the computations, as recent studies (Adamczak et al., 2005 ; Faraggi et al., 2012 ) have demonstrated that secondary structure prediction can be improved by taking into account solvent-exposed surface area. To do so, we use the software NACCESS (Hubbard & Thornton, 1993 ), which allows calculation of the degree of exposure to solvent for all available PDB sequences. The idea was to differentiate buried and exposed residues by assigning them different weights according to the following rule. If the computed relative accessibility of a residue is less than 5.0%, then it is regarded as a buried one, while the residue is considered as exposed if its computed relative accessibility is greater than 40.0%. The others are regarded as intermediate cases. Buried residues were weighted most heavily. We made the following linear changes to the weights of residues: if a residue is buried, its weight is doubled; if it is intermediate, the original weight is multiplied by 1.5; if exposed, the weight remains unchanged. However, our results obtained for weights id³, id cutoff = 0.99 and the BLOSUM-45 substitution matrix show that differentiation of buried and exposed residues does not have a significant effect. Alternatively, computations could use a fast ASA (accessible surface area) predictor (Faraggi et al., 2014 , 2017 ) to estimate the benefits of including solvent-exposed surface area possibly to increase the accuracy of secondary structure prediction, but the present results do not make this look very promising.

2.11. The effect of protein size

In order to evaluate whether the protein size leads to any changes in the accuracy of prediction, we divided all proteins from the CB513 data set into four groups according to the sequence length: very small (n ≤ 100 residues), small (100 < n ≤ 200 residues), large (200 < n ≤ 300 residues) and very large (n > 300 residues). The very small, small, large and very large groups contained 154, 216, 84 and 58 sequences, respectively. As in the previous case, we used the BLOSUM-45 substitution matrix with weight function id³. No optimization was applied. The prediction accuracies for proteins of different size are presented in Table 2 . With increasing sizes, we noted a small increase in accuracy from 0.911 for the smallest group to 0.948 for the large category.

Table 2
The accuracies of predictions for proteins of different sizes using the BLOSUM-45 substitution matrix with weight function id³

	CB513	Very small (n ≤ 100)	Small (100 < n ≤ 200)	Large (200 < n ≤ 300)	Very large (n > 300)
Q₃	0.931	0.911	0.936	0.948	0.940

3. Consensus data-mining method – combining FDM with GOR V

In an attempt to improve the accuracy of secondary structure predictions with the fragment database mining method, we developed the consensus data mining (CDM) method. This method combines our two previous successful secondary structure prediction methods: the fragment database mining (FDM) method (Cheng et al., 2005 ) and the GOR V algorithm (Kloczkowski et al., 2002 ; Sen et al., 2005 ). The basic assumption with this approach is that the combination of two complementary methods can enhance the performance of the overall secondary structure prediction. An advantage of FDM over the GOR V method is its ability to predict secondary structure accurately when sequentially similar fragments in the PDB are available. However, GOR V predictions are more accurate than FDM when good fragments from the PDB are not available. We combined the FDM and GOR V methods by introducing a novel CDM method that optimally utilizes the distinct advantages of both methods. The CDM algorithm uses a single parameter – the sequence identity threshold – to decide whether to use the FDM or the GOR V prediction at a given site. The consensus in the CDM method is reached as follows: FDM predictions are used if the sequence identity score for residues is greater than the sequence identity cutoff value, otherwise the GOR V predictions are used.

3.1. Dependence on the availability of fragments

The success of FDM largely depends on the availability of fragments similar to the target sequence. In practice, however, the availability of similar sequences can vary significantly. In order to analyse the relationship between the performance of CDM and the sequence similarity of fragments, we methodically excluded fragment alignments with sequence identities above a certain limit and called this limit the upper sequence identity limit. The upper sequence identity limit is not an additional parameter in the CDM method; these results demonstrate what expected results would be in the absence of fragments with similarities above the sequence identity limit.

3.2. Multiple sequence alignments improve predictions

The performance of all secondary structure prediction methods can be improved with multiple sequence alignments: the GOR V method tested with the full jack-knife methodology (a popular leave-one-out resampling technique) yields an accuracy of 73.5% when multiple sequence alignments (MSAs) are included; otherwise the accuracy is about 10% less.

One of the significant advantages of FDM is its applicability to various evolutionary problems, because the algorithm does not rely exclusively on the sequences with the highest sequence similarity, but assigns weights to BLAST-aligned sequences that apparently capture divergent evolutionary relationships. As a result, CDM, which incorporates FDM, can be successfully used even when there is a range of sequence similarities for the BLAST-identified sequences (Kandoi et al., 2017 ).

3.3. Correlated mutations and residue contacts

Recent efforts to improve protein sequence matching are yielding significant gains (Jia & Jernigan, 2021 ) by incorporating structural information into the sequence matching. This is done by utilizing correlated sequence pairs that are in contact in protein structures to increase the number of allowed substitutions. By using this approach, the sequence fragment library is significantly expanded and therefore the results for the methods discussed above should be significantly improved.

Note that the results shown here are not conventional machine-learning results in the sense of using a training set and a test set, but are just from trials to investigate how successful the variations on the approach can be, all on the same set of known experimental structures. The next secondary structure prediction method discussed below is a machine-learning method, having training and test sets of data.

4. The SPINE-X method

SPINE-X, a method originally developed by Faraggi et al. (2012 ), combines prediction of secondary structure, residue solvent accessibility and torsion angles by using a six-step iterative procedure.

4.1. The six-step iterative procedure

The first five steps predict torsion angles (both φ and ψ). SPINE-X first generates a position-specific scoring matrix (PSSM) using PSIBLAST and seven representative physical parameters (PPs): a steric parameter (graph shape index), hydrophobicity, volume, polarizability, isoelectric point, helix probability and sheet probability. In the first step, a neural network predicts a secondary structure SS0 employing PSSM and PP as input. In the second step, another neural network is built to predict residue solvent accessibility (RSA) with PSSM, PP and predicted SS0 as input. Then, in the third step, the predicted RSA and SS0 together with PSSM and PP are used to predict the torsion angles. In the fourth step, a new round of secondary structure prediction (SS1) is performed based on the previous predictions. Then, in the fifth round, new torsion-angle predictions are performed based on predictions from previous iterations. In the sixth (final) step, a neural network is trained to predict secondary structure using PSSM, PP and predicted values from the first five rounds.

4.2. Neural network architecture and protein data set

In each step, the general architecture of the neural networks is the same, consisting of two hidden layers with 101 hidden nodes. Training and initial testing for all neural networks were performed on the SPINE data set of 2640 proteins from the PDB and on its subset of 2479 proteins with lengths less than 500 residues.

4.3. Prediction accuracy

For secondary structures the accuracy is between 81 and 84% depending on the data set and choice of tests. The Pearson correlation coefficient for accessible surface area predictions is 0.75 and the mean absolute error for the φ and ψ dihedral angles are 20° and 33°, respectively. All details are available in Faraggi et al. (2012 ) and Faraggi & Kloczkowski (2017 ).

5. Crystallization data mining in protein X-ray structure determination

This is another area where data mining can also be useful, but at present does not relate in any direct way to the secondary structure prediction methods described above. There is a significant limitation to the use of crystallography in obtaining the large single crystals required for conventional structure determination by X-ray diffraction.

5.1. X-ray crystallography, NMR and cryo-EM methods

The ultimate goal of protein science is to determine the three-dimensional structures of all proteins and to determine their functions and interactions with all other proteins and ligands. X-ray crystallography (Kendrew et al., 1960 ) and nuclear magnetic resonance (NMR) spectroscopy (Bax & Tjandra, 1997 ) are the main techniques that are used to determine three-dimensional protein structures, although applications of cryo-EM are growing rapidly. NMR spectroscopy allows proteins to be studied in solutions rather than in crystals; however, this method has disadvantages in that it is limited to relatively small and medium-sized proteins. For larger proteins, the NMR spectrum becomes highly crowded with many overlapping peaks and therefore very difficult to interpret.

5.2. Over 60 years of protein X-ray crystallography

Ninety per cent of the experimental structures in the PDB were determined by X-ray diffraction. X-ray crystallography is based on diffraction of an X-ray beam by a crystal lattice, as discovered in 1911 by Max von Laue. The incoming beam is scattered, and the directions and intensities of such reflections depend on the types and distribution of the atoms within the crystal. In 1912 Bragg showed that X-ray scattering can be used for structure determination. The first crystal structures of myoglobin and haemoglobin were determined by John Kendrew in 1957 and Max Perutz in 1959, respectively (for which they shared the Nobel Prize in Chemistry in 1962), and since then the amount of structural data deposited in the PDB has been growing at a rapid pace. While it took Max Perutz 22 years to determine the crystal structure of haemoglobin, the structure deposition rate has been increasing from about one structure per year in the 1960s to about 1 structure per hour in the last decade. Now (as of March 2022), the number of structures solved experimentally and deposited in the Protein Data Bank (Berman et al., 2000 ) is ∼188 000, while the number of known protein sequences in March 2022 in UniProtKB/TrEMBL (The UniProt Consortium, 2015 ) is 230 million. The number of protein sequences continues to grow at a faster rate than the number of protein structures deposited in the PDB. Note that in 2014 the number of solved molecules and known protein sequences were ∼114 000 and ∼80 million, respectively (The UniProt Consortium, 2015 ).

5.3. Protein crystallization – the bottleneck of protein crystallography

There are many steps on the path to crystal structure determination. Most of the stages have been improved, optimized and automated by using robots, faster data-collection devices, better fitting procedures, remote control of the crystallization process etc. However, the main bottleneck in the crystallization process that still needs to be improved is the way in which the conditions for crystallization are chosen in order to obtain a sufficiently large high-quality crystal for X-ray diffraction. Unsuccessful attempts (the success rate is generally well below 10%) increase the cost of the crystallization process by up to 70%, as reported by the Joint Center for Structural Genomics (Jahandideh et al., 2014 ). A better understanding of all the factors governing the crystallization process has attracted intensive experimental and theoretical interest, as this could provide insight into the critical relevant variables, i.e., the crystallization space, for optimizing crystal size and morphology. The ability to predict the critical variables (the concentration and the nature of the protein, salt concentration, type of salt, pH, buffer, additives, temperature, precipitant type and concentration, etc.) for a given protein could significantly speed up the structure determination process. Recently, data-mining approaches have become powerful tools for characterizing the crystallization space that can help to control and improve the crystallization process.

5.4. Biological Macromolecule Crystallization Database

Publicly available databases such as the Biological Macromolecule Crystallization Database (the BMCD) (Gilliland et al., 1996 , 2002 ; Tung & Gallagher, 2009 ) and the PDB provide crystallization information for macromolecular structures. The PDB is not only the largest database of experimentally determined three-dimensional structures of biological macromolecules, but also contains a wealth of information about the crystallization process. There have been attempts to broaden our understanding of the crystallization process by analysing this information (Peat et al., 2005 ; Kirkwood et al., 2015 ; Pérez-Priede & García-Granda, 2017 ).

The BMCD (Gilliland et al., 1996 , 2002 ) includes the protein name, protein concentration, crystallization precipitant, pH, temperature, unit cell and resolution, which are the parameters describing how a given protein has been successfully crystallized. In version 7.0 of the BMCD, the content has been expanded to include 99 211 crystal entries and the following features have been added: macromolecule sequence, which can enable the use of more elaborate analysis of relations among protein properties, crystal-growth conditions, and the geometric and diffraction properties of the crystals (Tung & Gallagher, 2009 ). The BMCD is available as a server at http://bmcd.ibbr.umd.edu/ .

5.5. Limitations of the BMCD

It should be noted that there are two main limitations to the BMCD database that restrict its application for unbiased data mining. First, it only includes positive data by collecting data from successful experiments. Negative results are not reported in this database and only rarely reported in the literature in general (Newman et al., 2012 ). Such negative results could be a key to deeper understanding of the crystallization process. A second limitation is that preparation methods for different entries in the database are not always fully described and can be significantly dissimilar. However, despite these limitations, the BMCD database has been used for successful crystallization outcomes (Schiefner et al., 2015 ; García-Fernández et al., 2012 ).

5.6. Data mining improves crystallization success rates

In order to improve crystallization success rates, there have been attempts to obtain the minimal sets of conditions that can be used to crystallize most proteins in a given data set. At the Joint Center for Structural Genomics (the JCSG), by mining a data set of 539 T. maritime proteins, the ten most effective conditions were identified to crystallize 196 proteins, while 108 best conditions led to 465 successful crystallization outcomes. The authors identified 67 conditions that were the most productive in promoting protein crystallization, and referred to these as the core screen. Together with the next 29 most effective conditions these formed the expanded core screen, which has been widely used for initial crystallization trials (Page & Stevens, 2004 ).

5.7. Proteins from different species exhibit different crystallization behaviour

Kimber et al. (2003 ) mined a data set of 775 proteins with 48 conditions using different sample preparation processes and crystallization conditions. It was demonstrated that proteins from different species typically exhibit different crystallization behaviour. A minimal screen with six conditions produced 205 crystals, while an extended screen with 24 conditions led to 318 crystallized proteins out of the 338 trial proteins in the set. These results also show that many of the conditions used in sparse matrix screens are redundant, and that current screens are not sufficient to crystallize all proteins, since many proteins failed to crystallize at all (Kimber et al., 2003 ; Page & Stevens, 2004 ).

Segelke (2001 ) and Rupp (2003 ) suggested a random sampling of crystallization space and DeLucas et al. (2003 ) applied incomplete factorial screens to streamline the crystallization procedure. For some specific target proteins, those strategies are more appropriate and efficient to identify crystallization conditions. Oldfield (2001 ) applied data mining of protein fragments for molecular replacement models that are used for solving the phase problem in X-ray crystallography.

5.8. Isoelectric point and grand average hydropathy correlate with crystallizability

Babnigg & Joachimiak (2010 ) analysed protein properties for more than 1300 proteins that are well expressed but insoluble, and for ∼720 unique proteins for which structures had been solved by X-ray diffraction. They showed that a protein's isoelectric point and grand average hydropathy (GRAVY) correlate with its propensity to crystallize. Additional physiochemical properties of amino acids from the AAindex database (Kawashima et al., 2008 ) were considered and used for data-mining purposes. The set of attributes that were most strongly correlated with protein crystallization propensity was identified and has been successfully used by incorporation into a support vector machine (SVM) classifier. Using the proposed SVM method, conditions for insoluble proteins and for proteins with solved structures, deposited in the PDB, were predicted at 56% and 75% accuracies, respectively.

6. Conclusions

It has become evident that protein crystallization depends on many factors, which can be divided into two groups. The first group refers to the intrinsic characteristics of proteins including their amino-acid sequence, their secondary structure, flexibility, order and disorder information, hydrophobicity of side chains, charge, aromatic interactions, etc. The second group is comprised of crystallization conditions including pH, temperature, salt concentration and other environmental properties. Many of these factors (for a recent review see Wang et al., 2018 ) are used in bioinformatics tools to predict the crystallization propensity of proteins (Slabinski et al., 2007 ; Smialowski et al., 2006 ; Kurgan et al., 2009 ; Overton et al., 2011 ; Babnigg & Joachimiak, 2010 ; Kirkwood et al., 2015 ; Overton & Barton, 2006 ; Chen et al., 2007 ; Overton et al., 2008 ; Charoenkwan et al., 2013 ; Kandaswamy et al., 2010 ; Jahandideh & Mahdavi, 2012 ; Mizianty & Kurgan, 2012 ; Wang et al., 2014 ). Despite recent progress in identifying the factors relevant for crystallization, the successful prediction of conditions for protein crystallization remains a challenge in structural biology, as little is still known about how these factors depend upon one another for successful crystallization.

We have seen above that data-mining techniques are powerful tools with high potential for discovering new knowledge from large volumes of data. However, in the context of successful prediction of protein crystallization, it remains a challenge. An appropriate future goal could be to improve the methods for protein structure prediction and combine them with existing crystallization prediction methods to improve crystallization condition mining, which should lead to a better overall understanding of the biological behaviours of proteins.

7. Abbreviations

ASA: Accessible surface area

BLAST: Basic local alignment search tool

BLOSUM: Blocks of amino acid substitution matrix

BMCD: Biological Macromolecule Crystallization Database

CB513: Cuff and Barton data set of 513 sequences

CDM: Consensus data mining

CG: Coarse grained

DSSP: Dictionary of secondary structure assignments

FDM: Fragment data mining

GRAVY: Grand average hydropathy

JCSG: Joint Center for Structural Genomics

MC: Monte Carlo

MD: Molecular dynamics

MSAs: Multiple sequence alignments

PAM: Percent accepted mutation

PDB: Protein Data Bank

PPs: Physical parameters

PSIBLAST: Position-specific iterated basic local alignment search tool

PSSM: Position-specific scoring matrix

RSA: Residual solvent accessibility

SVM: Support vector machine

Acknowledgements

We acknowledge financial support from NSF grant DBI1661391, and NIH grants R01GM127701 and R01HG012117.

References

Adamczak, R., Porollo, A. & Meller, J. (2005). Proteins, 59, 467–475.Google Scholar

Alexander, P. A., He, Y. A., Chen, Y. H., Orban, J. & Bryan, P. N. (2009). Proc. Natl Acad. Sci. USA, 106, 21149–21154.Google Scholar

Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403–410.Google Scholar

Anfinsen, C. B. (1973). Science, 181, 223–230.Google Scholar

Babnigg, G. & Joachimiak, A. (2010). J. Struct. Funct. Genomics, 11, 71–80.Google Scholar

Baker, D. & Sali, A. (2001). Science, 294, 93–96.Google Scholar

Bax, A. & Tjandra, N. (1997). J. Biomol. NMR, 10, 289–292.Google Scholar

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242.Google Scholar

Blaszczyk, M., Jamroz, M., Kmiecik, S. & Kolinski, A. (2013). Nucleic Acids Res. 41, W406–W411.Google Scholar

Blaszczyk, M., Kurcinski, M., Kouza, M., Wieteska, L., Debinski, A., Kolinski, A. & Kmiecik, S. (2016). Methods, 93, 72–83.Google Scholar

Boczko, E. M. & Brooks, C. L. (1995). Science, 269, 393–396.Google Scholar

Charoenkwan, P., Shoombuatong, W., Lee, H. C., Chaijaruwanich, J., Huang, H. L. & Ho, S. Y. (2013). PLoS One, 8, e72368.Google Scholar

Chen, K., Kurgan, L. & Rahbari, M. (2007). Biochem. Biophys. Res. Commun. 355, 764–769.Google Scholar

Cheng, H., Sen, T. Z., Kloczkowski, A., Margaritis, D. & Jernigan, R. L. (2005). Polymer, 46, 4314–4321.Google Scholar

Cuff, J. A. & Barton, G. J. (1999). Proteins, 34, 508–519.Google Scholar

Cuff, J. A. & Barton, G. J. (2000). Proteins, 40, 502–511.Google Scholar

Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978). Atlas Protein Seq. Struct. Suppl., pp. 345–352.Google Scholar

DeLucas, L. J., Bray, T. L., Nagy, L., McCombs, D., Chernov, N., Hamrick, D., Cosenza, L., Belgovskiy, A., Stoops, B. & Chait, A. (2003). J. Struct. Biol. 142, 188–206.Google Scholar

Dill, K. A. & MacCallum, J. L. (2012). Science, 338, 1042–1046.Google Scholar

Faraggi, E. & Kloczkowski, A. (2017). Methods Mol. Biol. 1484, 45–53.Google Scholar

Faraggi, E., Kouza, M., Zhou, Y. & Kloczkowski, A. (2017). Methods Mol. Biol. 1484, 127–136.Google Scholar

Faraggi, E., Zhang, T., Yang, Y. D., Kurgan, L. & Zhou, Y. Q. (2012). J. Comput. Chem. 33, 259–267.Google Scholar

Faraggi, E., Zhou, Y. Q. & Kloczkowski, A. (2014). Proteins, 82, 3170–3176.Google Scholar

García-Fernández, R., Pons, T., Meyer, A., Perbandt, M., González-González, Y., Gil, D., de los Angeles Chávez, M., Betzel, C. & Redecke, L. (2012). Acta Cryst. F68, 1289–1293.Google Scholar

Gilliland, G. L., Tung, M. & Ladner, J. (1996). J. Res. Natl Inst. Stand. Technol. 101, 309–320.Google Scholar

Gilliland, G. L., Tung, M. & Ladner, J. E. (2002). Acta Cryst. D58, 916–920.Google Scholar

Hansmann, U. H. E. (1997). Chem. Phys. Lett. 281, 140–150.Google Scholar

Henikoff, S. & Henikoff, J. G. (1992). Proc. Natl Acad. Sci. USA, 89, 10915–10919.Google Scholar

Hubbard, S. J. & Thornton, J. M. (1993). NACCESS. Department of Biochemistry and Molecular Biology, University College London, UK.Google Scholar

Jahandideh, S., Jaroszewski, L. & Godzik, A. (2014). Acta Cryst. D70, 627–635.Google Scholar

Jahandideh, S. & Mahdavi, A. (2012). J. Theor. Biol. 306, 115–119.Google Scholar

Jia, K. J. & Jernigan, R. L. (2021). Proteins, 89, 671–682.Google Scholar

Kabsch, W. & Sander, C. (1983). Biopolymers, 22, 2577–2637.Google Scholar

Kandaswamy, K. K., Pugalenthi, G., Suganthan, P. N. & Gangal, R. (2010). Protein Pept. Lett. 17, 423–430.Google Scholar

Kandoi, G., Leelananda, S. P., Jernigan, R. L. & Sen, T. Z. (2017). Methods Mol. Biol. 1484, 35–44.Google Scholar

Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T. & Kanehisa, M. (2008). Nucleic Acids Res. 36, D202–D205.Google Scholar

Kendrew, J. C., Dickerson, R. E., Strandberg, B. E., Hart, R. G., Davies, D. R., Phillips, D. C. & Shore, V. C. (1960). Nature, 185, 422–427.Google Scholar

Kimber, M. S., Vallee, F., Houston, S., Nečakov, A., Skarina, T., Evdokimova, E., Beasley, S., Christendat, D., Savchenko, A., Arrowsmith, C. H., Vedadi, M., Gerstein, M. & Edwards, A. M. (2003). Proteins, 51, 562–568.Google Scholar

Kirkwood, J., Hargreaves, D., O'Keefe, S. & Wilson, J. (2015). Acta Cryst. F71, 1228–1234.Google Scholar

Kloczkowski, A., Ting, K. L., Jernigan, R. L. & Garnier, J. (2002). Proteins, 49, 154–166.Google Scholar

Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A. E. & Kolinski, A. (2016). Chem. Rev. 116, 7898–7936.Google Scholar

Kolinski, A. (2004). Acta Biochim. Pol. 51, 349–371.Google Scholar

Kouza, M. & Hansmann, U. H. E. (2011). J. Chem. Phys. 134, 044124.Google Scholar

Kouza, M. & Hansmann, U. H. E. (2012). J. Phys. Chem. B, 116, 6645–6653.Google Scholar

Kurgan, L., Razib, A. A., Aghakhani, S., Dick, S., Mizianty, M. & Jahandideh, S. (2009). BMC Struct. Biol. 9, 50.Google Scholar

Liwo, A., He, Y. & Scheraga, H. A. (2011). Phys. Chem. Chem. Phys. 13, 16890–16901.Google Scholar

Mizianty, M. J. & Kurgan, L. A. (2012). Protein Pept. Lett. 19, 40–49.Google Scholar

Newman, J., Bolton, E. E., Müller-Dieckmann, J., Fazio, V. J., Gallagher, D. T., Lovell, D., Luft, J. R., Peat, T. S., Ratcliffe, D., Sayle, R. A., Snell, E. H., Taylor, K., Vallotton, P., Velanker, S. & von Delft, F. (2012). Acta Cryst. F68, 253–258.Google Scholar

Oldfield, T. J. (2001). Acta Cryst. D57, 1421–1427.Google Scholar

Onuchic, J. N. & Wolynes, P. G. (2004). Curr. Opin. Struct. Biol. 14, 70–75.Google Scholar

Ovchinnikov, S., Park, H., Varghese, N., Huang, P. S., Pavlopoulos, G. A., Kim, D. E., Kamisetty, H., Kyrpides, N. C. & Baker, D. (2017). Science, 355, 294–298.Google Scholar

Overton, I. M. & Barton, G. J. (2006). FEBS Lett. 580, 4005–4009.Google Scholar

Overton, I. M., van Niekerk, C. A. J. & Barton, G. J. (2011). Proteins, 79, 1027–1033.Google Scholar

Overton, I. M., Padovani, G., Girolami, M. A. & Barton, G. J. (2008). Bioinformatics, 24, 901–907.Google Scholar

Page, R. & Stevens, R. C. (2004). Methods, 34, 373–389.Google Scholar

Peat, T. S., Christopher, J. A. & Newman, J. (2005). Acta Cryst. D61, 1662–1669.Google Scholar

Pérez-Priede, M. & García-Granda, S. (2017). J. Cryst. Growth, 459, 146–152.Google Scholar

Rupp, B. (2003). J. Struct. Biol. 142, 162–169.Google Scholar

Schiefner, A., Rodewald, F., Neumaier, I. & Skerra, A. (2015). Biochem. J. 466, 95–104.Google Scholar

Segelke, B. W. (2001). J. Cryst. Growth, 232, 553–562.Google Scholar

Sen, T. Z., Jernigan, R. L., Garnier, J. & Kloczkowski, A. (2005). Bioinformatics, 21, 2787–2788.Google Scholar

Shaw, D. E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R. O., Eastwood, M. P., Bank, J. A., Jumper, J. M., Salmon, J. K., Shan, Y. B. & Wriggers, W. (2010). Science, 330, 341–346.Google Scholar

Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997). J. Mol. Biol. 268, 209–225.Google Scholar

Simons, K. T., Ruczinski, I., Kooperberg, C., Fox, B. A., Bystroff, C. & Baker, D. (1999). Proteins, 34, 82–95.Google Scholar

Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I. A., Lesley, S. A. & Godzik, A. (2007). Bioinformatics, 23, 3403–3405.Google Scholar

Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. & Frishman, D. (2006). Proteins, 62, 343–355.Google Scholar

The UniProt Consortium (2015). Nucleic Acids Res. 43, D204–D212.Google Scholar

Tung, M. & Gallagher, D. T. (2009). Acta Cryst. D65, 18–23.Google Scholar

Wabik, J., Kmiecik, S., Gront, D., Kouza, M. & Koliński, A. (2013). Int. J. Mol. Sci. 14, 9893–9905.Google Scholar

Wang, H., Feng, L., Webb, G. I., Kurgan, L., Song, J. & Lin, D. (2018). Brief. Bioinform. 19, 838–852.Google Scholar

Wang, H., Wang, M., Tan, H., Li, Y., Zhang, Z. & Song, J. (2014). PLoS One, 9, e105902.Google Scholar

Zhang, Y. (2008). BMC Bioinformatics, 9, 40.Google Scholar