International
Tables for Crystallography Volume C Mathematical, physical and chemical tables Edited by T. R. Welberry © International Union of Crystallography 2022 |
International Tables for Crystallography (2022). Vol. C. ch. 1.5,
https://doi.org/10.1107/S1574870721008247 Chapter 1.5. Data mining. II. Prediction of protein structure and optimization of protein crystallizabilityaFaculty of Chemistry, University of Warsaw, Poland,bBattelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, Ohio, USA,cDepartment of Physics, Indiana University–Purdue University Indianapolis, Indianapolis, Indiana, USA,dResearch and Information Systems, LLC, Indianapolis, Indiana, USA,eCrop Improvement and Genetics Research Unit, US Department of Agriculture, Agricultural Research Service, Albany, California, USA,fDepartment of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Indiana, USA, and gDepartment of Pediatrics, The Ohio State University College of Medicine, Columbus, Ohio, USA Recent advances in computational technology have made it possible to store and mine huge data sets, much larger than ever before. Data-mining techniques, which extract useful information from these massive data sets, have evolved into a mature technique which can be employed as an efficient predictive tool across broad scientific fields, including but not limited to molecular biology, astronomy, bioinformatics, physics and medicine. In this chapter, first we discuss an original method (fragment data mining, FDM), which mines the structural segments from the Protein Data Bank (PDB) and utilizes structural information by matching the sequence of these structural fragments with the aim of improving the prediction of secondary structure. We also discuss further improvements by combining FDM with the classical GOR V secondary structure prediction method, which is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments. We also discuss a second approach that is newer, and more accurate for secondary structure predictions, the SPINE-X method based on a machine-learning methodology to predict secondary structures by mining protein sequences and structures. Our results strongly suggest that data mining can be an efficient and accurate approach for secondary structure prediction in proteins. The last part of the chapter discusses applications of data mining to the problem of optimizing protein crystallization conditions. Data mining can be used to improve the yield and quality of protein crystals, and thus aid in solving protein structures by X-ray crystallography. There is a vast amount of data for protein structures, sequences and crystallization conditions that can be mined to aid in structure prediction and structure determination. Keywords: data mining; protein crystallography; protein structure prediction; secondary structure prediction; protein crystallization. |
Understanding the mechanism of protein folding remains one of the most challenging problems in molecular biology (Dill & MacCallum, 2012; Onuchic & Wolynes, 2004). The seminal experiment reported by Anfinsen on unfolding and refolding of ribonuclease A demonstrated that the protein sequence contains all of the information to specify how a protein can fold spontaneously into its active native state (Anfinsen, 1973). Protein structure prediction and folding have been studied by computational modelling at different levels of resolution and time scales. Through advances in computational power it has become possible to set new limits for simulating the folding of proteins (Shaw et al., 2010), but folding simulations for the largest proteins still remain beyond the reach of all-atom molecular dynamics (MD) simulations with explicit solvent. (A list of the abbreviations used in this chapter is provided in Section 7.)
Various strategies for overcoming the size and time-scale limitations have been proposed, including the development of novel coarse-grained (CG) methods and implicit solvent models. The CG models reduce the complexity of each amino acid by representing them by fewer numbers of pseudo-atoms (nodes) than the actual number of atoms, which allows a significant decrease in the number of degrees of freedom (Liwo et al., 2011; Kolinski, 2004; Kmiecik et al., 2016; Wabik et al., 2013), especially since coarse graining may typically be performed at the level of one geometric point per amino acid. Despite the fact that the improved computational performance of CG models comes at the cost of loss of some structural accuracy, CG models are still powerful and successful tools for de novo structure prediction (Kmiecik et al., 2016).
There are various alternative ways to improve conformational sampling efficiency: one can use enhanced sampling techniques such as replica-exchange (Hansmann, 1997; Kouza & Hansmann, 2011) and umbrella sampling (Boczko & Brooks, 1995), or reduce the conformational space further by implementing a fragment-based approach (Zhang, 2008; Simons et al., 1997; Blaszczyk et al., 2013, 2016).
The fragment-based approach to predicting protein structures was adopted early on in the Rosetta suite (Simons et al., 1997). Rosetta creates a library of possible conformations of short fragments of the target protein sequence. Those short fragments, derived from a set of known structures, are assembled by a Monte Carlo (MC) minimization together with a scoring function rewarding native-like properties. The success of Rosetta in structure and function predictions shows the importance of the information in structural fragments for theoretical predictive studies of proteins. Newer versions using sequence correlation information predict contacts within structures with great success (Ovchinnikov et al., 2017).
It is widely assumed that a protein's three-dimensional structure is encoded in its sequence of amino acids. Although there are counterexamples, as sometimes one mutation in a sequence can lead to very different protein folds and functions (Alexander et al., 2009; Kouza & Hansmann, 2012), a sufficiently high similarity between protein sequences generally implies similarity between protein structures. This observation is the basis of many structure prediction algorithms that use sequence similarity searches to identify homologous proteins. While protein crystallography may be the method of choice for structure determination, it is too expensive and too slow to deal with the exponentially growing number of protein sequences being produced by high-throughput genome sequencing. Computational methods are thus going to be needed to solve this problem (Baker & Sali, 2001). One of the key steps in the prediction of tertiary structures of proteins is the prediction of secondary structures. Typically, in ab initio methods this is the first step. Prediction of secondary structure is a one-dimensional problem, which is much easier than three-dimensional structure prediction, and has been a subject of interest for a long time. However, de novo protein structure prediction (and secondary structure prediction) still remains a challenging problem in molecular biology.
Recent advances in technology have made storing huge data sets extremely cheap. More than 95% of protein structural data has been generated and stored in the last decade and there seems to be no end to new data generation. Data-mining techniques for extracting useful information from huge data sets is a maturing field. To analyse and process these enormous amounts of biological data, new `big data' methodologies have been developed. They are based on mapping, searching and analysing for patterns in sequences and data mining of three-dimensional structures. Mining for information in biological databases involves various forms of data analysis such as clustering, sequence homology searches, structure homology searches, examination of statistical significance and so on.
The most basic data-mining tool in biology is the basic local alignment search tool (BLAST; Altschul et al., 1990) for examining a new nucleic acid or protein sequence. BLAST compares the new sequence to all sequences in the PDB database to find sequences that are the most similar to a query sequence. This is normally the way genes and proteins are assigned functions.
Data-mining tasks can be descriptive, by uncovering patterns and relationships in the available data, or predictive, based on models derived from these data. Most popular and frequently used are automated data-mining tools that employ sophisticated algorithms based on statistics or machine learning to discover hidden patterns in biological data. Data mining can be applied to a variety of biological problems such as analysis of protein interactions, finding homologous sequences or homologous structures, multiple sequence alignment, construction of phylogenetic trees, genomic sequence analysis, gene finding, gene mapping, gene expression data analysis, drug discovery etc. All of these different problems can be studied by using various data-mining tools and techniques.
In this chapter, first we discuss the application of data mining to the problem of protein secondary structure prediction. We describe an original method (fragment data mining, FDM), which has proven to be a good predictor of the secondary structure of proteins. The FDM method mines structural segments in the Protein Data Bank (PDB) and utilizes structural information from the matching sequence fragments, with the aim of improving predictions of secondary structure. Next, we briefly describe the consensus data mining (CDM) method, which combines FDM with the classical GOR V secondary structure prediction method, which is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments. Finally, we discuss the application of the data-mining approach to the problem of optimizing protein crystallization conditions.
The fragment database method proposed by us in 2005 was pursued because of the success of Baker and collaborators, who developed the Rosetta algorithm (Simons et al., 1997, 1999), in tertiary structure prediction based on the use of short structural fragments. Rosetta takes the distribution of local structures adopted by short sequence segments and identifies patterns (so called I-sites) that correspond to local structures of proteins in the PDB. It assembles the I-site library that is used for the prediction of protein three-dimensional structure. The Rosetta method considers both the local and superlocal sequence structure biases and uses the fragment insertion Monte Carlo method to build the three-dimensional structure of a protein. Its success in structure and function predictions shows the importance of the information carried in the structural fragments.
Taking inspiration from the approach implemented in Rosetta, we developed the FDM method. Because a sufficiently high similarity between protein sequences implies similarity between protein structures and conserved local motifs may assume a similar shape, we searched for a local alignment method to obtain structure information for predicting the secondary structure of query sequences. In the FDM method, BLAST is applied to query all the structures from the PDB. Then, evolutionary information is introduced to the prediction of secondary structure by analysing the fragments of the alignments belonging to proteins in the PDB. In the prediction and evaluation part, each query sequence from the data set is subjected to the following procedure.
The procedure starts from the weights assignment to the matching segments obtained from BLAST. In this step various types of parameters are considered, including different substitution matrices, similarity/identity cutoffs, degree of exposure to solvent of residues, and protein classification and sizes. In the second step, we calculate normalized scores for each residue. This is followed by prediction of the secondary structure for that residue according to the normalized scores. It should be noted that in order to predict the secondary structure according to the normalized scores of residues, we applied two different strategies. The first strategy is to choose the highest-scoring structure class as the prediction, and the second one is to use machine-learning approaches to choose a classification based on training. Finally, we calculate Q3 (the sum of the fractions of successful predictions for α-helix, β-strand and coil) and the Matthews' correlation coefficients.
Fig. 1 shows a scheme of the multistage procedure employed in the fragment database mining (FDM) method.
|
Multistage procedure for the prediction of secondary structure using the fragment database mining (FDM) method. |
For training we used the CB513 benchmarked data set of 513 non-redundant domains developed by Cuff & Barton (2000, 1999). Local sequence alignments were generated by using BLAST with different substitution matrices (Henikoff & Henikoff, 1992), including BLOSUM-45, BLOSUM-62, BLOSUM-80, PAM-30 and PAM-70.
Unlike in DSSP (Kabsch & Sander, 1983), where an eight-letter structural alphabet is used, we instead used a reduced three-letter structural alphabet of secondary structures: α-helix (H), extended (β-sheet) (E) and coil (C). H, E and C in our three-letter structural code correspond to helices (H, G and I), strands and bridges (E and B) together with turn, bend and coil (T, S and C) in the DSSP code, respectively.
For weight assignments we defined identity scores and their powers (idc, where c is a positive real number) as the weights of matching segments. Here id is the ratio of the number of exact matches of residues to the total number of residues in the matching segment. Weights were then adjusted to obtain the best match. This is illustrated in Fig. 2. At each position, the predicted secondary structure is determined by the secondary structures of the matches at that position. Each match is assigned a weight according to the similarity or identity score of the alignment from BLAST. At each position, the weights are normalized, and the normalized scores for each position within each of the secondary structure states are calculated.
We defined s(H, i) as the normalized score for position i to be in the state H,Here w(H, i) is the weight for one matching segment with a residue at the ith position in a helix, and w(E, i) and w(C, i) are similarly defined. For example s(H, 2) = 0.2/(0.1 + 0.2 + 0.4) = 0.29, and s(E, 5) = (0.1 + 0.2)/(0.1 + 0.2 + 0.3 + 0.4) = 0.3. The secondary structure state having the highest score is chosen as the final prediction result for a given position in the sequence. For the ith position of a query sequence, we have three normalized scores for each secondary structure state: s(H, i), s(E, i) and s(C, i). In our prediction scheme, we always choose the highest score among these three to determine the secondary structure prediction at the ith position.
We have used two different substitution matrices: PAM and BLOSUM. The PAM (percent accepted mutation) matrix was introduced by Dayhoff et al. (1978) to quantify the amount of evolutionary change in a protein sequence, based on observation of how often different amino acids replace other amino acids during evolution. BLOSUM (blocks substitution matrix) was introduced by Henikoff & Henikoff (1992) to obtain a better measure of differences between two proteins, specifically for distantly related ones. The BLOSUM matrix is derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. We used several versions of these matrices (BLOSUM-45, BLOSUM-62, BLOSUM-80, PAM-30 and PAM-70) in BLAST. For example, PAM-30 refers to 30 mutations per 100 amino acids of sequence, and BLOSUM-62 means that it was derived from sequence blocks clustered at the 62% identity level.
We tried different combinations of matrices and identity powers. The best results were obtained by using BLOSUM-45 and id3 as the weight assignment method. [See Fig. 3 in Cheng et al. (2005) for prediction accuracies computed for the CB513 data set using five different substitution matrices and several powers of identity scores.]
To test the concept of fragment assembly for secondary structure prediction, we set limits to the extent of similarity or identity for fragments to be included by using different cutoffs: 99, 90, 80, 70 and 60% of similarity or identity scores. The matches having similarity or identity scores higher than a cutoff were eliminated from the matching lists of segments that are used for calculating normalized scores. Fig. 3 shows results obtained for BLOSUM-45 and id3. We observe a constant decrease in Q3 with a decrease in the cutoff.
Matches with the highest identity scores (greater than the id cutoff) were filtered out, but `reasonably' high id matches were kept. We define the `reasonable' high-id matches to be those that have relatively high identity scores (greater than the id cutoff) and are not too short (>5 residues) or too long (less than 90 or 95% of the length of the query sequence).
The prediction accuracies Q3 were compared for three cases, always using BLOSUM-45 with an id cutoff of 0.90. In case (1), all high-id matches (matches with identity scores higher than the identity cutoff) are filtered out. This case is used as a control. In case (2), sequences with identity scores greater than the id cutoff (0.90 here), lengths longer than 5 residues and lengths less than 90% of query sequence are retained. In case (3), sequences with identity scores greater than 90%, lengths longer than 5 residues and lengths below 95% of query sequence are kept.
Table 1 shows the accuracies for all three cases. We observe again that id3 gives the best accuracy.
|
It is interesting to include approximate tertiary information in the computations, as recent studies (Adamczak et al., 2005; Faraggi et al., 2012) have demonstrated that secondary structure prediction can be improved by taking into account solvent-exposed surface area. To do so, we use the software NACCESS (Hubbard & Thornton, 1993), which allows calculation of the degree of exposure to solvent for all available PDB sequences. The idea was to differentiate buried and exposed residues by assigning them different weights according to the following rule. If the computed relative accessibility of a residue is less than 5.0%, then it is regarded as a buried one, while the residue is considered as exposed if its computed relative accessibility is greater than 40.0%. The others are regarded as intermediate cases. Buried residues were weighted most heavily. We made the following linear changes to the weights of residues: if a residue is buried, its weight is doubled; if it is intermediate, the original weight is multiplied by 1.5; if exposed, the weight remains unchanged. However, our results obtained for weights id3, id cutoff = 0.99 and the BLOSUM-45 substitution matrix show that differentiation of buried and exposed residues does not have a significant effect. Alternatively, computations could use a fast ASA (accessible surface area) predictor (Faraggi et al., 2014, 2017) to estimate the benefits of including solvent-exposed surface area possibly to increase the accuracy of secondary structure prediction, but the present results do not make this look very promising.
In order to evaluate whether the protein size leads to any changes in the accuracy of prediction, we divided all proteins from the CB513 data set into four groups according to the sequence length: very small (n ≤ 100 residues), small (100 < n ≤ 200 residues), large (200 < n ≤ 300 residues) and very large (n > 300 residues). The very small, small, large and very large groups contained 154, 216, 84 and 58 sequences, respectively. As in the previous case, we used the BLOSUM-45 substitution matrix with weight function id3. No optimization was applied. The prediction accuracies for proteins of different size are presented in Table 2. With increasing sizes, we noted a small increase in accuracy from 0.911 for the smallest group to 0.948 for the large category.
|
In an attempt to improve the accuracy of secondary structure predictions with the fragment database mining method, we developed the consensus data mining (CDM) method. This method combines our two previous successful secondary structure prediction methods: the fragment database mining (FDM) method (Cheng et al., 2005) and the GOR V algorithm (Kloczkowski et al., 2002; Sen et al., 2005). The basic assumption with this approach is that the combination of two complementary methods can enhance the performance of the overall secondary structure prediction. An advantage of FDM over the GOR V method is its ability to predict secondary structure accurately when sequentially similar fragments in the PDB are available. However, GOR V predictions are more accurate than FDM when good fragments from the PDB are not available. We combined the FDM and GOR V methods by introducing a novel CDM method that optimally utilizes the distinct advantages of both methods. The CDM algorithm uses a single parameter – the sequence identity threshold – to decide whether to use the FDM or the GOR V prediction at a given site. The consensus in the CDM method is reached as follows: FDM predictions are used if the sequence identity score for residues is greater than the sequence identity cutoff value, otherwise the GOR V predictions are used.
The success of FDM largely depends on the availability of fragments similar to the target sequence. In practice, however, the availability of similar sequences can vary significantly. In order to analyse the relationship between the performance of CDM and the sequence similarity of fragments, we methodically excluded fragment alignments with sequence identities above a certain limit and called this limit the upper sequence identity limit. The upper sequence identity limit is not an additional parameter in the CDM method; these results demonstrate what expected results would be in the absence of fragments with similarities above the sequence identity limit.
The performance of all secondary structure prediction methods can be improved with multiple sequence alignments: the GOR V method tested with the full jack-knife methodology (a popular leave-one-out resampling technique) yields an accuracy of 73.5% when multiple sequence alignments (MSAs) are included; otherwise the accuracy is about 10% less.
One of the significant advantages of FDM is its applicability to various evolutionary problems, because the algorithm does not rely exclusively on the sequences with the highest sequence similarity, but assigns weights to BLAST-aligned sequences that apparently capture divergent evolutionary relationships. As a result, CDM, which incorporates FDM, can be successfully used even when there is a range of sequence similarities for the BLAST-identified sequences (Kandoi et al., 2017).
Recent efforts to improve protein sequence matching are yielding significant gains (Jia & Jernigan, 2021) by incorporating structural information into the sequence matching. This is done by utilizing correlated sequence pairs that are in contact in protein structures to increase the number of allowed substitutions. By using this approach, the sequence fragment library is significantly expanded and therefore the results for the methods discussed above should be significantly improved.
Note that the results shown here are not conventional machine-learning results in the sense of using a training set and a test set, but are just from trials to investigate how successful the variations on the approach can be, all on the same set of known experimental structures. The next secondary structure prediction method discussed below is a machine-learning method, having training and test sets of data.
SPINE-X, a method originally developed by Faraggi et al. (2012), combines prediction of secondary structure, residue solvent accessibility and torsion angles by using a six-step iterative procedure.
The first five steps predict torsion angles (both φ and ψ). SPINE-X first generates a position-specific scoring matrix (PSSM) using PSIBLAST and seven representative physical parameters (PPs): a steric parameter (graph shape index), hydrophobicity, volume, polarizability, isoelectric point, helix probability and sheet probability. In the first step, a neural network predicts a secondary structure SS0 employing PSSM and PP as input. In the second step, another neural network is built to predict residue solvent accessibility (RSA) with PSSM, PP and predicted SS0 as input. Then, in the third step, the predicted RSA and SS0 together with PSSM and PP are used to predict the torsion angles. In the fourth step, a new round of secondary structure prediction (SS1) is performed based on the previous predictions. Then, in the fifth round, new torsion-angle predictions are performed based on predictions from previous iterations. In the sixth (final) step, a neural network is trained to predict secondary structure using PSSM, PP and predicted values from the first five rounds.
In each step, the general architecture of the neural networks is the same, consisting of two hidden layers with 101 hidden nodes. Training and initial testing for all neural networks were performed on the SPINE data set of 2640 proteins from the PDB and on its subset of 2479 proteins with lengths less than 500 residues.
For secondary structures the accuracy is between 81 and 84% depending on the data set and choice of tests. The Pearson correlation coefficient for accessible surface area predictions is 0.75 and the mean absolute error for the φ and ψ dihedral angles are 20° and 33°, respectively. All details are available in Faraggi et al. (2012) and Faraggi & Kloczkowski (2017).
This is another area where data mining can also be useful, but at present does not relate in any direct way to the secondary structure prediction methods described above. There is a significant limitation to the use of crystallography in obtaining the large single crystals required for conventional structure determination by X-ray diffraction.
The ultimate goal of protein science is to determine the three-dimensional structures of all proteins and to determine their functions and interactions with all other proteins and ligands. X-ray crystallography (Kendrew et al., 1960) and nuclear magnetic resonance (NMR) spectroscopy (Bax & Tjandra, 1997) are the main techniques that are used to determine three-dimensional protein structures, although applications of cryo-EM are growing rapidly. NMR spectroscopy allows proteins to be studied in solutions rather than in crystals; however, this method has disadvantages in that it is limited to relatively small and medium-sized proteins. For larger proteins, the NMR spectrum becomes highly crowded with many overlapping peaks and therefore very difficult to interpret.
Ninety per cent of the experimental structures in the PDB were determined by X-ray diffraction. X-ray crystallography is based on diffraction of an X-ray beam by a crystal lattice, as discovered in 1911 by Max von Laue. The incoming beam is scattered, and the directions and intensities of such reflections depend on the types and distribution of the atoms within the crystal. In 1912 Bragg showed that X-ray scattering can be used for structure determination. The first crystal structures of myoglobin and haemoglobin were determined by John Kendrew in 1957 and Max Perutz in 1959, respectively (for which they shared the Nobel Prize in Chemistry in 1962), and since then the amount of structural data deposited in the PDB has been growing at a rapid pace. While it took Max Perutz 22 years to determine the crystal structure of haemoglobin, the structure deposition rate has been increasing from about one structure per year in the 1960s to about 1 structure per hour in the last decade. Now (as of March 2022), the number of structures solved experimentally and deposited in the Protein Data Bank (Berman et al., 2000) is ∼188 000, while the number of known protein sequences in March 2022 in UniProtKB/TrEMBL (The UniProt Consortium, 2015) is 230 million. The number of protein sequences continues to grow at a faster rate than the number of protein structures deposited in the PDB. Note that in 2014 the number of solved molecules and known protein sequences were ∼114 000 and ∼80 million, respectively (The UniProt Consortium, 2015).
There are many steps on the path to crystal structure determination. Most of the stages have been improved, optimized and automated by using robots, faster data-collection devices, better fitting procedures, remote control of the crystallization process etc. However, the main bottleneck in the crystallization process that still needs to be improved is the way in which the conditions for crystallization are chosen in order to obtain a sufficiently large high-quality crystal for X-ray diffraction. Unsuccessful attempts (the success rate is generally well below 10%) increase the cost of the crystallization process by up to 70%, as reported by the Joint Center for Structural Genomics (Jahandideh et al., 2014). A better understanding of all the factors governing the crystallization process has attracted intensive experimental and theoretical interest, as this could provide insight into the critical relevant variables, i.e., the crystallization space, for optimizing crystal size and morphology. The ability to predict the critical variables (the concentration and the nature of the protein, salt concentration, type of salt, pH, buffer, additives, temperature, precipitant type and concentration, etc.) for a given protein could significantly speed up the structure determination process. Recently, data-mining approaches have become powerful tools for characterizing the crystallization space that can help to control and improve the crystallization process.
Publicly available databases such as the Biological Macromolecule Crystallization Database (the BMCD) (Gilliland et al., 1996, 2002; Tung & Gallagher, 2009) and the PDB provide crystallization information for macromolecular structures. The PDB is not only the largest database of experimentally determined three-dimensional structures of biological macromolecules, but also contains a wealth of information about the crystallization process. There have been attempts to broaden our understanding of the crystallization process by analysing this information (Peat et al., 2005; Kirkwood et al., 2015; Pérez-Priede & García-Granda, 2017).
The BMCD (Gilliland et al., 1996, 2002) includes the protein name, protein concentration, crystallization precipitant, pH, temperature, unit cell and resolution, which are the parameters describing how a given protein has been successfully crystallized. In version 7.0 of the BMCD, the content has been expanded to include 99 211 crystal entries and the following features have been added: macromolecule sequence, which can enable the use of more elaborate analysis of relations among protein properties, crystal-growth conditions, and the geometric and diffraction properties of the crystals (Tung & Gallagher, 2009). The BMCD is available as a server at http://bmcd.ibbr.umd.edu/ .
It should be noted that there are two main limitations to the BMCD database that restrict its application for unbiased data mining. First, it only includes positive data by collecting data from successful experiments. Negative results are not reported in this database and only rarely reported in the literature in general (Newman et al., 2012). Such negative results could be a key to deeper understanding of the crystallization process. A second limitation is that preparation methods for different entries in the database are not always fully described and can be significantly dissimilar. However, despite these limitations, the BMCD database has been used for successful crystallization outcomes (Schiefner et al., 2015; García-Fernández et al., 2012).
In order to improve crystallization success rates, there have been attempts to obtain the minimal sets of conditions that can be used to crystallize most proteins in a given data set. At the Joint Center for Structural Genomics (the JCSG), by mining a data set of 539 T. maritime proteins, the ten most effective conditions were identified to crystallize 196 proteins, while 108 best conditions led to 465 successful crystallization outcomes. The authors identified 67 conditions that were the most productive in promoting protein crystallization, and referred to these as the core screen. Together with the next 29 most effective conditions these formed the expanded core screen, which has been widely used for initial crystallization trials (Page & Stevens, 2004).
Kimber et al. (2003) mined a data set of 775 proteins with 48 conditions using different sample preparation processes and crystallization conditions. It was demonstrated that proteins from different species typically exhibit different crystallization behaviour. A minimal screen with six conditions produced 205 crystals, while an extended screen with 24 conditions led to 318 crystallized proteins out of the 338 trial proteins in the set. These results also show that many of the conditions used in sparse matrix screens are redundant, and that current screens are not sufficient to crystallize all proteins, since many proteins failed to crystallize at all (Kimber et al., 2003; Page & Stevens, 2004).
Segelke (2001) and Rupp (2003) suggested a random sampling of crystallization space and DeLucas et al. (2003) applied incomplete factorial screens to streamline the crystallization procedure. For some specific target proteins, those strategies are more appropriate and efficient to identify crystallization conditions. Oldfield (2001) applied data mining of protein fragments for molecular replacement models that are used for solving the phase problem in X-ray crystallography.
Babnigg & Joachimiak (2010) analysed protein properties for more than 1300 proteins that are well expressed but insoluble, and for ∼720 unique proteins for which structures had been solved by X-ray diffraction. They showed that a protein's isoelectric point and grand average hydropathy (GRAVY) correlate with its propensity to crystallize. Additional physiochemical properties of amino acids from the AAindex database (Kawashima et al., 2008) were considered and used for data-mining purposes. The set of attributes that were most strongly correlated with protein crystallization propensity was identified and has been successfully used by incorporation into a support vector machine (SVM) classifier. Using the proposed SVM method, conditions for insoluble proteins and for proteins with solved structures, deposited in the PDB, were predicted at 56% and 75% accuracies, respectively.
It has become evident that protein crystallization depends on many factors, which can be divided into two groups. The first group refers to the intrinsic characteristics of proteins including their amino-acid sequence, their secondary structure, flexibility, order and disorder information, hydrophobicity of side chains, charge, aromatic interactions, etc. The second group is comprised of crystallization conditions including pH, temperature, salt concentration and other environmental properties. Many of these factors (for a recent review see Wang et al., 2018) are used in bioinformatics tools to predict the crystallization propensity of proteins (Slabinski et al., 2007; Smialowski et al., 2006; Kurgan et al., 2009; Overton et al., 2011; Babnigg & Joachimiak, 2010; Kirkwood et al., 2015; Overton & Barton, 2006; Chen et al., 2007; Overton et al., 2008; Charoenkwan et al., 2013; Kandaswamy et al., 2010; Jahandideh & Mahdavi, 2012; Mizianty & Kurgan, 2012; Wang et al., 2014). Despite recent progress in identifying the factors relevant for crystallization, the successful prediction of conditions for protein crystallization remains a challenge in structural biology, as little is still known about how these factors depend upon one another for successful crystallization.
We have seen above that data-mining techniques are powerful tools with high potential for discovering new knowledge from large volumes of data. However, in the context of successful prediction of protein crystallization, it remains a challenge. An appropriate future goal could be to improve the methods for protein structure prediction and combine them with existing crystallization prediction methods to improve crystallization condition mining, which should lead to a better overall understanding of the biological behaviours of proteins.
ASA: Accessible surface area
BLAST: Basic local alignment search tool
BLOSUM: Blocks of amino acid substitution matrix
BMCD: Biological Macromolecule Crystallization Database
CB513: Cuff and Barton data set of 513 sequences
CDM: Consensus data mining
CG: Coarse grained
DSSP: Dictionary of secondary structure assignments
FDM: Fragment data mining
GRAVY: Grand average hydropathy
JCSG: Joint Center for Structural Genomics
MC: Monte Carlo
MD: Molecular dynamics
MSAs: Multiple sequence alignments
PAM: Percent accepted mutation
PDB: Protein Data Bank
PPs: Physical parameters
PSIBLAST: Position-specific iterated basic local alignment search tool
PSSM: Position-specific scoring matrix
RSA: Residual solvent accessibility
SVM: Support vector machine
Acknowledgements
We acknowledge financial support from NSF grant DBI1661391, and NIH grants R01GM127701 and R01HG012117.
References
Adamczak, R., Porollo, A. & Meller, J. (2005). Proteins, 59, 467–475.Google ScholarAlexander, P. A., He, Y. A., Chen, Y. H., Orban, J. & Bryan, P. N. (2009). Proc. Natl Acad. Sci. USA, 106, 21149–21154.Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403–410.Google Scholar
Anfinsen, C. B. (1973). Science, 181, 223–230.Google Scholar
Babnigg, G. & Joachimiak, A. (2010). J. Struct. Funct. Genomics, 11, 71–80.Google Scholar
Baker, D. & Sali, A. (2001). Science, 294, 93–96.Google Scholar
Bax, A. & Tjandra, N. (1997). J. Biomol. NMR, 10, 289–292.Google Scholar
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242.Google Scholar
Blaszczyk, M., Jamroz, M., Kmiecik, S. & Kolinski, A. (2013). Nucleic Acids Res. 41, W406–W411.Google Scholar
Blaszczyk, M., Kurcinski, M., Kouza, M., Wieteska, L., Debinski, A., Kolinski, A. & Kmiecik, S. (2016). Methods, 93, 72–83.Google Scholar
Boczko, E. M. & Brooks, C. L. (1995). Science, 269, 393–396.Google Scholar
Charoenkwan, P., Shoombuatong, W., Lee, H. C., Chaijaruwanich, J., Huang, H. L. & Ho, S. Y. (2013). PLoS One, 8, e72368.Google Scholar
Chen, K., Kurgan, L. & Rahbari, M. (2007). Biochem. Biophys. Res. Commun. 355, 764–769.Google Scholar
Cheng, H., Sen, T. Z., Kloczkowski, A., Margaritis, D. & Jernigan, R. L. (2005). Polymer, 46, 4314–4321.Google Scholar
Cuff, J. A. & Barton, G. J. (1999). Proteins, 34, 508–519.Google Scholar
Cuff, J. A. & Barton, G. J. (2000). Proteins, 40, 502–511.Google Scholar
Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978). Atlas Protein Seq. Struct. Suppl., pp. 345–352.Google Scholar
DeLucas, L. J., Bray, T. L., Nagy, L., McCombs, D., Chernov, N., Hamrick, D., Cosenza, L., Belgovskiy, A., Stoops, B. & Chait, A. (2003). J. Struct. Biol. 142, 188–206.Google Scholar
Dill, K. A. & MacCallum, J. L. (2012). Science, 338, 1042–1046.Google Scholar
Faraggi, E. & Kloczkowski, A. (2017). Methods Mol. Biol. 1484, 45–53.Google Scholar
Faraggi, E., Kouza, M., Zhou, Y. & Kloczkowski, A. (2017). Methods Mol. Biol. 1484, 127–136.Google Scholar
Faraggi, E., Zhang, T., Yang, Y. D., Kurgan, L. & Zhou, Y. Q. (2012). J. Comput. Chem. 33, 259–267.Google Scholar
Faraggi, E., Zhou, Y. Q. & Kloczkowski, A. (2014). Proteins, 82, 3170–3176.Google Scholar
García-Fernández, R., Pons, T., Meyer, A., Perbandt, M., González-González, Y., Gil, D., de los Angeles Chávez, M., Betzel, C. & Redecke, L. (2012). Acta Cryst. F68, 1289–1293.Google Scholar
Gilliland, G. L., Tung, M. & Ladner, J. (1996). J. Res. Natl Inst. Stand. Technol. 101, 309–320.Google Scholar
Gilliland, G. L., Tung, M. & Ladner, J. E. (2002). Acta Cryst. D58, 916–920.Google Scholar
Hansmann, U. H. E. (1997). Chem. Phys. Lett. 281, 140–150.Google Scholar
Henikoff, S. & Henikoff, J. G. (1992). Proc. Natl Acad. Sci. USA, 89, 10915–10919.Google Scholar
Hubbard, S. J. & Thornton, J. M. (1993). NACCESS. Department of Biochemistry and Molecular Biology, University College London, UK.Google Scholar
Jahandideh, S., Jaroszewski, L. & Godzik, A. (2014). Acta Cryst. D70, 627–635.Google Scholar
Jahandideh, S. & Mahdavi, A. (2012). J. Theor. Biol. 306, 115–119.Google Scholar
Jia, K. J. & Jernigan, R. L. (2021). Proteins, 89, 671–682.Google Scholar
Kabsch, W. & Sander, C. (1983). Biopolymers, 22, 2577–2637.Google Scholar
Kandaswamy, K. K., Pugalenthi, G., Suganthan, P. N. & Gangal, R. (2010). Protein Pept. Lett. 17, 423–430.Google Scholar
Kandoi, G., Leelananda, S. P., Jernigan, R. L. & Sen, T. Z. (2017). Methods Mol. Biol. 1484, 35–44.Google Scholar
Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T. & Kanehisa, M. (2008). Nucleic Acids Res. 36, D202–D205.Google Scholar
Kendrew, J. C., Dickerson, R. E., Strandberg, B. E., Hart, R. G., Davies, D. R., Phillips, D. C. & Shore, V. C. (1960). Nature, 185, 422–427.Google Scholar
Kimber, M. S., Vallee, F., Houston, S., Nečakov, A., Skarina, T., Evdokimova, E., Beasley, S., Christendat, D., Savchenko, A., Arrowsmith, C. H., Vedadi, M., Gerstein, M. & Edwards, A. M. (2003). Proteins, 51, 562–568.Google Scholar
Kirkwood, J., Hargreaves, D., O'Keefe, S. & Wilson, J. (2015). Acta Cryst. F71, 1228–1234.Google Scholar
Kloczkowski, A., Ting, K. L., Jernigan, R. L. & Garnier, J. (2002). Proteins, 49, 154–166.Google Scholar
Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A. E. & Kolinski, A. (2016). Chem. Rev. 116, 7898–7936.Google Scholar
Kolinski, A. (2004). Acta Biochim. Pol. 51, 349–371.Google Scholar
Kouza, M. & Hansmann, U. H. E. (2011). J. Chem. Phys. 134, 044124.Google Scholar
Kouza, M. & Hansmann, U. H. E. (2012). J. Phys. Chem. B, 116, 6645–6653.Google Scholar
Kurgan, L., Razib, A. A., Aghakhani, S., Dick, S., Mizianty, M. & Jahandideh, S. (2009). BMC Struct. Biol. 9, 50.Google Scholar
Liwo, A., He, Y. & Scheraga, H. A. (2011). Phys. Chem. Chem. Phys. 13, 16890–16901.Google Scholar
Mizianty, M. J. & Kurgan, L. A. (2012). Protein Pept. Lett. 19, 40–49.Google Scholar
Newman, J., Bolton, E. E., Müller-Dieckmann, J., Fazio, V. J., Gallagher, D. T., Lovell, D., Luft, J. R., Peat, T. S., Ratcliffe, D., Sayle, R. A., Snell, E. H., Taylor, K., Vallotton, P., Velanker, S. & von Delft, F. (2012). Acta Cryst. F68, 253–258.Google Scholar
Oldfield, T. J. (2001). Acta Cryst. D57, 1421–1427.Google Scholar
Onuchic, J. N. & Wolynes, P. G. (2004). Curr. Opin. Struct. Biol. 14, 70–75.Google Scholar
Ovchinnikov, S., Park, H., Varghese, N., Huang, P. S., Pavlopoulos, G. A., Kim, D. E., Kamisetty, H., Kyrpides, N. C. & Baker, D. (2017). Science, 355, 294–298.Google Scholar
Overton, I. M. & Barton, G. J. (2006). FEBS Lett. 580, 4005–4009.Google Scholar
Overton, I. M., van Niekerk, C. A. J. & Barton, G. J. (2011). Proteins, 79, 1027–1033.Google Scholar
Overton, I. M., Padovani, G., Girolami, M. A. & Barton, G. J. (2008). Bioinformatics, 24, 901–907.Google Scholar
Page, R. & Stevens, R. C. (2004). Methods, 34, 373–389.Google Scholar
Peat, T. S., Christopher, J. A. & Newman, J. (2005). Acta Cryst. D61, 1662–1669.Google Scholar
Pérez-Priede, M. & García-Granda, S. (2017). J. Cryst. Growth, 459, 146–152.Google Scholar
Rupp, B. (2003). J. Struct. Biol. 142, 162–169.Google Scholar
Schiefner, A., Rodewald, F., Neumaier, I. & Skerra, A. (2015). Biochem. J. 466, 95–104.Google Scholar
Segelke, B. W. (2001). J. Cryst. Growth, 232, 553–562.Google Scholar
Sen, T. Z., Jernigan, R. L., Garnier, J. & Kloczkowski, A. (2005). Bioinformatics, 21, 2787–2788.Google Scholar
Shaw, D. E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R. O., Eastwood, M. P., Bank, J. A., Jumper, J. M., Salmon, J. K., Shan, Y. B. & Wriggers, W. (2010). Science, 330, 341–346.Google Scholar
Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997). J. Mol. Biol. 268, 209–225.Google Scholar
Simons, K. T., Ruczinski, I., Kooperberg, C., Fox, B. A., Bystroff, C. & Baker, D. (1999). Proteins, 34, 82–95.Google Scholar
Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I. A., Lesley, S. A. & Godzik, A. (2007). Bioinformatics, 23, 3403–3405.Google Scholar
Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. & Frishman, D. (2006). Proteins, 62, 343–355.Google Scholar
The UniProt Consortium (2015). Nucleic Acids Res. 43, D204–D212.Google Scholar
Tung, M. & Gallagher, D. T. (2009). Acta Cryst. D65, 18–23.Google Scholar
Wabik, J., Kmiecik, S., Gront, D., Kouza, M. & Koliński, A. (2013). Int. J. Mol. Sci. 14, 9893–9905.Google Scholar
Wang, H., Feng, L., Webb, G. I., Kurgan, L., Song, J. & Lin, D. (2018). Brief. Bioinform. 19, 838–852.Google Scholar
Wang, H., Wang, M., Tan, H., Li, Y., Zhang, Z. & Song, J. (2014). PLoS One, 9, e105902.Google Scholar
Zhang, Y. (2008). BMC Bioinformatics, 9, 40.Google Scholar