| Ab Initio Modeling of Glycosyl Torsions and Anomeric Effects in a Model Carbohydrate: 2-Ethoxy Tetrahydropyran Biophysical Journal, Volume 93, Issue 1, 1 July 2007, Pages 1-10 H. Lee Woodcock, Damian Moran, Richard W. Pastor, Alexander D. MacKerell and Bernard R. Brooks Abstract A range of ab initio calculations were carried out on the axial and equatorial anomers of the model carbohydrate 2-ethoxy tetrahydropyran to evaluate the level of theory required to accurately evaluate the glycosyl dihedral angle and the anomeric ratio. Vacuum CCSD(T)/CBS extrapolations at the global minimum yield Δ=−=1.42kcal/mol. When corrected for solvent (by the IEFPCM model), zero-point vibrations and entropy, Δ=0.49kcal/mol, in excellent agreement with the experimental value of 0.47±0.3kcal/mol. A new additivity scheme, the layered composite method (LCM), yields Δ to within 0.1kcal/mol of the CCSD(T)/CBS result at a fraction of the computer requirements. Anomeric ratios and one-dimensional torsional surfaces generated by LCM and the even more efficient MP2/cc-pVTZ level of theory are in excellent agreement, indicating that the latter is suitable for force-field parameterization of carbohydrates. Hartree-Fock and density functional theory differ from CCSD(T)/CBS for Δ by ∼1kcal/mol; they show similar deviations in torsional surfaces evaluated from LCM. A comparison of vacuum and solvent-corrected one- and two-dimensional torsional surfaces indicates the equatorial form of 2-ethoxy tetrahydropyran is more sensitive to solvent than the axial. Abstract | Full Text | PDF (878 kb) |
| The Stretched Intermediate Model of B-Z DNA Transition Biophysical Journal, Volume 88, Issue 3, 1 March 2005, Pages 1593-1607 Wilber Lim and Yuan Ping Feng Abstract There have been numerous attempts to describe the mechanism of B-Z transition. Our simulations based on the stochastic difference equation with length algorithm show that a short DNA oligomer will tend to unwind and overstretch during the transition. The overstretching of DNA is then understood from the Zhou, Zhang, and Ou-Yang model. Unlike the Harvey model, the stretched intermediate model does not pose any steric dilemma; we are able to show that the chain sense reversal progresses spontaneously using the stretched intermediate model. A nonlinear DNA model is used to describe the origins and mechanism of base rotation in the stretched intermediate state of DNA. We also propose an experiment that can verify the existence of a stretched intermediate state during B-Z transition, thus opening up fresh grounds for experimentation in this field. Abstract | Full Text | PDF (478 kb) |
| Generalized Pattern Search Algorithm for Peptide Structure Prediction Biophysical Journal, Volume 95, Issue 10, 15 November 2008, Pages 4988-4999 Giuseppe Nicosia and Giovanni Stracquadanio Abstract Finding the near-native structure of a protein is one of the most important open problems in structural biology and biological physics. The problem becomes dramatically more difficult when a given protein has no regular secondary structure or it does not show a fold similar to structures already known. This situation occurs frequently when we need to predict the tertiary structure of small molecules, called peptides. In this research work, we propose a new ab initio algorithm, the generalized pattern search algorithm, based on the well-known class of Search-and-Poll algorithms. We performed an extensive set of simulations over a well-known set of 44 peptides to investigate the robustness and reliability of the proposed algorithm, and we compared the peptide conformation with a state-of-the-art algorithm for peptide structure prediction known as PEPstr. In particular, we tested the algorithm on the instances proposed by the originators of PEPstr, to validate the proposed algorithm; the experimental results confirm that the generalized pattern search algorithm outperforms PEPstr by 21.17% in terms of average root mean-square deviation, RMSD C. Abstract | Full Text | PDF (647 kb) |
Copyright © 2008 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 95, Issue 5, 2434-2449, 1 September 2008
doi:10.1529/biophysj.108.133587
Proteins
Yelena A. Arnautova and Harold A. Scheraga
, 
Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York
Address reprint requests to Harold A. Scheraga, Dept. of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, NY 14853-1301. Tel.: 607-255 4034.Accurate prediction of protein structure when the only information provided is about amino acid sequence remains one of the greatest challenges in computational chemistry. Results of the CASP (Critical Assessment of Techniques for Protein Structure Prediction) exercises 1 demonstrated that, in many cases, the tertiary structure of a protein (especially a homologous one) can be predicted with a high degree of certainty. However, these predictions provide information about protein structure at relatively low resolution (>3Å), which may not be sufficient for practical applications (for example, for structure-based drug design). To achieve the atomic level of detail in protein structure prediction, so called refinement methods have been introduced 2,3,4,5,6,7, and significant attention has been paid to their development 2. These methods are designed to be able to shift the low- and medium-resolution models obtained either from statistics-based methods (homology modeling and threading) or from the use of physics-based course-grained force fields closer to the native state. An important element of any refinement method is an accurate scoring function that must be able to discriminate nativelike conformations from nonnative folds. Many different scoring functions, including empirical 3, knowledge-based 4,5,8, and physics-based 2,6,7 functions, are described in the literature. One type of scoring function, i.e., the one including physics-based all-atom force fields 2,6,7, seems to be a very promising tool for refinement, because these force fields are designed to model physical interactions and may therefore help to shed light on protein folding mechanisms. As such, they are also expected to have better transferability.
According to the Anfinsen thermodynamic hypothesis 9, a necessary requirement for energy functions to produce accurate protein structure models is their ability to recognize the native state of the protein as the conformation, or a set of very similar conformations, for which the system, i.e., the protein plus its surroundings, is of lowest free energy. Several physical scoring functions, such as those based on the AMBER 10,11,12, OPLS 13, CHARMM 14,15,16, and GROMOS 17 force fields, combined with different implicit solvent models, were reported to perform well when tested on large sets of decoys generated for many proteins. Despite the significant increase in accuracy of the available all-atom force fields and treatment of solvation effects, the physics-based scoring functions still exhibit some difficulty in differentiating native structures from the sets of decoys generated using very different methods. Their performance worsens when some kind of energy relaxation method (for example, short molecular dynamics (MD) runs) is applied to the decoys 18. Limited success 19,20 has been achieved in refinement of low- and medium-resolution protein models, especially when they have significant unstructured regions and long loops 7.
The accuracy of a physics-based force field is directly related to its ability to reproduce the energetic balance between different interactions accurately. Development of all-atom force fields usually involves the use of a number of assumptions and approximations, such as a simplified form of the energy function, use of fixed charges, omission of polarization effects, and use of implicit solvent models. These approximations definitely affect the accuracy of the resulting force field; however, they are difficult to avoid without significantly increasing the computational cost of the energy calculations. The other source of inaccuracy is the lack ofdirect experimental data, and hence, torsional and solvation energy terms are probably the most poorly determined parts of the physics-based force fields. For example, the torsional energy terms of all-atom force fields are often optimized by using gas phase data (from high-level quantum mechanical calculations) for small molecules and peptides. It has been shown 21 that the propensity of a given residue to form a particular secondary structure in the gas phase is different from that in solution, and therefore, the torsional parameters derived in this way are not directly transferable to larger systems in a solvent environment.
There is also a lack of direct experimental data on solvation free energies of proteins, which has led to the use of experimental data for small molecules and peptides for parameterization of solvent models. The solvation free energy term is intended to capture several complex effects involved for a protein in solution, from solvent entropy to ionization effects. All these effects may manifest differently for small molecules (or peptides) than for proteins. For example, the solvent exposure of atoms of the same type may, on average, be very different in proteins and small molecules. Moreover, the solvation free energy contribution parameterized by using experimental data for small molecules is usually added to the total energy without any adjustment to take into account the differences between small molecules and proteins. All these factors may contribute to the poor performance of all-atom force fields applied to proteins in solution.
One way to improve the accuracy of all-atom force fields is to use explicit water simulations and available experimental data (conformational equilibria of peptides and small proteins) in force-field parameterization. For example, extensive folding and unfolding simulations with an explicit solvent model have been used for optimization of backbone torsional parameters alone22 and with solvation parameters 23. Mohanty and Hansmann 24 used parallel tempering simulations with implicit water, carried out for a small β-sheet peptide, to reparameterize an empirical all-atom force field. The relatively high computational cost of this approach and the difficulties in applying it to more than one protein molecule at a time (to insure better transferability of the resulting force field) are some of the main obstacles to its wide application.
Another approach, which was also used in this work, is based on the thermodynamic hypothesis and involves the use of large sets of protein decoys to optimize the parameters of a force field. It was applied initially to parameterize coarse-grained protein models 25,26 and later to optimize all-atom force fields 27,28,29,30,31,32. Thus, Meirovitch et al. 27,31 optimized solvation parameters associated with solvent-accessible surface areas in all-atom physical energy functions intended for use in predicting surface loops in proteins. Their search of parameter space was restricted to a small number of parameters and was not systematic. A force-field optimization method, called MOPED, was used 28 to improve solvation parameters by creating an energy gap between the native conformations and a small number of decoys of two to three training proteins. Okur et al. 29 applied a genetic algorithm to optimize backbone torsional parameters using a large set of decoys generated for two peptides. Herges and Wenzel 30 parameterized their surface-area solvent model to stabilize the native structure of a single α-helical protein (the villin headpiece) against a large set of nonnative decoys. The resulting force field 30 was shown to be transferable to other α-helical proteins (but not to α/β or β-proteins because they were not considered in the parameterization). In recent work 32, the weights of the AMBER all-atom force field, supplemented by an explicit hydrogen-bond potential, were optimized to stabilize the native structures of a very large number of proteins (namely, 58) against a large number of decoy conformations. The authors also introduced additional energetic and structural criteria into their parameter optimization procedure to achieve better correlation between the energies of decoys and the similarity to the native structure. The force-field optimization led to a significant improvement in performance of the AMBER-based force field. Thus, the fraction of proteins for which the native structure had the lowest energy increased from 0.22 to 0.90. It should be mentioned that the protein decoys used by Wroblewska et al. 32 are characterized by a high degree of similarity (in terms of the secondary and tertiary structure) to the native conformation, and therefore, the ability of the force field they describe to discriminate native structures from very different compact conformations, as well as from the decoys generated by using different decoy generation procedures, remains to be established.
In this work, we introduce a new method of decoy-based force-field optimization. It is based on Anfinsen's thermodynamic hypothesis 9. Therefore, the optimization is aimed at stabilizing nativelike conformations against a large set of decoys by creating free energy gaps between the sets of nativelike and nonnative structures. The search for the best parameters of a force field is carried out with minimization in parameter space. In general, the goal of this work was to test the ability of the new optimization method to find a set of parameters of a given energy function that stabilizes the nativelike conformations of a number of proteins with different folds against large sets of nonnative decoys. To the best of our knowledge, the training set of proteins and the corresponding decoy sets used in this work are among the largest used to date for optimization of physics-based all-atom force fields.
We applied the new method to optimize the torsional and solvation parameters of the effective energy function built by using the physics-based all-atom ECEPP05 force field 33 coupled with the OONS 34 implicit surface-area (SA) solvation free energy term. The original ECEPP05/OONS force field fails to discriminate native structures from the decoys for several nonhomologous proteins (see Results and Discussion section). Although implicit solvent models, especially one as simple as a surface-area model, cannotaccount completely for all the effects of solvation, they are computationally very efficient and were shown to perform well when applied to the prediction of surface loops in proteins 27,31 and to the folding of small proteins 30,35 and peptides 36,37. We decided to consider this simple surface-area model and evaluate whether its accuracy can be improved by parameter optimization. We find that the optimization method succeeds in this task, and that the parameters obtained in learning from only a few proteins are transferable to other proteins. As an independent test of the optimized force field, we also considered the 4state-reduced set of decoys of Park and Levitt 38.
The total free energy of a protein in solution can be represented approximately as the sum of two terms:
![]() | (1) |
The internal free energy is given by
![]() | (2) |
![]() | (3) |
The ECEPP05 force field 33 was used to compute the internal energy (Uint) of a protein in the absence of solvent. The ECEPP05 internal energy is a function of the torsional degrees of freedom, i.e., all the backbone and side-chain torsional angles, of a protein (all bond angles and bond lengths are fixed at standard values 41). The Uint of a protein is given by
![]() | (4) |
The first two terms in Eq. (4) were computed as
![]() | (5) |
![]() | (6) |
The torsional energy term, Etor, in Eq. (4) for each dihedral angle x was computed as
![]() | (7) |
and
are the torsional parameters; x varies from 0 to 180°.It should be mentioned that there is no explicit hydrogen-bonding term in the ECEPP05 potential function. This interaction is represented by a combination of electrostatic and nonbonded interactions with the hydrogen involved in a hydrogen bond treated as a separate atom type with parameters different from those of the other types of hydrogens.
The solvation free energy, ΔGsolv, of each structure is estimated by using a solvent-accessible SA model,
![]() | (8) |
The performance of a given scoring function depends not only on the accuracy of its functional form and parameters but also on how it is applied. Scoring of protein decoys using physics-based functions is usually carried out through energy evaluation. Due to the roughness of the all-atom energy surface characterized by huge energy variations corresponding to small changes in structural parameters, such computations may not provide a realistic picture, especially if the decoys were generated using a very different force field. Scoring can also be carried out using local energy minimization, which relaxes a given conformation to the closest energy minimum. All local energy minimizations of the native structures of proteins from the Protein Data Bank (PDB 42) and of the corresponding structures from the decoy sets considered in this work were carried out using the SUMSL minimizer 43 as implemented in the ECEPPAK program 44,45,46.
For large all-atom systems such as proteins, which have a very rugged potential energy surface, local energy minimization leads to minor changes in the structure compared with the starting conformation and does not provide information about the existence of lower energy minima corresponding to conformationally very similar structures. A conformational search, limitedto the vicinity of the starting conformation, should make it possible to overcome this problem; therefore, in this work, we used two types of runs, 1), local energy minimizations, followed by 2), short Monte Carlo simulated annealing (MCSA) runs, to evaluate the energies of protein decoys. The effective free energy ΔGeff given by Eq. (3) was used as a scoring function for both types of runs.
Each MCSA run started at T0=1000K, and the system was then cooled in N cooling cycles to TN=200K. The resulting structures at 200K were energy-minimized. The same number of steps was carried out at each cooling cycle. The temperature of each cycle i was computed according to the formula Ti=T0−iA47, with
![]() | (9) |
In each cooling cycle, new conformations were generated by performing a short Monte Carlo search at a given Ti. Conformations generated during the Monte Carlo simulation were obtained by a 10% perturbation of the backbone and side-chain torsional angles.
The proteins considered in this work are listed in Table 1,Table 2,Table 3. The training (Table 1) and test (Table 2) sets of proteins were used for force field optimization and evaluation, respectively. These sets contain a total of 12 proteins with 20–76 residues each, and without any stabilizing ligands or disulfide bonds, because these cannot be accounted for by this version of the force field. All types of secondary structure, i.e., α5, α/β5, or β2, are represented in these sets of proteins. All proteins were considered with unblocked N- and C-termini. All ionizable residues and end groups were assumed to be neutral.
| Table 1 Results obtained for the training set of proteins using the ECEPP05/OONS and the optimized ECEPP05/SA force fields |
| ECEPP05/SA (after optimization)‡ | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Minimization§ | MCSA¶ | |||||||||||||
| Protein (PDB code) | Experimental method | Class | No. of residues | No. of decoys | RMSD range* | ECEPP05/OONS RMSD† | RMSD‖ | ΔGeff** | ΔΔGeff†† | RMSD‖ | ΔGeff** | ΔΔGeff†† | ||
| 1e0l | NMR | β | 37 | 3563 | 0.1–18.0 | 11.2 | 1.66 | −834.1 | −22.5 | 2.01 | −839.9 | −4.0 | ||
| 1gab | NMR | α | 53 | 7191 | 0.1–18.0 | 12.3 | 4.03 | −701.6 | −22.1 | 4.25 | −700.8 | −15.4 | ||
| 1igd | X-ray | α/β | 61 | 1638 | 0.1–30.0 | 23.8 | 1.36 | −922.6 | −16.4 | 1.40 | −946.2 | −34.9 | ||
| 1l2y | NMR | α/β | 20 | 4028 | 0.1–10.0 | 1.9 | 3.16 | −450.0 | −0.4 | 3.16 | −449.6 | −0.4 | ||
| 1csp | X-ray | β | 76 | 1937 | 0.1–27.0 | 16.4 | 2.19 | −1316.0 | −5.1 | 12.8 | −1315.7 | 3.2 | ||
| 1msi | X-ray | α/β | 66 | 4229 | 0.1–25.0 | 14.5 | 2.35 | −1397.8 | −22.0 | 2.27 | −1396.8 | −29.8 | ||
| * Range of RMSDs from the native structure for decoys of a given protein (Å). The first value corresponds to the RMSD of the native structure converted to the ECEPP geometry (i.e., with standard values of bond lengths and bond angles). † RMSD (Å) from the native structure of the decoy with the lowest ECEPP05/OONS energy (after only local energy minimization). ‡ Results obtained using the optimized ECEPP05/SA force field. § Local energy minimization. ¶ Monte Carlo simulated annealing run after local energy minimization. ‖ RMSD of the lowest energy decoy from the native structure (Å). ** Effective free energy (Eq. (3)) of the lowest energy decoy, kcal/mol. †† ΔΔGeff= where and are the effective free energies (in kcal/mol) of the lowest-energy nativelike and nonnative decoys, respectively. ΔΔGeff was computed only for the cases in which the energy distributions in Fig. 5 had two well-defined minima corresponding to nativelike and nonnative decoys. |
| Table 2 Results obtained for the test set of proteins using the optimized ECEPP05/SA force fields |
| Protein (PDB code) | Experimental method | Minimization | MCSA | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Class | No. of residues | No. of decoys | RMSD range | RMSD | ΔGeff | ΔΔGeff | RMSD | ΔGeff | ΔΔGeff | ||||
| 1bdd | NMR | α | 46 | 1203 | 0.1–20.0 | 3.92 | −1365.2 | −72.7 | 3.66 | −1381.0 | −48.9 | ||
| 1vii | NMR | α | 36 | 2271 | 0.9–13.0 | 3.07 | −718.1 | −18.5 | 5.45 | −733.4 | —* | ||
| 1res | NMR | α | 43 | 2824 | 0.1–21.0 | 2.69 | −1494.6 | — | 2.66 | −1503.9 | —* | ||
| 1fsd | NMR | α/β | 28 | 3208 | 0.1–12.0 | 3.77 | −1308.1 | −46.0 | 6.47 | −1330.4 | 8.3 | ||
| 1cc7 | X-ray | α/β | 72 | 5755 | 0.1–27.0 | 1.72 | −1166.2 | — | 1.53 | −1171.8 | —* | ||
| 1ail | X-ray | α | 73 | 3622 | 0.1–29.0 | 3.09 | −2440.3 | −35.7 | 3.19 | −2430.4 | −19.4 | ||
| See Table 1 notes for descriptions of parameters. |
| * ΔΔGeff was not computed (see Table 1, last note). |
| Table 3 Results for the 4state-reduced decoy set obtained using the optimized ECEPP05/SA force field |
| Protein (PDB code) | Class | No. of residues | No. of decoys | Energy evaluation* | Minimization† | MCSA‡ | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSD§ | ΔGeff¶ | RMSD§ | ΔGeff¶ | RMSD§ | ΔGeff¶ | ||||||
| 1ctf | α/β | 68 | 630 | 3.20 | −430.5 | 1.17 | −906.7 | 1.73 | −962.0 | ||
| 1r69 | α | 63 | 675 | 0.14 | −1339.7 | 0.76 | −1914.0 | 1.23 | −1981.3 | ||
| 3icb‖ | α | 75 | 653 | 1.80 | −715.4 | — | — | — | — | ||
| 4rxn** | α/β | 54 | 677 | 0.31 | −337.8 | — | — | — | — | ||
| 2cro | α | 65 | 674 | 0.14 | −1230.1 | 2.65 | −1909.7 | 1.17 | −1934.3 | ||
| The 4state-reduced decoy set was developed by Park and Levitt 38. |
| * Energy evaluation for a fixed conformation. † Local energy minimization. ‡ Monte Carlo Simulated Annealing run after local energy minimization. § RMSD from the native structure for the lowest energy decoy, Å. ¶ Effective free energy (Eq. (3)) of the lowest energy decoy, kcal/mol. ‖ Calcium binding protein. ** Metal-binding protein. |
Decoys for all the proteins from Table 1,Table 2 were generated according to the procedure described in Ripoll et al. 48, i.e., starting from 1), native, 2), canonical α-helical (ϕ=−60.0°, ψ=−40.0°, ω=180.0°), and 3), randomly generated conformations, by using the electrostatically driven Monte Carlo method 44 with the ECEPP05 force field coupled with the OONS solvent-accessible SA model. The generated conformations were clustered by using the minimal spanning tree method 49 and assuming a specific root mean-square deviation (RMSD) cutoff of 0.7Å for all heavy atoms and no cutoff in energy. For each protein, the size of the ensemble generated from all three starting points varied from 1,203 to 7,191 conformations and is characterized for most proteins by a uniform distribution of RMSD from the native fold in the range 0.1–30.0Å.
The decoy set generated for each protein also included the native structure. The coordinates for the native structure of each protein used in this work (listed in Table 1,Table 2) were taken from the PDB and subsequently converted to ECEPP-type geometry, i.e., with fixed (standard-value) bond lengths and bond angles. This conversion provides an all-atom representation, including hydrogen atoms, for each of the selected proteins. The RMSD for the heavy-atoms between the native structures before and after the conversion is very low, as can be seen from the 6th column in Table 1,Table 2. When more than one structure for a given protein is present in the PDB (NMR-derived structures), the one corresponding to the PDB code in Table 1,Table 2 was selected. If several conformations were submitted under the same PDB code, the model submitted as model number 1 was used.
It should be mentioned that most authors of force-field optimization methods restrict their use to the native protein structures solved only by x-ray diffraction measurements (avoiding NMR-derived models). The main reason for this choice is the lower accuracy (uncertainties up to 2Å) of NMR structures (due to the much smaller amount of experimental data available for each atom) compared to that of the x-ray structures. Although x-ray-derived protein conformations are more accurate, uncertainties in atomic positions for high-quality structures can be up to 0.6–1.0Å. Atomic-resolution crystal structures exhibit extensive, discrete conformational substates in which a high percentage of side chains can exist in multiple conformations 50 or are completely disordered. The main chains are more conserved, although uncertainty in the positions of the main-chain atoms can become pronounced in flexible surface loops. Kruskal 51 showed that crystal-packing effects are not a main source of structural differences between NMR and x-ray structures of the same proteins, but he suggested that the crystalline environment could have the effect of “freezing out” one conformation from the more diverse ensemble present in solution. This effect may be widespread, since most crystal structures reported today are determined at cryogenic temperature (∼100K). Last but not least, it is not clear what aspects of these low-temperature structures are relevant at room temperature 52. This evidence suggests that use of x-ray diffraction structures has little advantage over that of NMR-derived conformations, and we therefore considered both types of experimental structures in this work.
As an additional test of the optimized force field, we also included the 4state-reduced set of decoys of Park and Levitt 38 (Table 3). This setcontains decoy structures for seven proteins. We considered only five of these, namely, 1ctf, 1r69, 3icb, 4rxn, and 2cro, because the native structures of the remaining two proteins (4pti and 1sn3) are stabilized by disulfide bonds.
The parameter-optimization method described in this work makes use of Anfinsen's thermodynamic hypothesis 9. To develop a force field satisfying this hypothesis, two sets of conformations, namely, nativelike and nonnative ones, were considered, and the optimized parameters were derived by creating a free energy gap between these two sets. This approach differs from other similar optimization methods described in the literature 28,30 which make use of a single native structure instead of a set of nativelike conformations. The similarity measure that we used to define a nativelike structure is described in the next subsection.
We introduced a conformational free energy, (Fi(a)), of a native/nonnative set of structures, computed as
![]() | (10) |
is the effective free energy of the kth conformation; and x represents the backbone and side-chain torsional angles of the kth conformation. β=1/RT, where R is the universal gas constant and T is the absolute temperature. T was considered an empirical parameter. By allowing T to vary, the value of the conformational free energy of a given set of structures can be altered relative to the energy distribution of this set. The value of T used in this work (β=0.5 mol/kcal) was chosen in such a way that the conformational free energy of each level (i.e., a set of nativelike or nonnative structures, as described in next section) was close to the energy of the lowest-energy structure from this level. The Boltzmann summation in Eq. (10) is taken over the conformations from level i (i denotes the native or nonnative level).Using the ECEPP05/SA energy function, we modified the force-field parameters, a, in an attempt to satisfy the condition that, for a training set of proteins, the conformational free energy of the nativelike conformations should be lower than the conformational free energy of a set of nonnative decoys:
![]() | (11) |
Force field parameters (a) were optimized by minimizing the target function
![]() | (12) |
![]() | (13) |
and ymax=−Δj.The energy of a given conformation, represented by a set of backbone and side-chain torsional angles, x, is a function of the force field parameters, a, i.e., all k and σ. As the force-field parameters vary, the conformations based on the initial parameters may no longer correspond to energy minima; therefore, the effective free energies (Eq. (3)) computed with the new parameters will not reflect the real relative stabilities of native and nonnative levels. To solve this problem, we employed the following approach. The parameter optimization is an iterative procedure in which each iteration first involves optimization of a while holding the conformations fixed. Then, all conformations are energy-minimized with the resulting interim parameters, a. This procedure is repeated until all the free energy gaps reach the predefined value Δj (Δj=5 kcal/mol for all j). Both the minimization of the target function Φ as a function of a and the minimization of the effective free energy, ΔGeff, for each conformation as a function of x are carried out by using the SUMSL minimizer 43. At each iteration, we computed the local energy minima with the current force-field parameter set by performing energy minimizations from the fixed initial native and decoy conformations. For each training protein, all decoys plus the native structure are included in the optimization procedure. The flowchart of the optimization method is shown in Fig. 2.
The method described above was applied to optimize the backbone ϕ and ψ torsional (
and
in Eq. (7)) and solvation (σi in Eq. (8)) parameters. The force field parameters for the remaining torsional angles (ω and χ) were kept fixed at the original ECEPP05 values 33. We also attempted to stabilize the nativelike conformations of the training proteins by varying the relative contributions of different energy terms of the effective energy function (Eq. (3)) by optimizing the weights (w) of the equation
![]() | (14) |
As mentioned earlier, we considered a set of nativelike conformations (defined below) instead of a single native structure to optimize parameters of an all-atom force field. Our decision to use a set of nativelike conformations is based on the fact that a protein under physiological conditions exists as a dynamic ensemble of conformations. Ideally, NMR experiments should be able to provide information about this ensemble. Although protein structures solved by x-ray diffraction are represented by a single conformation, work on the interpretation of crystallographic data 54 showed that dynamics and heterogeneity remain even in the crystalline state, and suggested that a single conformation may not provide the best solution to the crystallographic structure-determination problem 54,55. In addition, the authors concluded that use of a single conformation may introduce a bias in the computation of protein properties such as solvent-accessible SA, total energy, etc., which are sensitive to small variations in atomic positions.
All decoy conformations were divided into two groups (levels), namely, nativelike and nonnative, according to two criteria: 1), fraction of residues with the same conformation (according to the conformational letter code of Zimmerman et al. 56) as in the corresponding fragment of the experimental structure, and 2), fraction of contacts in a fragment matching those in the corresponding fragment of the experimental structure. Thus, the similarity of packing of the secondary structure fragments was defined in terms of the fraction of the interfragment native contacts.
A decoy is defined as nativelike if both the secondary structure and packing of the secondary structure elements are similar to those in the native structure. The first of these requirements, i.e., similarity of the secondary structure, means that at least 60% of consecutive residues in each secondary structure element should be from the same regions (defined by Zimmerman et al. 56) of the Ramachandran (ϕ-ψ) map as the corresponding native residues.
To quantify the similarity of the packing of secondary structure elements to that in the experimental structure, we used the Q parameter introduced in Furnham et al. 57, i.e.,
![]() | (15) |
A set of nativelike conformations defined according to the twofold similarity measure introduced above includes more diverse structures than is described by an average experimentally determined native ensemble (RMSD≤2Å); there is no direct correspondence between the measure described above and RMSD. However, the nativelike structures correspond roughly to those with RMSD≤4Å from the PDB structure. Since the goal of this work was to develop a force field capable of discriminating nativelike from nonnative protein conformations (i.e., those with different tertiary and even secondary structure), we felt that the use of the definition of a nativelike structure introduced here is justified.
For analysis of the results, the structural similarity between two protein conformations was also expressed as the RMSD between the best overlap of the heavy atoms (i.e., all atoms except hydrogens) of the two conformations.
In this section, we report the application of the parameter optimization method presented in this article to the development of an accurate all-atom force field including hydration. First, the accuracy of the original ECEPP05/OONS force field to score protein decoys is evaluated. Next, we report optimization of the force field parameters (torsionaland SA solvation parameters (see below)) using the procedure described in the Methods section and shown in Fig. 2. The optimization procedure minimizes the target function Φ (Eq. (12), Fig. 2) and is aimed at stabilizing nativelike conformations relative to nonnative decoys for a set of training proteins. Finally, the resulting optimized force field was used in two types of simulations: 1), local energy minimizations; and 2), MCSA runs after the local energy minimizations. Performance of the force field in these tests carried out for the training and test sets of proteins, as well as for the 4state-reduced decoy set of Park and Levitt 38, is discussed.
Decoys of the training proteins (Table 1) were first energy-minimized using the all-atom ECEPP05 force field 33 combined with the OONS implicit SA solvation model 34 with the original parameters (ECEPP05/OONS). The results of the calculations are reported in Table 1, column 7, and in Fig. 3. For all the proteins but one (1l2y) from the training set, nonnative structures have lower energies than nativelike ones. The nativelike conformation with an RMSD of 1.9Å from the experimental structure was obtained as the lowest-energy one for 1l2y. For all the other proteins from the set, the ECEPP05/OONS force field favors all-helical structures. In the case of the α-helical protein, 1gab, the lowest-energy structure (a two-helix bundle) differs from the native conformation (a three-helix bundle).
Low energies of nonnative helical conformations with ECEPP05/OONS are due to the large contribution of the sum of the nonbonded and torsional energies, which constitutes ∼90% of the total energy. Since this part (i.e., EvdW+Eel+Etor) of the force field was parameterized to reproduce the ab initio (gas phase) ϕ-ψ map of terminally blocked alanine, which has a global minimum corresponding to the α-helical conformation, the ECEPP05 force field is expected to favor this type of conformation. On the other hand, the solvationenergy is more favorable for extended structures. However, the solvation energy contribution to the total energy for the OONS model is very small (∼10%) and does not significantly affect the relative stabilities of different conformations. Most of the high-RMSD decoys shown in Fig. 3 are noncompact α-helical conformations, which are favored by both the gas phase and the solvation energy terms of the ECEPP05/OONS force field. Only in the case of 1l2y is the low total energy of the 1.9-Å nativelike conformation determined by the (Enb+Etor) contribution.
These results indicate that the ECEPP05 force field combined with the OONS solvent model (with their original parameters) is not able to discriminate nativelike structures and has to be improved.
In this work, we attempted to optimize the force-field parameters by minimizing a target function Φ (Eq. (12)). The backbone torsional ϕ and ψ and the solvation parameters were selected as candidates for the optimization because they are the most difficult ones to derive from first principles and, therefore, their values may contain a high degree of uncertainty. First, only the torsional parameters were allowed to vary during the optimization. When more than one training protein was considered, the optimization did not lead to any significant improvement in the stability of the nativelike structures relative to the nonnative decoys. On the other hand, when the target function, Φ, was optimized as a function of either solvation alone or both solvation and torsional parameters, the target free energy gaps were achieved in a small number of iterations. There was also more than one set of parameters that minimized the target function Φ. As a result, it may be necessary to consider a very large set of proteins to obtain a unique set of parameters. Since consideration of a large number of proteins simultaneously (with a large number of decoys) is computationally very demanding, we decided to focus on optimization of the full range of σ parameters of the solvation model, allowing only a limited variation of the torsional parameters (within ±10% of the gas phase values). The torsional parameters were not fixed during the optimization, because their original values were derived from ab initio (gas phase) calculations and therefore may not be adequate for a protein in solution. No restrictions were placed on possible values of the solvation parameters; however, we assumed that the OONS parameter set represents a reasonable starting point for reparameterization, and therefore, we did not attempt to explore the space of solvation parameters for alternative optimized parameters (i.e., only one starting set of parameters (OONS) was considered).
Before starting the optimization, we evaluated how the size of the training set influences the resulting values of the force-field parameters. Five sets, containing 2, 3, 4, 5, and 6 proteins, respectively, were considered. These sets were composed of the proteins from Table 1 by adding one protein at a time, e.g., 1e0l and 1gab (set 1), 1e0l, 1gab, and 1igd (set 2), and so on. Fig. 4 shows the values of the solvation parameters as a function of the size of the training set. The parameters that depend the most on the size of the training set are σ1, σ2, σ5, and σ6 (the corresponding functional groups are shown in Fig. 1). The solvation parameter for the aliphatic group (Fig. 1), σ1, is the most sensitive to the number of proteins used for its optimization. In general, changes in the solvation parameters as a function of the training set size are small (<4%) and become even smaller when the set size reaches five proteins. Although the parameter values obtained from the optimizations carried out using five and six proteins, respectively, are very close, we decided to use six proteins to ensure better transferability of the resulting force field.
The training set containing the six proteins listed in Table 1 was considered in the parameter optimization carried out by using the procedure (Fig. 2) described in the Methods section. The force-field parameters that satisfy Eq. (11), with Δ=5 kcal/mol, were obtained in a small number of iterations (approximately three). The resulting backbone torsional and solvation parameters are given in Table 4,Table 5, respectively. The most significant changes in the parameter values arose for three types of groups, namely, aliphatic, aromatic, and carboxyl/carbonyl oxygen. The aliphatic and aromatic groups became more “hydrophobic” (large positive values of σ), whereas carboxyl/carbonyl oxygen became more “hydrophilic” (more negative values of σ).
| Table 4 Initial and optimized values of the backbone ϕ and ψ torsional parameters (kcal/mol) |
| φ | ψ | |||||||
|---|---|---|---|---|---|---|---|---|
| Parameters | k1 | k2 | k3 | k1 | k2 | k3 | ||
| Initial | −1.43 | 1.41 | 0.19 | −1.70 | 1.95 | −0.46 | ||
| Optimized | −1.41 | 1.41 | 0.17 | −1.64 | 1.95 | −0.49 | ||
| Initial values were taken from Arnautova et al. 33. Force field parameters for the torsional angles ω and χ were taken from ECEPP05, and were kept fixed during the parameter optimization. |
| Table 5 Initial (OONS) and optimized values of the solvation parameters (σ) |
| Chemical group | σOONS | σopt | |||
|---|---|---|---|---|---|
| Aliphatic (CH3, CH2, CH) | (σ1) | 0.008 | 0.171 | ||
| Aromatic (=CH−) | (σ2) | −0.008 | 0.155 | ||
| Hydroxyl (−OH) | (σ3) | 0.427 | 0.416 | ||
| Amide and amine (NH2, NH) | (σ4) | −0.132 | −0.187 | ||
| Carboxyl and carbonyl carbon | (σ5) | −0.172 | −0.245 | ||
| Carboxyl and carbonyl oxygen | (σ6) | −0.038 | −0.269 | ||
| Sulfur –S− and thiol –SH | (σ7) | −0.021 | —* | ||
| Values are given in kcal/mol·Å2. |
| * The training proteins did not have sulfur-containing residues in their sequences. |
Fig. 5 shows energies of the decoys versus all heavy-atom RMSDs from the native structure for the six training proteins. The energies (Fig. 5, red circles) correspond only to the local energy minima of the optimized force field before implementation of MCSA. The optimized ECEPP05/SA force field stabilizes near-native conformations against the competing low-energy decoys for all six proteins (the energy gaps between the lowest-energy nativelike and nonnative decoys (Table 1, column 10) are all negative). The best result, in terms of the RMSD from the native structure, was obtainedfor 1igd, for which the lowest-energy decoy was only 1.36Å from the x-ray conformation (Figure 5c, red circles). For 1gab, the lowest-energy decoy (−701.6 kcal/mol) has a nativelike structure, although its RMSD from the native structure is relatively high (∼4Å, Figure 5b). The only protein for which the optimized force field performed less well than ECEPP05/OONS was 1l2y (Trp-cage, Figure 5d). The lowest-energy decoy (−450.0 kcal/mol) of 1l2y has a relatively high RMSD (3.16Å) from the native structure and is only marginally more stable (∼0.4 kcal/mol) than the low-energy nonnative conformations (Figure 5d and Table 1). At the same time, optimization of the force field led to increased stability of a set of nonnative decoys of 1l2y, with RMSDs around 6.4Å and general flattening of the effective free energy surface for the 2–6.5Å RMSD range (Figure 5d), indicated by the appearance of the large number of decoys in this RMSD range with similar and very low energies (Figure 5d).
The parameters obtained by using the optimization procedure described in this work (see Methods and Fig. 2) and reported in Table 4,Table 5, are not guaranteed to be optimal, even for the training proteins, because low-energy decoys, not included in the training decoy sets, can exist. The free energy relaxation carried out by performing short MCSA runs after energy minimization was used to explore the free energy surface in the vicinity of the minima corresponding to the training decoys. These simulations yielded results (Fig. 5, blue circles, and Table 1, columns 11 and 13) very similar to those obtained from local energy minimization (Table 1, columns 8 and 10). For five out of six training proteins, nativelike structures were scored by the optimized force field as the lowest in energy. Only in the case of 1csp did nonnative all-helical decoys have lower free energies (ΔΔGeff>0 (Table 1, column 13)) than nativelike β-barrel conformations. It should be mentioned that the energies of these new nonnative conformations produced by MCSA runs are still higher (−1315.7 kcal/mol) than the energies of the lowest-energy nativelike structures obtained from local energy minimizations (−1316.0 kcal/mol) (Table 1, columns 12 and 9, respectively). Comparison of the free energies obtained separately by either local energy minimization or MCSA runs after local energy minimization shows that only in the case of 1e0l and 1igd did the MCSA search lead to a significant decrease in energy of the decoys compared to those obtained from local energy minimization (−839.9 vs. −834.1 and −946.2 vs. −922.6 for 1e0l and 1igd, respectively (Table 1, columns 12 and 9)). The MCSA runs did not yield any lower-energy nativelike decoys for 1gab and 1 csp (decoys with RMSD<5Å in Figure 5b and decoys with RMSD<4Å in Figure 5e for 1gab and 1csp, respectively) or any nativelike or nonnative conformations for 1l2y and 1msi (Figure 5df, respectively). For 1l2y, a small 20-residue protein, this result may be caused by good sampling of the conformational space during decoy generation. However, for 1gab, 1csp, and 1msi, MCSA seems to be less efficient in finding new low-energy minima. 1csp and 1msi are the largest proteins in the training set (76 and 66 residues, respectively), so it makes sense that generating near-native decoys would not be easy, because even small conformational changes in the interior residues may lead to atomic clashes and high energies.
The MCSA runs intended to explore the vicinity of free energy minima corresponding to the decoys, and therefore provide additional information about the free energy surface of a protein, did not locate any nonnative conformations with energies lower than those of the nativelike structures. In other words, considering the combined results of the local energy minimizations and the MCSA runs, the optimized ECEPP05/SA force field is able to discriminate nativelike structures for all six training proteins as the lowest in effective free energy.
To evaluate whether the optimized ECEPP05/SA force field is transferable to other nonhomologous proteins, i.e., whether it is able to score near-native conformations as those with lowest energies, we considered sets of decoys generated for the six proteins (test set) listed in Table 2. The test set included four α-helical and two α/β proteins. Both local energy minimization and free energy relaxation (MCSA) runs afterenergy minimization were carried out for the decoys of these six proteins. The results of the calculations are given in Table 2 and Fig. 6.
As a result of local energy minimization with the optimized force field, we find that for each of the six test proteins, a near-native conformation emerges as the lowest in energy when compared to other low-energy decoys (Fig. 6, red circles). For all the proteins, the RMSD of the lowest-energy decoys is <4.0Å (Table 2, column 7). As seen in Fig. 6 e, the most stable decoy of 1cc7 is characterized by the lowest RMSD of 1.72Å from the native structure (the latter being shown in Figure 7a). The highest RMSD, 3.92Å, was obtained for 1bdd (Figure 6a). The two (middle and C-terminal) helices in the lowest-energy 1bdd decoy have the same tertiary alignment as in the NMR structure, but with slightly different orientation of the N-terminal helix (shown in Figure 7b). The lowest-energy decoy of 1fsd also has a relatively high (3.77Å) RMSD from the native structure. The main difference between this conformation and the NMR structure lies in the shape of the N-terminal fragment (shown in Figure 7c).
Somewhat different results were obtained for the test decoy sets using free energy relaxation (MCSA runs after energy minimization (Fig. 6, blue circles)). For five out of six proteins, namely 1bdd, 1vii, 1res, 1cc7, and 1ail, nativelike structures are stabilized relative to nonnative decoys (Fig. 6 and Table 2, column 12). For the remaining protein, 1fsd, a nonnative decoy with an RMSD of 6.47Å (Figure 6d) was scored as the most stable (−1330.4 kcal/mol). In the case of 1vii, the lowest-energy decoy (−733.4 kcal/mol) has a high RMSD of 5.45Å; however, it has overall nativelike topology with slightly higher helix content and, compared to the NMR conformation, a different relative orientation of helices 1 and 2(