Article Outline

Article Information

PubMed

Related Articles

  • …more

Copyright © 2008 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 94, Issue 5, 1575-1588, 1 March 2008

doi:10.1529/biophysj.107.119651

Biophysical Theory and Modeling

In Silico Protein Fragmentation Reveals the Importance of Critical Nuclei on Domain Reassembly

Lydia M. Contreras Martínez1Ernesto E. Borrero QuintanaFernando A. EscobedoGo To Corresponding Author  and Matthew P. DeLisaGo To Corresponding Author 

School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, New York

Address reprint requests to Fernando A. Escobedo, School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY 14853. Tel.: 607-255-8243; Fax: 607-255-9166 or to Matthew P. DeLisa at the same address. Tel.: 607-254-8560; Fax: 607-255-9166.

1 Lydia M. Contreras Martínez and Ernesto E. Borrero Quintana contributed equally to this work.

Abstract

Protein complementation assays (PCAs) based on split protein fragments have become powerful tools that facilitate the study and engineering of intracellular protein-protein interactions. These assays are based on the observation that a given protein can be split into two inactive fragments and these fragments can reassemble into the original properly folded and functional structure. However, one experimentally observed limitation of PCA systems is that the folding of a protein from its fragments is dramatically slower relative to that of the unsplit parent protein. This is due in part to a poor understanding of how PCA design parameters such as split site position in the primary sequence and size of the resulting fragments contribute to the efficiency of protein reassembly. We used a minimalist on-lattice model to analyze how the dynamics of the reassembly process for two model proteins was affected by the location of the split site. Our results demonstrate that the balanced distribution of the “folding nucleus,” a subset of residues that are critical to the formation of the transition state leading to productive folding, between protein fragments is key to their reassembly.

Introduction

Recent advances in molecular biology techniques have led to the development of many powerful research tools that have been key in providing detailed knowledge of the principles underlying highly specific interactions between cellular proteins. Of particular note is the protein fragment complementation assay (PCA), wherein a reporter protein is split into individual fragments that by themselves remain inactive but upon reassembly under the appropriate cellular conditions yield the original, properly folded and active protein structure. For example, the yeast two-hydrid system, based on the functional reconstitution of the split Gal-4 transcriptional activator 1, has facilitated the systematic determination of proteome-scale protein-protein interaction networks within numerous organisms, including humans 2, Drosophila melanogaster3, Caenorhabditis elegans4, Saccharomyces cerevisiae5,6, vaccinia virus 7, and Escherichia coli bacteriophage T7 8.

The increasing interest in protein-protein interactions has motivated the search for additional split reporter proteins that can be used for different applications and in other systems besides yeast 9. Examples include split green fluorescent protein (GFP) and its spectral variants yellow FP and cyan FP 10,11, ubiquitin 12, murine dihydrofolate reductase (DHFR) 13, β-lactamase 14,15, and firefly luciferase 16. The use of these split proteins is highly convenient, since the reconstituted activity of each is directly measurable by fluorescence or other well-established enzymatic assay. Numerous successes notwithstanding 17,18, the use of split proteins can be limited in usefulness because of the slow folding kinetics and formation of misfolded aggregates associated with the reassembly process of the fragments 11,17. For instance, whereas GFP activity can be detected in minutes, the two split fragments that result when the protein is dissected near the middle of the sequence fail to associate and reassemble when expressed in bacteria 11. A similar drawback has also been observed in other split systems like DHFR, β-lactamase, and ubiquitin, where folding is dramatically (or completely) inhibited upon protein fragmentation. In most cases, the addition of two interacting proteins to the split halves dramatically improves the kinetics of split protein reassembly, presumably by nucleating the reassembly reaction 11. However, even when fragments are each fused to strongly interacting leucine zippers (KD ≈ 1–20μM), folding and activity of the reconstituted protein are achieved only after 1–2 days 19. This inefficiency hinders the effective application of these detection systems on biologically relevant timescales. In an effort to increase the self-assembly efficiency of protein fragments in the absence of any interacting partners, a number of strategies have been employed, including 1), the identification of “permissive” split sites along the protein sequence using circular permutation 20,21, structure-guided design 14,22, or bioinformatic and theoretical analyses 23; and 2), the optimization of a target sequence for more efficient splitting/reassembly using directed evolution 24,25. In the majority of cases, split sites are often selected in regions away from the catalytic site, in areas containing flexible loops that can typically tolerate amino acid insertions, or in linker regions that separate naturally occurring functional domains 17. However, given that a few key residues known as the folding nucleus provide a significant driving force in the folding of a protein 26,27,28, we hypothesized that the way in which this nucleus is distributed between fragments determines reassembly efficiency of split proteins. In support of this notion, it has been observed that introduction of residues into the folding nucleus that lower its stability can dramatically slow the folding process 29.

To test our hypothesis, we have developed an on-lattice minimalist coarse-grained protein model to address how the reassembly kinetics, thermodynamic stability, and folding mechanism of a lattice model protein are affected upon splitting. Specifically, we designed several two-fragment systems derived from a well characterized 48-mer that is known to follow a nucleation-driven folding mechanism 30. Each of these split 48-mers was analyzed to determine the extent to which the reassembly process was impacted by differential partitioning of the folding nucleus between the two fragments. Our results suggest that a balanced distribution of folding nuclei amino acids between protein fragments is essential for efficient reassembly; this result was corroborated by the behavior observed for the reassembly process of a second set of split proteins derived from a 64-mer model protein. Collectively, these results provide new insights into the thermodynamic and kinetic aspects underlying protein fragment complementation and should prove extremely useful in the forward design and engineering of new split proteins.


Methods

Split protein models

To explore protein fragment complementation experimentally, three two-protein fragment systems (N-split, Mid-split, and C-split) were created by splitting a model 48-mer protein, namely 48-1 (TSKRQQPYPMSLGSPFIRIPMIGPRPRMRLLILLMGYPKRGRSGGGLF) 31, in three different locations (Fig. 1). Folded structures and a detailed thermodynamic and kinetic characterization for the parental 48-1 model protein sequence can be found elsewhere 31,32,33. In the N-split case, the sequence was split near the N-terminus between amino acids 16 and 17, creating one 16-residue fragment and a second 32-residue fragment. In the Mid-split case, the sequence was split in the middle between residues 24 and 25, creating two equal-sized fragments. In the C-split case, the sequence was split C-terminally between residues 32 and 33, creating one 16-residue fragment and a second 32-residue fragment. The symmetry shared by the N- and C-split systems was created so that the two fragments in each system were of equal length (i.e., each system has one 16-mer and one 32-mer fragment). This was done to eliminate any effect on folding due to variations in chain size since it was unclear at the outset how this might impact reassembly.

Display large version of this figure
Figure 1
Construction of split protein systems. (a) Linear representation of the parent 48-mer sequence. Fragmentation sites for N-, Mid-, and C-split systems are indicated. Residues comprising the folding nucleus at the center of the folded structure are shown in light gray. (b) Schematic representation of folded structures for split systems. Amino acids or connecting bonds shown in black correspond to N-terminal fragment (chain 1); amino acids or connecting bonds shown in dark gray correspond to those in the C-terminal fragment (chain 2); amino acids shown in light gray correspond to the folding nucleus. (c) Schematic representation of interchain (InterC) contacts that involve interacting residues from both chains, and intrachain (IntraC) contacts that involve interacting residues within the same chain. (d) InterC and IntraC contacts formed by critical core residues upon protein fragmentation. Most probable native contacts found in the transition state (TS) ensemble (i.e., the folding core/nuclei) for folding of the 48-mer sequence at Tf=0.27. Data obtained from Borrero and Escobedo 34.

Modeling folding for two protein fragments

To model the folding process, we adopted an on-lattice minimalist protein model in which the configuration of each protein chain evolves according to a canonical Monte Carlo (MC) algorithm 34. Briefly, space was discretized into a three-dimensional cubic lattice. Proteins were represented as self-avoiding chains, where each bead represents an amino acid with the bonds between the amino acids having uniform length equal to the lattice spacing (σ). Amino acid interactions were simulated by a Miyazawa-Jerningan contact energy potential 35 that takes into account implicit solvent effects and side-chain character. Conformational sampling was performed through a set of MC moves based on the Verdier-Stockmayer algorithm that mimics the diffusive movement of the amino acids during the folding process and includes 1), tail moves of one of the end beads to one of the available four neighboring sites; 2), corner flips for beads characterized by a right angle between directions to both contour neighbors; and 3), crankshaft moves of bead pairs located at the bottom of a U-turn 36. Relative to Verdier-Stockmayer moves, translation of a randomly selected chain was attempted after each MC step with a priori probability ≤10−4, consisting of adding either +1 or −1 (randomly chosen) to a random axis coordinate of all segment positions. Although this choice of translational move probability has no impact on thermodynamic averages, it affects the apparent kinetic dynamics of the system; for this reason, we only considered relative comparisons of real time kinetics between simulated dynamics for the 48-mer and the split systems 37.


Characterization of the folded state

To capture the specific chain topology of the folded state, two main parameters were used: the native energy (Enat), which records the sum of the energies of all interresidue contacts, and the similarity parameter (Q), which represents the number of native contacts formed divided by the total number of native contacts that describes the folded structure of each system 38. According to this convention, Q=1 represents the native (folded) conformation and Q=0 represents the highly extended (unfolded) protein. As previously reported, the configuration corresponding to the folded state of the 48-mer structure was distinguished among all other visited configurations by the formation of 57 native contacts and a minimum energy value of −20.24kBT31,32,33.


Spatial restriction

To make the association event more likely to occur without unduly constraining the conformations of the individual chains, space was restricted to a cubic box of 12σ length units, corresponding to a volume fraction of chains of ∼3%. However, given the small size (3×4×4) of the folded structure, the small size of each chain (16–32 residues), and the small number of chains involved (two), this spatial restriction was closer to a diluted regime, since the chains had plenty of free space to move. It is also worth noting that by encaging the system, we essentially disregarded the diffusion process that needs to occur before the two chains come near each other; instead, we consider a restricted open space where the local environment was crowded enough to allow for interchain interactions but precluded the chains separating to an infinite distance.


Additional simulation parameters

To collect kinetic data, simulations were run up to the point where the native structure of the system was observed for the first time, and this time was recorded as the folding time. In the case where no folding was observed, simulations were run for a maximum of 5×108 MC steps. Data from each simulation was obtained by taking the mean folding time (MFT) values over 500 independent runs in the canonical ensemble, each one starting from a different unfolded structure (Q≤0.2). Results were determined to be statistically invariant, since the data was not significantly affected when additional runs beyond 500 were included in each simulation.


Thermodynamic analysis

The thermodynamics of the single and multichain systems were studied by employing replica exchange MC (REMC) sampling 36,39 combined with the multihistogram reweighting method (MHR) 40. REMC was used to alleviate problems related to the sampling of a rugged free-energy landscape, in which the polypeptide chains could be temporarily trapped at low temperature. Protein folding was simulated by running several parallel replicas (M), each at a different temperature (Ti). The reduced temperature, T, was normalized by the reference temperature, To, such that kBTo represented the energy unit pertinent to the system. Relative to Verdier-Stockmayer and translation moves, swap moves between systems of different temperatures were attempted after each MC step with a probability ≤0.05. In most calculations, the number of replicas was 9, with T ranging between 0.1 and 0.5. Details of the thermodynamic analysis are given in 32. By using the REMC-MHR method, data from all replicas were combined and analyzed, minimizing the error in the estimation of the density of state function [Ω(E)] and facilitating the calculation of thermodynamic quantities over a wide range of temperatures, such as the specific heat (Cv) via Eq. (1), and free energy via Eq. (2):

(1)
(2)
Here, E represents the energy, kB is Boltzmann's constant, T is the temperature, S is the entropy, the partition function Z(T)=ΣE Ω(E)exp(−E/kBT), and the Boltzmann distribution of states P(E,T)=Ω(E)exp(−E/kBT)/Z(T).



Results

Design of protein fragments

For this study, we chose the model 48-mer protein, 48-1, because its thermodynamic behavior, folding pathway, and transition state have been characterized in detail 31,32,33. The 48-1 sequence was originally designed by Shakhnovich and co-workers to model a well designed sequence that exhibits a stable, fast-folding structure and an all-or-none transition between clearly distinguishable native and unfolded states 31. To generate split lattice model proteins, we dissected the 48-1 sequence at three positions: between residues 16 and 17 (N-split), 24 and 25 (Mid-split), and 32 and 33 (C-split) (Figure 1a). The minimum-energy folded structure recovered from a large MC simulation for each of the N-, Mid-, and C-split systems (Figure 1b) was identical to that reached by the unsplit 48-1 chain (data not shown). However, whereas unsplit 48-1 was characterized by 57 native contacts, the folded state for all split cases was characterized by 58 native contacts, since the additional contact lost upon the excision of the full chain needed to reform between the last amino acid of the first fragment and the first amino acid of the second fragment. Additionally, as a result of this new native contact, the energy values for the N-, Mid-, and C-split systems were −20.43, −20.65, and −20.62kBT, respectively, compared to −20.24kBT for the unsplit 48-mer. It is also worth noting that the split sites for the N-, Mid-, and C-split systems were involved in five, three, and two total native contacts (including the split pair), respectively, that contributed locally to ∼8%, 5%, and 4%, respectively, of the total native energy.

More recently, it was shown that the 48-1 protein folds according to a classical nucleation mechanism, whereby a core of native contacts forms at an early stage of the process and causes the protein to rapidly collapse to more compact nativelike conformations that lead to the fast rearrangement of its residues into the final folded structure 34. These same authors reported that the nucleus was composed of several mostly hydrophobic amino acids that have >60% probability of forming native contacts in the transition-state intermediates; these residues (residues 13, 16, 17, 19–24, 26–31, and 34–47 in Figure 1a) form a core at the center of the folded structure. It is important to note that in the Mid- and C-split cases, folding nuclei residues are well distributed between fragments and participate in a significant number of interchain native contacts (InterC) as seen in Figure 1cd. In contrast, for the N-split case, the folding nuclei residues are disproportionately distributed between fragments and none of these are involved in interchain native contacts (Figure 1cd).


Thermal stability is affected by protein fragmentation and by choice of the split site

The effect of splitting on thermodynamics was studied by determining the transition temperature (Tmax) for the unsplit 48-1 and each multichain system. A plot of heat capacity as a function of temperature revealed a single, strong peak corresponding to the folding temperature (Tmax) for the 48-1, N-, Mid-, and C-split systems (Fig. 2), indicating a single-phase conformational transition. Relative to the single 48-mer chain, all of the two-fragment systems exhibited lower folding temperatures. Normalized transition temperatures were found to be Tmax/Tf=1 for the 48-mer, Tmax/Tf=0.956 for the C-split system, Tmax/Tf=0.937 for the Mid-split system, and Tmax/Tf=0.926 for the N-split system. Thus, whereas the unsplit 48-mer remained stable at a higher temperature, thermal denaturation occurred at lower temperatures when protein folding was reconstituted from multiple fragments. These data also suggest that thermal denaturation was dependent on the choice of split site, as evidenced by the difference in folding temperatures between the entirely symmetric N- and C-split systems.

Display large version of this figure
Figure 2
Thermodynamic analysis of single- and multichain systems. Heat capacities for the 48-1 (▵), N-split (♦), Mid-split (○), and C-split (■ proteins as a function of temperature within the scale of the energy potential implemented in our model. Transition temperature (Tmax), also referred to as the protein's folding temperature, was defined as the temperature at which the heat capacity exhibits a maximum for each system. All temperatures were normalized by 0.27, the folding temperature of the unsplit 48-mer (Tf). Thermodynamic simulations were performed for 5E9 MC steps (a long time relative to folding times).

Whereas we did not explicitly test the effect of protein concentration in this study, the decrease in thermal stability observed in the context of split fragments was consistent with the earlier observation that folding temperature decreased as the concentration of protein chains increased in a system designed to mimic protein aggregation 41. The observed decrease here was related to both 1), an increase in the frequency with which the protein's configurational energies were close to that of the unfolded state (Q≈0); and 2), a decrease in the frequency with which the multichain system explored nativelike configurations (Q≈1.0) during the folding process. For a more detailed analysis, let us assume a pseudoreaction of the following form for the unsplit 48-mer:

(3)
and, for the two-fragment split systems,
(4)
The increase in the number of available nonnative configurations stems from the fact that two chains have more freedom to explore the conformational space separately, and this increases the entropy of the unfolded state. The total entropy change upon folding involved in both the unsplit 48-mer (ΔS48-mer) and the split systems (ΔSsplit) has two main contributions, one due to the reduction of conformational entropy (SConf) and the other due to a reduction of translational entropy (STrans). Using Flory's lattice model to count chain conformations, it can be shown that SConfcan be approximated as
(5)
where kB is Boltzmann's constant, N is the number of segments (i.e., amino acids) in a chain, and ρ is the segment density (i.e., the number of amino acids within the volume occupied by the chain) 36. Since the folded state can be taken as a maximally collapsed state (ρ→1) and the unfolded state as an open conformation (ρ→0), a first approximation for the ΔSConfis given by
(6)
and
(7)
where N1 and N2 represent the number of amino acids in each of the two fragments in the split system. Since the length of the unsplit system is N=N1+N2, it follows that the unsplit 48-mer and all the derived split systems entail a roughly similarΔSConf. The second entropic contribution SConf for an ideal molecule (lacking interactions with other molecules) is given by
(8)
where n is the number of molecules and V is the volume accessible to them (e.g., in units of molecular volume) 36. According to Eq. (8), the unsplit 48-mer folding process entails no change of translational entropy , since the number of molecules does not change upon folding (Δn=0) and the entropy is independent of the chain's center of mass. In contrast, the change of translational entropy upon folding for the split processes is given by (with Δn=−1 and assuming V≫1):
(9)
where ξ is a positive constant whose precise value is not important. Hence, when calculating the total entropic difference upon folding between the split and unsplit processes (i.e., Eq. (9)+Eq. (7)−Eq. (6)), a change of is obtained; the fact that this change is always negative indicates that, relative to the folding process of the unsplit 48-mer, the folding process of the split systems results in an overall unfavorable entropic change (i.e.,).

In addition to the entropic differences between the unsplit and split systems, the enthalpy change associated with the folding process (computed from the difference between the average configurational energy of the folded (EF) and unfolded (EU) states) of the split systems is also unfavorable relative to the enthalpy change associated with the folding of the single 48-mer chain. In this case, ΔE=EFEU increases for the split proteins because the energy of the unfolded state decreases with the number of protein fragments. The lower energy of the unfolded state in split systems can be rationalized by the fact that protein fragmentation allows more freedom for some favorable contacts to form that are not able to form in the unsplit 48-mer (where all amino acids are connected). As shown in Fig. 3, the multichain system can sample configurations around the unfolded state for a range of energies that are not available for the unsplit system. In these plots, free energy landscapes for the unsplit and split chains are projected over the plane of native energy and the fractional nativeness. Note that the configurational energy refers to the total energy of the system (i.e., sum of the configurational energy for chain 1 and chain 2, and that between the two chains). If we assume that the folded state has essentially the same average energy (EF) for the unsplit 48-mer and split systems, the difference in energy between these two processes is always positive, as shown below:

(10)
Collectively, this analysis indicates that the reduced thermodynamic stability of the native state in the split systems arises from two factors: unfavorable enthalpic and unfavorable entropic contributions.

Display large version of this figure
Figure 3
Free energy (ΔA) versus configurational energy at T=0.25 for: 48-1 (▵), N-split (♦), Mid-split (○), and C-split (■) systems. The inset gives a schematic diagram of the free energy of the native (ΔAξ=ATSAF) and unfolded state (ΔA#=ATSAU), and the free energy of stabilization (ΔAA#−ΔAξ). The folded state (F) is defined by the minimum found at the lowest configurational energy, the transition state (TS) is defined by the maximum (peak) of the free-energy curve, and the unfolded state (U) is defined by the minimum found at the highest configurational energy.

The thermodynamic destabilization of the assembled split chains is also reflected by their higher free energies (ΔA) relative to the free energies observed in the case of the unsplit 48-mer (Fig. 3). ΔA is defined as the difference in free energy change between the folded state (AF) and the unfolded state (AU), i.e., ΔA=AFAU. Using Eq. (10), the difference in free energy changes between the unsplit 48-mer and the split-chain systems can be found by Eq. (11) (i.e., Eq. (13)−Eq. (12)):

(11)
where
(12)
and
(13)
Since we have already argued that all terms on the righthand side of the equation are positive (see Eqs. (9)), the difference in free energy between the 48-mer and the split protein systems (as calculated by Eq. (11)) is always positive as the system goes from the unfolded to the folded state. Importantly, the fact that (ΔAsplit)>(ΔA48mer) indicates that the folding of any split-chain system will have a smaller thermodynamic driving force than its corresponding unsplit system. The relevance of the cage volume (V) on multichain folding can also be appreciated from this simple thermodynamic model.


Kinetics of protein reassembly is sensitive to the split site

To determine the effect of temperature on the relative folding kinetics of the different split protein systems, we calculated the mean folding time for the 48-mer and N-, Mid-, and C-split systems over a wide range of temperatures. The optimum temperature (Topt), defined as the temperature at which a given system folds fastest, was ∼0.23 for the 48-mer, 0.22 for N-split, 0.23 for Mid-split, and 0.22 for C-split (Fig. 4). The MFTs for the N- and Mid-split proteins were approximately three and two times slower, respectively, than that of the 48-mer at their corresponding Topts (Fig. 4). Importantly, the total number of independent runs where the native structure formed within the maximum simulation time (5×108 MC steps) was 500 out of 500, or 100%, for each system. This percentage was defined as the folding frequency (FF). The apparent folding rate (AFR), defined as the ratio of FF to MFT at Topt, was determined to be 1.92×10−5 for N-split, 3.57×10−5 for Mid-split, and 6.37×10−5 for C-split.

Display large version of this figure
Figure 4
Kinetic analysis of model proteins. Mean folding time (MFT) plotted over a range of temperatures for the 48-mer 48-1 (▵), N-split (♦), Mid-split (○), and C-split (■) proteins. The MFT value corresponds to the MC step at which the folded structure was first observed. Each data point was obtained from an average of 500 simulations. Error values were estimated by finding the difference between the mean folding time calculated from the first 250 simulations and the mean folding time obtained for the last 250 simulations; this value was then divided by 2. The errors are within the symbol size.

The slower kinetics of fragment reassembly, relative to the folding of a single chain, is not entirely surprising. Intuitively, this could be partially reasoned by the fact that all the residues that need to come into contact to form the folded structure in a single chain are in closer proximity by virtue of their interconnectivity; this is strikingly different from the case of two unconnected chains, where residues that have to associate to enable the formation of native contacts can move independently in space. Thermodynamically, the increase in folding times for the split fragments relative to the folding time of the single 48-mer chain is also not surprising, since it can be argued that the reassembly of split fragments (represented by Eq. (4)) has a larger free-energy barrier (ΔA#=ATSAU) and thus should be slower than the folding process for the unsplit 48-mer (represented by Eq. (3)). This conjecture can be reached by assuming that the folding “transition state” (TS) is roughly independent of whether or not the protein is split, the “folded” state (F) on the righthand sides of Eqs. (3) can be replaced by the TS. Although the assumption of TS isomorphism is not generally justified, since the TS should depend on the location of the splitting site, it is sensible to expect that the relative decrease of the free energy of the unfolded state (embodied by Eq. (13)) in any two-chain system will also tend to increase the barrier to folding (for the same underlying physical reasons).

Two aspects of the kinetic data shown in Fig. 4 are unexpected and intriguing: 1), the observation that a much smaller change in folding kinetics exists between the 48-mer and the C-split system (relative to the 48-mer and the other split systems), to the extent that there is no significant change in the folding times of these two systems at temperatures neighboring their respective Topts; and 2), the observation that at Topt the N-split folds 46% slower than the Mid-split and 70% slower than the C-split, despite the complete symmetry of these two systems. These trends prevailed over most of the temperature range tested for each system. It is also worth noting that the fragmentation itself did not dramatically retard folding in the case of the C-split system. This can best be attributed to the spatial constrictions that were placed on this moderately confined system (3-D cage of size 12σ), where a crowded environment relative to open space was created to ensure association between the different fragments. Note that it has been previously shown that, relative to folding in open space, the folding kinetics of this particular unsplit 48-mer remain unchanged when confined within a cubic box of size >10σ unit length 32,33.

The differences in folding kinetics can be rationalized thermodynamically by comparing the differences in free-energy barriers observed between the 48-mer and the different split proteins. For instance, the similarity in folding kinetics between the unsplit 48-mer and the C-split system is reflected in Fig. 3. These data show that, although ΔA# is larger for the C-split than for the unsplit system, these two systems display approximately the same TS dividing surface. Likewise, the much slower folding kinetics between the Mid-split and especially the N-split system relative to the unsplit 48-mer is reflected by the displacement of the TS toward the folded state (i.e., toward states of lower configurational energies, where it is more difficult to be accessed). The shift in the transition-state dividing surface observed for the N- and Mid-split systems, but not the C-split, indicates that the reassembly of these systems takes place via a different folding mechanism that appears to be slower. Collectively, our kinetic data and thermodynamic analysis of free energies suggest that in this confined system, the degree of retardation observed as a result of having two separate fragments is modulated by the location of the splitting site with respect to the folding nucleus.


The roughness on the free energy folding landscapes depends on the split site

To further explore the differences underlying the observed trends in MFTs, we plotted the free-energy landscape of the 48-mer, N-, Mid-, and C-split systems at their respective Tmax as a function of the total contact energy and the similarity parameter Q (Fig. 5). In the case of the split-fragment systems, the parameter Q included native contacts that formed within the same chain (intrachain) as well as those formed between different chains (interchain) (Fig. 1, IntraC and InterC, respectively). The free energy was obtained from Eq. (2). Consistent with previous work, the 48-mer exhibited two free-energy minima corresponding to the unfolded (high energy, Q≈0) and folded (low energy, Q≈1) states that were connected by a relative narrow passage wherein the transition state was identified as a saddle point (Figure 5a). The narrowness of the connecting region between the unfolded and folded states was characteristic of well designed proteins that exhibit a minimum number of misfolded (i.e., low-energy, low-Q-structure) states 42.

Display large version of this figure
Figure 5
Free-energy landscape for all protein systems at Tmax. Contour plot of the free-energy landscape of the 48-1-mer (a), N-split (b), Mid-split (c), and C-split (d) at the Tmax for each protein. The lowest elevations are indicated by arrows and appear as darkly shaded regions in the upper left (unfolded basin, U) and lower right (native-state basin, F) of each panel. Q values (x axis) represent the fraction of native contacts, calculated as the number of native contacts formed divided by the total number of native contacts for each sequence (i.e., 57 total contacts for the 48-mer and 58 native contacts for the split proteins). Energy values (y axis) represent the total configurational energy for the system (i.e., the sum of all energies for contacts within chain 1, chain 2, and between the two chains). The folded and unfolded states are represented by the two minima at Q=1.0 and Q≈0.1, respectively. Simulations were performed for a total number of 5×109 MC steps at equilibrium.

The fact that the same lowest energy configuration state was observed in all the landscapes confirmed that all systems shared the same folded state (Table 1). Moreover, since the additional contact observed in the split systems was favorable, the total configurational energy of these systems decreased with respect to the unsplit 48-mer. It is also important to stress that this folded state remained unique and was only achieved by the reassembly of the two chains; this is implicitly suggested by Figure 5bd, where only one low-energy state with a large number of native contacts was observed. The absence of multiple local energy minima in a region of a large number of native contacts supports the observation that single fragments by themselves remained unstructured and high in energy relative to the state they formed upon assembly. These differences separated these landscapes from those observed in a multichain aggregation system 41, where the appearance of low-energy/high-Q states suggested that each chain folded independently and that the formation of interprotein contacts only inhibited their separate folding process and resulted in aggregated, high-energy/low-Q states.

One striking difference observed in the folding landscape of the 48-mer (Figure 5a) when compared to the split proteins (Figure 5bd) was the spread of the free-energy minima region neighboring the unfolded state across a wider range of low Q values, closer to the transition state region of the parent 48-mer protein. This observation was significant, since the extent to which this low-energy, misfolded (low energy/low Q) region was amplified directly correlated with the retardation observed in the kinetics of the reassembly process. That is, whereas the free-energy landscape of the 48-mer did not change significantly when splitting the protein C-terminally (Figure 5a versus d), a much more diffusive (i.e., broad and rough) passage from the unfolded to the folded state resulted when splitting the 48-mer near its N-terminus (compare Figure 5ab).These data suggest that the efficiency of the reassembly process was decreased by the entrapment of protein fragments in misfolded configurations. Given that slower folding kinetics and a diffusive free-energy landscape were observed for the N-split relative to the C-split system, we hypothesized that the shared distribution of critical core residues between the two fragments is essential for efficient reassembly. This hypothesis is supported by the observation that the distribution pattern of critical core residues is the primary difference between the N-split and C-split fragments.


Productivity of interchain interaction depends on split site

The inefficiency in folding observed for the N-split relative to other systems could have resulted from lack of association between the two fragments (i.e., the fragments never came together) or, if they did associate, from an inability of the fragments to form productive interactions. Since the parameter Q includes both interchain and intrachain native contacts, the free energy landscapes shown in Fig. 5 do not distinguish between misfolded configurations caused by unproductive interactions between the two fragments and those caused from unproductive interactions among individual fragments. To decouple this effect, we plotted contours of the number of interchain contacts as a function of the similarity parameter, Q, for the N-, Mid-, and C-split systems at their respective Tmax (see Fig. 1 in Supplementary Material ). Two highly populated regions were observed in these landscapes. The first region, representing a large number of interchain contacts neighboring the folded state (high InterC, high Q), confirmed that access to the folded state was highly dependent on associations between chains. The second region represented a significant (but not high) number of interchain contacts neighboring the unfolded state (mid-InterC, low Q) and was much more populated for the N-split than for the Mid- and C-split systems. This observation suggests that although associations between fragments occurred for all the systems, the occurrence of these in the N-split case was less likely to result in productive interactions that would lead to the folded state. Taken together, these data support the notion that the efficiency of protein reassembly depends to a great extent on the site at which the protein is split.


A shared critical nucleus “glues” fragments productively during reassembly

Given that the formation of the critical nucleus is key for folding efficiency in the case of a classical nucleation folding mechanism, as is the case for the 48-mer 32, we next analyzed how the dissection of amino acids in the nucleus upon protein fragmentation affected reassembly and folding. Specifically, we plotted landscapes of the critical core residues (Table 2 and Figure 1d) as a function of the total number of native contacts (Q). In the N-split case, a region with a high number of critical contacts and a low number of total native contacts was observed (Figure 6a), but not in the case of the Mid- or C-split proteins (Figure 6bc). These data indicate that the more difficult transition to the folded state observed for the N-split protein stems from the formation of the full core in a single chain that trapped the system in a region of highly misfolded states. Further analysis of folding “snapshots” of the N-split system during a typical folding trajectory suggests that intrachain formation of the core leads to preassembly of the largest fragment (chain 2) into a semistable structure that prevents the efficient incorporation of the smallest chain (chain 1) (Figure 6a). This type of isolated preassembled structure was clearly observed in the snapshots (Figure 6a, i and ii), where these chains exhibited minimum association with each other. In stark contrast, the shared formation of the core between the Mid- and C-split systems resulted in transition-state structures of highly interacting fragments that more readily formed the rest of the native contacts, leading to efficient assembly of the folded structure (Figure 6bc). However, although these structural configurations were characterized by the formation of interchain native contacts, the part of the fragments that was away from the contact point between the two chains remained highly extended. The structural patterns reflected in these snapshots were repeatedly observed throughout the 10–15 sets of data that we analyzed for each system (data not shown).

The simple thermodynamic model presented above (see Eqs. (5)) was used to rationalize the differences in behavior between the case where one of the two chains preassembles (such as the N-split case) and the case where both chains exhibit more cooperative folding behavior (such as the C-split case). For this analysis, we assume that the “unfolded” state is the one in which the two chain fragments have already collapsed or associated, if strongly inclined to do so. Based on the typical snapshots analyzed for the folding trajectory of the N-split case (Figure 6a), we assume that in the unfolded state, chain 1 (the small chain) has an open conformation (with ), chain 2 is prefolded (with ΔSConf→0), and the two chains tend to be separate (with ); in this case, the total (conformational and translational) entropy can be described as ΔSN-split/kB=−N1−ξ−lnV. Consistent with Fig. 6, for the C-split case, we assume that in the unfolded state both chains are not collapsed but tend to be associated (with ΔSTrans→0); in this case, the total entropy is purely conformational and can be described as: ΔSC-split=−(N1+N2)kB. Given these expressions for entropy, the free-energy changes upon folding, for the N-split and C-split cases, can be described as

(14)
and
(15)
respectively, so that the difference between these two free-energy changes is
(16)
Assuming that the unfolded N-split protein has stronger (more negative) energetic interactions than the unfolded C-split protein, then note that this result is consistent with Fig. 3, where we observed that average unfolded-state configurational energies were lower in the case of the N-split than in the case of the C-split system. Additionally, since our simulation results showed that the folded N-split protein was less stable than the folded C-split protein, we conclude that ΔAN-splitAC-split. Based on these results, the righthand side of Eq. (16) must be positive. In this case, it appears that the first two (positive) terms in the lefthand side of Eq. (16) dominate, so thatΔAN-splitAC-split. It is important to note that this result indicates that the driving force for folding is smaller for the N-split system than for the C-split system. Note, however, the nontrivial interplay of the interactions: 1), the prefolding of a chain fragment favors folding on entropic grounds (since the unfolded states start at lower entropies, e.g., more ordered) but disfavors folding on energetic grounds (since unfolded states are found at lower energies, e.g., closer to the folded state); and 2), the interchain association favors folding on entropic grounds (by reducing translational entropy) but may disfavor it if the associated (unfolded) states are found at very low energies.


A different folding mechanism emerges when core residues are not shared

To obtain insight into the mechanism by which the two fragments assemble, we examined the order in which all native contacts formed over 500 different folding trajectories for each split system. It was observed that the first native contacts to form (i.e., the ones with longer contact waiting time, τf) are those corresponding to the critical core (Fig. 7). Although a precise folding mechanism for the split fragments cannot be inferred by these results alone (i.e., specific transition states are not identified), these data indicate that 1), the same critical core (Table 1) of native contacts seen for the parent protein forms even in the cases when the protein is split; and 2), early formation of this set of native contacts is critical to the folding pathway of the split fragments.

Display large version of this figure
Figure 7
Kinetic evolution of native contact formation in split proteins. The contact waiting time (τf) represents the time a native contact has to wait until complete folding takes place for the N-split (a), Mid-split (b), and C-split (c) proteins. τf was normalized by the total folding time (MC step) for each protein and τf was averaged for each native contact over 500 simulation runs for each system. The number assigned to the NC pair code (x axis) corresponds to the native contact listed in Table 2. Native contacts in chain 1 (●), native contacts in chain 2 (▴), native contacts shared by chains 1 and 2 (■). Encircled symbols represent native contacts that that form the critical folding nuclei. Simulations were performed at T=0.25.

Inspection of these data also suggests that the assembly mechanism of the N-split differs significantly from that of the Mid- and C-split cases. For instance, two separate stages were observed in the reassembly process of the N-split protein (Figure 7a). During the first stage (at longer τf), a set of critical native contacts preassembled in the longer chain (chain 2), whereas the smaller chain (chain 1) remained completely unincorporated (no interchain contacts were formed) and unfolded (no native contacts were observed). Then, during a later stage (at shorterτf), the folding process was completed when the smaller chain was incorporated into this already preassembled structure to form the rest of the native contacts. It is important to note that the coassembly stage did not take place until a long time (relative to the total folding time) after the folding process had started. A much different folding process, closer to the one observed for the unsplit protein, was observed for the Mid- and C-split cases. In these systems, both chains coassembled from the beginning of the folding process and jointly proceeded to the folded state. The fact that folding for the N-split system was significantly inhibited (relative to the parent 48-mer protein and to the other two split protein systems) further supports the notion that folding is less efficient when individual folding of one of the fragments (i.e., the nuclei-containing fragment) occurs. The mechanistic insight obtained by this analysis is consistent with our interpretation of the folding landscapes and snapshots shown in Fig. 6.

It is worth noting that in all the split protein cases, contact 58 (where each protein is split; see Table 1) was one of the very last native contacts to form in the folding process, as reflected by the very short τf associated with its formation (Fig. 7). Interestingly, all other native contacts that were locally affected upon protein fragmentation in each system also formed at relatively short τf, toward the very end of the folding process; these contacts included pair codes 9, 23, 28, and 29, pair codes 45 and 37, and pair code 50 for the N-, Mid-, and C-split systems, respectively (Table 1). Additionally, although contact 28 was one of the last to form in the N-split system, this contact was the first to form in both the Mid- and C-split systems. Most noteworthy are the observations that reattachment at (or near) the split site occurred late in all the split folding processes and that formation of interchain nuclei contacts occurred early in the cases of productive folding (i.e., the Mid- and C-split cases). This confirmed that efficient folding depends on the early “gluing” of the fragments specifically by the early interchain formation of folding nuclei contacts. Furthermore, productive folding appears to be independent of the early reconstitution of the original full-length 48-mer sequence, by reattachment of the fragments at the site where they were split.


Importance of folding nuclei in fragment reassembly of a split 64-mer

To test whether a shared folding nucleus contributed to the reassembly efficiency of proteins other than the 48-mer, we analyzed a model 64-mer 41,43,44. It is important to note that, like the 48-mer, this 64-mer also folds according to a classical nucleation mechanism where the core of critical native contacts that forms at an early stage of the folding process is composed of residues 2, 3, and 24–37, which have >90% probability of forming native contacts in the transition-state intermediates 34. Also noteworthy is that in contrast to the folding nucleus of the 48-mer, the amino acid composition of the folding core of the 64-mer is only 50% hydrophobic, and its location is on the side (as opposed to the center) of the folded structure. Additionally, given the larger size of this sequence relative to the 48-mer, it exhibits a more complex and therefore slower pattern of folding, where 81 native contacts characterize the folded structure.

To evaluate the importance of the folding nucleus in the reconstitution of a split 64-mer, two symmetric two-fragment systems, each containing a 27-mer and a 37-mer fragment, were derived (Fig. 8). N-split64 was derived by splitting the 64-mer near the N-terminus of the sequence between residues 27 and 28, whereas C-split64 was derived by splitting the sequence toward the C-terminal end of the sequence between residues 37 and 38. The additional native contact that restores the amino acid connection lost upon excision in each fragmentation case changes the native energy corresponding to the parent 64-mer from −30.13kBT to −29.93kBT and −30.22kBT for the N-split64 and C-split64 systems, respectively. It is important to note that all of the 13 native core contacts of C-split64 form within the larger of the two chains, whereas 8 out of 13 core contacts (>60%) of N-split64 form between the two fragments, and only 5 out of 13 core contacts form within a single chain (two contacts in the shorter chain and three contacts in the longer chain (see Supplementary Material, Table 1S ).

Display large version of this figure
Figure 8
Thermodynamic analysis of 64-mer systems. Heat capacities for the 64-mer (▵), N-split64 (■), and C-split64 (○) proteins as a function of temperature. The sequence of this protein is: KEKSTAGRVASGVLDSVACGVLGDIDTLQGSPIAKLKTFYGNKFNDVEASQAHMIR WPNYTLPE. Peaks represent transition temperatures (Tmax) for each system. All temperatures are normalized by 0.27, the folding temperature of the unsplit 64-mer (Tf). Normalized transition temperatures were found to be Tmax/Tf=1 for the 64-mer, Tmax/Tf=0.856 for the N-split64 system, and Tmax/Tf=0.815 for the C-split64 system. Thermodynamic simulations were performed for 5E10 MC steps (a long time relative to folding times). (Insets) Schematic representations of the folded structures. Amino acids shown in black or connected by black lines correspond to those in the first fragment (chain 1); amino acids shown in dark gray or connected by dark gray lines correspond to those in the second fragment (chain 2); amino acids shown in light gray correspond to those that form the folding nucleus.

Given the distribution of core contacts for the N- and C-split64, we hypothesized that folding would be more efficient in the case of the N-split64 due to the higher number of interchain critical native contacts in this system relative to the C-split64. Indeed, the resulting MFT, calculated as the average over 100 simulations at T=0.22 (a temperature below the Tmax for the two systems) for protein reassembly within 5×108 MC steps, was 3.10±0.48×108 for the N-split64 and 3.99±0.82×108 for the C-split64. Both split cases exhibited slower folding kinetics relative to that of the unsplit 64-mer (MFT=1.38±0.08×108) at the same temperature. Moreover, the N-split64 protein was observed to reassemble in 66 out of 100 simulation trials (FF=66%) with an AFR of 2.13×10−7, whereas the C-split64 protein only reassembled 59 times out of 100 trials (FF=59%) with an AFR of 1.48×10−7, indicating a 31% decrease in folding for the C-split64 relative to the N-split64. It is also important to note that a decrease in thermal stability was observed upon fragmentation of the 64-mer, as reflected by the much lower Tmax of the N-split64 (Tmax=0.23) and C-split64(Tmax=0.22), relative to that of the unsplit 64-mer (Tmax=0.27). Furthermore, a small and broad Cv peak is observed for the split systems, which implies an increase in near-native conformations. This effect suggests that their thermal transition is less cooperative 42. However, the split systems still follow a two-state mechanism, which is evidenced by the presence of a single Cv peak. Thus, the split 64-mer systems exhibited the same correlation between thermal stability and folding kinetics as was observed for the split 48-mer system.



Conclusions

In this work, we used two relatively simple model systems to obtain insight about how the choice of split sites affects the thermodynamics and kinetics of protein reassembly and folding upon fragmentation. Specifically, we focused our studies on understanding how the splitting of critical native contacts, which are located in the critical core that leads to folding, contribute to productive folding. In general, our results showed that the folding process for different split fragment systems is slower relative to the case of an unsplit protein, consistent with experimental observations 10,11,17. Furthermore, the nature and magnitude of reassembly retardation was highly dependent on the distribution of the critical nuclei between the two split fragments. Strategic splitting of the critical core was shown to 1), prevent the permanent preassembly of an individual fragment that would otherwise inhibit the assembly of the two chains; and 2), drive the formation of interchain native contacts that lead to productive folding. The importance of a shared folding core was particularly evident by the slower folding kinetics that were observed in the N-split system, where the critical core was localized in a single fragment, as compared with the C-split system, where the critical core was more equally shared between the two fragments.

Although a precise characterization of the folding mechanism or of the transition states for the N-, Mid-, and C- split systems was not determined, we observed that the concentration of the core native contacts in a single fragment changed the folding mechanism from a cooperative coassembly process, where the two fragments fold together, to a two-step assembly process, where an individual chain preassembles and then forms interchain connections with the second chain. Coassembly was observed for the