| The Gaussian Network Model: Precise Predictions of Residue Fluctuations and Application to Binding Problems Biophysical Journal, Volume 91, Issue 10, 15 November 2006, Pages 3589-3599 Burak Erman Abstract The single-parameter Γ matrix of force constants proposed by the Gaussian Network Model (GNM) is iteratively modified to yield native state fluctuations that agree exactly with experimentally observed values. The resulting optimized Γ matrix contains residue-specific force constants that may be used for an accurate analysis of ligand binding to single or multiple sites on proteins. Bovine Pancreatic Trypsin Inhibitor (BPTI) is used as an example. The calculated off-diagonal elements of the Γ matrix, i.e., the optimized spring constants, obey a Lorentzian distribution. The mean value of the spring constants is ∼−0.1, a value much weaker than −1 of the GNM. Few of the spring constants are positive, indicating repulsion between residues. Residue pairs with large number of neighbors have spring constants around the mean, −0.1. Large negative spring constants are between highly correlated pairs of residues. The fluctuations of the distance between anticorrelated pairs of residues are subject to smaller spring constants. The importance of the number of neighbors of residue pairs in determining the elements of the Γ matrix is pointed out. Allosteric effects of binding on a single or multiple residues of BPTI are illustrated and discussed. Comparison of the predictions of the present model with those of the standard GNM shows that the two models agree at lower modes, i.e., those relating to global motions, but they disagree at higher modes. In the higher modes, the present model points to the important contributions from specific residues whereas the standard GNM fails to do so. Abstract | Full Text | PDF (345 kb) |
| Insights into Correlated Motions and Long-Range Interactions in CheY Derived from Molecular Dynamics Simulations Biophysical Journal, Volume 92, Issue 6, 15 March 2007, Pages 2062-2079 Michael H. Knaggs, Freddie R. Salsbury, Marshall Hall Edgell and Jacquelyn S. Fetrow Abstract CheY is a response regulator protein involved in bacterial chemotaxis. Much is known about its active and inactive conformations, but little is known about the mechanisms underlying long-range interactions or correlated motions. To investigate these events, molecular dynamics simulations were performed on the unphosphorylated, inactive structure from and the active mimic structure (with removed) from . Simulations utilized both sequences in each conformation to discriminate sequence- and structure-specific behavior. The previously identified conformational differences between the inactive and active conformations of the strand-4-helix-4 loop, which are present in these simulations, arise from the structural, and not the sequence, differences. The simulations identify previously unreported structure-specific flexibility features in this loop and sequence-specific flexibility features in other regions of the protein. Both structure- and sequence-specific long-range interactions are observed in the active and inactive ensembles. In the inactive ensemble, two distinct mechanisms based on Thr-87 or Ile-95 rotameric forms, are observed for the previously identified + and rotamer sampling by Tyr-106. These molecular dynamics simulations have thus identified both sequence- and structure-specific differences in flexibility, long-range interactions, and rotameric form of key residues. Potential biological consequences of differential flexibility and long-range correlated motion are discussed. Abstract | Full Text | PDF (2627 kb) |
| Biosynthetic phage display: a novel protein engineering tool combining chemical and genetic diversity Chemistry & Biology, Volume 7, Issue 4, 1 April 2000, Pages 263-274 Mary A Dwyer, Wuyuan Lu, John J Dwyer and Anthony A Kossiakoff Summary Peptide synthesis and phage-display mutagenesis can be combined to produce a very powerful protein engineering tool. The physical properties of the environment surrounding the introduced non-natural residue can be selected for by evaluating all possible combinations of amino acid types at a targeted set of sites using phage-display mutagenesis. Summary | Full Text | PDF (239 kb) |
Copyright © 2007 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 92, Issue 9, 3052-3062, 1 May 2007
doi:10.1529/biophysj.106.099440
Biophysical Theory and Modeling
Ali Rana Atilgan, Deniz Turgut and Canan Atilgan
, 
Faculty of Engineering and Natural Sciences, Sabanci University, 34956 Istanbul, Turkey
Address reprint requests to Canan Atilgan.Interactions, delay, and feedback are the three key characteristics of complex fluids. Using these features, entities at different time and length scales communicate with great accuracy, efficiency, and speed 1. Self-assembling molecular systems are complex fluids with robust and adaptable architectures that incorporate nanoscopic and mesoscopic length scales decisive on their emergent properties over different timescales; proteins, whose internal motions are decisive on their folding, stability, and function, are exquisite examples of these 2,3,4. Proteins regularly experience perturbations in their environment—e.g., in the cell where other small and large molecules are densely and heterogeneously distributed—or in the test tube with only water around, displaying ceaseless fluctuations around their folded structure. Since proteins function efficiently, accurately, and rapidly in the crowded environment of the cell, they are expected to be effective information transmitters by design. Whether the protein is functional or not depends on the size and location of these fluctuations, making use of the concerted action of residues positioned at different regions of the protein 5,6,7,8. It is, therefore, of utmost interest to investigate how proteins respond to changes in the environment under physiological or extreme conditions.
The response of any structure to perturbations depends on its general architecture. For proteins, local, regular packing geometries 9 cannot provide short distances between highly separated residues for fast information transmission. In fact, it has been shown that random packing of hard spheres similar to soft condensed matter is observed in a set of representative proteins 10. Consistent with the concurrent requirement of order and randomness in the protein structure, we 11 and others 12,13,14 have recently shown that proteins are organized within the small-world network topology. A network is referred to as “small-world” if the average shortest path between any two vertices scales logarithmically with the total number of vertices, provided that a high local clustering is observed 15. Such properties are common in many real-world complex networks 16,17, and there are examples from a diverse pool of applications such as the world wide web 18, the internet 19, math coauthorship 20, power grid 15, and residue networks 11.
In recent years, we treated proteins as networks of interacting amino acid pairs to determine their network structure and to identify the adaptive mechanisms in response to perturbations 11,21,22. In fact, similar network treatments of proteins predict collective domain motions, hot spots, and conserved sites 5,23,24,25,26. For these networks we used the term “residue networks” 11 to distinguish them from “protein networks”, which are used to describe systems of interacting proteins 27. We carried out a statistical analysis to show that proteins may be treated within the small-world network topology. We analyzed the local and global properties of these networks with their spatial location in the three-dimensional structure of the protein. We also showed that the shortest path lengths in the residue networks and residue fluctuations are highly correlated. In the past few years, the network treatment of residues in proteins has been adopted to study their various features such as conserved long-range interactions 28, functional residues 29,30, protein-protein association 31, and detection of structural elements 32.
In all these treatments, which have been successful in describing many important properties of proteins and provide insight as to how they function, the identities of individual amino acids are omitted in the calculations. In other words, specificity is taken into account in an indirect manner by assuming that the locations of the different amino acid types along the contour of the polymeric chain have been operational in determining the particular average three-dimensional structure. In this viewpoint, the interactions between different pairs, triplets, etc. of amino acids are assumed to be smeared out, and the observed behavior once the protein is folded is driven by the overall structure. In fact, it has been noted that the residue nonspecific interactions (i.e., those depending on the relative placement of residue pairs, irrespective of their identity) contribute more to the overall stability of proteins by a factor of about five, compared to distinct residue-residue interactions 33. The question remains, however, as to the extent to which such a coarsened description of the folded protein may be used to determine other crucial properties, especially those pertaining to dynamics.
In this study, we further elaborate on the paths between residue pairs, which we term “information pathways”, to understand how they relate to dynamic phenomena in proteins. In particular, it is of interest to understand allosteric interactions mediated through the changes in the dynamic fluctuations around the average structure, both in the presence and absence of conformational changes, the latter having very recently been shown to exist in proteins through a series of NMR experiments 34. To this end, we attribute weights to the links between residue pairs using knowledge-based potentials 35,36 and discuss the relationship between dynamic phenomena occurring in proteins and the optimal path lengths obtained from these weighted networks. We show that it is possible to extract minimal subgraphs from the fully connected networks of residues, where a few designed interactions overlaying the backbone are sufficient to display communication path lengths similar to that of the full residue network. We also demonstrate an application of these ideas using a nonredundant data set of interacting proteins and extract residue pairs on the interface of the receptor/ligand that frequently appear along information pathways.
For the single protein calculations, we utilize 595 single-chain proteins with sequence homology <25% 37 and sizes spanning 54–1021 residues. This protein set is identical to that used in our previous study of residue networks 11. Forty-five of the proteins in the set have fewer than 100 residues, the number of proteins in the ranges (101–200), (201–300), (301–400), and more than 400 residues are 234, 122, 108, and 86, respectively. A list of all the proteins used, their sizes, and the distribution of the sizes in counting bins of size 20 are provided as Supplementary Material . For the receptor-ligand complexes, on the other hand, we use the nonredundant benchmark set of Weng and collaborators developed for testing docking algorithms that contains overall 59 pairs of proteins with 22 enzyme-inhibitor complexes, 19 antibody-antigen complexes, 11 other complexes, and 7 difficult test cases 38. We form spatial residue networks from each of these proteins using their Cartesian coordinates reported in the Protein Data Bank (PDB) 39. In these networks, each residue is represented as a single point centered on the Cβ atoms; the Cα atoms are used for glycine residues. Given the Cβ coordinates of a protein with N residues, a contact map can be formed for a selected cutoff radius, rc, an upper limit for the separation between two residues in contact. This contact map also describes a network that is generated such that if two residues are in contact, then there is a connection (edge) between these two residues (nodes) 11. Thus, the elements of the so-called adjacency matrix, A, are given by
![]() | (1) |
Here, rij is the distance between the ith and jth nodes and H(x) is the Heaviside step function given by H(x)=1 for x>0 and H(x)=0 for x≤0. We adopt the value for the cutoff distance rc=6.7Å, which includes all neighbors within the first coordination shell around a central residue. For the set of 595 proteins here, the Cβ-Cβ radial distribution function was calculated and was displayed as an inset to Fig. 6 of Atilgan et al. 11, where the first, second, third, and fourth coordination shells were shown to be located at 6.7, 8.5, 10.5, and 12.0Å, respectively. The former two show distinct locations, whereas the latter two are interwoven, in agreement with the liquid-like free-volume distributions in proteins 40.
In the case of the weighted residue networks, we assign weights to the edges according to the interresidue interaction “potentials” of Miyazawa and Jernigan 35 and Thomas and Dill 36. These are statistical potentials extracted from a protein database. Both potentials have been extensively tested in threading algorithms 41,42, protein stability, and designability studies 43, folding and binding energetics, as well as amino acid classification 44. The Miyazawa-Jernigan (MJ) potential is based on a set of protein subunit structures exceeding 1600 in number 35. In their treatment of the problem, the system is taken as an equilibrium mixture of unconnected residues and effective solvent atoms. The quasichemical (Bethe) approximation is employed to estimate the contact energies from the numbers of contacts that arise in the sample 45,46. Excluded volume is taken into account by the inclusion of a hard-core repulsion between the residues and a repulsive packing-density-dependent term. The Thomas-Dill (TD) potential, on the other hand, utilizes a much smaller data set of 37 proteins 36. The authors use the folded chain conformation as the reference state, instead of a collection of randomly mixed particles of residues and solvent molecules (in treatments using the Bethe approximation, the problem of reference states has been addressed and corrections have been proposed 47). Thomas and Dill employ an iterative method that extracts pair potentials that incrementally drive the system toward a lowest energy structure that corresponds to the native structure. The main discrepancies in the statistical potentials that result from the approximate treatment or neglect of excluded volume, chain connectivity, and interdependence of pairing frequencies are therefore intrinsically taken care of.
In this study, we have repeated all the calculations using both the MJ and the TD knowledge-based potentials. Despite differences in details, the main results and conclusions reached do not change with the choice of potential. In what follows, we therefore report only results from the TD potentials. We assign eij, the value of the connection between the ith and jth residue, according to the interresidue interaction potential between the ith and jth residue types. Thus, the links connecting the residue pairs with the least favorable interaction energy have the lowest weight, i.e., the highest value.
The networks are classified by local and global parameters, all of which can be derived from the adjacency matrix (Eq. (1)). In the absence of edge weights, the most general descriptors of the network structure are average connectivity of a node and the average shortest path length through the network. The connectivity ki of residue i is the number of neighbors of that residue,
The average connectivity of the network is thus K=〈ki〉, where the brackets denote the average over all nodes. The connectivity distribution of the residue networks follows the Gaussian distribution 11.
The shortest path length,
of a homogeneous network, where the links have no weights, is the average over the minimum number of connections that must be traversed to connect residue pair i and j. In computing the shortest path between a pair of nodes, we make use of the fact that the number of different paths connecting a pair of nodes i and j in n steps is given by (An)ij. Thus, the shortest path between nodes i and j is given by the minimum power, m, of A for which (Am)ij is nonzero.
In the presence of weights, it is possible to redefine the path lengths to take into account the skewing effects of the weights. Weights may be factored into the path lengths using different optimality criteria. We define two criteria for paths between two residues 48,49,50: weak disorder and strong disorder. In the former, the optimal path connecting residues i and j is the length of the path,
that minimizes the sum of the weights along the path. We employ the Dijkstra algorithm to compute the optimal paths in the weak disorder case. In the latter (strong disorder) case,
is the length of the shortest path that minimizes the maximum weight along the path. To obtain
we sort the links in descending order and sequentially remove the links beginning with the highest weight (lowest energy). We continue to remove the links until we find the bottleneck link that will cause the connectivity between vertices i and j to be lost. We then compute the length of this remaining path in terms of the number of intervening links. Note that once the optimal path connecting residues i and j is determined, the path length is simply the sum of the connections along the path; i.e., the step lengths themselves are not weighted.
The characteristic path length of the network is the average,
![]() | (2) |
Proteins may be modeled as networks where a special set of interactions are imposed on chain connectivity, and the extent to which such interactions are specially designed is of interest here. In this study, we generate a variety of networks based on selected proteins. A firm basis for comparing the various networks that may be formed from a given chain with a known contact number is a chain of the same length and the same number of connections for each of its nodes but a randomized set of links between the nodes. To generate such networks, we rewire every residue (node) randomly to another residue chosen from a uniform distribution such that each residue has the same number of neighbors (contact number, ki) while the contact order changes; chain connectivity is preserved by keeping the (i, i+1) contacts intact. For this purpose, we manipulate the adjacency matrix (Eq. (1)) so that the block diagonal of three elements remains unchanged, whereas the rest of the off-diagonal terms are randomly reassigned while maintaining the symmetry of the matrix. Such a network corresponds to the random coil conformation of a polymer chain at an arbitrary point in time. In our previous study, it was established that the proteins have a Poisson distribution of contacts 11. It is also known from network theory that a completely random, Poisson distributed network has the shortest path length 51,
![]() | (3) |
As shown in Fig. 1 (bottom curve) it is verified that the randomized chains behave exactly as expected from a completely random collection of nodes. Average path lengths on the residue networks, Lh, on the other hand, are significantly higher than the randomized networks while still preserving the approximately logarithmic dependence on number of residues, as shown with the filled circles in Fig. 1. The loss of high optimality (i.e., a twofold increase in the shortest path lengths compared to a random network) must be compensated for by the emergence of functionality in the self-organized structure. This exchange is achieved along the scaffold of the nonrandom networks formed by the residues of the proteins.
In the absence of weight information of the links (i.e., for a homogeneous network), Lh is the only parameter we can use as a measure of the distance between nodes in the network with N vertices. In the presence of weights, the heterogeneity of the medium is taken into account; hence different types of optimality criteria can be defined. In the case of weak disorder, the sum of the potentials along the optimal path is minimized to obtain Lw. This can be interpreted as the path that causes the minimum possible total disturbance to the residues along the path. The links with lower potentials are more likely to tolerate the disturbances. In Fig. 1 we display a comparison of shortest paths of homogeneous and weak disordered networks, Lh (shown by a solid circle), and Lw (shown by an open circle), respectively, with that of the random coil. The correlation between the two data sets is excellent, showing that the weighted network in the weak disorder limit behaves similar to the homogeneous network. The optimal path in the strong disorder, on the other hand, is the path that minimizes the maximum of the potentials along the path, which can be interpreted as the shortest path that causes minimal maximum disturbance along the path. As exhibited in Fig. 1 for the strong disorder case (see the data shown by a triangle), Ls is significantly larger than Lw by an average factor of 1.3.
To answer this question, we randomly reassign the potentials attributed to pairs of residues. This is achieved by redistributing the 210 different types of pair potentials in the TD potential matrix; e.g., the original Ala-Thr value may now be assigned to the Val-Glu pairs. As such, the underlying network structure remains unchanged, whereas the optimal paths that are preferred will be affected. The results based on these networks are obtained from five realizations of this randomization.
Two major observations are made for such networks: In the weak disorder limit, the optimal path lengths increase (data not shown), signifying that the residue pairs are specially distributed in the protein network to have similar allotments of weights around a given node, although the values themselves have a large span [−1.8 … 1.5]. Moreover, the strong paths in the weight-randomized networks are longer (shown by the dashed line in Fig. 1), further corroborating this finding with the more stringent constraint that key links minimizing the maximum weight along given paths exist in the folded protein.
We deduce subnetworks from the original residue networks of each of the 595 proteins utilized in this work by systematically removing links that have values higher than a given cutoff value, ecut. Chain connectivity is preserved regardless of the residue types flanking a given bond. We rely on the fact that a protein under external disturbance will have a higher tendency to lose communication through high-energy contacts, whereas the low energy ones will be more cohesive. Thus, although the protein loses the ability to use some paths, it is intrinsically assumed that additional and/or alternate paths do not arise from such disturbances. The shortest path lengths of each of the remaining networks are subsequently computed. Several important cases are presented in Fig. 2 as a function of the random coil of the same size, N, and the same original number of neighbors, K (Eq. (3)). The distribution of the links is shown in the inset to this figure, and the chosen cutoff values are marked on the distribution.
The redundancy in the proteins is such that when approximately half of the nonbonded contacts are disregarded, ecut=0, the system still has the same shortest path length as the full protein that preserves all of its contacts. Upon further removal of links, the paths get longer, and they overlap with Ls at ecut=−0.6 kBT. At this point, only ∼20% of the long-range contacts remain in the subnetworks. Further removal of contacts results in a sudden increase in the shortest path lengths, exemplified by the case of ecut=−1.0 kBT. In Fig. 2, this data set is shown, along with the best-fitting line (slope=22.6, in comparison to the random networks where the slope is 1). Note also that the scatter in the data is extreme, signifying that the logarithmic dependence of path lengths on number of residues is lost.
Another way to observe these data is by plotting the shortest path lengths of the subnetworks as a function of the random coil of the same size, N, and the modified (reduced) number of neighbors, K′ (Fig. 3). Although the path length increases as networks with fewer contacts are formed, as expected, the slope of the best-fitting line remains constant until ecut=−0.6 kBT, i.e., coincides with the original, fully connected network that utilizes the strong paths as was shown in Fig. 2. Further removal of links results in a dramatic increase in the shortest paths, as exemplified by the ecut=−1.0 kBT case (shown by the ♢ symbol; values on the right y axis). Again, it is observed that the scatter in the data increases as the subnetworks approach a linear chain (ecut=−1.8 kBT, i.e., only connectivity remains).
A folded protein needs to perform its function under the constraints that the overall shape is suitable for the task it undertakes while it is not energetically penalized. As a molecular machine, it needs to optimize the time it takes to communicate the incoming information, which, to a first approximation, may be assumed to be linearly dependent on the shortest path length in its residue network. Excluded volume imposes another limit on the size of the molecule. As incoming information, we refer to perturbations that are imparted on one or several of the residues. Changes in the environmental conditions that are reflected on thermodynamic parameters, such as the temperature, will affect the whole system. The latter are not of concern in this study, since these may potentially change the overall network structure.
In the previous section we displayed results that introduce several different perspectives to evaluate how folded proteins are organized to manage their redundancies under suboptimal conditions. Our basis for comparison is the random coil, whereby a Poisson distributed arrangement of residues will always lead to the most optimal path length, given by the analytical relationship of Eq. (3). The random networks constructed for Fig. 1 have the same average number of neighbors as their folded network counterparts (K=6.9, as shown in Atilgan et al. 11). They may be thought of as compact chains that constantly change their partners at different points in time. They, therefore, represent an average over many significantly different configurations, in direct opposition to the case of a folded protein, where residues always keep the same neighbors while they fluctuate in space. For a given amount of excluded volume, decided upon by chain connectivity and the number of long-range contacts, the random coils give a limiting value for how fast information may be spread through the system.
On the other hand, information spreading will take on different forms in a protein depending on the type of local perturbation that is received. Two limiting situations may be distinguished: i), Proteins experience constant random fluctuations from the environment under the usual conditions in which they function; e.g. random collisions with solvent molecules, formation of local hot spots, etc. We classify these perturbations, extensive in number but small in the size of fluctuation they invoke, as “everyday events”. ii), At other times, there will be large perturbations that will be targeted on specific regions, such as those occurring during binding, or approach of a large cellular body to unspecified regions of the protein. We classify these perturbations as “extreme events”. The modes of response from the protein are expected to be different for the two types of events. In other biological systems, such modified reactions to different types of input (global versus pathway-specific noise) were also observed and quantified, e.g., for the variation in the behavior of genetically identical cells 52,53.
In folded proteins, the network structure, equivalent to a coarse graining obtained from the average conformation of the folded structure, is expected to remain nearly the same under both conditions. However, the way the energy will be transmitted throughout the network will differ according to the type of perturbation. Noting that the network is mostly made up of residues held together by nonbonded interactions, the proximity of pairs of residues will not differ; e.g., in many cases, the structure of the bound and unbound forms of a ligand protein to its receptor is less than the experimental uncertainty, as in the case of chymotrypsin inhibitor II 5. However, the transfer of information (energy) along the residue network will only occur if the fluctuations in neighboring residues are correlated along any chosen pathway (as conformational variability increases, the communication of a signal in a molecule, e.g., conductance, occurs with less strength and over a broader range of values, as was recently demonstrated through unique experiments in a series of diphenyl-containing small molecule systems 54). For small perturbations caused by random fluctuations, the correlations between neighboring residues are expected not to be affected, and the most probable pathway for information transmission is the lowest energy one — i.e., Lw. For large impacts (extreme events), although the overall network structure will be preserved due to the pressure exerted by the compact structure of the molecule, the correlations between pairs of residues that are weakly connected to each other will be lost. For the purpose of information propagation, those pathways may be assumed to be nonexistent; i.e., those network connections will be lost.
Usually, the impacts imparted on the protein in its usual environment will be intermediate between the two extremes of small perturbations and large impacts. Our analysis in Fig. 3 shows the operational limits of these molecular machines: We may classify those perturbations that delete nearly half the nonbonded contacts from being functional (i.e., ecut=0.0 kBT) as everyday events. The change in the average path length of the protein relative to the change in that of the randomly rewired counterpart (∂L′/∂L′random, where L′ refers to path length on the subnetworks with the lower average connectivity, K′) remains fixed for that range (Fig. 3). The latter quantity is shown for the whole range of values of ecut in Figure 4a. In the same range of values, the average shortest path length, a size-dependent quantity, is also constant (Figure 4b). The change in the average number of neighbors of a node is also relatively small, decreasing from 6.2 to 5 (Figure 4c). Noting that two of these neighbors are located along the chain, at ecut=0.0 kBT an average node has lost one of its four nonbonded neighbors.
Further removal of the links signifies even larger perturbations to the protein. Up to ∼ecut=−0.6 kBT, where the shortest path lengths on the subnetworks coincide with the strong paths of the original weighted residue networks (marked by the dashed lines in Figure 4ac), the quantity ∂L′/∂L′random shows a decreasing trend (inset to Figure 4a). In the range of ecut=−0.6–0.0 kBT, the increase in L is less than a factor of two for all sizes of proteins, whereas its value increases logarithmically beyond that cutoff (ecut<−0.7 kBT; see Figure 4b). The logarithmic dependence of the path length on chain size is also preserved in this range (see Figure 2 and Figure 3). Note that at this critical value of the cutoff, only about one nonbonded contact per average node remains (Figure 4c).
Representative proteins of α, β, and α/β types are shown in Fig. 5; ribbon diagrams of the structures deposited in the PDB are shown in the first column. All nonbonded contacts (thin lines) superimposed on the backbone (thick lines) are shown in the second column. The strongest links that form the underlying structure and that give the polymeric chain its protein-like path lengths are shown in the third column. Any other interactions added to these create redundancies that contribute to the robustness of the structure so that the protein is able to function under the harsh conditions of the cell. In reality, depending on the size and direction of the impact, some of the weaker links that are located far from that site may be preserved; i.e., we do not expect the links to be lost hierarchically. Nevertheless, the protein's reaction to the perturbation, as measured by the average path lengths of the effectively remaining contacts, is relatively insensitive to size and direction, as long as the most cohesive of the interactions remains intact.
To understand allosteric communication in proteins and how signals that originate in one part of the protein propagate to remote regions, a sequence-based statistical approach was proposed 55,56 and further developed 57. Based on the assumption that residues that are functionally and/or structurally coupled to each other should coevolve, these authors predicted a set of residues that communicate signals through the protein core in a chain-like manner for several protein families. They showed that the predictions were confirmed by mutagenesis studies.
We have studied one of the poxvirus and zinc finger domain representatives (PDB code 1a68) with our structural methodology. We find that the residues that were identified in the sterically connected path as (77→118→149→148) in the statistical study 55 lie on one of the two alternative shortest paths connecting residues 77 and 148 (these are PHE-77→LEU-122→ILE-121→THR-149→PHE-148 and PHE-77→PHE-118→ILE-121→THR-149→PHE-148). The former is slightly more favorable in the weak path due to the lower contact potential between LEU-ILE compared to that of PHE-ILE; the two paths are equivalent in terms of strong path length since the bottleneck contact is brought about by the THR-PHE pair in the last step. Similarly, for the PDZ domain representative, we find that all the optimal paths connecting HIS-76 and ALA-51 go through ILE-31 and PHE-29, where the latter was found to be strongly coupled to perturbations at position 76 (our calculations are based on the structure with PDB code 1bfe, but we use the numbering scheme in Lockless and Ranganathan 55).
In the case of the GPCR family, we study the paths between LYS-296, which was perturbed in the statistical analyses 56,57, and TYR-136, which participates in one of the regions that undergoes a structural change upon light activation in rhodopsin. An allosteric communication exists between them since the two residues are located ∼30Å apart from each other. The weak path connecting the two is five-steps long (LYS-296→SER-298→ALA-124→LEU-128→ALA-132→TYR-136) and mainly utilizes a direct passage from helix 7 to helix 3 that leads to residue 136 on its edge, as a highway. The strong path, on the other hand, is nine-steps long (LYS-296→PHE-293→PHE-294→THR-297→TYR-301→ILE-305→TYR-136→LEU-128→ALA-132→TYR-136) and initially utilizes helix 7, containing the epicenter as a highway, including residue 294, which was identified as functionally important by Suel et al. 56. It then passes on to helix 3 through the single residue MET-257 in helix 6. This residue is known to participate in constitutive activity of the molecule, possibly through interactions with the conserved motif involving residues 302–306 on helix 7 58. As this latter example clearly shows, strong pathways are more descriptive of locating the important residues that participate in allosteric communication.
We postulate that residues, frequently found along the paths connecting a receptor-ligand pair, control the communication between the two proteins. Since binding is an event that requires exchange of large amounts of energy, in this treatment, we use the optimal paths with strong disorder, which emphasize the largest barriers to be crossed along the way. Using the benchmark set of 59 receptor-ligand complexes 38 described in the Methods, we seek the pairs of residues that are most significant in determining key interactions. In the data set, there are ∼2×106 such pathways, giving a statistically significant number for our analysis.
We first record the pairs that form bridges between receptor and ligand for every path that originates in the receptor and ends in the ligand; i.e., residue i is located on the receptor and residue j is located on the ligand and they are connected within the network formed by the protein-protein complex. We then take into account the fact that the propensity of a selected amino acid type being located along the interaction surface significantly varies, as reported by Ma et al. 59; e.g., TRP, ARG, and GLN are the residues that are found most frequently on the interface. Therefore, we normalize the probability of finding a residue pair along the strong pathways, pi↔j. Thus, the conditional probability, p(i↔j|i, j), can be computed by relating the probability that the pair actually appears along the selected paths to the probability of each of the residues in the pair being located on the interface, qi and qj:
![]() | (4) |
| Table 1 Residue pairs that appear in the interface with significantly enhanced probabilities |
| Residue pair (receptor → ligand) | Propensity-normalized probability, p(i↔j|i, j) | Contact potential (units of kBT) | ||
|---|---|---|---|---|
| ILE-VAL | 0.13 | −0.98 | ||
| ALA-ILE | 0.041 | −0.64 | ||
| ILE-ILE | 0.039 | −0.71 | ||
| ILE-LEU | 0.036 | −1.04 | ||
| GLU-LYS | 0.032 | −0.09 | ||
| LEU-ILE | 0.030 | −1.04 | ||
| VAL-VAL | 0.027 | −1.15 | ||
Note that the pairs that are used in the paths consist mostly of the hydrophobic-hydrophobic interaction types, though not necessarily appearing in the order of cohesive energy. In fact, if all amino acids are grouped in the broadest sense of hydrophobic, polar, charged, and GLY, over 42% of all pairs that appear along the interface and that are on the strong paths make hydrophobic-hydrophobic contacts. Furthermore, the interactions need not be symmetric; in fact, the most significant pairs have ILE on the receptor and VAL on the ligand (normalized probability is 0.13). The reverse arrangement does not appear to be significant. A similar observation is also made for the ALA-ILE pair. In contrast, ILE and LEU pairs appear to be involved in specific interactions, though not with a significant preference for the ligand or the receptor.
Although tens of residues appear on the protein-protein interfaces, in general, only a small set of mainly hydrophobic contacts dominate the affinity, as verified by mutagenesis studies (see Atwell et al. 60, and references sited therein). One example ligand-receptor system of α-chymotrypsin in complex with eglin c is shown in Fig. 6. Residue pairs that are on the largest number of pathways between the smaller and the larger polypeptide chains are shown in purple and pink (out of the total of 14,868), respectively. Note that in the large interaction surface of the protein pairs, it is possible to identify four key interactions utilizing three residues on the ligand and four on the substrate. Of these, those involving LEU-45 on eglin c is on two of the most frequently utilized paths connecting the two proteins, participating in 54% of the strong paths connecting the two proteins. This residue is the primary specificity residue of eglin c, whose nature greatly affects the strength and specificity of the association between the inhibitor and the enzyme, as shown by the equilibrium constants determined for the interaction between chymotrypsin and the inhibitors expressed with variants of LEU-45 61. TYR-49, on the other hand, participates in 43% of the strong paths. In fact, in another innovative study, libraries of randomly constructed variants of eglin c at positions 33, 35, 37, 39, 40, 47, 49, 50, 65, and 68 were constructed and screened for activity 62. Therein, position 49 solely emerged as having a significant effect on the binding affinity of eglin c to various substrates.
Another example presented in Fig. 6 is between the chemokine, interleukin 8 (IL-8), and a fragment from its receptor CXCR1 whose structure was determined by NMR 63 and on which we previously conducted an analysis of the structure/function relationships 64. The generally accepted scheme for the binding mode between CXC chemokines that contain an ELR motif and their receptors is that the N-loop residues of the chemokine interact with the N-terminal domain residues of the receptor (site I), and the N-terminal residues of the chemokine interact with the receptor exoloops and transmembrane residues (site II) 65. The structure we study here corresponds to site I, and therein, interactions between the fragment-chemokine residues MET-8-PHE-17, TYR-15-ILE-10, TYR-15-CYS-50 emerge as key bridging pairs on the strong pathways. MET-8 and TYR-15 residues are utilized nearly equally between the two proteins. On the other hand, since TYR-15 is toward the C-terminus of the peptide that leads to the seven transmembrane helices, it is expected that if this study were to be conducted on the full CXCR1, there would be a predominant shift of the distribution toward the contacts made by this residue. In fact, TYR-15 is known from studies on alanine analogs of the fragment to be important for affinity, whereas truncation of residues up to and including MET-8 does not affect binding affinity 63. Thus, this analysis, which emphasizes the strong paths in the case of large perturbations (i.e., binding) experienced by a protein, is successful in identifying the key interactions on the binding surface.
In this study, we have taken a network perspective of analyzing proteins and have shown that residue specificity plays an important role in protein functioning. A statistical analysis on nearly 600 nonhomolog proteins has led us to define key quantities for discriminating the underlying structure that make the protein robust in the environment where it is functional. In particular, the quantity ∂L′/∂L′random (Figure 3a and Figure 4a) has been uniquely defined for finding a critical threshold value to determine the key interactions in the protein, if it is to survive extreme events and to continue carrying out its function. Our results also support the finding that optimized protein sequences can tolerate relatively large random errors in pair potentials obtained using a variety of methodologies 47,66. In fact, none of our conclusions change when the work here is repeated with the pair potentials of Miyazawa and Jernigan 35, rather than that of Thomas and Dill 36, although there are differences in the details of, e.g., Fig. 4.
In this work, we propose that in events involving small perturbations, the total energy to traverse that path will be important and information will flow through the optimal paths with weak disorder, similar to that in the homogeneous network. On the other hand, when large perturbations are involved, such events require surpassing the largest energy barriers along the paths. In this approach, the same pair potentials are used as thermodynamic measures in the former case and as kinetic measures in the latter. If a pair of residues has high contact energy, it may be assumed that the energy that must be used to separate them will be commensurate with its value to a first approximation. Due to other effects such as the size and the shape of the residues, slight modifications may be included. We feel that the network approach used here—involving many approximations as well as a large amount of coarse graining overlaying the atomic structure—has firm grounds. The strong paths, therefore, set a limit on the protein whereby the robust structure resists large amounts of external perturbations and preserves its protein-like communication pathways. Furthermore, using this approach, we have been able to define key contacts that form bridges between interacting proteins (Table 1). Note that nearly half the surface area of the total protein, and therefore an overwhelming number of residue pairs, is involved in protein-protein interactions. As a possible practical application of this approach, the few key contact pairs identified may be used as primary links in recognizing the interaction geometry, overlaid by the energy lowering contributions from the rest of the pairs in solving protein-protein interaction problems.
C.A. acknowledges support by the Turkish Academy of Sciences in the framework of the Young Scientist Award Program (CB/TÜBA-GEBIP/2004-3).
1. (2002). Systems biology: a brief overview. Science 295, 1662–1664. CrossRef | PubMed
2. (1996). The “new” science of “complex fluids”. J. Phys. Chem. 100, 13169–13189. CrossRef | PubMed
3. (2000). Interactions and phase transitions in protein solutions. Curr. Opin. Colloid Interface Sci. 5, 38–43. PubMed
4. (1999). Complexity in chemistry. Science 284, 89–92. CrossRef | PubMed
5. (2001). Coordination topology and stability for the native and binding conformers of chymotrypsin inhibitor 2. Proteins 45, 62–70. CrossRef | PubMed
6. (2002). Relaxation kinetics and the glassiness of proteins: the case of bovine pancreatic trypsin inhibitor. Biophys. J. 83, 699–705. Abstract | Full Text | PDF (138 kb) | PubMed
7. (2005). Relaxation kinetics and the glassiness of native proteins. Coupling of timescales. Biophys. J. 88, 1570–1576. Abstract | Full Text | PDF (137 kb) | CrossRef | PubMed
8. (2000). How soft is a protein? A protein dynamics force constant measured by neutron scattering. Science 288, 1604–1607. CrossRef | PubMed
9. (1997). Ideal architecture of residue packing and its observation in protein structures. Prot. Sci. 6, 2072–2083. PubMed
10. (2000). Voronoi tessellation reveals the condensed matter character of folded proteins. Phys. Rev. Lett. 85, 3532–3535. CrossRef | PubMed
11. (2004). Small-world communication of residues and significance for protein dynamics. Biophys. J. 86, 85–91. Abstract | Full Text | PDF (210 kb) | PubMed
12. (2002). Small-world view of the amino acids that play a key role in protein folding. Phys. Rev. E 65, 061910. PubMed
13. (2003). Uncovering network systems within protein structures. J. Mol. Biol. 334, 781–791. CrossRef | PubMed
14. (2005). Network properties of protein structures. Physica A 346, 27–33. PubMed
15. (1998). Collective dynamics of ‘small-world’ networks. Nature 393, 440–442. CrossRef | PubMed
16. (2000). Models of the small world. J. Stat. Phys. 101, 819–841. PubMed
17. (2001). Exploring complex networks. Nature 410, 268–276. CrossRef | PubMed
18. (1999). Growth dynamics of the world-wide web. Nature 401, 131. CrossRef | PubMed
19. (2002). Large-scale topological and dynamical properties of the Internet. Phys. Rev. E 65, 066130. PubMed
20. (2002). Evolution of the social network of scientific collaborations. Physica A 311, 590–614. PubMed
21. (2001). Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys. J. 80, 505–515. Abstract | Full Text | PDF (664 kb) | PubMed
22. (2000). Identifying the adaptive mechanism in globular proteins: fluctuations in densely packed regions manipulate flexible parts. J. Chem. Phys. 113, 4454–4464. CrossRef | PubMed
23. (1998). Vibrational dynamics of folded proteins: significance of slow and fast modes in relation to function and stability. Phys. Rev. Lett. 80, 2733–2736. CrossRef | PubMed
24. (1997). Direct evaluation of thermal fluctuations in proteins using a single parameter harmonic potential. Fold. Des. 2, 173–181. CrossRef | PubMed
25. (1999). Collective dynamics of HIV-1 reverse transcriptase: examination of flexibility and enzyme function. J. Mol. Biol. 285, 1023–1037. CrossRef | PubMed
26. (1998). Identification of kinetically hot residues in proteins. Protein Sci. 7, 2522–2532. PubMed
27. (2001). Lethality and centrality in protein networks. Nature 411, 41–42. CrossRef | PubMed
28. (2006). Elucidation of conserved long-range interaction networks in proteins and their significance in determining protein topology. Physica A 368, 595–606. PubMed
29. (2004). Network analysis of protein structures identifies functional residues. J. Mol. Biol. 344, 1135–1146. CrossRef | PubMed
30. (2006). Residue centrality, functionally important residues, and active site shape: analysis of enzyme and non-enzyme families. Prot. Sci. 15, 2120–2128. PubMed
31. (2005). Oligomeric protein structure networks: insights into protein-protein interactions. BMC Bioinformatics 6, 296. CrossRef | PubMed
32. (2006). Graph theoretic properties of networks formed by the Delaunay tessellation of protein structures. Phys. Rev. E 73, 041925. PubMed
33. (1997). Inter-residue potentials in globular proteins and the dominance of highly specific hydrophillic interactions at close separation. J. Mol. Biol. 266, 195–214. CrossRef | PubMed
34. (2006). Dynamically driven protein allostery. Nat. Struct. Mol. Biol. 13, 831–838. CrossRef | PubMed
35. (1996). Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256, 623–644. CrossRef | PubMed
36. (1996). An iterative method for extracting energy-like quantities from protein structures. Proc. Natl. Acad. Sci. USA 93, 11628–11633. CrossRef | PubMed
37. (1999). A neural network based predictor of residue contacts in proteins. Protein Eng. 12, 15–21. PubMed
38. (2003). A protein-protein docking benchmark. Proteins 52, 88–91. CrossRef | PubMed
39. (2000). The Protein Data Bank. Nucleic Acids Res. 28, 235–242. CrossRef | PubMed
40. (1999). How much sequence variation can the functions of biological molecules tolerate. In Simplicity and Complexity in Proteins and Nucleic Acids. Frauenfelder, H., Deisenhofer, J., Wolynes, P.G., eds. (Berlin: Dahlem University Press). PubMed
41. (1999). An empirical energy potential with a reference state for protein fold and sequence recognition. Proteins 36, 357–369. CrossRef | PubMed
42. (2004). Three-dimensional threading approach to protein structure recognition. Polymer 45, 687–697. PubMed
43. (2002). Designability of protein structures: a lattice-model study using the Miyazawa-Jernigan matrix. Proteins 49, 403–412. CrossRef | PubMed
44. (2004). A general clustering approach with applications to the Miyazawa-Jernigan potentials for amino acids. Proteins 55, 999–1004. CrossRef | PubMed
45. (1986). An Introduction to Statistical Thermodynamics. (Toronto: Dover). PubMed
46. (1985). Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18, 534–552. CrossRef | PubMed
47. (1999). Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Prot. Sci. 8, 361–369. PubMed
48. (1994). Optimal paths and domain walls in the strong disorder limit. Phys. Rev. Lett. 72, 2320–2323. CrossRef | PubMed
49. (2003).