| Correction Biophysical Journal, Volume 93, Issue 2, 15 July 2007, Pages 704 Full Text | PDF (164 kb) |
| Correction Biophysical Journal, Volume 90, Issue 2, 15 January 2006, Pages 709 Full Text | PDF (38 kb) |
| Damped-Dynamics Flexible Fitting Biophysical Journal, Volume 95, Issue 7, 1 October 2008, Pages 3192-3207 Julio A. Kovacs, Mark Yeager and Ruben Abagyan Abstract In fitting atomic structures into EM maps, it often happens that the map corresponds to a different conformation of the structure. We have developed a new methodology to handle these situations that preserves the covalent geometry of the structure and allows the modeling of large deformations. The first goal is achieved by working in generalized coordinates (positional and internal coordinates), and the second by avoiding harmonic potentials. Instead, we use dampers (shock absorbers) between every pair of atoms, combined with a force field that attracts the atomic structure toward incompletely occupied regions of the EM map. The trajectory obtained by integrating the resulting equations of motion converges to a conformation that, in our validation cases, was very close to the target atomic structure. Compared to current methods, our approach is more efficient and robust against wrong solutions and to overfitting, and does not require user intervention or subjective decisions. Applications to the computation of transition pathways between known conformers, homology and loop modeling, as well as protein docking, are also discussed. Abstract | Full Text | PDF (1732 kb) |
Copyright © 2007 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 92, Issue 3, 701-716, 1 February 2007
doi:10.1529/biophysj.106.081265
Biophysical Theory and Modeling
C. Brian Roland* and Eugene I. Shakhnovich†,
, 
* Chemical Physics Program, Harvard University, Cambridge, Massachusetts
† Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts
Address reprint requests to E. I. Shakhnovich, Dept. of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138. Tel.: 617-495-4130; Fax: 617-384-9228.Within the field of molecular biophysics, there are unanswered questions regarding the molecular evolution of proteins. In particular: if two protein structural domains have similar structure, do they necessarily have a common ancestor 1? The answer to this question concerns the relative importance of divergent and convergent mechanisms of protein domain fold discovery. In this work, we describe a divergent model of protein evolution in structure space. We compute the properties of this model, and compare the results to the properties of proteomes of real organisms. Before we delineate our model, we first present the experimental observations that check different explanations of protein evolution.
One of the striking observations made in the structural genomics effort is the uneven fold usage within the genomes of individual organisms: the number of SCOP-folds with a given number of genes follows a power law 2,3,4,5,6. For several organisms, the distribution of domain genes among folds was captured by a simple divergent model 3 with a “rich get richer” mechanism. This mechanism presents one explanation of the uneven distribution of domains among folds.
However, it is accepted that alternative models, employing convergent hypotheses, could also explain the uneven fold distribution 7. Dokholyan et al. 8 suggested that in addition to the fold distribution, the structural similarities between domains within a fold also held discriminating evolutionary information. This suggestion is motivated by earlier observations: “Although the definition of discrete fold types is useful for counting purposes, the Dali Domain Dictionary also makes explicit the graded similarities that exist between the members of the same fold type and that may extend beyond the borders of fold categories” 9. The arrangement of the set of Dali domains in structure space (space of all
-traces) is clarified by viewing the set as a network in which the nodes are the domains and the binary interactions between nodes correspond to the pairwise structural similarities (Dali Z-scores). The graph representation of this network is called the protein domain universe graph (PDUG).
In a subsequent work, Deeds et al. 10,11 studied how individual genomes cover structure space through the analysis of organismal PDUGs (oPDUGs). For each of many fully sequenced procaryote genomes, putative structural domain sequences were identified by, and assigned to, Dali domains according to high sequence similarity with a Dali-domain sequence. Thus, the list of Dali domains “present” in the genome was assembled. In an oPDUG, each Dali domain present in the genome is represented as a labeled vertex, and two vertices are connected by an edge if their Dali Z-score exceeds a cutoff. At an organism-independent value of the cutoff, each oPDUG exhibited a nonrandom global connectivity, visible in the degree distribution; in graph terminology, the number of edges emanating from a vertex is called “the degree of a vertex” (see Appendix , The degree in the mathematics of graphs, for further explanation), and the fraction of vertices with a certain degree is called “the degree distribution”. Specifically, it was found that the degree distribution of each oPDUG differed strongly from that of a classical random graph with the same number of vertices and edges (in this type of graph, edges are placed between randomly chosen pairs of vertices). This striking feature of the oPDUGs provides a stringent experimental measurement, even more discerning than the uneven fold distribution, against which to compare divergent and convergent models of protein evolution.
The nonrandom degree distribution of the oPDUGs was captured by a divergent model of oPDUG evolution 8,10. In the following, we attempt to resolve the mechanisms at work in this previous model. In so doing, we formulate and characterize a phenomenological model of oPDUG evolution that 1), reproduces the nonrandom connectivity of real oPDUGs (which have finite size); and 2), allows analytical calculation of the behavior of infinite graphs. Specifically, simulation results suggest that in the limit of large graphs (long evolutionary times), the model generates graphs that are well fit by a power law at high degree, i.e., the graphs are asymptotically scale-free. We analytically compute the scale-free exponent in the large-graph, high-degree limit. Our primary result is that as the graph size increases in our model, the nonrandom connectivity of finite graphs develops into the asymptotically scale-free connectivity of infinite graphs.
First, we attempt to identify important features, in addition to the nonrandom degree distribution, of the studied oPDUGs. In particular, we calculate the distribution of clustering coefficients for four importantly different organisms. Second, we present two models of the time development of oPDUGs, each isolating mechanisms at work in previous models. Specifically, the second of the two models has a mechanism with a type of memory not present in the first. Third, we computer-simulate the models in a finite time regime. We find that, in this time regime, the memory-full mechanism does not outperform the memory-less mechanism with respect to reproducing course features of the real oPDUGs. Fourth, we discuss analytical results for the long-time behavior of the memory-less model, presenting the phase diagram. Last, we discuss the successes and failures of the models, and the implication for the evolution of organisms.
A phylogenetic analysis of many procaryote genomes, in which the characters correspond to the presence or absence of the Dali domains, revealed that the genomes had many domains in common 12. Thus, to study the features generic to all of the oPDUGs, we need only study a small number of examples that are significantly different from each other. We consider a single example from each of four major clades identified in the neighbor-joining phylogeny 12. We study Agrobacterium tumefaciens C58 from the proteobacteria-like clade, Streptococcus pneumonia R6 from the gram-positive-like clade, Campylobacter jejuni from the ϵ-proteobacteria-like clade, and Archaeoglobus fulgidus from the archae-like clade.
Our first aim is to evaluate the evolutionary information content of the oPDUGs. Under the null hypothesis that the oPDUG does not hold signals about the nature of evolution (dynamics or driving forces), we define the CR model for the architecture of an oPDUG. In the CR model, which stands for classical random-graph model, the statistics of an oPDUG is similar to that of a classical random graph 13, with the same number of vertices and edges. If we define a graph ensemble as a collection of graphs such that each graph is prepared according to certain probabilistic rules, a single graph in the CR graph ensemble is made by assigning edges to pairs of vertices at random (uniformly). We call each graph in the CR ensemble a shuffled oPDUG, and compute 50 such shufflings. The assembly of the CR ensemble washes away correlations in the edges shared by any group of three vertices; in this sense, this graph ensemble is devoid of nontrivial evolutionary information.
To compare the CR model to a real oPDUG, we compute the degree distribution and clustering-coefficient distribution as follows. For a given oPDUG, each domain (vertex) has some number of structural neighbors (edges); in graph terminology, the number of edges emanating from a vertex is called the degree (see Appendix , The degree in the mathematics of graphs, and Albert and Barabasi 13). The degree distribution,
![]() | (1) |
,
, the ensemble-averaged degree distribution is denoted by![]() | (2) |
is the graph degree (see Appendix , The degree in the mathematics of graphs), i.e., the average-over-vertices of the degree of a vertex, where E is the total number of edges in the oPDUG. Additionally, for a given oPDUG, we compute the clustering coefficient for each vertex t with degree kt≥2 according to the standard definition![]() | (3) |
is the number of edges between neighbors of vertex t13;
is not defined for vertices with k=0 or k=1. The clustering-coefficient distribution is the histogram of these values, normalized to the total number of
vertices,
(Fig. 2). For the corresponding CR model, we compute the clustering-coefficient distribution for each graph in the ensemble, normalizing each distribution by the value of
for that particular graph. The normalized distribution is averaged over all graphs in the CR ensemble, giving equal weight to each graph; this results in the ensemble-averaged clustering-coefficient distribution shown in Fig. 2.
; the fit was performed on the entire interval spanned by the data 10. D is the graph degree. The curve intersects the abscissa where the fraction of vertices is zero. Also shown is the ensemble-averaged degree distribution for the CR model (circles). Each graph in the CR ensemble has the same number of vertices and edges as the corresponding oPDUG, but the edges are assigned to pairs of vertices at random. The Poisson distribution (dashed line) fits the CR model.
is the total number of vertices with
in the real oPDUG. Also shown is the ensemble-averaged clustering-coefficient distribution for the CR model (circles).For each oPDUG studied, the degree distribution differs strongly from the prediction of the CR model (Fig. 1). By comparison, the degree distribution of a real oPDUG shows 1), many vertices with degree k=0 (orphans); and 2), many vertices with degree
. Since CR represents the null hypothesis, we purport that features 1 and 2 are signatures of evolution (either dynamics or driving forces). To highlight these features and to put them on equal footing, we view the degree distribution in log-log scale with the k axis shifted. Features 1 and 2 result in an approximately straight line in this scale. The simplest function to satisfy this observation is the Pareto law
![]() | (4) |
Additionally, features 1 and 2 can be quantified in terms of the standard deviation (spread) of the degree. Consistent with the generic properties of classical random graphs, CR is well fit by a Poisson distribution
![]() | (5) |
13. For A. tumefaciens, D=2.52 is the graph degree, so for the corresponding CR degree distribution, the mean is 2.5 and the standard deviation is 1.3. Thus, the CR degree distribution is relatively well localized about the mean value, i.e., it has a small spread. However, whereas the oPDUG degree distribution has a mean value of 2.5 as well, the standard deviation is computed to be 6.3. The oPDUG degree distribution has a large spread compared to the corresponding CR degree distribution.For each oPDUG studied, the clustering-coefficient distribution differs strongly from the prediction of the CR model (Fig. 2). In the oPDUGs shown, of the vertices with
, 25–30% have C=1 and 5–15% have C=0 (stars). Fig. 3, which summarizes gross properties of all 59 proteomes, shows that most organisms have a fraction of C=1 and C=0 vertices, consistent with the four clade examples. These results are in strong disagreement with the CR model, suggesting that the clustering-coefficient distribution also contains information about evolution (dynamics or driving forces).
is the number of vertices with degree
, and p(0), p(1) are the values of the clustering-coefficient distribution at C=0 and C=1. The organism index is defined in Deeds et al. (10).In summary, the network statistics of the oPDUGs is nonrandom in the sense that the degree and clustering-coefficient distributions compare poorly with the predictions of the CR model.
In the next section, we complement the null hypothesis by proposing a model for the evolutionary dynamics that generated the oPDUGs. To compare the oPDUGs with this model, we restrict our focus to A. tumefaciens, because it has the largest number of domains. From the key features of the A. tumefaciens oPDUG, we list the following set of criteria for a stochastic-dynamic model of the evolution of an oPDUG. Within the ensemble of graphs generated by the model, there must be a significant fraction of individual graphs with the following features:
The architecture of polypeptide sequence space motivates our models of how a genome’s Dali domains are arranged in structure space, i.e., of its oPDUG. We thus recount the bioinformatic method described by Mirny et al. 14, which attempts to identify the pattern of sequence positions involved in physical interactions important to stabilizing a sequence in its native-state fold. For a given fold, e.g., immunoglobulin, it was found that the positions that are highly conserved within any given family of sequences (high sequence identity) match up, after structural alignment, with the conserved positions in any other family of proteins having the same fold. This matching up appeared in high values of the conservatism-of-conservatism (CoC) for certain sequence positions 14. For a given fold, each family differed in types of residues at the high CoC positions, but had the pattern of high CoC positions in common. These quantitative results are consistent with earlier qualitative observations: “The map [of fold space] further reveals a small number of densely populated regions where the common features are topological motifs at the core of the domains” 9.
Based on this observation, we model the space of all polypeptide sequences by assuming the existence of evolutionarily stable regions called sequence pockets (see Fig. 4) 15. Sequences in a pocket are similar in the sense that there is a pattern of key sequence positions, i.e., the high CoC positions, at which the residue type (in a reduced alphabet) is the same for each sequence. We call this pattern the fold pattern. The residue types at non-key sequence positions provide the degrees of freedom that give volume to the pocket. Each sequence is evolutionarily stable in two respects. First, each sequence in the pocket is useful to the cell because it folds as an independent unit to a well defined native conformation, i.e., each sequence corresponds to a structural domain (for definitions of a structural domain, see Holm and Sander 9). Second, if we make a single residue-type substitution (reduced alphabet) at any one of the positions in the fold pattern, the resulting sequence does not fold independently. A sequence pocket is defined both by the fold pattern and the residue type at each position in the fold pattern.
The pocket of structural domain sequences corresponds to a set of native-state structures that forms a localized cluster in the space of all polypeptide structures (Fig. 5). Thus, a sequence pocket can be labeled with, and well represented by, any member sequence and its associated structure. We take each sequence pocket to precisely define a type of structural domain, i.e., to define a domain type.
If we define an organism’s complete proteome as the collection of all structural domain sequences contained in its genome, we imagine the evolution of its complete proteome as a dynamic process in which previously unoccupied sequence pockets become occupied. In other words, evolution is described as a timeline of domain-type discoveries. Initially, a single seed sequence occupies each of a set of pockets. On some short evolutionary timescale t∘, each seed sequence gives rise to a steady-state population of descendants that are similar in the sense that all are confined to the parent’s pocket (Fig. 6). We call this monophyletic population of sequences in a stable pocket belonging to the same genome a paralog family (group of paralogous sequences), and consider it to be the elementary unit of evolution. On some longer evolutionary timescale, each paralog family gives rise to a sequence that seeds a distinct pocket, thus overcoming the single-residue-substitution barrier that confines the family on short timescales (Fig. 7). In accord with the scenario of divergent evolution, we assume that the distinct pocket is typically unoccupied, i.e., that a new domain type is discovered.
. In a divergent fold discovery, a sequence in some SP spontaneously mutates into a sequence in an SP with a different fold pattern on some long timescale
(see Model, Spontaneous versus fixed mutations).We now accommodate the constraint that the observer can only detect a subset of all sequence pockets, using bioinformatics. Thus, we define the long-timescale evolutionary state of an organism by the occupation states (filled or empty) of detected sequence pockets, i.e., by its structural proteome—which we define as the list of detected domain types found in the organism’s genome (Fig. 5). We now identify the sequence-structure pairs in the Dali Domain Dictionary as an experimental proxy for the set of detected domain types in our caricatured model of protein sequence space.
Although the organization of sequence space into stable pockets is a reasonable model for the high-sequence-identity region of structural-similarity versus sequence-similarity plots 16,17, and the picture of evolution through domain-type discovery is highly plausible, the following question arises: what kinds of domain-type discoveries explain the current evolutionary state (structural proteome) of organisms?
At this point, we distinguish between three kinds of domain-type discovery. An occupied sequence pocket may seed an unoccupied sequence pocket with a fold that is 1), the same as the parent fold—we call this “neutral fold discovery”; 2), different from the parent fold and unoccupied by any paralog family—“divergent fold discovery”; and 3), different from the parent fold but already occupied by paralog families—“convergent fold discovery”.
One explanation of the uneven fold population is the convergent evolutionary hypothesis: there have been many convergent fold discoveries, and the organism either 1), selects domain types corresponding to folds that accommodate many functions; or 2), exhibits a larger number of domain types for folds that can intrinsically accommodate more distinct domain types (high designability). Another explanation, the divergent evolutionary hypothesis, is that convergent fold discoveries are rare or nonexistent. In this work, we attempt to explain the network statistics (degree and clustering-coefficient distributions) of structural proteomes using only neutral and divergent fold discovery.
A physically reasonable assumption for the organization of sequence space is that the spontaneous mutations that cause neutral fold discovery occur much more frequently than the spontaneous mutations that cause divergent fold discovery. That is, the timescale for divergent fold discovery,
, is much longer than the timescale for neutral fold discovery,
, i.e.,
(see Fig. 5 and Fig. 7). We take this inequality to be a key property of our model of protein sequence space (see Model, Sequence pockets and the structural proteome).
However, spontaneous mutations may not survive generations of reproduction of a population of organisms if they do not provide a substantial fitness advantage. It is possible that the spontaneous mutations causing neutral fold discovery are fixed in a population very infrequently, whereas the spontaneous mutations causing divergent fold discovery are fixed very frequently (personal communication 2005, E. J. Deeds). For simplicity, we assume that fixation effects restore the balance between neutral- and divergent-discovery rates. Thus, in the models that follow, there is no explicit distinction between neutral and divergent fold discovery.
According to the evolutionary model above (Sequence pockets and the structural proteome), the evolutionary state of each procaryote is synonymous with its structural proteome. At this point, we choose to reduce the detail of our description by tracking the oPDUG representation of the evolutionary state.
Our first model for the time development of an oPDUG is described as follows. Preliminarily, we define a set of discrete time points
, each interval of time between t−1 and t is a time step labeled with the terminal time point t. We imagine that during the first time step, t=1, of the existence of the structural proteome, a single seed sequence fills some sequence pocket. This sequence pocket is represented by a single vertex on the oPDUG, labeled t=1. At the start of each subsequent time step t, a randomly-chosen occupied sequence pocket spawns an intrepid sequence (due to gene duplication) that is destined to discover a new sequence pocket. At the time of duplication, the intrepid sequence resides in the parent sequence pocket; thus, the structural similarity to the parent’s representative structure is high.
On the graph, the birth of the intrepid sequence is represented by adding a “baby” vertex, with label b indicating the time step b=t of creation, which shares an edge (dashed) with a randomly-chosen parent vertex p (Fig. 8). The baby vertex also shares an edge (dotted) with each neighbor vertex
, where a neighbor is a vertex that shares an edge with the parent. Subsequently, the intrepid sequence undergoes mutations that place it in a new sequence pocket, where it seeds a new paralog family. The representative structure for the new sequence pocket will sometimes be similar to—or different from—the parent pocket’s structure, and thus also to the structures of the parent’s structurally neighboring pockets. This variation in the magnitude of structural divergence is motivated by the diversity of structural similarities in the “twilight” and “midnight” zones of protein sequence alignments 17,18.
(dotted). After a single time step, the
edge is retained with probability
. If the
edge is retained, then each
edge is independently retained with probability
. The edges that survive the divergence (grey), for one particular realization of the probabilistic process, are shown on the right.Thus, on the graph, at the end of its first time step of life, the baby vertex has retained the baby-parent edge with retention probability
(Fig. 8). Given that the baby-parent edge is retained, each baby-neighbor edge is retained with probability
. If the baby-parent edge is not retained, then all of the baby-neighbor edges are lost as well; we call this baby vertex an orphan. The diverged baby vertex corresponds to a newly occupied sequence pocket.
We call this model M0, for zero memory. This designation will be given meaning in the next section.
We now comment on a particularly relevant feature of the model M0. To do so, we define a special type of memory associated with the baby-neighbor edge retention mechanism of duplication-and-divergence models. Consider a particular vertex that parents a baby at time step s and at time step
. Of the parent’s neighbors that exist at time s, the subset that retain edges with baby s may be correlated with the subset that retain edges with baby t. A correlation between the two subsets means that they typically have an unusually large number of vertices in common. If the correlation between these two subsets is nonzero, then we say that the baby-neighbor edge retention mechanism has memory.
Note that the baby-neighbor edge retention mechanism of M0 has zero memory. This feature is the most fundamental distinction between M0 and previous models 8. To evaluate the importance of memory, we considered a second model called M1—for full memory—that contains a mechanism with extremely strong memory. In the Supplementary Material , we show that the memory-full mechanism does not outperform the memory-less mechanism with respect to mimicking the real oPDUGs; we compute the degree and clustering-coefficient distributions of M1. We conclude that memory mechanisms are not essential to explaining the data. Therefore, we present M0 as a simple representation of the evolutionary dynamics that generated the real oPDUGs.
To fit M0 to the A. tumefaciens oPDUG, the “phase diagram” of the model was surveyed. At each of a large number of points in the parameter space—the
-square, a graph ensemble was generated by computer simulation. The graph ensemble, called
, is a collection of M graphs in which each individual graph is independently evolved with parameters
for
. For the results shown, M=103 and N=103.
According to the criteria for agreement with the A. tumefaciens oPDUG listed in Biological Data, Criteria for model development, the parameter values (0.6,0.8) provide best fit for
. The parameters were chosen such that the ensemble-averaged degree distribution has a Pareto-fit exponent that approximates the exponent obtained for A. tumefaciens. Fig. 9 shows that Γ0(0.6,0.8;N) contains individual graphs that mimic the oPDUG degree distribution. Additionally, we compute the Pareto-fit exponent for each member of the ensemble, and plot their distribution (Fig. 10). Thus, Γ0(0.6,0.8;N) contains a significant fraction of individual graphs with exponents similar to that of the real oPDUG. M0 fails, however, to generate the high fraction of orphans observed in the real oPDUG.
minus the standard deviation in
;
is the degree of the most highly connected vertex in graph G.
. Bin size is 0.1.In Fig. 9, we plot the clustering-coefficient distributions of Γ0(0.6,0.8;N). M0 over-represents
, which is the total number of
vertices. This over-representation is probably related to the under-representation of orphans. M0 succeeds in generating a dominant peak at C=1, and a subdominant peak at C=0.
We summarize these results by saying that M0, for finite-graph sizes, captures the nonrandom statistics (degree and clustering-coefficient distributions) of the real oPDUGs. In the next section, we demonstrate that the nonrandom degree distribution of the finite-graph ensemble becomes scale-free at high k for large N.
We study the asymptotic behavior, with increasing N, of the M0 phase diagram. First, we map a nonanalytic transition of the model, using the ensemble-averaged graph degree as a global order parameter. In particular, we find a surface in parameter space at which the graph degree exhibits a divergence for infinite N (Eq. (7)). Second, we obtain the analytical solution of the high-k behavior of the ensemble-averaged degree distribution in the
limit, using a power-law ansatz.
We consider the
ensemble in which each graph
—labeled
with
—is independently evolved under the rules of M0 (see Model, M0: model without memory). We consider
, where
and
are the baby-parent and baby-neighbor edge retention probabilities, respectively. For graph
at time t=N,
is the total number of edges and
is the graph degree, i.e., the average-over-vertices of the single-vertex degree. We note the general relation
.
We define the ensemble-average of
at time t=N as
![]() |
(see Appendix , Ensemble-averaged graph-degree). It can be shown that E(N) obeys the equation![]() | (6) |
![]() | (7) |
. We call the
region of the phase diagram the linear regime, because
grows linearly with N [Fig. 11]. In like manner, we call the line
the linear-log growth regime, and the region
the superlinear regime. We summarize these results by noting that in the linear regime, the ensemble-averaged graph degree approaches a finite value as N increases, whereas in the superlinear regime, it grows without bound (goes to infinity). We call the transition between the two regimes a nonanalytic transition,n because for infinite N,
diverges as the line
is approached from below.Our computer simulations suggest that, as N increases, the ensemble-averaged degree distribution of M0 develops a stable power-law regime at high k(Fig. 12, and we include Fig. 13 for completeness). To characterize M0, we would like to calculate the value of the exponent in the large-N limit. Thus, we again consider the ensemble
, but restrict our attention to
. Given graph
at time t=N, we call
the fraction of vertices with degree k; this is the degree distribution of graph
. The ensemble-averaged degree distribution is then
![]() | (8) |
changes from time t=N to t=N+1, we derive in Appendix , Ensemble-averaged degree distribution, the rate equation![]() | (9) |
, where![]() | (10) |
is termed the attachment kernel in studies of preferential attachment models of complex networks 20. Equation (9) involves no approximations for the model. We look for a solution
that is stationary, meaning that it does not change in time, at
. In this parameter regime, the solution
satisfies the algebraic equation![]() | (11) |
; the fit (dotted line) is shown with vertical shift and indicates the interval on which the fit is performed. Both the quality of fit (r is the correlation coefficient) and the size of the fitting window increase with N, suggesting that the degree distribution is scale-free at high k for large N. Each plot is the ensemble-averaged degree distribution with M=10 and
. The lower limit of the fitting interval is k=4 and the upper limit is the ensemble average of
minus the standard deviation in
;
is the degree of the most highly connected vertex in graph G. The value
in each plot is the ensemble average of
.
.
is the ensemble average of the number of
vertices.As motivated by our simulation results (Fig. 12), we assume that at
, our stationary solution has a power-law form, i.e.,
. This assumption results in an equation for γ as a function of
and
:
![]() | (12) |
, Eq. (12) appears to have two solutions. Comparison with simulation suggests that the physical solution is the larger of the two. For the physical solution, the transition between the
and
regime coincides with the transition between the linear and superlinear growth regimes. The simulation results suggest that the location of the boundary between the two regimes of γ depends on N.
at
, obtained by numerical solution of Eq. (12), and at M=103, N=103, intermediate k, obtained by simulation.We emphasize that although Eq. (12) may be solved anywhere on the
square, we corroborate the power-law ansatz with simulation results only in the specific region near
(Fig. 12).
Our primary result is the development of an asymptotically scale-free model (M0) that is consistent with the statistics of finite procaryote oPDUGs. By comparison with the null model CR, we demonstrated that procaryote oPDUGs hold information about evolution (dynamics or driving forces), and that this information appears in the nonrandom shapes of the degree and clustering-coefficient distributions. We then countered the null model with a dynamical model (M0) that explained the nonrandom statistics in terms of the mechanism of duplication and divergence. Simulation results demonstrated that the high-k (degree) region of the ensemble-averaged degree distribution develops into a power law as N (graph size) increases. We then computed analytically the exponent of the power-law regime in the infinite-graph limit. So, by asymptotically scale-free, we mean that in the large-graph limit, the ensemble-averaged degree distribution is power law at high degree. This work suggests that the statistical features of oPDUGs are consistent with an asymptotically scale-free dynamical process whereby new domains are discovered via duplication and divergence.
We note that, as with any area in scientific research, we cannot prove that models presented in this work, M0 and/or M1, are the only ones that satisfactorily describe the oPDUG. We cannot rule out the existence of alternative models, including convergent ones, which could also explain the peculiarities of the oPDUG. However, this possibility seems somewhat academic at the moment, since no convergent model has been proposed to describe nonrandom, asymptotically scale-free organization of the oPDUGs. In our earlier study 21, we attempted to develop a convergent PDUG model for a much simpler model—lattice proteins—of a protein universe, without apparent success. The analysis presented by Deeds et al. 21 suggests that inventing a satisfactory convergent model of oPDUG is a challenging task. As regards alternative divergent models, they are perhaps possible, but, as this study shows, the simplest memory-less model M0 performs reasonably well, especially in the high-k regime.
Specific results for M0, for finite-graph sizes, are the following. When M0 is fit to the oPDUG of A. tumefaciens, the parameters generate an ensemble of graphs in which the degree distributions of individual graphs are well fit by a Pareto law (at high k) with exponents in the neighborhood of 1.6. Additionally, the normalized clustering-coefficient distributions have a strong peak at C=1 and a strong but subdominant peak at C=0. These results are consistent with our observations of real organisms (see Biological Data).
Specific results for M0, in the infinite-graph limit, are the following. The ensemble of graphs with an asymptotically large number of vertices has two regimes of behavior separated by a sharp phase boundary. In the linear growth regime, the degree
approaches a limit as N increases and for the Pareto-fit exponent of the ensemble-averaged degree distribution,
(for definition of
, see Appendix, The degree in the mathematics of graphs). In the linear-log regime,
grows logarithmically with N and
. In the superlinear regime,
grows algebraically with N and
. Fitting A. tumefaciens to M0 for N=1000 results in parameters
and
, placing this procaryote genome in the linear growth regime. If we assume that these parameters are independent of time, then for the Pareto-fit exponents at time N=1000,
, and at time
,
, we have
(Fig. 14). The structural proteome of A. tumefaciens has not reached its asymptotic behavior because the Pareto-fit exponent of the degree distribution is far from the long-time value. In this sense, A. tumefaciens is young. Fig. 12 confirms that the Pareto-fit exponent γ increases with the graph size.
Our secondary result is that memory mechanisms, as defined in M1: a model with memory, are unnecessary to model the oPDUGs. This result is the key step in distilling certain aspects of the previous model 8 to a form simple enough to extract analytical results about the long-time behavior. The simulation results in the Supplementary Material section shows that this distillation does not compromise accuracy.
Our third main finding, is that our models, both M0 and the memory-full model (M1) discussed in Supplementary Material , fail to quantitatively reproduce the orphan fraction of real procaryote oPDUGs. In both models, only 30% of the vertices are orphans, compared to 60% for the real oPDUG of A. tumefaciens (Fig. 9 and Supplementary Fig. S2). This is the most significant shortcoming of the model. We understand this shortcoming as follows. It is a simple fact that each oPDUG may have some orphan Dali domains that share a Dali fold with other domains, and other orphans that are the sole occupants of folds (singlet folds). Some structural orphans may have been generated by the neutral-fold-discovery mechanism and others by the divergent mechanism (see Model, Divergent versus convergent evolution). That these mechanisms may have distinct timescales may have consequences for the oPDUGs (see Model, Spontaneous versus fixed mutations). Our models do not distinguish between the two kinds of discoveries and we speculate that this is the source of the failure. Work is in progress to evaluate the validity of this conjecture.
We thank Eric Deeds for help at the initial stages of this work and Julius Lucks, Jason Donald, and Edo Kussell for helpful comments on the article.
This work was supported by the National Institutes of Health. C.B.R. also acknowledges support from the Harvard Merit Fellowship.
The single-vertex degree is the number of edges emanating from a particular vertex. The graph degree,
, of a graph G is the average-over-vertices, in the graph, of the single-vertex degree. The ensemble-averaged graph degree,
, is the average of
over all graphs in the ensemble, where the ensemble has been time-evolved for N time steps. These definitions can be found in Albert and Barabasi 13.
We consider the
ensemble in which each graph
—labeled
with
—is independently evolved under the rules of M0 (see Results). We consider
, where
and
are the baby-parent and baby-neighbor edge retention probabilities, respectively. We would like an estimate of the degree of a typical vertex in a typical graph in this ensemble. Therefore, we compute the ensemble-averaged graph degree,
, which we define as follows. Graph
at time N is composed of N vertices, each labeled
, indicating the time step during which that vertex was created and underwent divergence from the parent vertex; if we define a lattice of time points as
, each interval of time between
and t is a time step labeled with the terminal time point t. Each vertex t has some degree that we call
, and for the entire graph we define the graph degree,
, as
![]() | (13) |
of a graph as the average over the M graphs
in the ensemble at time t=N, using the notation![]() | (14) |
is![]() | (15) |
For graph
at time t=N,
is the total number of edges. The change in
from time t=N to t=N+1 must be
![]() | (16) |
. The ensemble average of
is![]() | (17) |
as the subensemble of
in which the baby-parent (
) edge is retained during time-step
. We relabel all graphs so that
, with
, labels the graphs in
. We can write![]() | (18) |
is the number of edges that are retained between the baby and the neighbors of the parent. We must average
over
.We further subdivide
into “structure groups”
, where
and
is the number of graphs in each group. Each group
contains graphs
, where
, and the function
gives the graph label
. In
, the subgraphs
are identical to each other in the sense that for any vertex labeled with birth date t in the first graph, the neighbor-vertex labels are the same as the neighbor-vertex labels for vertex t in the second graph. Note that different structure groups need not have different structure.
Because M is large,
is also large, and within each structure group, there will be many graphs in which the parent vertex has the same label
(giving its birth date). Thus, we further partition each structure group
into “parent groups”
, where
and
is the number of graphs in each group. Each parent group
contains graphs
, where
, such that subgraphs
are identical and baby
is parented by the same vertex in each
;
is the function that relates labels
to label m
![]() | (19) |
over the subensemble
of
is the same as the average over all graphs in the entire ensemble
(here, special attention must be paid to t=N versus t=N+1). We obtain![]() | (20) |
, and in the third line we used the definition of the ensemble average
of
. In the continuous N approximation, this equation becomes![]() | (21) |
The solutions of the equation
![]() | (22) |
![]() | (23) |
and
are the initial values. For our model, we have
,
,
, and
. With these parameter values, our solutions may be expressed in terms of
(defined in Ensemble-averaged degree) to obtain the exact result![]() | (24) |
We again consider the ensemble
as described in Appendix , Ensemble-averaged graph-degree, but restrict our attention to
. We would like to compute a degree distribution, giving the fraction of vertices with degree k, that estimates the degree distribution of a typical graph in
. Thus, we now derive an equation for how the ensemble-averaged degree distribution, called
and defined below, changes between time points
and
.
For graph
at time N,
is the degree of vertex t and
![]() | (25) |
is the Kronecker δ with the above meaning. Thus, the total number of vertices with degree k is![]() | (26) |
, the change in the number of vertices with degree k, from time
to time
, is![]() | (27) |