Article Outline

Article Information

PubMed

Related Articles

  • …more

Copyright © 2007 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 92, Issue 3, 701-716, 1 February 2007

doi:10.1529/biophysj.106.081265

Biophysical Theory and Modeling

Divergent Evolution of a Structural Proteome: Phenomenological Models

C. Brian Roland* and Eugene I. ShakhnovichGo To Corresponding Author 

* Chemical Physics Program, Harvard University, Cambridge, Massachusetts
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts

Address reprint requests to E. I. Shakhnovich, Dept. of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138. Tel.: 617-495-4130; Fax: 617-384-9228.

Abstract

We develop models of the divergent evolution of genomes; the elementary object of sequence dynamics is the protein structural domain. To identify patterns of organization that reflect mechanisms of evolution, we consider the individual genomes of many procaryote species, studying the arrangement of protein structural domains in the space of all polypeptide structures. We view the network of structural similarities as a graph, called the organismal Protein Domain Universe Graph (oPDUG); vertices represent types of structural domains and edges represent strong structural similarity. As observed before, each oPDUG is a highly nonrandom graph, as evidenced in the vertex degree distribution, which resembles a Pareto law (which has a power-law asymptotic). To explain this and other peculiar properties of the oPDUGs, we construct an evolving-graph model for the long-timescale evolutionary dynamics of oPDUGs, containing only divergent mechanisms of domain discovery. The model generates degree distributions (resembling Pareto laws) and clustering-coefficient distributions that are characteristic of the oPDUGs. In the infinite-graph limit, we analytically compute the exponent for specific biological parameters, as well as the complete phase diagram of the model, finding two distinct regimes of domain innovation dynamics. Thus, divergent evolutionary dynamics quantitatively explains the nonrandom organization of oPDUGs.

Introduction

Within the field of molecular biophysics, there are unanswered questions regarding the molecular evolution of proteins. In particular: if two protein structural domains have similar structure, do they necessarily have a common ancestor 1? The answer to this question concerns the relative importance of divergent and convergent mechanisms of protein domain fold discovery. In this work, we describe a divergent model of protein evolution in structure space. We compute the properties of this model, and compare the results to the properties of proteomes of real organisms. Before we delineate our model, we first present the experimental observations that check different explanations of protein evolution.

One of the striking observations made in the structural genomics effort is the uneven fold usage within the genomes of individual organisms: the number of SCOP-folds with a given number of genes follows a power law 2,3,4,5,6. For several organisms, the distribution of domain genes among folds was captured by a simple divergent model 3 with a “rich get richer” mechanism. This mechanism presents one explanation of the uneven distribution of domains among folds.

However, it is accepted that alternative models, employing convergent hypotheses, could also explain the uneven fold distribution 7. Dokholyan et al. 8 suggested that in addition to the fold distribution, the structural similarities between domains within a fold also held discriminating evolutionary information. This suggestion is motivated by earlier observations: “Although the definition of discrete fold types is useful for counting purposes, the Dali Domain Dictionary also makes explicit the graded similarities that exist between the members of the same fold type and that may extend beyond the borders of fold categories” 9. The arrangement of the set of Dali domains in structure space (space of all -traces) is clarified by viewing the set as a network in which the nodes are the domains and the binary interactions between nodes correspond to the pairwise structural similarities (Dali Z-scores). The graph representation of this network is called the protein domain universe graph (PDUG).

In a subsequent work, Deeds et al. 10,11 studied how individual genomes cover structure space through the analysis of organismal PDUGs (oPDUGs). For each of many fully sequenced procaryote genomes, putative structural domain sequences were identified by, and assigned to, Dali domains according to high sequence similarity with a Dali-domain sequence. Thus, the list of Dali domains “present” in the genome was assembled. In an oPDUG, each Dali domain present in the genome is represented as a labeled vertex, and two vertices are connected by an edge if their Dali Z-score exceeds a cutoff. At an organism-independent value of the cutoff, each oPDUG exhibited a nonrandom global connectivity, visible in the degree distribution; in graph terminology, the number of edges emanating from a vertex is called “the degree of a vertex” (see Appendix , The degree in the mathematics of graphs, for further explanation), and the fraction of vertices with a certain degree is called “the degree distribution”. Specifically, it was found that the degree distribution of each oPDUG differed strongly from that of a classical random graph with the same number of vertices and edges (in this type of graph, edges are placed between randomly chosen pairs of vertices). This striking feature of the oPDUGs provides a stringent experimental measurement, even more discerning than the uneven fold distribution, against which to compare divergent and convergent models of protein evolution.

The nonrandom degree distribution of the oPDUGs was captured by a divergent model of oPDUG evolution 8,10. In the following, we attempt to resolve the mechanisms at work in this previous model. In so doing, we formulate and characterize a phenomenological model of oPDUG evolution that 1), reproduces the nonrandom connectivity of real oPDUGs (which have finite size); and 2), allows analytical calculation of the behavior of infinite graphs. Specifically, simulation results suggest that in the limit of large graphs (long evolutionary times), the model generates graphs that are well fit by a power law at high degree, i.e., the graphs are asymptotically scale-free. We analytically compute the scale-free exponent in the large-graph, high-degree limit. Our primary result is that as the graph size increases in our model, the nonrandom connectivity of finite graphs develops into the asymptotically scale-free connectivity of infinite graphs.

First, we attempt to identify important features, in addition to the nonrandom degree distribution, of the studied oPDUGs. In particular, we calculate the distribution of clustering coefficients for four importantly different organisms. Second, we present two models of the time development of oPDUGs, each isolating mechanisms at work in previous models. Specifically, the second of the two models has a mechanism with a type of memory not present in the first. Third, we computer-simulate the models in a finite time regime. We find that, in this time regime, the memory-full mechanism does not outperform the memory-less mechanism with respect to reproducing course features of the real oPDUGs. Fourth, we discuss analytical results for the long-time behavior of the memory-less model, presenting the phase diagram. Last, we discuss the successes and failures of the models, and the implication for the evolution of organisms.


Biological data: arrangement of dali domains in structure space

Genomes under study

A phylogenetic analysis of many procaryote genomes, in which the characters correspond to the presence or absence of the Dali domains, revealed that the genomes had many domains in common 12. Thus, to study the features generic to all of the oPDUGs, we need only study a small number of examples that are significantly different from each other. We consider a single example from each of four major clades identified in the neighbor-joining phylogeny 12. We study Agrobacterium tumefaciens C58 from the proteobacteria-like clade, Streptococcus pneumonia R6 from the gram-positive-like clade, Campylobacter jejuni from the ϵ-proteobacteria-like clade, and Archaeoglobus fulgidus from the archae-like clade.


Failure of the classical random-graph model (CR) for the oPDUG

Our first aim is to evaluate the evolutionary information content of the oPDUGs. Under the null hypothesis that the oPDUG does not hold signals about the nature of evolution (dynamics or driving forces), we define the CR model for the architecture of an oPDUG. In the CR model, which stands for classical random-graph model, the statistics of an oPDUG is similar to that of a classical random graph 13, with the same number of vertices and edges. If we define a graph ensemble as a collection of graphs such that each graph is prepared according to certain probabilistic rules, a single graph in the CR graph ensemble is made by assigning edges to pairs of vertices at random (uniformly). We call each graph in the CR ensemble a shuffled oPDUG, and compute 50 such shufflings. The assembly of the CR ensemble washes away correlations in the edges shared by any group of three vertices; in this sense, this graph ensemble is devoid of nontrivial evolutionary information.

To compare the CR model to a real oPDUG, we compute the degree distribution and clustering-coefficient distribution as follows. For a given oPDUG, each domain (vertex) has some number of structural neighbors (edges); in graph terminology, the number of edges emanating from a vertex is called the degree (see Appendix , The degree in the mathematics of graphs, and Albert and Barabasi 13). The degree distribution,

(1)
is plotted in Fig. 1. For each oPDUG, the number of vertices and edges defines a CR model, for which we compute the ensemble-averaged degree distribution, which is the degree distribution averaged over all graphs in the ensemble (Fig. 1). Specifically, for any ensemble of graphs , , the ensemble-averaged degree distribution is denoted by
(2)
In Fig. 1, N is the number of vertices and is the graph degree (see Appendix , The degree in the mathematics of graphs), i.e., the average-over-vertices of the degree of a vertex, where E is the total number of edges in the oPDUG. Additionally, for a given oPDUG, we compute the clustering coefficient for each vertex t with degree kt≥2 according to the standard definition
(3)
where is the number of edges between neighbors of vertex t13; is not defined for vertices with k=0 or k=1. The clustering-coefficient distribution is the histogram of these values, normalized to the total number of vertices, (Fig. 2). For the corresponding CR model, we compute the clustering-coefficient distribution for each graph in the ensemble, normalizing each distribution by the value of for that particular graph. The normalized distribution is averaged over all graphs in the CR ensemble, giving equal weight to each graph; this results in the ensemble-averaged clustering-coefficient distribution shown in Fig. 2.

Display large version of this figure
Figure 1
oPDUG degree distributions (solid line) for a representative from each of four major clades. N is the total number of vertices in a graph. γ is the exponent of the fit to the Pareto law ; the fit was performed on the entire interval spanned by the data 10. D is the graph degree. The curve intersects the abscissa where the fraction of vertices is zero. Also shown is the ensemble-averaged degree distribution for the CR model (circles). Each graph in the CR ensemble has the same number of vertices and edges as the corresponding oPDUG, but the edges are assigned to pairs of vertices at random. The Poisson distribution (dashed line) fits the CR model.
Display large version of this figure
Figure 2
oPDUG clustering-coefficient distributions (solid line) for a representative from each of four major clades. The bin size is 0.01. For easy viewing, the peaks at C=1 and C=0 are artificially shifted away from plot boundaries. is the total number of vertices with in the real oPDUG. Also shown is the ensemble-averaged clustering-coefficient distribution for the CR model (circles).

For each oPDUG studied, the degree distribution differs strongly from the prediction of the CR model (Fig. 1). By comparison, the degree distribution of a real oPDUG shows 1), many vertices with degree k=0 (orphans); and 2), many vertices with degree . Since CR represents the null hypothesis, we purport that features 1 and 2 are signatures of evolution (either dynamics or driving forces). To highlight these features and to put them on equal footing, we view the degree distribution in log-log scale with the k axis shifted. Features 1 and 2 result in an approximately straight line in this scale. The simplest function to satisfy this observation is the Pareto law

(4)
where parameters A and γ are coupled through the normalization condition. Thus, it has proven convenient to characterize the degree distribution of each oPDUG with a fitted value of γ10. We do not claim, however, that the data is better fit to a Pareto law than every other analytical function. We claim only that the Pareto law is a convenient shorthand for the presence of features 1 and 2, and that γ conveniently gives the relationship between the two.

Additionally, features 1 and 2 can be quantified in terms of the standard deviation (spread) of the degree. Consistent with the generic properties of classical random graphs, CR is well fit by a Poisson distribution

(5)
with parameter D, where the mean value of the distribution is D and the standard deviation is 13. For A. tumefaciens, D=2.52 is the graph degree, so for the corresponding CR degree distribution, the mean is 2.5 and the standard deviation is 1.3. Thus, the CR degree distribution is relatively well localized about the mean value, i.e., it has a small spread. However, whereas the oPDUG degree distribution has a mean value of 2.5 as well, the standard deviation is computed to be 6.3. The oPDUG degree distribution has a large spread compared to the corresponding CR degree distribution.

For each oPDUG studied, the clustering-coefficient distribution differs strongly from the prediction of the CR model (Fig. 2). In the oPDUGs shown, of the vertices with , 25–30% have C=1 and 5–15% have C=0 (stars). Fig. 3, which summarizes gross properties of all 59 proteomes, shows that most organisms have a fraction of C=1 and C=0 vertices, consistent with the four clade examples. These results are in strong disagreement with the CR model, suggesting that the clustering-coefficient distribution also contains information about evolution (dynamics or driving forces).

Display large version of this figure
Figure 3
Gross properties of the 59 procaryote oPDUGs. For a given oPDUG, N is the total number of vertices, is the number of vertices with degree , and p(0), p(1) are the values of the clustering-coefficient distribution at C=0 and C=1. The organism index is defined in Deeds et al. (10).

In summary, the network statistics of the oPDUGs is nonrandom in the sense that the degree and clustering-coefficient distributions compare poorly with the predictions of the CR model.


Criteria for model development

In the next section, we complement the null hypothesis by proposing a model for the evolutionary dynamics that generated the oPDUGs. To compare the oPDUGs with this model, we restrict our focus to A. tumefaciens, because it has the largest number of domains. From the key features of the A. tumefaciens oPDUG, we list the following set of criteria for a stochastic-dynamic model of the evolution of an oPDUG. Within the ensemble of graphs generated by the model, there must be a significant fraction of individual graphs with the following features:

1. the degree distribution is well fit by the Pareto law at low degree
2. the degree distribution is well fit by the Pareto law at high degree
3. the exponent of the Pareto law is ∼1.6
4. the distribution of clustering coefficients has a strong peak at C=1
5. the distribution of clustering coefficients has a strong but subdominant peak at C=0.



Model: the duplication and divergence of paralog families

Sequence pockets and the structural proteome

The architecture of polypeptide sequence space motivates our models of how a genome’s Dali domains are arranged in structure space, i.e., of its oPDUG. We thus recount the bioinformatic method described by Mirny et al. 14, which attempts to identify the pattern of sequence positions involved in physical interactions important to stabilizing a sequence in its native-state fold. For a given fold, e.g., immunoglobulin, it was found that the positions that are highly conserved within any given family of sequences (high sequence identity) match up, after structural alignment, with the conserved positions in any other family of proteins having the same fold. This matching up appeared in high values of the conservatism-of-conservatism (CoC) for certain sequence positions 14. For a given fold, each family differed in types of residues at the high CoC positions, but had the pattern of high CoC positions in common. These quantitative results are consistent with earlier qualitative observations: “The map [of fold space] further reveals a small number of densely populated regions where the common features are topological motifs at the core of the domains” 9.

Based on this observation, we model the space of all polypeptide sequences by assuming the existence of evolutionarily stable regions called sequence pockets (see Fig. 4) 15. Sequences in a pocket are similar in the sense that there is a pattern of key sequence positions, i.e., the high CoC positions, at which the residue type (in a reduced alphabet) is the same for each sequence. We call this pattern the fold pattern. The residue types at non-key sequence positions provide the degrees of freedom that give volume to the pocket. Each sequence is evolutionarily stable in two respects. First, each sequence in the pocket is useful to the cell because it folds as an independent unit to a well defined native conformation, i.e., each sequence corresponds to a structural domain (for definitions of a structural domain, see Holm and Sander 9). Second, if we make a single residue-type substitution (reduced alphabet) at any one of the positions in the fold pattern, the resulting sequence does not fold independently. A sequence pocket is defined both by the fold pattern and the residue type at each position in the fold pattern.

Display large version of this figure
Figure 4
Our model for the organization of polypeptide sequence space. There are stable regions called sequence pockets (SP); inside an SP, all sequences share both the same fold pattern and the same residue type at each position within the pattern. SPs are separated from each other by sequences that do not fold independently.

The pocket of structural domain sequences corresponds to a set of native-state structures that forms a localized cluster in the space of all polypeptide structures (Fig. 5). Thus, a sequence pocket can be labeled with, and well represented by, any member sequence and its associated structure. We take each sequence pocket to precisely define a type of structural domain, i.e., to define a domain type.

Display large version of this figure
Figure 5
Our model for the organization of polypeptide sequence space and structure space. The cartoonized distances in sequence space correlate with the timescale for spontaneous mutation from one sequence to another. In structure space, the representative structures of two SPs are close (far) if the fold pattern is common (distinct). On the far right is an “expanded view” of the region of structure space corresponding to three SPs with common fold pattern; we show structures for all sequences in each SP. Here, we schematize how an organism’s proteome might populate the SPs.

If we define an organism’s complete proteome as the collection of all structural domain sequences contained in its genome, we imagine the evolution of its complete proteome as a dynamic process in which previously unoccupied sequence pockets become occupied. In other words, evolution is described as a timeline of domain-type discoveries. Initially, a single seed sequence occupies each of a set of pockets. On some short evolutionary timescale t, each seed sequence gives rise to a steady-state population of descendants that are similar in the sense that all are confined to the parent’s pocket (Fig. 6). We call this monophyletic population of sequences in a stable pocket belonging to the same genome a paralog family (group of paralogous sequences), and consider it to be the elementary unit of evolution. On some longer evolutionary timescale, each paralog family gives rise to a sequence that seeds a distinct pocket, thus overcoming the single-residue-substitution barrier that confines the family on short timescales (Fig. 7). In accord with the scenario of divergent evolution, we assume that the distinct pocket is typically unoccupied, i.e., that a new domain type is discovered.

Display large version of this figure
Figure 6
A paralog family is a lineage (having common ancestor) of structural domain sequences confined to a sequence pocket. A single seed sequence undergoes duplication-and-divergence events to produce a paralog family on timescale to15, which is the timescale on which a given sequence spontaneously mutates into another sequence within the same SP.
Display large version of this figure
Figure 7
Sequence-pocket discovery events. In a neutral fold discovery, a sequence in some SP spontaneously mutates into a sequence in a different SP with the same fold pattern on some intermediate timescale . In a divergent fold discovery, a sequence in some SP spontaneously mutates into a sequence in an SP with a different fold pattern on some long timescale (see Model, Spontaneous versus fixed mutations).

We now accommodate the constraint that the observer can only detect a subset of all sequence pockets, using bioinformatics. Thus, we define the long-timescale evolutionary state of an organism by the occupation states (filled or empty) of detected sequence pockets, i.e., by its structural proteome—which we define as the list of detected domain types found in the organism’s genome (Fig. 5). We now identify the sequence-structure pairs in the Dali Domain Dictionary as an experimental proxy for the set of detected domain types in our caricatured model of protein sequence space.

Although the organization of sequence space into stable pockets is a reasonable model for the high-sequence-identity region of structural-similarity versus sequence-similarity plots 16,17, and the picture of evolution through domain-type discovery is highly plausible, the following question arises: what kinds of domain-type discoveries explain the current evolutionary state (structural proteome) of organisms?


Divergent versus convergent evolution

At this point, we distinguish between three kinds of domain-type discovery. An occupied sequence pocket may seed an unoccupied sequence pocket with a fold that is 1), the same as the parent fold—we call this “neutral fold discovery”; 2), different from the parent fold and unoccupied by any paralog family—“divergent fold discovery”; and 3), different from the parent fold but already occupied by paralog families—“convergent fold discovery”.

One explanation of the uneven fold population is the convergent evolutionary hypothesis: there have been many convergent fold discoveries, and the organism either 1), selects domain types corresponding to folds that accommodate many functions; or 2), exhibits a larger number of domain types for folds that can intrinsically accommodate more distinct domain types (high designability). Another explanation, the divergent evolutionary hypothesis, is that convergent fold discoveries are rare or nonexistent. In this work, we attempt to explain the network statistics (degree and clustering-coefficient distributions) of structural proteomes using only neutral and divergent fold discovery.


Spontaneous versus fixed mutations

A physically reasonable assumption for the organization of sequence space is that the spontaneous mutations that cause neutral fold discovery occur much more frequently than the spontaneous mutations that cause divergent fold discovery. That is, the timescale for divergent fold discovery, , is much longer than the timescale for neutral fold discovery, , i.e., (see Fig. 5 and Fig. 7). We take this inequality to be a key property of our model of protein sequence space (see Model, Sequence pockets and the structural proteome).

However, spontaneous mutations may not survive generations of reproduction of a population of organisms if they do not provide a substantial fitness advantage. It is possible that the spontaneous mutations causing neutral fold discovery are fixed in a population very infrequently, whereas the spontaneous mutations causing divergent fold discovery are fixed very frequently (personal communication 2005, E. J. Deeds). For simplicity, we assume that fixation effects restore the balance between neutral- and divergent-discovery rates. Thus, in the models that follow, there is no explicit distinction between neutral and divergent fold discovery.


Evolving-graph models

M0: model without memory

According to the evolutionary model above (Sequence pockets and the structural proteome), the evolutionary state of each procaryote is synonymous with its structural proteome. At this point, we choose to reduce the detail of our description by tracking the oPDUG representation of the evolutionary state.

Our first model for the time development of an oPDUG is described as follows. Preliminarily, we define a set of discrete time points , each interval of time between t−1 and t is a time step labeled with the terminal time point t. We imagine that during the first time step, t=1, of the existence of the structural proteome, a single seed sequence fills some sequence pocket. This sequence pocket is represented by a single vertex on the oPDUG, labeled t=1. At the start of each subsequent time step t, a randomly-chosen occupied sequence pocket spawns an intrepid sequence (due to gene duplication) that is destined to discover a new sequence pocket. At the time of duplication, the intrepid sequence resides in the parent sequence pocket; thus, the structural similarity to the parent’s representative structure is high.

On the graph, the birth of the intrepid sequence is represented by adding a “baby” vertex, with label b indicating the time step b=t of creation, which shares an edge (dashed) with a randomly-chosen parent vertex p (Fig. 8). The baby vertex also shares an edge (dotted) with each neighbor vertex , where a neighbor is a vertex that shares an edge with the parent. Subsequently, the intrepid sequence undergoes mutations that place it in a new sequence pocket, where it seeds a new paralog family. The representative structure for the new sequence pocket will sometimes be similar to—or different from—the parent pocket’s structure, and thus also to the structures of the parent’s structurally neighboring pockets. This variation in the magnitude of structural divergence is motivated by the diversity of structural similarities in the “twilight” and “midnight” zones of protein sequence alignments 17,18.

Display large version of this figure
Figure 8
Schematic of M0, showing the divergence of a baby vertex during its first time step of existence. At birth, the baby vertex b shares edges with the parent p (dashed) and each of the neighbors (dotted). After a single time step, the edge is retained with probability . If the edge is retained, then each edge is independently retained with probability . The edges that survive the divergence (grey), for one particular realization of the probabilistic process, are shown on the right.

Thus, on the graph, at the end of its first time step of life, the baby vertex has retained the baby-parent edge with retention probability (Fig. 8). Given that the baby-parent edge is retained, each baby-neighbor edge is retained with probability . If the baby-parent edge is not retained, then all of the baby-neighbor edges are lost as well; we call this baby vertex an orphan. The diverged baby vertex corresponds to a newly occupied sequence pocket.

We call this model M0, for zero memory. This designation will be given meaning in the next section.


M1: a model with memory

We now comment on a particularly relevant feature of the model M0. To do so, we define a special type of memory associated with the baby-neighbor edge retention mechanism of duplication-and-divergence models. Consider a particular vertex that parents a baby at time step s and at time step . Of the parent’s neighbors that exist at time s, the subset that retain edges with baby s may be correlated with the subset that retain edges with baby t. A correlation between the two subsets means that they typically have an unusually large number of vertices in common. If the correlation between these two subsets is nonzero, then we say that the baby-neighbor edge retention mechanism has memory.

Note that the baby-neighbor edge retention mechanism of M0 has zero memory. This feature is the most fundamental distinction between M0 and previous models 8. To evaluate the importance of memory, we considered a second model called M1—for full memory—that contains a mechanism with extremely strong memory. In the Supplementary Material , we show that the memory-full mechanism does not outperform the memory-less mechanism with respect to mimicking the real oPDUGs; we compute the degree and clustering-coefficient distributions of M1. We conclude that memory mechanisms are not essential to explaining the data. Therefore, we present M0 as a simple representation of the evolutionary dynamics that generated the real oPDUGs.




Results

Finite-graph ensembles of M0

To fit M0 to the A. tumefaciens oPDUG, the “phase diagram” of the model was surveyed. At each of a large number of points in the parameter space—the -square, a graph ensemble was generated by computer simulation. The graph ensemble, called , is a collection of M graphs in which each individual graph is independently evolved with parameters for . For the results shown, M=103 and N=103.

According to the criteria for agreement with the A. tumefaciens oPDUG listed in Biological Data, Criteria for model development, the parameter values (0.6,0.8) provide best fit for . The parameters were chosen such that the ensemble-averaged degree distribution has a Pareto-fit exponent that approximates the exponent obtained for A. tumefaciens. Fig. 9 shows that Γ0(0.6,0.8;N) contains individual graphs that mimic the oPDUG degree distribution. Additionally, we compute the Pareto-fit exponent for each member of the ensemble, and plot their distribution (Fig. 10). Thus, Γ0(0.6,0.8;N) contains a significant fraction of individual graphs with exponents similar to that of the real oPDUG. M0 fails, however, to generate the high fraction of orphans observed in the real oPDUG.

Display large version of this figure
Figure 9
The statistics of the model M0 compared to the oPDUG of A. tumefaciens. For the ensemble-averaged degree distribution of M0, the Pareto-law fit with vertical shift is shown, and indicates the interval on which the fit is performed. The lower limit of the fitting interval is k=4 and the upper limit is the ensemble average of minus the standard deviation in ; is the degree of the most highly connected vertex in graph G.
Display large version of this figure
Figure 10
The distribution of single-graph degree distribution exponents within Γ0(0.6, 0.8; N), for M=1000 and N=1000. Each exponent is obtained from a Pareto fit on the interval . Bin size is 0.1.

In Fig. 9, we plot the clustering-coefficient distributions of Γ0(0.6,0.8;N). M0 over-represents , which is the total number of vertices. This over-representation is probably related to the under-representation of orphans. M0 succeeds in generating a dominant peak at C=1, and a subdominant peak at C=0.

We summarize these results by saying that M0, for finite-graph sizes, captures the nonrandom statistics (degree and clustering-coefficient distributions) of the real oPDUGs. In the next section, we demonstrate that the nonrandom degree distribution of the finite-graph ensemble becomes scale-free at high k for large N.


Infinite-graph ensembles of M0

We study the asymptotic behavior, with increasing N, of the M0 phase diagram. First, we map a nonanalytic transition of the model, using the ensemble-averaged graph degree as a global order parameter. In particular, we find a surface in parameter space at which the graph degree exhibits a divergence for infinite N (Eq. (7)). Second, we obtain the analytical solution of the high-k behavior of the ensemble-averaged degree distribution in the limit, using a power-law ansatz.

Ensembled-averaged graph degree

We consider the ensemble in which each graph —labeled with —is independently evolved under the rules of M0 (see Model, M0: model without memory). We consider , where and are the baby-parent and baby-neighbor edge retention probabilities, respectively. For graph at time t=N, is the total number of edges and is the graph degree, i.e., the average-over-vertices of the single-vertex degree. We note the general relation .

We define the ensemble-average of at time t=N as

and similarly define D(N) as the ensemble average of (see Appendix , Ensemble-averaged graph-degree). It can be shown that E(N) obeys the equation
(6)
where we have treated N as a continuous variable. The solutions of this equation can be obtained by an isomorphism with the edge-growth equation for the model of Redner et al. 19; additionally, we give a derivation that includes correction terms in Appendix, Ensemble-averaged graph-degree). The large-N behavior of the solution, given in terms of average graph degree D(N), is
(7)
for . We call the region of the phase diagram the linear regime, because grows linearly with N [Fig. 11]. In like manner, we call the line the linear-log growth regime, and the region the superlinear regime. We summarize these results by noting that in the linear regime, the ensemble-averaged graph degree approaches a finite value as N increases, whereas in the superlinear regime, it grows without bound (goes to infinity). We call the transition between the two regimes a nonanalytic transition,n because for infinite N, diverges as the line is approached from below.

Display large version of this figure
Figure 11
The phase boundary between the linear and superlinear regimes of M0 in the infinite-graph limit.


Ensemble-averaged degree distribution

Our computer simulations suggest that, as N increases, the ensemble-averaged degree distribution of M0 develops a stable power-law regime at high k(Fig. 12, and we include Fig. 13 for completeness). To characterize M0, we would like to calculate the value of the exponent in the large-N limit. Thus, we again consider the ensemble , but restrict our attention to . Given graph at time t=N, we call the fraction of vertices with degree k; this is the degree distribution of graph . The ensemble-averaged degree distribution is then

(8)
To compute how changes from time t=N to t=N+1, we derive in Appendix , Ensemble-averaged degree distribution, the rate equation
(9)
valid for , where
(10)
is the ensemble-averaged change in the number of vertices with degree k, and is termed the attachment kernel in studies of preferential attachment models of complex networks 20. Equation (9) involves no approximations for the model. We look for a solution that is stationary, meaning that it does not change in time, at . In this parameter regime, the solution satisfies the algebraic equation
(11)

Display large version of this figure
Figure 12
The high-k region of the degree distribution (solid line) of M0 is well-fit by a Pareto law ; the fit (dotted line) is shown with vertical shift and indicates the interval on which the fit is performed. Both the quality of fit (r is the correlation coefficient) and the size of the fitting window increase with N, suggesting that the degree distribution is scale-free at high k for large N. Each plot is the ensemble-averaged degree distribution with M=10 and . The lower limit of the fitting interval is k=4 and the upper limit is the ensemble average of minus the standard deviation in ; is the degree of the most highly connected vertex in graph G. The value in each plot is the ensemble average of .
Display large version of this figure
Figure 13
The clustering-coefficient distribution of M0 at increasing values of N. Each plot is the ensemble-averaged distribution with M=10 and . is the ensemble average of the number of vertices.

As motivated by our simulation results (Fig. 12), we assume that at , our stationary solution has a power-law form, i.e., . This assumption results in an equation for γ as a function of and :

(12)
Fig. 14 compares the results of the numerical solution of Eq. (12) to simulation. At moderately low , Eq. (12) appears to have two solutions. Comparison with simulation suggests that the physical solution is the larger of the two. For the physical solution, the transition between the and regime coincides with the transition between the linear and superlinear growth regimes. The simulation results suggest that the location of the boundary between the two regimes of γ depends on N.

Display large version of this figure
Figure 14
The Pareto-fit exponent of the ensemble-averaged degree distribution of at , obtained by numerical solution of Eq. (12), and at M=103, N=103, intermediate k, obtained by simulation.

We emphasize that although Eq. (12) may be solved anywhere on the square, we corroborate the power-law ansatz with simulation results only in the specific region near (Fig. 12).



Conclusions

Our primary result is the development of an asymptotically scale-free model (M0) that is consistent with the statistics of finite procaryote oPDUGs. By comparison with the null model CR, we demonstrated that procaryote oPDUGs hold information about evolution (dynamics or driving forces), and that this information appears in the nonrandom shapes of the degree and clustering-coefficient distributions. We then countered the null model with a dynamical model (M0) that explained the nonrandom statistics in terms of the mechanism of duplication and divergence. Simulation results demonstrated that the high-k (degree) region of the ensemble-averaged degree distribution develops into a power law as N (graph size) increases. We then computed analytically the exponent of the power-law regime in the infinite-graph limit. So, by asymptotically scale-free, we mean that in the large-graph limit, the ensemble-averaged degree distribution is power law at high degree. This work suggests that the statistical features of oPDUGs are consistent with an asymptotically scale-free dynamical process whereby new domains are discovered via duplication and divergence.

We note that, as with any area in scientific research, we cannot prove that models presented in this work, M0 and/or M1, are the only ones that satisfactorily describe the oPDUG. We cannot rule out the existence of alternative models, including convergent ones, which could also explain the peculiarities of the oPDUG. However, this possibility seems somewhat academic at the moment, since no convergent model has been proposed to describe nonrandom, asymptotically scale-free organization of the oPDUGs. In our earlier study 21, we attempted to develop a convergent PDUG model for a much simpler model—lattice proteins—of a protein universe, without apparent success. The analysis presented by Deeds et al. 21 suggests that inventing a satisfactory convergent model of oPDUG is a challenging task. As regards alternative divergent models, they are perhaps possible, but, as this study shows, the simplest memory-less model M0 performs reasonably well, especially in the high-k regime.

Specific results for M0, for finite-graph sizes, are the following. When M0 is fit to the oPDUG of A. tumefaciens, the parameters generate an ensemble of graphs in which the degree distributions of individual graphs are well fit by a Pareto law (at high k) with exponents in the neighborhood of 1.6. Additionally, the normalized clustering-coefficient distributions have a strong peak at C=1 and a strong but subdominant peak at C=0. These results are consistent with our observations of real organisms (see Biological Data).

Specific results for M0, in the infinite-graph limit, are the following. The ensemble of graphs with an asymptotically large number of vertices has two regimes of behavior separated by a sharp phase boundary. In the linear growth regime, the degree approaches a limit as N increases and for the Pareto-fit exponent of the ensemble-averaged degree distribution, (for definition of , see Appendix, The degree in the mathematics of graphs). In the linear-log regime, grows logarithmically with N and . In the superlinear regime, grows algebraically with N and . Fitting A. tumefaciens to M0 for N=1000 results in parameters and , placing this procaryote genome in the linear growth regime. If we assume that these parameters are independent of time, then for the Pareto-fit exponents at time N=1000, , and at time , , we have (Fig. 14). The structural proteome of A. tumefaciens has not reached its asymptotic behavior because the Pareto-fit exponent of the degree distribution is far from the long-time value. In this sense, A. tumefaciens is young. Fig. 12 confirms that the Pareto-fit exponent γ increases with the graph size.

Our secondary result is that memory mechanisms, as defined in M1: a model with memory, are unnecessary to model the oPDUGs. This result is the key step in distilling certain aspects of the previous model 8 to a form simple enough to extract analytical results about the long-time behavior. The simulation results in the Supplementary Material section shows that this distillation does not compromise accuracy.

Our third main finding, is that our models, both M0 and the memory-full model (M1) discussed in Supplementary Material , fail to quantitatively reproduce the orphan fraction of real procaryote oPDUGs. In both models, only 30% of the vertices are orphans, compared to 60% for the real oPDUG of A. tumefaciens (Fig. 9 and Supplementary Fig. S2). This is the most significant shortcoming of the model. We understand this shortcoming as follows. It is a simple fact that each oPDUG may have some orphan Dali domains that share a Dali fold with other domains, and other orphans that are the sole occupants of folds (singlet folds). Some structural orphans may have been generated by the neutral-fold-discovery mechanism and others by the divergent mechanism (see Model, Divergent versus convergent evolution). That these mechanisms may have distinct timescales may have consequences for the oPDUGs (see Model, Spontaneous versus fixed mutations). Our models do not distinguish between the two kinds of discoveries and we speculate that this is the source of the failure. Work is in progress to evaluate the validity of this conjecture.


Acknowledgments

We thank Eric Deeds for help at the initial stages of this work and Julius Lucks, Jason Donald, and Edo Kussell for helpful comments on the article.

This work was supported by the National Institutes of Health. C.B.R. also acknowledges support from the Harvard Merit Fellowship.

Appendix


The degree in the mathematics of graphs

The single-vertex degree is the number of edges emanating from a particular vertex. The graph degree, , of a graph G is the average-over-vertices, in the graph, of the single-vertex degree. The ensemble-averaged graph degree, , is the average of over all graphs in the ensemble, where the ensemble has been time-evolved for N time steps. These definitions can be found in Albert and Barabasi 13.


Ensemble-averaged graph-degree

Equation-of-motion for E(N): derivation

We consider the ensemble in which each graph —labeled with —is independently evolved under the rules of M0 (see Results). We consider , where and are the baby-parent and baby-neighbor edge retention probabilities, respectively. We would like an estimate of the degree of a typical vertex in a typical graph in this ensemble. Therefore, we compute the ensemble-averaged graph degree, , which we define as follows. Graph at time N is composed of N vertices, each labeled , indicating the time step during which that vertex was created and underwent divergence from the parent vertex; if we define a lattice of time points as , each interval of time between and t is a time step labeled with the terminal time point t. Each vertex t has some degree that we call , and for the entire graph we define the graph degree, , as

(13)
which is the average-over-vertices of the single-vertex degree. We now define the ensemble average for some property of a graph as the average over the M graphs in the ensemble at time t=N, using the notation
(14)
Thus, the ensemble-average of is
(15)
and to compute this, we first consider the total number of edges in a graph.

For graph at time t=N, is the total number of edges. The change in from time t=N to t=N+1 must be

(16)
because the only mechanism by which the edge number changes is by the addition of the baby vertex during time-step . The ensemble average of is
(17)
We define as the subensemble of in which the baby-parent () edge is retained during time-step . We relabel all graphs so that , with , labels the graphs in . We can write
(18)
where is the number of edges that are retained between the baby and the neighbors of the parent. We must average over .

We further subdivide into “structure groups” , where and is the number of graphs in each group. Each group contains graphs , where , and the function gives the graph label . In , the subgraphs are identical to each other in the sense that for any vertex labeled with birth date t in the first graph, the neighbor-vertex labels are the same as the neighbor-vertex labels for vertex t in the second graph. Note that different structure groups need not have different structure.

Because M is large, is also large, and within each structure group, there will be many graphs in which the parent vertex has the same label (giving its birth date). Thus, we further partition each structure group into “parent groups” , where and is the number of graphs in each group. Each parent group contains graphs , where , such that subgraphs are identical and baby is parented by the same vertex in each ; is the function that relates labels to label m

(19)
Finally, we use the fact that the average of over the subensemble of is the same as the average over all graphs in the entire ensemble (here, special attention must be paid to t=N versus t=N+1). We obtain
(20)
where in the second line we used the general relation , and in the third line we used the definition of the ensemble average of . In the continuous N approximation, this equation becomes
(21)
This is Eq. (6) from Ensemble-averaged graph degree.


Equation of motion for E(N): solution

The solutions of the equation

(22)
for arbitrary real parameters a,b are obtained using the method of integrating factors, resulting in
(23)
where and are the initial values. For our model, we have , , , and . With these parameter values, our solutions may be expressed in terms of (defined in Ensemble-averaged degree) to obtain the exact result
(24)
for any N. In the large-N limit, we obtain Eq. (7) from Ensemble-averaged graph degree.



Ensemble-averaged degree distribution

We again consider the ensemble as described in Appendix , Ensemble-averaged graph-degree, but restrict our attention to . We would like to compute a degree distribution, giving the fraction of vertices with degree k, that estimates the degree distribution of a typical graph in . Thus, we now derive an equation for how the ensemble-averaged degree distribution, called and defined below, changes between time points and .

For graph at time N, is the degree of vertex t and

(25)
where is the Kronecker δ with the above meaning. Thus, the total number of vertices with degree k is
(26)
where t runs over all vertices in the graph. Thus, for graph , the change in the number of vertices with degree k, from time to time , is
(27)
where