| Information Content of Molecular Structures Biophysical Journal, Volume 85, Issue 1, 1 July 2003, Pages 174-190 David C. Sullivan, Tiba Aynechi, Vincent A. Voelz and Irwin D. Kuntz Abstract For a completely enumerated set of conformers of a macromolecule or for exhaustive lattice walks of model polymers it is straightforward to use Shannon information theory to deduce the information content of the ensemble. It is also practicable to develop numerical measures of the information content of sets of exact distance constraints applied to specific conformational ensembles. We examine the effects of experimental uncertainties by considering “noisy” constraints. The introduction of noise requires additional assumptions about noise distribution and conformational clustering protocols that make the problem of measuring information content more complex. We make use of a standard concept in communication theory, the “noise sphere,” to link uncertainty in measurements to information loss. Most of our numerical results are derived from two-dimensional lattice ensembles. Expressing results in terms of information per degree of freedom removes almost all of the chain length dependence. We also explore off-lattice polyalanine chains that yield surprisingly similar results. Abstract | Full Text | PDF (367 kb) |
| An Information Theoretic Approach to Macromolecular Modeling: II. Force Fields Biophysical Journal, Volume 89, Issue 5, 1 November 2005, Pages 3008-3016 Tiba Aynechi and Irwin D. Kuntz Abstract In this article, we explore the information content of molecular force-field calculations. We make use of exhaustive lattice models of molecular conformations and reduced alphabet sequences to determine the relative resolving power of pairwise interaction-based force fields. We find that sequence-specific interactions that operate over longer distances offer greater amounts of information than nearest-neighbor or non-sequence-specific interactions. In a companion article in this issue, we explored the information content of sequence alignment procedures and the calculation of gap penalties. Both articles have implications for protein and nucleic-acid computations. Abstract | Full Text | PDF (141 kb) |
| Salt Dependence of Nucleic Acid Hairpin Stability Biophysical Journal, Volume 95, Issue 2, 15 July 2008, Pages 738-752 Zhi-Jie Tan and Shi-Jie Chen Abstract Single-stranded junctions/loops are frequently occurring structural motifs in nucleic acid structures. Due to the polyanionic nature of the nucleic acid backbone, metal ions play a crucial role in the loop stability. Here we use the tightly bound ion theory, which can account for the possible ion correlation and ensemble (fluctuation) effects, to predict the ion-dependence of loop and stem-loop (hairpin) free energies. The predicted loop free energy is a function of the loop length, the loop end-to-end distance, and the ion (Na and Mg in this study) concentrations. Based on the statistical mechanical calculations, we derive a set of empirical formulas for the loop thermodynamic parameters as functions of Na and Mg concentrations. For three specific types of loops, namely, hairpin, bulge, and internal loops, the predicted free energies agree with the experimental data. Further applications of these empirical formulas to RNA and DNA hairpin stability lead to good agreements with the available experimental data. Our results indicate that the ion-dependent loop stability makes significant contribution to the overall ion-dependence of the hairpin stability. Abstract | Full Text | PDF (1025 kb) |
Copyright © 2005 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 89, Issue 5, 2998-3007, 1 November 2005
doi:10.1529/biophysj.104.054072
Biophysical Theory and Modeling
Tiba Aynechi* and Irwin D. Kuntz†,
, 
* Graduate Group in Biophysics, University of California, San Francisco, California
† Department of Pharmaceutical Chemistry, University of California, San Francisco, California
Address reprint requests to Dr. Irwin D. Kuntz, Dept. of Pharmaceutical Chemistry, University of California at San Francisco, San Francisco, CA 94143-0446. Tel.: 415-476-1937; Fax: 415-502-1411.Structural biology now has the challenge of providing structural and functional information on a genomic scale. Current methods combine different experimental and computational procedures to deduce the structure and function of biomacromolecules. A partial list includes sequence analysis, crystallography, magnetic resonance, spectroscopy, homology modeling, and molecular dynamics. However, despite the quantitative nature of such undertakings, there is no unifying model of information content and error analysis for the field as a whole. Although there have been important specialized forays 1,2,3,4,5,6,7, there is a need to seek a broader approach that would permit the evaluation and comparison of such methods. A further related concern is the additivity of information when different techniques are combined. Previously, we demonstrated 8 that the basic tenets of information theory 9 can be used to quantify the information content of distance constraints. In this and a companion article 10, we apply the same general principles to simple exact models (SEMs) to draw inferences about two important tools of computational biology, sequence analysis, and force fields.
Sequence alignment is an integral part of comparative modeling protocols. Aside from ab initio methods 11, theoretical structure prediction is generally approached in two steps:
A standard way to gain insight into complex problems is through SEMs. In the protein-folding field, these models use simplified representations of sequences and structures to mimic sequence and structure interactions in real systems. Thus, self-avoiding two-dimensional lattice walks and simplified alphabets have long been used to evaluate and understand the principles of protein folding 13,14. The ability to exhaustively enumerate all states of the system affords the opportunity to describe the system’s behavior unambiguously, and it can provide a clear path relationship between assumptions and consequences. Thus, SEMs are well suited for formulating and evaluating general concepts: a task that may be much more difficult with real-world examples because of their heavy parameter dependence and need for approximations 15.
In this article, we combine the use of simplified systems with information theory to derive the costs of alignment procedures, scoring matrices, and gap penalties of idealized models. We then consider the applicability of the insights gained to the current approaches to sequence alignment. As noted above, our modeling choices are chosen to illuminate the underlying issues. We consider force fields in the companion article 10.
Our basic approach is to explore a simple exact model where it is possible to write out all occurrences of the set of interest (i.e., all possible sequences) and ask what the informational consequences are of performing an operation that combines some of the objects. The information required to select one object from a set of W objects is log2W9. The normal use of sequence-alignment procedures is considered to increase the information associated with a given probe sequence. That is, one queries a database of sequences and assigns properties (structure, function) to the probe sequence based on the statistically valid matches that are found. This information increase arises because a single sequence is placed into a cluster smaller than the full set of sequences from which it was indistinguishable before the alignment procedure. However, we can equally well treat alignment as a clustering procedure in which a number of sequences are grouped together as indistinguishable. Such clustering reduces the effective number of distinguishable objects compared to the full set of unique sequences. From this point of view, information is lost because a number of sequences that were treated as distinct from one another are now considered the same (i.e., similar sequence, function, or structure). In this context, gap penalties are a direct reflection of the price that must be paid for the information loss.
Of course, it is not feasible to write out all sequences for all proteins and nucleic acids. Nor can we advance a comprehensive model for the evolutionary and structural constraints that give rise to the sequences that form the current pan-genomic database. Rather, our strategy will be to uncover general properties by making use of model systems and simplified alphabets 14,16. However, we are also interested in exploring the implications of such models for the real world of sequences and conformations. This relationship is not a formal part of information theory, and will involve additional assumptions or hypotheses, the truth of which must be established by other methods. For example, it is straightforward to evaluate the informational consequences of the proposition that the known sequences are a random subset of all possible sequences; this proposal can be directly tested statistically, but information theory, alone, cannot determine its validity.
The information required to select an individual entity from a set of W objects is defined as
![]() | (1) |
![]() | (2) |
As mentioned previously, clustering can lead to a change in information. We relate Eqs. (1) to yield the information gain/loss of a clustering procedure as
![]() | (3) |
Sequences of proteins or nucleic acids of unknown structure and function are sources of information through association with other sequences whose functions/structures are already known. The most widely used associative process is alignment. Alignment algorithms can be divided into two categories, global and local. A global alignment 17 looks for the best overall similarity among sequences, whereas a local alignment 18 searches for similar sub-sequences between two proteins. Both of these algorithms make use of a variety of scoring matrices and gap penalties 19,20,21,22,23,24. Sequence-alignment problems are underdetermined, having multiple optimal solutions depending on the parameters used. Thus far, there has not been a quantitative analysis of the parameter dependence, one reason being the absence of a standard comparison metric. With an information theoretic approach, we are able to formalize the effects of parameters such as sequence length, alphabet size, etc. We consider the sets of sequences of length N, drawn from an alphabet of A characters. Assume that the characters are used with equal frequency (effects of character correlation can be readily included at a later stage). With this simplifying assumption, each sequence has equal weight and there are AN unique sequences. The information content of the set is simply Nlog2A. Alignment procedures require the definition of a template of length T<N. The template may contain gaps—that is, the string for the template may contain one or more positions that match any character. Alternatively, the template may be considered continuous and the sequences with which it is being compared can contain gaps. We ask how many sequences of an exhaustive list match a specific template. Most generally, because there is nothing of special interest for any given template, we are interested in the information content averaged over all templates of a certain type.
We begin with the case of gapless pairwise alignments and then move to multigapped alignments. We will use both exhaustive and stochastic data sets, along with simple alignment models, to provide insight into the informational issues associated with sequence comparisons.
Statistics from alignments are gathered under two scenarios:
For an A-letter alphabet, the total number of possible N-letter sequences is AN,
![]() | (4) |
![]() | (5) |
This formula counts exactly all occurrences of the template in complete (multiple occurrence) alignments. For a two-letter alphabet consisting of zeroes and ones, the templates are of the form 01, 001, 0001, …. Templates of higher symmetry, e.g., 000 or 010, have somewhat fewer hits (data not shown). Using the asymmetric templates gives the maximum number of hits, which also corresponds to the numerical results from the formulas.
For single-occurrence, ungapped alignments, we have an alternative approach using 1), the contiguous string; and 2), the standard formula for the probability of failure to match, pF. Given the probability of occurrence of the template in a single sequence, pT=(1/A)T, and the number of independent attempts (K+1), pF=(1−pT)K+1. The probability of a hit for a sequence, pH, is then (1−pF) and the total number of hits for the set is AN×pH,
![]() | (6) |
This formula assumes that multiple occurrences of a template occur with an equal probability, pT, in a given sequence. However, once a template appears once in a sequence, it fixes the positions it occupies and the probabilities of any subsequent overlapping templates will no longer be independent. As a result, this formula may either underestimate or overestimate the hit count depending on the template type. In the case of the templates used here (01, 0001,…), the formula underestimates. This effect is lessened as the template/sequence length ratio becomes larger, because of the diminishing possibility of overlaps.
However, Eq. (6) provides useful values for I for the full range of K for single occurrence of templates (see Results and Discussion, below).
For more general gap distributions, in which all templates of length T=N – K are aligned against a probe of length N, we need to consider the combinatorial arrangement of gaps of varying length. For gaps of minimum-length one, there will be C (N, K)=N!/K!(N–K)! ways to arrange the gaps in an N-long sequence. However, if we require the minimum-gap size, Gmin, to be greater than one, then the effective length of the sequence is reduced to Neffective=N−K×Gmin+K. There are AK sequences for each arrangement:
![]() |
Results for Gmin=1 are exact for complete alignments (see Results and Discussion).
We have also found a formulation leading to an exact solution for the number of gapped matches for single occurrences of the template. The number of hits to match a given template of length T, where K=N – T, against an exhaustive set of sequences becomes
![]() | (8) |
![]() | (9) |
The formulas above quantify the amount of information associated with successful alignments when an exhaustive basis set of all possible sequences is available. They also can be used to set bounds on gap penalty values (see Results and Discussion).
Gap penalties can be derived by examining the length distributions of gaps in systems where structural alignment is possible. This approach is based on a general affine model of gap penalties 25 and uses a geometric distribution to assess the probability distribution of gaps, yielding, in the Qian and Goldstein treatment 26, the formula for gap initiation (γI) and gap extension (γE) penalties of
![]() | (10) |
Here, Pg is the probability of opening a gap, and λ is the half-decay length of the gap length distribution. The values for Pg and λ are determined in a similar way to Qian and Goldstein 26. For a given sequence-length and template, we plot the distribution of gap lengths versus the probability for the observed hits (as an example, see Fig. 1). We then fit the data to an exponential of the form
![]() | (11) |
To compare our exhaustive reference-state distributions to previously determined values, we use our counting experiments to measure the distribution of gap lengths for sets of sequences and templates of varying length. We then use the Qian and Goldstein equation 10 to calculate gap initiation (γI) and extension(γE) penalties.
For every sequence, S, in the set, given a template T, we look for the first occurrence of the symbol in the first position of the template. Looking forward, we search for the first instance of the symbol in the second position of the template and so forth, until the last position in T. If a symbol is not observed in S in order of appearance in T, the search is terminated. Indices of each hit in the sequence are tabulated to determine the length of the gap among instances of each symbol present in the template.
Fig. 2 shows how the occurrences of a template T are sought in a sequence, S, consisting of an alphabet of size A. The final list contains all the occurrences of T in S by specifying the indices of the symbols in S. The positional indices for each occurrence are used to determine the distribution of gap lengths.
We turn to the question of how to compare results from the exhaustive list of sequences with those generated from a (sub)set of observed sequences. There are several issues. First, the set of observed sequences is not fixed but is continually updated with new sequences being added and old sequences being modified or even deleted. For our purposes, we will ignore such issues and simply take a snapshot of the existing data. A second concern is that the observed sequences show unequal utilization of the characters. Such variable weightings were part of Shannon’s initial formulation and Eq. (2) yields a single correction term equal to 0.12 bits/amino acid, based on the nonuniform composition of amino acids in real proteins 27. Higher-order terms dealing with joint probability of multiple characters can also be considered as corrections to the simple assumption of equal frequencies 28. In the same way, scoring matrices with partial weights for alternative characters reduce the effective alphabet below the limit set by equal utilization of all characters. A correction term can be generated for any scoring matrix of interest.
A harder question is the relationship of the observed sequences to the exhaustive set. Many hypotheses can be put forward about the mechanisms of evolution and the types of structural constraints imposed upon both nucleic acids and proteins. We do not propose in this article to select among them. Instead, we provide simple examples to illustrate how the mapping from exhaustive sequences to sequence subsets changes the information content of alignment operations and hence changes the values of gap penalties. These simple hypotheses are:
To examine the information content of a random subset of the exhaustive sequences, we generated sets of 10,000–100,000 random sequences of lengths N=10, 20, 30 for A=2. These were scanned with templates of various lengths and gap lengths. The number of hits was recorded with each of the sequences as a starting point and the probabilities of clustering were calculated. The information content for each alignment procedure was tabulated.
To generate a simple evolutionary model, we used the constraint that L of the N positions would not vary. This assumption produces a subset of sequences that are in exact correspondence to sequences from the exhaustive list for N′ where N′=N−L. Equations (5) can then be applied directly to this subset.
Of course, one random model and one simple evolutionary model just begin to explore the sequence constraints operating on the natural sequences. Presumably, one of the critical limits is that most of the experimental sequences arise from sequence subsets that provide stable three-dimensional (tertiary) structures for some range of physical variables. We have not attempted to construct such a model in this article, but others have approached the problem 29,30.
In a seminal article, Chothia and Lesk 31 provided the first quantitative relationship between sequence identity and structural similarity. In recent work, this relationship has been revisited in great detail 32. For our purposes, we use the methods above and those of Sullivan and Kuntz 33 to compare the information content of sequence and structural alignments, as follows. As a model of real proteins, we choose the backbone conformations for 100-mer compact polyalanine chains, a 20-letter alphabet, and a multiple-occurrence gapped alignment model. For a given homology level, we then compare the information of sequence and structural alignments. For our example, we select a similarity level of 90%. We use Eqs. (2) to determine the information from sequence alignments. From Chothia and Lesk, a 90% identity yields a root-mean-square deviation (RMSD) of ∼0.5Å. Therefore, we ask: What is the conformational probability of 100-mer compact polyalanine models falling within 0.5Å RMSD, as derived by Sullivan and Kuntz, to determine the information required for a corresponding structural alignment?
We have studied two alignment models, each treating ungapped and gapped templates. We have both analytic formulas and statistical results for the information. In addition, we have collected statistics on gap frequencies and the probability distribution of gap lengths. As noted earlier, Eqs. (5) describe the ungapped data exactly for multiple hits (Table 1) and within an average of 3% for single-occurrence hits (Table 2), respectively. Table 3,Table 4 show that Eqs. (8) provide an exact numerical result for gapped alignments for multiple and first-occurrence models.
| Table 1 Gapless alignments for exhaustive sequence sets—multiple-occurrences as described by Eq. (5), verified numerically from the counting data |
| Alphabet size (A) | Template length | Sequence length | K | AK | (K+1)AK | Number of hits (actual count) | ||
|---|---|---|---|---|---|---|---|---|
| 2 | 3 | 20 | 17 | 131,072 | 2,359,296 | 2,359,296 | ||
| 2 | 4 | 20 | 16 | 65,536 | 1,114,112 | 1,114,112 | ||
| 2 | 5 | 20 | 15 | 32,768 | 524,288 | 524,288 | ||
| 2 | 6 | 20 | 14 | 16,384 | 245,760 | 245,760 | ||
| 2 | 7 | 20 | 13 | 8192 | 114,688 | 114,688 | ||
| 2 | 8 | 20 | 12 | 4096 | 53,248 | 53,248 | ||
| 2 | 9 | 20 | 11 | 2048 | 24,576 | 24,576 | ||
| 2 | 10 | 20 | 10 | 1024 | 11,264 | 11,264 | ||
| 2 | 11 | 20 | 9 | 512 | 5120 | 5120 | ||
| 2 | 12 | 20 | 8 | 256 | 2304 | 2304 | ||
| 2 | 13 | 20 | 7 | 128 | 1024 | 1024 | ||
| 2 | 14 | 20 | 6 | 64 | 448 | 448 | ||
| 2 | 15 | 20 | 5 | 32 | 192 | 192 | ||
| 2 | 16 | 20 | 4 | 16 | 80 | 80 | ||
| 2 | 17 | 20 | 3 | 8 | 32 | 32 | ||
| 2 | 18 | 20 | 2 | 4 | 12 | 12 | ||
| 2 | 19 | 20 | 1 | 2 | 4 | 4 | ||
| 2 | 20 | 20 | 0 | 1 | 1 | 1 | ||
| 3 | 3 | 12 | 9 | 19,683 | 196,830 | 196,830 | ||
| 3 | 4 | 12 | 8 | 6561 | 59,049 | 59,049 | ||
| 3 | 5 | 12 | 7 | 2187 | 17,496 | 17,496 | ||
| 3 | 6 | 12 | 6 | 729 | 5103 | 5103 | ||
| 3 | 7 | 12 | 5 | 243 | 1458 | 1458 | ||
| 3 | 8 | 12 | 4 | 81 | 405 | 405 | ||
| 3 | 9 | 12 | 3 | 27 | 108 | 108 | ||
| 3 | 10 | 12 | 2 | 9 | 27 | 27 | ||
| 3 | 11 | 12 | 1 | 3 | 6 | 6 | ||
| 3 | 12 | 12 | 0 | 1 | 1 | 1 | ||
| Table 2 Gapless alignments for exhaustive sequence sets—first occurrence as described by Eq. (6), verified numerically from the counting data |
| Alphabet size (A) | Template length | PT | Seq. length* | K | PH† | Eq. (6): Number of hits | Number of hits (actual) | ΔI | % Difference, number of hits | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | 0.125 | 20 | 17 | 9.10E-01 | 953,790 | 1,019,920 | 19.96 | 6.48 | ||
| 2 | 4 | 0.0625 | 20 | 16 | 6.66E-01 | 698,541 | 782,497 | 19.58 | 10.73 | ||
| 2 | 5 | 0.03125 | 20 | 15 | 3.98E-01 | 417,637 | 458,495 | 18.81 | 8.91 | ||
| 2 | 6 | 0.015625 | 20 | 14 | 2.10E-01 | 220,618 | 234,280 | 17.84 | 5.83 | ||
| 2 | 7 | 0.007813 | 20 | 13 | 1.04E-01 | 109,042 | 112,896 | 16.78 | 3.41 | ||
| 2 | 8 | 0.003906 | 20 | 12 | 4.96E-02 | 52,018 | 53,008 | 15.69 | 1.87 | ||
| 2 | 9 | 0.001953 | 20 | 11 | 2.32E-02 | 24,314 | 24,552 | 14.58 | 0.97 | ||
| 2 | 10 | 0.000977 | 20 | 10 | 1.07E-02 | 11,209 | 11,263 | 13.46 | 0.48 | ||
| 2 | 11 | 0.000488 | 20 | 9 | 4.87E-03 | 5109 | 5120 | 12.32 | 0.22 | ||
| 2 | 12 | 0.000244 | 20 | 8 | 2.20E-03 | 2302 | 2304 | 11.17 | 0.10 | ||
| 2 | 13 | 0.000122 | 20 | 7 | 9.76E-04 | 1024 | 1024 | 10.00 | 0.04 | ||
| 2 | 14 | 6.1E-05 | 20 | 6 | 4.27E-04 | 448 | 448 | 8.81 | 0.02 | ||
| 2 | 15 | 3.05E-05 | 20 | 5 | 1.83E-04 | 192 | 192 | 7.58 | 0.01 | ||
| 2 | 16 | 1.53E-05 | 20 | 4 | 7.63E-05 | 80 | 80 | 6.32 | 0.00 | ||
| 2 | 17 | 7.63E-06 | 20 | 3 | 3.05E-05 | 32 | 32 | 5.00 | 0.00 | ||
| 2 | 18 | 3.81E-06 | 20 | 2 | 1.14E-05 | 12 | 12 | 3.58 | 0.00 | ||
| 2 | 19 | 1.91E-06 | 20 | 1 | 3.81E-06 | 4 | 4 | 2.00 | 0.00 | ||
| 2 | 20 | 9.54E-07 | 20 | 0 | 9.54E-07 | 1 | 1 | 0.00 | 0.00 | ||
| 3 | 3 | 0.037037 | 12 | 9 | 3.14E-01 | 167,064 | 176,957 | 17.43 | 5.59 | ||
| 3 | 4 | 0.012346 | 12 | 8 | 1.06E-01 | 56,215 | 57,835 | 15.82 | 2.80 | ||
| 3 | 5 | 0.004115 | 12 | 7 | 3.25E-02 | 17,246 | 17,442 | 14.09 | 1.12 | ||
| 3 | 6 | 0.001372 | 12 | 6 | 9.56E-03 | 5082 | 5102 | 12.32 | 0.39 | ||
| 3 | 7 | 0.000457 | 12 | 5 | 2.74E-03 | 1456 | 1458 | 10.51 | 0.11 | ||
| 3 | 8 | 0.000152 | 12 | 4 | 7.62E-04 | 405 | 405 | 8.66 | 0.03 | ||
| 3 | 9 | 5.08E-05 | 12 | 3 | 2.03E-04 | 108 | 108 | 6.75 | 0.01 | ||
| 3 | 10 | 1.69E-05 | 12 | 2 | 5.08E-05 | 27 | 27 | 4.75 | 0.00 | ||
| 3 | 11 | 5.65E-06 | 12 | 1 | 1.13E-05 | 6 | 6 | 2.58 | 0.00 | ||
| 3 | 12 | 1.88E-06 | 12 | 0 | 1.88E-06 | 1 | 1 | 0.00 | 0.00 | ||
| * Sample size=AN. † PH=1−PF. |
| Table 3 Gapped alignments for exhaustive sequence sets—multiple-occurrences as described by Eq. 7, verified numerically from the counting data |
| Alphabet size (A) | Template length | Sequence length* | K | Number of hits (actual) | ΔI | C(N,K) | (A)K | Eq. 7: C(N,K)(A)K | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | 20 | 17 | 149,422,080 | 27.15 | 1140 | 131,072 | 149,422,080 | ||
| 2 | 4 | 20 | 16 | 317,521,920 | 28.24 | 4845 | 65,536 | 317,521,920 | ||
| 2 | 5 | 20 | 15 | 508,035,072 | 28.92 | 15,504 | 32,768 | 508,035,072 | ||
| 3 | 3 | 12 | 9 | 4,330,260 | 22.05 | 220 | 19,683 | 4,330,260 | ||
| 3 | 4 | 12 | 8 | 3,247,695 | 21.63 | 495 | 6561 | 3,247,695 | ||
| 3 | 5 | 12 | 7 | 1,732,104 | 20.72 | 792 | 2187 | 1,732,104 | ||
| * Sample size=AN. |
| Table 4 Gapped alignments for exhaustive sequence sets—first-occurrence model as described by Eq. (8), verified numerically from the counting data |
| Alphabet size (A) | Template length | Sequence length* | K | Number of hits (actual) | # Failures | ΔI | C(N,K)(A−1)K | ||
|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | 20 | 17 | 1,048,365 | 1140 | 20.00 | 1140 | ||
| 2 | 4 | 20 | 16 | 1,047,225 | 4845 | 20.00 | 4845 | ||
| 2 | 5 | 20 | 15 | 1,042,380 | 15,504 | 19.99 | 15,504 | ||
| 2 | 6 | 20 | 14 | 1,026,876 | 38,760 | 19.97 | 38,760 | ||
| 2 | 7 | 20 | 13 | 988,116 | 77,520 | 19.91 | 77,520 | ||
| 2 | 8 | 20 | 12 | 910,596 | 125,970 | 19.80 | 125,970 | ||
| 2 | 9 | 20 | 11 | 784,626 | 167,960 | 19.58 | 167,960 | ||
| 2 | 10 | 20 | 10 | 616,666 | 184,756 | 19.23 | 184,756 | ||
| 2 | 11 | 20 | 9 | 431,910 | 167,960 | 18.72 | 167,960 | ||
| 2 | 12 | 20 | 8 | 263,950 | 125,970 | 18.01 | 125,970 | ||
| 2 | 13 | 20 | 7 | 137,980 | 77,520 | 17.07 | 77,520 | ||
| 2 | 14 | 20 | 6 | 60,460 | 38,760 | 15.88 | 38,760 | ||
| 2 | 15 | 20 | 5 | 21,700 | 15,504 | 14.41 | 15,504 | ||
| 2 | 16 | 20 | 4 | 6196 | 4845 | 12.60 | 4845 | ||
| 2 | 17 | 20 | 3 | 1351 | 1140 | 10.40 | 1140 | ||
| 2 | 18 | 20 | 2 | 211 | 190 | 7.72 | 190 | ||
| 2 | 19 | 20 | 1 | 21 | 20 | 4.39 | 20 | ||
| 3 | 3 | 12 | 9 | 435,185 | 112,640 | 16.78 | 112,640 | ||
| 3 | 4 | 12 | 8 | 322,545 | 126,720 | 16.95 | 126,720 | ||
| 3 | 5 | 12 | 7 | 195,825 | 101,376 | 16.63 | 101,376 | ||
| 3 | 6 | 12 | 6 | 94,449 | 59,136 | 15.85 | 59,136 | ||
| 3 | 7 | 12 | 5 | 35,313 | 25,344 | 14.63 | 25,344 | ||
| 3 | 8 | 12 | 4 | 9969 | 7920 | 12.95 | 7920 | ||
| 3 | 9 | 12 | 3 | 2049 | 1760 | 10.78 | 1760 | ||
| 3 | 10 | 12 | 2 | 289 | 264 | 8.04 | 264 | ||
| 3 | 11 | 12 | 1 | 25 | 24 | 4.58 | 24 | ||
| * Sample size=AN. |
One of our primary concerns is the implication of these equations for gap penalties. We can get estimates of these penalties by examining the equations directly, or we can calculate the distribution of gap lengths. Equations (5) contain terms sensitive to K (the total length of all gaps), as well as terms that depend on the size of the alphabet, A, and the length of the sequence, N. These formulas cannot be separated cleanly into gap initiation terms and gap extension terms. However, they are generally consistent with a gap penalty that costs information at the rate of log2A per unit of gap length (K). The full loss (initiation+extension) at K=1 is log2(A×N). For N=100, such a penalty would be equivalent to −6.0 for A=4 and −7.6 for A=20 in the units normally used for sequence-alignment programs (i.e., ln A). These values are model-dependent. We also note that, for equivalent coding, the nucleic acid model, A=4, would have an N of 300, yielding a penalty term of −7.1. These results are in reasonable agreement with the range of empirical gap initiation penalties reported by Qian and Goldstein 20 (see Fig. 3).
We can also examine the probability distribution of gap lengths 26. We gathered these data either from short exhaustive binary sequences or from stochastic samples of longer sequences with larger alphabets. There are two ways that we can count gap frequencies and gap lengths (see Methods, above). First, for the equations given above, we have used a first-occurrence model in which the gap-length data are taken from the initial successful match of a template to a sequence. Alternatively, we can identify all matches of a template with a specific sequence, and for each match, tabulate the gap-length information (multiple-occurrence) model. This model appears to be closer to the empirical data reported by Qian and Goldstein 26.
Our basic findings are:
| Table 5 Gap distributions and gap penalties for the first-occurrence model |
| N | Alphabet size | Template length | Total number of gaps | Total number of hits | Pg | λ | (−) γgap−I | (−) γgap-E | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 2 | 2 | 1,004 | 1,013 | 1.043 | 1.425 | 0.676 | 0.702 | ||
| 10 | 2 | 3 | 1,398 | 968 | 1.219 | 1.352 | 0.632 | 0.739 | ||
| 10 | 2 | 4 | 1,528 | 848 | 1.695 | 1.209 | 0.552 | 0.827 | ||
| 20 | 2 | 2 | 1,048,536 | 1,048,555 | 1.000 | 1.443 | 0.693 | 0.693 | ||
| 20 | 2 | 3 | 1,572,291 | 1,048,365 | 1.001 | 1.442 | 0.693 | 0.693 | ||
| 20 | 2 | 4 | 2,092,512 | 1,047,225 | 1.004 | 1.441 | 0.692 | 0.694 | ||
| 50 | 2 | 2 | 1,009,981 | 1,000,000 | 0.997 | 1.444 | 0.694 | 0.692 | ||
| 50 | 2 | 3 | 1,514,903 | 1,000,000 | 0.996 | 1.445 | 0.694 | 0.692 | ||
| 50 | 2 | 4 | 2,020,116 | 1,000,000 | 0.999 | 1.443 | 0.694 | 0.693 | ||
| 50 | 2 | 5 | 2,524,154 | 1,000,000 | 0.997 | 1.444 | 0.694 | 0.692 | ||
| 100 | 2 | 2 | 999,853 | 1,000,000 | 1.000 | 1.442 | 0.693 | 0.693 | ||
| 100 | 2 | 3 | 1,501,252 | 1,000,000 | 1.000 | 1.443 | 0.693 | 0.693 | ||
| 100 | 2 | 4 | 2,000,332 | 1,000,000 | 0.999 | 1.443 | 0.693 | 0.693 | ||
| 100 | 2 | 5 | 2,501,670 | 1,000,000 | 0.998 | 1.444 | 0.693 | 0.693 | ||
| 20 | 4 | 4 | 28,163,173 | 774,544 | 0.170 | 2.928 | 1.215 | 0.341 | ||
| 20 | 4 | 5 | 54,233,834 | 585,323 | 0.228 | 2.589 | 1.112 | 0.386 | ||
| 20 | 4 | 6 | 108,467,668 | 383,398 | 0.316 | 2.264 | 1.004 | 0.442 | ||
| 20 | 4 | 7 | 215,925,355 | 214,460 | 0.447 | 1.972 | 0.897 | 0.507 | ||
| 50 | 4 | 4 | 2,998,161 | 999,517 | 0.111 | 3.477 | 1.386 | 0.288 | ||
| 50 | 4 | 5 | 3,740,043 | 997,900 | 0.112 | 3.471 | 1.384 | 0.288 | ||
| 50 | 4 | 6 | 4,461,340 | 992,992 | 0.112 | 3.460 | 1.381 | 0.289 | ||
| 50 | 4 | 7 | 5,126,973 | 980,293 | 0.114 | 3.439 | 1.374 | 0.291 | ||
| 100 | 4 | 4 | 2,999,727 | 1,000,000 | 0.111 | 3.472 | 1.385 | 0.288 | ||
| 100 | 4 | 5 | 3,750,300 | 1,000,000 | 0.111 | 3.473 | 1.386 | 0.288 | ||
| 100 | 4 | 6 | 4,500,088 | 999,998 | 0.111 | 3.472 | 1.386 | 0.288 | ||
| 100 | 4 | 7 | 5,248,490 | 9 | ||||||