Article Outline

Article Information

PubMed

Related Articles

  • …more

Copyright © 2005 The Biophysical Society. All rights reserved.
Biophysical Journal, Volume 89, Issue 5, 2998-3007, 1 November 2005

doi:10.1529/biophysj.104.054072

Biophysical Theory and Modeling

An Information Theoretic Approach to Macromolecular Modeling: I. Sequence Alignments

Tiba Aynechi* and Irwin D. KuntzGo To Corresponding Author 

* Graduate Group in Biophysics, University of California, San Francisco, California
Department of Pharmaceutical Chemistry, University of California, San Francisco, California

Address reprint requests to Dr. Irwin D. Kuntz, Dept. of Pharmaceutical Chemistry, University of California at San Francisco, San Francisco, CA 94143-0446. Tel.: 415-476-1937; Fax: 415-502-1411.

Abstract

We are interested in applying the principles of information theory to structural biology calculations. In this article, we explore the information content of an important computational procedure: sequence alignment. Using a reference state developed from exhaustive sequences, we measure alignment statistics and evaluate gap penalties based on first-principle considerations and gap distributions. We show that there are different gap penalties for different alphabet sizes and that the gap penalties can depend on the length of the sequences being aligned. In a companion article, we examine the information content of molecular force fields.

Introduction

Structural biology now has the challenge of providing structural and functional information on a genomic scale. Current methods combine different experimental and computational procedures to deduce the structure and function of biomacromolecules. A partial list includes sequence analysis, crystallography, magnetic resonance, spectroscopy, homology modeling, and molecular dynamics. However, despite the quantitative nature of such undertakings, there is no unifying model of information content and error analysis for the field as a whole. Although there have been important specialized forays 1,2,3,4,5,6,7, there is a need to seek a broader approach that would permit the evaluation and comparison of such methods. A further related concern is the additivity of information when different techniques are combined. Previously, we demonstrated 8 that the basic tenets of information theory 9 can be used to quantify the information content of distance constraints. In this and a companion article 10, we apply the same general principles to simple exact models (SEMs) to draw inferences about two important tools of computational biology, sequence analysis, and force fields.

Sequence alignment is an integral part of comparative modeling protocols. Aside from ab initio methods 11, theoretical structure prediction is generally approached in two steps:

1. Given an amino acid sequence, find an appropriate structural template (using homology modeling and/or threading).
2. Refine the structural model to produce an energetically minimized or best-scoring conformation.
The first step requires sequence-alignment algorithms, which rely heavily on the use of empirical parameters such as gap penalties and scoring matrices 12. Because sequence space is poorly characterized, it is difficult to either evaluate or improve overall performance except in the context of specific training sets. What is needed is a unified picture of the fundamental issues.

A standard way to gain insight into complex problems is through SEMs. In the protein-folding field, these models use simplified representations of sequences and structures to mimic sequence and structure interactions in real systems. Thus, self-avoiding two-dimensional lattice walks and simplified alphabets have long been used to evaluate and understand the principles of protein folding 13,14. The ability to exhaustively enumerate all states of the system affords the opportunity to describe the system’s behavior unambiguously, and it can provide a clear path relationship between assumptions and consequences. Thus, SEMs are well suited for formulating and evaluating general concepts: a task that may be much more difficult with real-world examples because of their heavy parameter dependence and need for approximations 15.

In this article, we combine the use of simplified systems with information theory to derive the costs of alignment procedures, scoring matrices, and gap penalties of idealized models. We then consider the applicability of the insights gained to the current approaches to sequence alignment. As noted above, our modeling choices are chosen to illuminate the underlying issues. We consider force fields in the companion article 10.


Methods

Overview

Our basic approach is to explore a simple exact model where it is possible to write out all occurrences of the set of interest (i.e., all possible sequences) and ask what the informational consequences are of performing an operation that combines some of the objects. The information required to select one object from a set of W objects is log2W9. The normal use of sequence-alignment procedures is considered to increase the information associated with a given probe sequence. That is, one queries a database of sequences and assigns properties (structure, function) to the probe sequence based on the statistically valid matches that are found. This information increase arises because a single sequence is placed into a cluster smaller than the full set of sequences from which it was indistinguishable before the alignment procedure. However, we can equally well treat alignment as a clustering procedure in which a number of sequences are grouped together as indistinguishable. Such clustering reduces the effective number of distinguishable objects compared to the full set of unique sequences. From this point of view, information is lost because a number of sequences that were treated as distinct from one another are now considered the same (i.e., similar sequence, function, or structure). In this context, gap penalties are a direct reflection of the price that must be paid for the information loss.

Of course, it is not feasible to write out all sequences for all proteins and nucleic acids. Nor can we advance a comprehensive model for the evolutionary and structural constraints that give rise to the sequences that form the current pan-genomic database. Rather, our strategy will be to uncover general properties by making use of model systems and simplified alphabets 14,16. However, we are also interested in exploring the implications of such models for the real world of sequences and conformations. This relationship is not a formal part of information theory, and will involve additional assumptions or hypotheses, the truth of which must be established by other methods. For example, it is straightforward to evaluate the informational consequences of the proposition that the known sequences are a random subset of all possible sequences; this proposal can be directly tested statistically, but information theory, alone, cannot determine its validity.


Shannon information of a set W

The information required to select an individual entity from a set of W objects is defined as

(1)
IS is referred to as the source information 9. Given a metric set, M, that partitions the objects into subsets, the information content can be measured in bits using Shannon’s formulation 9,
(2)
in which pk is the population of cluster k expressed as a fraction of the ensemble, summed over all clusters. These clusters are subsets of the population that are indistinguishable under particular assumptions or constraints.

As mentioned previously, clustering can lead to a change in information. We relate Eqs. (1) to yield the information gain/loss of a clustering procedure as

(3)


Sequence alignment

Overview

Sequences of proteins or nucleic acids of unknown structure and function are sources of information through association with other sequences whose functions/structures are already known. The most widely used associative process is alignment. Alignment algorithms can be divided into two categories, global and local. A global alignment 17 looks for the best overall similarity among sequences, whereas a local alignment 18 searches for similar sub-sequences between two proteins. Both of these algorithms make use of a variety of scoring matrices and gap penalties 19,20,21,22,23,24. Sequence-alignment problems are underdetermined, having multiple optimal solutions depending on the parameters used. Thus far, there has not been a quantitative analysis of the parameter dependence, one reason being the absence of a standard comparison metric. With an information theoretic approach, we are able to formalize the effects of parameters such as sequence length, alphabet size, etc. We consider the sets of sequences of length N, drawn from an alphabet of A characters. Assume that the characters are used with equal frequency (effects of character correlation can be readily included at a later stage). With this simplifying assumption, each sequence has equal weight and there are AN unique sequences. The information content of the set is simply Nlog2A. Alignment procedures require the definition of a template of length T<N. The template may contain gaps—that is, the string for the template may contain one or more positions that match any character. Alternatively, the template may be considered continuous and the sequences with which it is being compared can contain gaps. We ask how many sequences of an exhaustive list match a specific template. Most generally, because there is nothing of special interest for any given template, we are interested in the information content averaged over all templates of a certain type.

We begin with the case of gapless pairwise alignments and then move to multigapped alignments. We will use both exhaustive and stochastic data sets, along with simple alignment models, to provide insight into the informational issues associated with sequence comparisons.

Statistics from alignments are gathered under two scenarios:

1. For every sequence in the data set, a single (first) occurrence of the template, T, is sufficient for assignment.
2. All possible occurrences of a template are sought in each sequence of the data set (multiple-occurrence model).


Gapless alignments

For an A-letter alphabet, the total number of possible N-letter sequences is AN,

(4)
In the simplest case, we consider those template sequences of length T whose elements are found in contiguous positions in probe sequences of length N. Defining K=N−T, the templates can be anchored in K+1 positions each with AK possible matches, leading to an estimate of (K+1)AK sequences if there are no duplicate sequences. Consequently, the information required to distinguish among the ungapped matches in an exhaustive set is
(5)

This formula counts exactly all occurrences of the template in complete (multiple occurrence) alignments. For a two-letter alphabet consisting of zeroes and ones, the templates are of the form 01, 001, 0001, …. Templates of higher symmetry, e.g., 000 or 010, have somewhat fewer hits (data not shown). Using the asymmetric templates gives the maximum number of hits, which also corresponds to the numerical results from the formulas.

For single-occurrence, ungapped alignments, we have an alternative approach using 1), the contiguous string; and 2), the standard formula for the probability of failure to match, pF. Given the probability of occurrence of the template in a single sequence, pT=(1/A)T, and the number of independent attempts (K+1), pF=(1−pT)K+1. The probability of a hit for a sequence, pH, is then (1−pF) and the total number of hits for the set is AN×pH,

(6)

This formula assumes that multiple occurrences of a template occur with an equal probability, pT, in a given sequence. However, once a template appears once in a sequence, it fixes the positions it occupies and the probabilities of any subsequent overlapping templates will no longer be independent. As a result, this formula may either underestimate or overestimate the hit count depending on the template type. In the case of the templates used here (01, 0001,…), the formula underestimates. This effect is lessened as the template/sequence length ratio becomes larger, because of the diminishing possibility of overlaps.

However, Eq. (6) provides useful values for I for the full range of K for single occurrence of templates (see Results and Discussion, below).


Gapped alignments

For more general gap distributions, in which all templates of length T=N – K are aligned against a probe of length N, we need to consider the combinatorial arrangement of gaps of varying length. For gaps of minimum-length one, there will be C (N, K)=N!/K!(NK)! ways to arrange the gaps in an N-long sequence. However, if we require the minimum-gap size, Gmin, to be greater than one, then the effective length of the sequence is reduced to Neffective=NK×Gmin+K. There are AK sequences for each arrangement:

Results for Gmin=1 are exact for complete alignments (see Results and Discussion).

We have also found a formulation leading to an exact solution for the number of gapped matches for single occurrences of the template. The number of hits to match a given template of length T, where K=NT, against an exhaustive set of sequences becomes

(8)
This equation (discovered empirically from the counting data) provides exact counts over the complete range of N, T. When converting to bits of information, the right-hand side of Eq. (8) generally cannot be reduced to a simpler form; however, when A>2 and T<95% of N, the highest order term sufficiently dominates so that the summation is no longer needed. Under these circumstances, the information is, to a good approximation,
(9)


Gap penalties

The formulas above quantify the amount of information associated with successful alignments when an exhaustive basis set of all possible sequences is available. They also can be used to set bounds on gap penalty values (see Results and Discussion).

Gap penalties can be derived by examining the length distributions of gaps in systems where structural alignment is possible. This approach is based on a general affine model of gap penalties 25 and uses a geometric distribution to assess the probability distribution of gaps, yielding, in the Qian and Goldstein treatment 26, the formula for gap initiation (γI) and gap extension (γE) penalties of

(10)

Here, Pg is the probability of opening a gap, and λ is the half-decay length of the gap length distribution. The values for Pg and λ are determined in a similar way to Qian and Goldstein 26. For a given sequence-length and template, we plot the distribution of gap lengths versus the probability for the observed hits (as an example, see Fig. 1). We then fit the data to an exponential of the form

(11)
where n is length of the gap and B is defined as B=Pg× exp(−1/λ)/[1−exp(−1/λ)].

Display large version of this figure
Figure 1
Calculation of λ and Pg from gap distributions. Sample distribution of gaps for the first-occurrence model, using a stochastic set of 50-mers with a six-letter alphabet, using a 10-mer template. The data is fit (solid line) to the form p(n)=B×exp(−n/λ), for this case λ=3.833, B=0.30298. According to the formulation of Qian and Goldstein 20, B=Pg×exp(−1/λ)/[1−exp(−1/λ)]. We calculate Pg by substituting λ back into the expression for B to obtain gap initiation and extension penalties according to Eq. (10).

To compare our exhaustive reference-state distributions to previously determined values, we use our counting experiments to measure the distribution of gap lengths for sets of sequences and templates of varying length. We then use the Qian and Goldstein equation 10 to calculate gap initiation (γI) and extension(γE) penalties.



Search algorithm methods

First-occurrence alignments

For every sequence, S, in the set, given a template T, we look for the first occurrence of the symbol in the first position of the template. Looking forward, we search for the first instance of the symbol in the second position of the template and so forth, until the last position in T. If a symbol is not observed in S in order of appearance in T, the search is terminated. Indices of each hit in the sequence are tabulated to determine the length of the gap among instances of each symbol present in the template.


Multiple-occurrence alignments

Fig. 2 shows how the occurrences of a template T are sought in a sequence, S, consisting of an alphabet of size A. The final list contains all the occurrences of T in S by specifying the indices of the symbols in S. The positional indices for each occurrence are used to determine the distribution of gap lengths.

Display large version of this figure
Figure 2
Search algorithm for finding all occurrences of a template, T, in a sequence S, using alphabet A.


Exhaustive versus incomplete sequence sets

Mapping

We turn to the question of how to compare results from the exhaustive list of sequences with those generated from a (sub)set of observed sequences. There are several issues. First, the set of observed sequences is not fixed but is continually updated with new sequences being added and old sequences being modified or even deleted. For our purposes, we will ignore such issues and simply take a snapshot of the existing data. A second concern is that the observed sequences show unequal utilization of the characters. Such variable weightings were part of Shannon’s initial formulation and Eq. (2) yields a single correction term equal to 0.12 bits/amino acid, based on the nonuniform composition of amino acids in real proteins 27. Higher-order terms dealing with joint probability of multiple characters can also be considered as corrections to the simple assumption of equal frequencies 28. In the same way, scoring matrices with partial weights for alternative characters reduce the effective alphabet below the limit set by equal utilization of all characters. A correction term can be generated for any scoring matrix of interest.

A harder question is the relationship of the observed sequences to the exhaustive set. Many hypotheses can be put forward about the mechanisms of evolution and the types of structural constraints imposed upon both nucleic acids and proteins. We do not propose in this article to select among them. Instead, we provide simple examples to illustrate how the mapping from exhaustive sequences to sequence subsets changes the information content of alignment operations and hence changes the values of gap penalties. These simple hypotheses are:

1. The observed sequences are a random subset of the exhaustive sequences.
2. The observed sequences are a particular evolutionary subset of the exhaustive sequences.
Again, our purpose is not to espouse these models, but to show how Eqs. (5) are modified in each case.

To examine the information content of a random subset of the exhaustive sequences, we generated sets of 10,000–100,000 random sequences of lengths N=10, 20, 30 for A=2. These were scanned with templates of various lengths and gap lengths. The number of hits was recorded with each of the sequences as a starting point and the probabilities of clustering were calculated. The information content for each alignment procedure was tabulated.

To generate a simple evolutionary model, we used the constraint that L of the N positions would not vary. This assumption produces a subset of sequences that are in exact correspondence to sequences from the exhaustive list for N′ where N′=NL. Equations (5) can then be applied directly to this subset.


Correlation of sequence alignment and conformational resolution

Of course, one random model and one simple evolutionary model just begin to explore the sequence constraints operating on the natural sequences. Presumably, one of the critical limits is that most of the experimental sequences arise from sequence subsets that provide stable three-dimensional (tertiary) structures for some range of physical variables. We have not attempted to construct such a model in this article, but others have approached the problem 29,30.

In a seminal article, Chothia and Lesk 31 provided the first quantitative relationship between sequence identity and structural similarity. In recent work, this relationship has been revisited in great detail 32. For our purposes, we use the methods above and those of Sullivan and Kuntz 33 to compare the information content of sequence and structural alignments, as follows. As a model of real proteins, we choose the backbone conformations for 100-mer compact polyalanine chains, a 20-letter alphabet, and a multiple-occurrence gapped alignment model. For a given homology level, we then compare the information of sequence and structural alignments. For our example, we select a similarity level of 90%. We use Eqs. (2) to determine the information from sequence alignments. From Chothia and Lesk, a 90% identity yields a root-mean-square deviation (RMSD) of ∼0.5Å. Therefore, we ask: What is the conformational probability of 100-mer compact polyalanine models falling within 0.5Å RMSD, as derived by Sullivan and Kuntz, to determine the information required for a corresponding structural alignment?




Results and discussion

We have studied two alignment models, each treating ungapped and gapped templates. We have both analytic formulas and statistical results for the information. In addition, we have collected statistics on gap frequencies and the probability distribution of gap lengths. As noted earlier, Eqs. (5) describe the ungapped data exactly for multiple hits (Table 1) and within an average of 3% for single-occurrence hits (Table 2), respectively. Table 3,Table 4 show that Eqs. (8) provide an exact numerical result for gapped alignments for multiple and first-occurrence models.

One of our primary concerns is the implication of these equations for gap penalties. We can get estimates of these penalties by examining the equations directly, or we can calculate the distribution of gap lengths. Equations (5) contain terms sensitive to K (the total length of all gaps), as well as terms that depend on the size of the alphabet, A, and the length of the sequence, N. These formulas cannot be separated cleanly into gap initiation terms and gap extension terms. However, they are generally consistent with a gap penalty that costs information at the rate of log2A per unit of gap length (K). The full loss (initiation+extension) at K=1 is log2(A×N). For N=100, such a penalty would be equivalent to −6.0 for A=4 and −7.6 for A=20 in the units normally used for sequence-alignment programs (i.e., ln A). These values are model-dependent. We also note that, for equivalent coding, the nucleic acid model, A=4, would have an N of 300, yielding a penalty term of −7.1. These results are in reasonable agreement with the range of empirical gap initiation penalties reported by Qian and Goldstein 20 (see Fig. 3).

Display large version of this figure
Figure 3
Distribution of gap initiation and extension penalties. Medium hashed bars designate gap initiation; solid bars, gap extension. Bars labeled Info Theory represent values derived from our gap distributions using Eq. (10). Dense hash bars indicate gap initiation+gap extension approximated using Eqs. (5). Data for the following were taken from Qian and Goldstein 20: BLOSUM62, BLOSUM30 38; PAM250, PAM350, PAM500 39; GCB 40; STR 41; JTT42; BC0030 43; OPTIMA 44; D-BL2545; and STROMA 20.

We can also examine the probability distribution of gap lengths 26. We gathered these data either from short exhaustive binary sequences or from stochastic samples of longer sequences with larger alphabets. There are two ways that we can count gap frequencies and gap lengths (see Methods, above). First, for the equations given above, we have used a first-occurrence model in which the gap-length data are taken from the initial successful match of a template to a sequence. Alternatively, we can identify all matches of a template with a specific sequence, and for each match, tabulate the gap-length information (multiple-occurrence) model. This model appears to be closer to the empirical data reported by Qian and Goldstein 26.

Our basic findings are:

1. Most of the gap-length distributions can be approximated by an exponential, but those arising from larger alphabets and longer templates clearly have a more complex character. The distributions can be numerically fit as multiple exponentials similar to those found by Qian and Goldstein for sequence alignments of proteins of known structures. They can also be fit with polynomial functions. It is not obvious if these expanded functions carry any physical significance.
2. For the first-occurrence model, the exponential decay increases strongly with alphabet size (Table 5). However, for the multiple-occurrence model, the exponential decay is independent of alphabet size, although it increases with N and decreases with T (Table 6). If we use the single-exponential approximation and the treatment of Qian and Goldstein (see Eq. (10)), we get the range of gap penalties shown in Fig. 3. Our values are consistently on the low end of the empirical range.