Colin Dewey Kevin Woods

Biological sequence alignment is a fundamental problem in computational biology. Section 2.2 introduced some scoring schemes and algorithms used for global pairwise alignment. Each scoring scheme is dependent on a set of parameters. As shown in Example 2.14, the optimal alignment can change significantly as these parameters are varied. We would like to know how the results obtained from alignment programs are affected by the values of the parameters and how confident we can be in a given optimal alignment. Such questions can be answered using parametric sequence alignment methods. Here we introduce the techniques of parametric sequence alignment and apply them to characterize different scoring schemes. Parametric alignment algorithms can be implemented almost as efficiently as algorithms for conventional sequence alignment. Hence, parametric algorithms may become powerful and important components of the computational biologist's toolbox.

The primary goal of biological sequence alignment is to match up positions in the input sequences that are homologous. Two sequence positions are homologous if the characters at those positions are derived from the same position in some ancestral sequence. It is important to note that two positions can be homologous even though the states of the positions are different. For example, there may exist sequences ct1 and ct2 such that position 5 in ct1 and position 9 in ct2 are homologous despite the fact that ct1 = A and ct2 = C. Alignments indicate that positions in two sequences are homologous by matching up the characters at those positions. An alignment is biologically correct if it matches up all positions that are truly homologous and no others. Because this chapter focuses on the global alignment problem for two sequences, we will ignore cases in which a duplication event causes one position to be homologous to multiple positions in the other sequence.

What does it mean for a global alignment to be "optimal" ? If our goal is to discover the biological truth, we should say that an "optimal" global alignment is one that is biologically correct. Having biologically correct alignments is critical, because other analyses, such as phylogenetic tree reconstruction, are heavily dependent on sequence alignments. Unfortunately, in most cases, we do not know what the biological truth is. To make a guess at the truth, we treat sequence alignment as an optimization problem, where we choose the alignment that is optimal with respect to some objective function. Although many types of objective functions are imaginable, the most common is to assign a score to each alignment, according to a scoring scheme such as the one described in Section 2.2. We then seek to find the alignment that maximizes this score. Which scoring scheme should we use to guess at the biologically correct alignment? Once we have chosen the parameter values for a given scoring scheme, how confident can we be that the resulting alignment is a good guess? It could be the case that, if we vary our chosen parameter values just slightly, we will obtain very different alignments. Even worse, it could be that no values for the parameters of our chosen scoring scheme will give the biologically correct alignment as being optimal.

The methods of parametric sequence alignment help to answer these questions, for specific input sequences, by analyzing a scoring scheme over all possible parameter values. Specifically, parametric sequence alignment subdivides the parameter space into regions such that parameter values in the same region give rise to the same optimal alignments. Such a subdivision tells us which of the exponentially many possible alignments can be optimal for some choice of parameter values and how many classes of optimal alignments there can be. In addition, if we find that the point in the parameter space corresponding to our current choice of values for the parameters is close to the boundary of a region in the subdivision, we may be less confident in our results. Lastly, we may wish to take a Bayesian approach and place a prior distribution on the parameters. In this case, we can determine what the most likely optimal alignments are by integrating over the different regions in the subdivision (note that this is different from finding the most likely alignment) [Pachter and Sturmfels, 2004a].

Parametric sequence alignment is feasible, because, although two given sequences have exponentially many possible alignments, there are only a few subsets of these alignments (corresponding to regions of the parameter space subdivision) that can be optimal for some choice of parameters. For two sequences of length at most n, it has been shown that, for a simple scoring scheme with 2 parameters allowed to vary, the number of regions in the subdivision is 0{n3) [Gusfield et al, 1994], This bound corresponds exactly with the bound given in Section 5.3 for the number of vertices of a sequence alignment Newton polytope with d = 2. In fact, for a scoring scheme with any number of parameters, the number of regions is bounded by a polynomial in n (see Section 5.3). The bound on the number of regions in the subdivision allows for the subdivision to be determined in just slightly more time than it takes to simply align the sequences in question. For the simple scoring scheme just mentioned, the subdivision can be found in 0{n3) time, as opposed to 0(n2) time for aligning the sequences with a fixed set of parameter values.

The concept of parametric sequence alignment is not a new one, and there are several programs available to perform parametric analysis [Gusfield, 1997].

Existing methods are restricted to the analysis of scoring schemes with at most 2 parameters allowed to vary at one time, whereas we may also like to analyze scoring schemes that have many parameters, such as the general 33-parameter scoring scheme described in Section 2.2. In this chapter, we apply the techniques of parametric inference described in Chapter 5 to the problem of sequence alignment and thus give a general method for parametric sequence analysis with scoring schemes involving any number of parameters.

In this section, we will describe how to efficiently compute a parametric sequence alignment of two sequences, ct1 and ct2, with lengths n and m, respectively. While the method we describe can be used to analyze a fully-parameterized scoring scheme (with the 33 parameters comprising the matrices shown in the equations (2.11) and (2.12)), we will concentrate on a simple 4-parameter scoring scheme. In this scoring scheme we have parameters M, X, S, and G, corresponding to the weights for matches, mismatches, spaces (symbols in the alignment), and gaps (contiguous sets of spaces), respectively. This scoring scheme is just a special case of the general scoring scheme with

Because we expect biologically correct alignments to have a large number of matches and a limited number of gaps, the parameters of this scoring scheme are normally set to reward matches and penalize gaps. We will refer to biologically reasonable parameter values as those that satisfy

An even simpler scoring scheme, which we will refer to as the 3-parameter scoring scheme, lacks a gap parameter (that is, it is the 4-parameter scoring scheme with G = 0). We shall present a method for parametric alignment with the 3-parameter scoring scheme, but it should be noted that this method easily generalizes to the 4-parameter and 33-parameter scoring schemes. With the 3-parameter scoring scheme, the weight of an alignment, h, is

WctV2 (h) = Mmh + Xxh + Ssh, where mh, xh, and sh denote the number of matches, mismatches, and spaces in h, respectively. We define the monomial fal,a2,h(0M ,0x ,0s )= 0Mh ®XXh ^S

Was this article helpful?

## Post a comment