## Bcde Ggggggg the alignment polygon.

Chapter 8 raised the question of whether all the lattice points inside the projected alignment polygon Pt come from alignments of the pair of sequences t. The construction in the proof of Proposition 9.4 gives instances for which some lattice points inside Pt do not come from any alignment.

For example, take Pt to be the projection of the alignment polygon corresponding to the pair of sequences in Figure 9.2. The points (1,1), (2,1) and (2,2) lie inside Pt. However, there is no alignment of these sequences with less than 3 insertions having at least one match, so these points do not correspond to alignments. Figure 9.3 shows exactly which lattice points in Pt come from alignments of this pair of sequences.

Next we will show that, even in the case of the binary alphabet, our quadratic upper bound on the number of inference functions of the 2-parameter model for sequence alignment is tight as well. Thus, the large alphabet £' from Proposition 9.4 is not needed to obtain Q(n2) slopes in the alignment polytopes.

Proposition 9.5 Consider the 2-parameter model for sequence alignment for two observed sequences of length n and let £' = {0,1} be the binary alphabet. The number of inference functions of this model is 6(n2). Fig. 9.3. The thick dots are the points (y, z) giving the number of insertions and matches in alignments of the sequences in Figure 9.2.

Proof We follow the same idea as in the proof of Proposition 9.4. We will construct a collection of pairs of binary sequences t = (ct^ct2) so that the total number of different slopes of the edges of the polygons NP(/r) is Q(n2). This will imply that the number of cones in /\TN(NP(/T)) is Q(n2), where t ranges over all pairs of binary sequences of length n.

Recall that PT denotes the projection of NP(/T) onto the yz-plane, and that it is a lattice polygon contained in [0, n]2.

We claim that for any positive integers u and v with u < v and 6v — 2u < n, there exists a pair t of binary sequences of length n such that PT has an edge of slope u/v. This will imply that the number of different slopes created by the edges of the polygons PT is Q(n2).

Thus, it only remains to prove the claim. Given positive integers u and v as above, let a := 2v, b := v — u. Assume first that n = 6v — 2u = 2a + 2b. Consider the sequences ct1 = 0a1b0b1a, ct2 = 1a 0b 1b 0a, where 0a indicates that the symbol 0 is repeated a times. Let t = (ct1,ct2). Then, it is not hard to see that the polygon PT for this pair of sequences has four vertices: vo = (0,0), v1 = (b, 3b), v2 = (a + b, a + b) and v3 = (n, 0). The slope of the edge between v\ and v2 is ^^ =

If n > 6v — 2u = 2a + 2b, we just append 0n-2a-2b to both sequences ct1 and ct2. In this case, the vertices of PT are (0,n — 2a — 2b), (b, n — 2a + b), (a + b, n — a — b), (n, 0) and (n — 2a — 2b, 0).

Note that if v — u is even, the construction can be done with sequences of length n = 3v — u, by taking a := v, b := ^f-. Figure 9.4 shows the alignment graph and the polygon Pr for a = 7, b = 2. □

In most cases, one is interested only in those inference functions that are biologically meaningful. In our case, meaningful values of the parameters occur when a, ^ > 0, which means that mismatches and spaces are penalized instead of rewarded. Sometimes one also requires that a < ^, which means that a mismatch should be penalized less than two spaces. It is interesting to observe that our constructions in the proofs of Propositions 9.4 and 9.5 not only show that the total number of inference functions is Q(n2), but also that the number of biologically meaningful ones is still Q(n2). This is because the different rays Fig. 9.4. A pair of binary sequences of length 18 giving the slope 3/7 in their alignment polytope.

created in our construction have a biologically meaningful direction in the parameter space.

Let us now relate the results from this section with the bounds given in the previous chapter. In Chapter 8 we saw that in the 2-parameter model for sequence alignment, if t is a pair of sequences of length n in an arbitrarily large alphabet, then the polygon PT can have 0(n2/3) vertices in the worst case. In Proposition 9.4 we have shown that the Minkowski sum of these polygons for all possible such observations t, namely QT PT, has 6(n2) vertices.

In the case where the alphabet for the sequences is binary (or more generally, finite), we conjectured in Chapter 8 that the polygon PT can only have @(s/n.) vertices. In Proposition 9.5 we have proved that the polygon QT PT, where t runs over all pairs of binary sequences of length n, has 6(n2) vertices as well.

An interpretation of Theorem 9.3 is that the ability to change the values of the parameters of a graphical model does not give as much freedom as it may appear. There is a very large number of possible ways to assign an explanation to each observation. However, only a tiny proportion of these come from a consistent method for choosing the most probable explanation for a certain choice of parameters. Even though the parameters can vary continuously, the number of different inference functions that can be obtained is at most polynomial in the number of edges of the model, assuming that the number of parameters is fixed.

In the case of sequence alignment, the number of possible functions that associate an alignment to each pair of sequences of length n is doubly-exponential in n. However, the number of functions that pick the alignment with highest score in the 2-parameter model, for some choice of the parameters a and 0, is only 6(n2). Thus, most ways of assigning alignments to pairs of sequences do not correspond to any consistent choice of parameters. If we use a model with more parameters, say d, the number of inference functions may be larger, but still polynomial in n, namely O(nd(d-1)).

Having shown that the number of inference functions of a graphical model is polynomial in the size of the model, an interesting next step would be to find an efficient way to precompute all the inference functions for given models. This would allow us to give the answer (the explanation) to a query (an observation) very quickly. It follows from this chapter that it is computationally feasible to precompute the polytope NP(f), whose vertices correspond to the inference functions. However, the difficulty arises when we try to describe a particular inference function efficiently. The problem is that the characterization of an inference function involves an exponential number of observations.