The Strand Symmetric Model

Marta Casanellas Seth Sullivant

This chapter is devoted to the study of strand symmetric Markov models on trees from the standpoint of algebraic statistics. A strand symmetric Markov model is one whose mutation probabilities reflect the symmetry induced by the double-stranded structure of DNA (see Chapter 4). In particular, a strand symmetric model for DNA must have the following equalities of probabilities in the root distribution:

nA = nT and nc = nc and the following equalities of probabilities in the transition matrices (): \$AA = 6TT , \$AC = 6>TG, ^AG = 6>TC, \$AT = \$TA,

Important special cases of strand symmetric Markov models are the group-based phylogenetic models including the Jukes-Cantor model and the Kimura 2 -and 3-parameter models. The general strand symmetric model or in this chapter just the strand symmetric model (SSM) has only these eight equalities of probabilities in the transition matrices and no further restriction on the transition probabilities. Thus, for each edge in the corresponding phylogenetic model, there are 6 free parameters.

Our motivation for the study of the SSM over the more commonly used group-based models comes from the fact that the SSM captures more biologically meaningful features of real DNA sequences that the group-based models fail to encode. For instance, in any group-based model, the stationary distribution of bases for a single species is always the uniform distribution. On the other hand, computational evidence [Yap and Pachter, 2004] suggests that the stationary distribution of bases for a single species is rarely uniform, but must always be strand symmetric. The SSM has the property that its stationary distributions can be general strand symmetric distributions. We express the SSM as an algebraic statistical model, so we make no mention of rate matrices in this chapter. In this sense, the SSM does not fit into the Felsenstein hierarchy (see Section 4.5), though we still feel it is an important model to study.

For the standard group-based models (i.e. Jukes-Cantor and Kimura), the transition matrices and the entire parameterization can be simultaneously diagonalized by means of the Fourier transform of the group Z2 x Z2 [Evans and Speed, 1993, Szekely et al., 1993]. In addition to the practical uses of the Fourier transform for group-based models (see for example [Semple and Steel, 2003]), this diagonalization makes it possible to compute phylogenetic invariants for these models, by reducing the problem to the claw tree K1>3 [Sturmfels and Sullivant, 2005]. Our goal in this chapter is to extend the Fourier transform from group-based models to the strand symmetric model. This is carried out in Section 16.1.

In Section 16.2 we focus on the case of the 3-taxa tree. The computation of phylogenetic invariants for the SSM in the Fourier coordinates is still not complete, though we report on what is known about these invariants. In particular, we describe all invariants of degree three and four. Section 16.4 is concerned with extending known invariants from the 3-taxa tree to an arbitrary tree. In particular, we describe how to extend the given degree three and four invariants from Section 16.2 to an arbitrary binary tree. To do this, we introduce G-tensors and explore their properties in Section 16.3.

In Section 16.5, we extend the "gluing" results for phylogenetic invariants in [Allman and Rhodes, 2004a] and [Sturmfels and Sullivant, 2005]. Our exposition and inspiration mainly comes from the work of Allman and Rhodes and we deduce that the problem of determining defining phylogenetic invariants for the strand symmetric model reduces to finding phylogenetic invariants for the claw tree K1>3. Here defining means a set of polynomials which generate the ideal of invariants up to radical; that is, defining invariants have the same zero set as the whole ideal of invariants. This result is achieved by proving some "block diagonal" versions of results which appear in the Allman and Rhodes paper. This line of attack is the heart of Sections 16.3 and 16.5.

16.1 Matrix-valued Fourier transform

In this section we introduce the matrix-valued group-based models and show that the strand symmetric model is one such model. Then we describe the matrix-valued Fourier transform and the resulting simplification in the parameterization of these models, with special emphasis on the strand symmetric model.

Let T be a rooted tree with n-taxa. First, we describe the random variables associated to each vertex v in the tree in the matrix-valued group-based models. Each random variable takes on kl states where k is the cardinality of a finite abelian group G and l is a parameter of the model. The states of the random variable are 2-tuples (j) where j € G and i € {0,1,..., l — 1}.

Associated to the root node R in the tree is the root distribution Rj; that is, Rj is the probability that the random variable at the root is in state (j). For each edge E of T, the double indexed set of parameters Ejj2 are the entries in the transition matrix associated to this edge. We use the convention that E is both the edge and the transition matrix associated to that edge, to avoid the need for introducing a third index on the matrices. Thus, Ejj2 is the conditional probability of making a transition from state (j^ to state (j2) along the edge E.

Definition 16.1 A phylogenetic model is a matrix-valued group-based model if for each edge, the matrix transition probabilities satisfy

when ji — j2 = ki — k2 (where the difference is taken in G) and the root distribution probabilities satisfy Rj = Rk for all j, k € G.

Example 16.2 Consider the strand symmetric model and make the identification of the states A = (°), G = (°), T = (¿), and C = Q). One can check directly from the definitions that the strand symmetric model is a matrix-valued group-based model with I = 2 and G = Z2. □

To avoid even more cumbersome notation, we will restrict attention to binary trees T and to the strand symmetric model for DNA. While the results of Section 16.2 and 16.4 are exclusive to the case of the SSM, our other results can be easily extended to arbitrary matrix-valued group-based models with the introduction of the more general Fourier transform, though we will not explain these generalizations here.

We assume all edges of T are directed away from the root R. Given an edge E of T let s(E) denote the initial vertex of E and t(E) the trailing vertex. Then the parameterization of the phylogenetic model is given as follows. The probability of observing states jj2 " j at the leaves is pjlj2 .. .jn = j TT Ejs(E)ji(E)

where the product is taken over all edges E of T and the sum is taken over the set

Here IntV(T) denotes the interior or non-leaf vertices of T.

Example 16.3 For the 3-leaf claw tree, the parameterization is given by the expression:

¡mn _ t>0 a01 t30m^-<0n i n1 all d 1m^-i 1n i n0 a01 d0m^-<0n . r>1 all d 1m ^-i 1n ftjfc — R0A0iB0j C0fc + R0A0iB0j C0fc + R1A1iB1j C1fc + R1 A1iB1j C1fc .

The study of this particular tree will occupy a large part of the paper. □

Because of the role of the group in determining the symmetry in the parameterization, the Fourier transform can be applied to make the parameterization simpler. We will not define the Fourier transform in general, only in the specific case of the group Z2. The Fourier transform applies to all of the probability coordinates, the transition matrices and the root distribution.

Definition 16.4 The Fourier transform of the probability coordinates is

Jlj2—jn = (_l)fcl jl +^2^2+-----hfcnjn „fclfc2--fcn fcl,fc2,...,fcn€{0,1}

The Fourier transform of the transition matrix E is the matrix e with entries ilk 1 \ " / i\kij1+k2j2 pfcifc2 hh 2 li2 '

The Fourier transform of the root distribution is rj

+1 0