## Info

3 6|1245 59.5006

3 5|1246 59.7909 2 6|1345 59.9546 2 5|1346 60.3253 picked split 14 5 6 | 2 3 tree is 1 4 5 6 (2,3)

After the first step, we see that the split {{2, 3}, {1, 4, 5, 6}} is the best, so we join nodes 2 and 3 together in the tree and continue. Notice that the scores of the partitions roughly correspond to how close they are to being splits:

Partition Score

2 3 5 | 14 6 53.3838 picked split 1 2 3 | 4 5 6 tree is 4 5 6 (1,(2,3))

After the second step, we join node 1 to the {2, 3} cherry and continue:

Partition Score

5 6 | 1 234 6.5292 4 6 | 1 2 3 5 23.1477 4 5 | 1 2 3 6 23.3001 picked split 1 2 3 4 | 5 6 tree is 4 (1,(2,3)) (5,6)

We have found the last cherry, leaving us with 3 remaining groups which we join together to form an unrooted tree. □

19.5 Performance analysis

### 19.5.1 Building trees with simulated data

The idea of simulation is that we first pick a tree and simulate a model on that tree to obtain aligned sequence data. Then we build a tree using Algorithm 19.12 and other methods from that data and compare the answers to the original tree.

We used the program seq-gen [Rambaut and Grassly, 1997] to simulate data of various lengths for the tree in Figure 19.3 with the two sets of branch i/2

Fig. 19.3. The 8-taxa tree used for simulation with (a, b) = (0.01,0.07) and (0.02,0.19).

lengths given in Figure 19.3. This tree was chosen as a particularly difficult tree [Strimmer and von Haeseler, 1996, Ota and Li, 2000].

We simulated DNA data under the general reversible model (the most general model supported by seq-gen). Random numbers uniformly distributed between 1 and 2 were chosen on each run for the six rate matrix parameters (see Figure 4.7). The root frequencies were all set to 1/4.

Next, the data was collapsed to binary data (that is, A and G were identified, similarly C and T). We used binary data instead of DNA data because of numerical instability with SVD using the much larger matrices from the DNA data. It should be noted that Algorithm 19.12 performed better on binary data than on DNA data. This may be due to the instability, but it may also be because the rank conditions define the entire ideal for binary data.

We ran all tests using our Algorithm 19.12 as well as two algorithms from the PHYLIP package (see Section 2.5): neighbor-joining (i.e., Algorithm 2.41), and a maximum likelihood algorithm (dnaml). We used Jukes-Cantor distance estimation for neighbor-joining and the default settings for dnaml. All three algorithms took approximately the same amount of time, except for dnaml, which slowed down considerably for long sequences.

Figures 19.4 and 19.5 show the results of the simulations. Each algorithm was run 1000 times for each tree and sequence length. While SVD performed slightly worse than the others, it showed very comparable behavior. It should be noted that SVD constructs trees according to a much more general model than the other two methods, so it should be expected to have a higher variance.

19.5.2 Building trees with real data

For data, we use the October 2004 freeze of the ENCODE alignments. For detailed information on these, see Section 4.3, Chapters 21 and 22.

As in Chapter 21, we restricted our attention to 8 species: human, chimp, a = 0.01, b = 0.07