The five basic steps in phylogenetic analysis of sequences (Hillis et al., 1993) organized as shown in Figure 13.1 consist of:

1. Sequence alignment

2. Assessing phylogenetic signal

3. Choosing methods for phylogenetic analysis

4. Construction of optimal phylogenetic tree

The sequence under study must be aligned so that positional homologues may be analyzed. Most methods of sequence alignment are designed for pairwise comparison. Two approaches to sequence alignments are used: a global alignment (Needleman and Wunsch, 1970) and a local alignment (Smith and Waterman, 1981). The former compares similarity across the full stretch of sequences, whereas the latter searches for regions of similarity in parts of the sequences (Chapter 11). Modifications of these approaches can also be used to align multiple sequences. Because the multiple alignment is inefficient with sequences if INDELs are common and substitution rates are high, most studies restrict comparisons to regions in which alignments are relatively obvious. Unless the number of taxa is few and INDELs are

uncommon, it is not feasible to ensure that the optimum alignment has been achieved. For this reason, regions of questionable alignments are often removed from consideration prior to phylogenetic analysis. In general, substitutions are more frequent between nucleotide/amino acids that are biochemically similar (e.g., Table 11.1). In the case of DNA, the transitions between purines-purine and py-rimidinespyrimidine are usually more frequent than the transversion between purine s pyrimidine and pyrimidine s purine. Such biases will affect the estimated divergence between two sequences. Specification of relative rates of substitution among particular residues usually takes the form of a square matrix; the number of rows/columns is 4 for DNA and 20 for proteins and 61 for codons. The diagonal elements represent the cost of having the same nucleotide/amino acid in different sequences. The off diagonal elements of the matrix correspond to the relative costs of going from one nucleotide/amino acid to another. A clustal program such as ClustalW (Higgins et al., 1996), which aligns sequences according to an explicitly phylogenetic criterion, is the most commonly used program for the multiple alignment of biochemical sequences.

For most sequence data, some positions are highly conserved, whereas other positions are randomized with respect to phylogenetic history. Thus assessment of phylogenetic signal is often required. In general, pairwise comparisons of the sequences are performed to evaluate the potential phylogenetic importance of the data. For example, the transition/transversion ratios for sequence pairs of DNA can be compared to those expected for the sequences at equilibrium, given the observed base compositions. DNA sequences that are largely free of homoplasy (convergence, parallelism, and reversal) will have transition/transversion ratios greater than those for sequences that are saturated by change, but similar to those observed for closely related taxa which remain highly structured. Similarly, pairwise divergences have been used to assess the potential phylogenetic value of DNA sequences, by plotting percent divergence against time. In such plots, regions of sequences or classes of character state transformations that are saturated by change do not show a significant positive relationship with time. Both the transition/transversion ratio and sequence divergence are influenced by homoplasy, thus both can provide insights into the potential phylogenetic value of sequence data. Another way of detecting the presence of a phylogenetic signal in a given data set is to examine the shape of the tree-length distribution (Fitch, 1979). Data sets with skewed tree-length distributions are likely to be phylogenetically informative. The data sets that produce significantly skewed tree-length distributions are also likely to produce the correct tree topology in phylogenetic analysis.

The following considerations may guide a choice of methods for phylogenetic analysis, namely, (1) assumptions about evolution, (2) parameters of sequence evolution, (3) the primary goal of analysis, (4) size of the data set, and (5) the limitation of computer time. Numerous methods are available, and each makes different assumptions about the molecular evolutionary process. It is unrealistic to assume that any one method can solve all problems, given the complexity of genomes/proteomes and their evolution. Phylogenetic analyses of sequences can be conducted by analyzing discrete characters such as aligned nucleotides and amino acids of the sequences (i.e., character-based approach) or by making pairwise comparisons of whole sequences (i.e. distance-based approach). Deciding whether to use a distance-based or a character-based method depends on the assumptions and the goals of the study.

Distance-Based Approaches. Distance-based methods use the amount of the distance (dissimilarity) between two aligned sequences to derive phylogenetic trees. A distance method would reconstruct the true tree if all genetic divergence events were accurately recorded in the sequence. Distance matrix methods simply count the number of differences between two sequences. This number is referred to as the evolutionary distance, and its exact size depends on the evolutionary model used. The actual tree is then computed from the matrix of distance values by running a clustering algorithm that starts with the most similar sequences or by trying to minimize the total branch length of the tree. Both the type and the position of a mutation have been incorporated into the parameters of divergence values. The simplest of the divergence measures is the two-parameter model of Kimura (Kimura, 1980). This model is designed to estimate divergence as well as the more complicated algorithms, over a broad range of divergences, without the need for additional specifics about the evolutionary process.

Methods for clustering distance data can be divided into those that provide single topology by following a specific series of steps (single-tree algorithms) and those that use optimality criterion to compare alternative topologies and to select the tree (multiple-tree algorithms). The representative single-tree algorithms are (a) the unweighed pair group method using arithmetic means (UPGMA) and (b) the neighbor joining method (NJ). The NJ (Saitou and Nei, 1987) has become a popular approach for analyzing sequence distances because of its ability to handle unequal rates, its connection to minimum-length trees, and its ease of calculation with regard to both tree topology and branch length. Several criteria of optimality are used for building distance trees. Each approach permits unequal rates and assumes additivity. The Fitch - Margoliash method (Fitch and Margoliash, 1967) minimizes the deviation between the observed pairwise distances and the path-length distances for all pairs of taxa on a tree. The best phylogenetic tree maximizes the fit of observed (original) versus tree-derived (patristic) distances, as measured by % standard deviation. Branch lengths are determined by linear algebraic calculations of observed distances among three different taxa interconnected by a common node. Minimum evolution (Saitou and Imanishi, 1989) aims to find the shortest tree that is consistent with the path lengths measured in a manner similar to that of Fitch-Margoliash approach without using all possible pairwise distances and all possible associated tree path lengths. Rather, it chooses the location of internal tree nodes based on the distance to external nodes, and then it optimizes the internal branch length according to the minimum measured error between these observed points. The phylogenetic tree is the one with minimal total overall length.

Character-Based Approaches. The individual aligned sites of biochemical sequences are equivalent to characters. The actual nucleotide or amino acid occupying a site is the character state. The character-based approaches treat each substitution separately rather than reducing all of the individual variation to a single divergence value. The relationships among organisms by the distribution of observed mutations are determined by counting each mutation event. These methods are preferred for studying character evolution, for combining multiple data sets, and for inferring ancestral genotypes. All sequence information is retained through the analyses. No information is lost in the conversion to distances. The approaches include parsimony, maximum likelihood, and method of invariants.

The principle of parsimony searches for a tree that requires the smallest number of changes to explain the differences observed among the taxa under study (Czelus-nlak et al., 1990). The method minimizes the number of evolutionary events required to explain the original data. In practical terms, the parsimony tree is the shortest and the one with the fewest changes (least homoplasy). Parsimony remains the most popular character-based approach for sequence data due to its logical simplicity, its ease of interpretation, its prediction of both ancestral character states and amount of change along branches, the availability of efficient programs for its implementation, and its flexibility of conducting character analyses.

The maximum likelihood method (Felsenstein, 1981) calculates the probability of a data set, given the particular model of evolutionary change and specific topology. The likelihood of changes in the data can be determined by considering each site separately with respect to a particular topology and model of molecular evolution. Therefore the maximum likelihood method depends largely on the model chosen and on how well it reflects the evolutionary properties of the macromolecule being studied. In practice, the maximum likelihood is derived for each base position in an alignment. An individual likelihood is calculated in terms of the probability that the pattern of variation produced at a site by a particular substitution process with reference to the overall observed base frequencies. The likelihood becomes the sum of the probabilities of each possible reconstruction of substitutions under a particular substitution process. The likelihoods for all the sites are multiplied to give an overall likelihood of the tree. Because of questions about the accuracy of the models, coupled with the computational complexities of the approach, maximum likelihood methods have not received wide attention.

When large amounts of homoplasy exist among distantly related branches, one can avoid the abundant homoplasy while recognizing the phylogenetic signal by relying on a few specific patterns of nucleotide variation that represent the most conservative changes (Lake, 1987). For example, three possible topologies for four taxa called operator invariants are calculated. These calculations are based on the variable positions with two purines and two pyrimidines. Zero-value invariants represent cases in which random multiple mutation events have canceled each other out, and as such a chi-squared (x2) or binomial test is used to identify the correct topology as the one with an invariant significantly greater than zero. There are 36 patterns of transitions/transversions for four taxa with three possible topologies. Different methods of phylogenetic inference rely on various combinations of these components, with 12 used to calculate the three operator invariants of evolutionary parsimony. When more than four taxa are considered, all possible groups of four are typically analyzed and a composite tree constructed from the individual results.

Once a method and appropriate software have been selected, the best tree under the selected optimality criterion must be estimated (Saitou, 1996). The number of distinct tree topologies is very large (Felsenstein, 1978), even for a modest number of taxa (e.g., over 2.8 x 1074 distinct, labeled, bifurcating trees for 50 taxa). For relatively few taxa (up to as many as 20 or 30) it is possible to use exact algorithms (algorithmic approach) that will find an optimal tree. For greater numbers of taxa, one must rely on heuristic algorithms that approximate the exact solutions but may not give the optimal solution under all conditions. The exact algorithm searches exhaustively through all possible tree topologies for the best solution(s). This method is computationally simple for 9 or fewer taxa and becomes impractical for 13 or more taxa. An alternative exact algorithm known as the branch-and-bound algorithm (Hendy and Penny, 1982) can be applied to speed up the analysis of data set consisting of less than 10-11 taxa. If the exact algorithms are not feasible for a given data set, various heuristic approaches can be attempted. Heuristic procedures (Stagle, 1971) are a computer simulation and programming philosophy suited to finding a problem solution exploiting any empirical trial, strategy, or shortcut by which the computer acquires knowledge of the structure of the problem space beyond its pure abstract definition. The heuristic methodology does not, however, guarantee (in contrast to the algorithmic approach) the discovery of all possible goals in an absolute sense. Most heuristic techniques start by finding a reasonably good estimate of optimal tree(s) and then attempting to find a better solution by examining structurally related trees. The initial tree is usually found by a stepwise addition of taxa sequentially to a tree at the optimal place in the growing tree. The subsequent improvement is achieved by examining related topologies by a series of procedures known as branch swapping. All involve rearranging branches of the initial tree to search for a shorter alternative.

To assess confidence of phylogenetic results, it is generally recommended that subsets of taxa be examined from within the more complex topologies by focusing on specific major questions targeted before the analysis. Alternatively, one could limit the comparisons to just those topologies considered plausible for biological reasons. The reliability in phylogenetic results can be assessed by analytical methods or resampling methods. Analytical techniques (Felsenstein, 1988; Lake, 1987) for testing phylogenetic reliability operate by comparing the support for one tree to that for another under the assumption of randomly distributed data. Resampling techniques estimate the reliability of a phylogenetic result by bootstrapping or jackknif-ing the characters of the original data set. The bootstrapping approach (Felsenstein, 1985; Hillis and Bull, 1993) creates a new data set of the original size by sampling the available characters with replacement, whereas the jackknife approach (Penny and Hendy, 1986) randomly drops one or more data points or taxa at a time, creating smaller data sets by sampling without replacement. The ultimate criterion for determining phylogenetic reliability rests on tests of congruence among independent data sets representing both molecular and non-molecular information (Hillis, 1987). Different character types and data sets are unlikely to be exposed to the same evolutionary biases, and as such, congruent results supported by each are more likely to reflect convergence onto the single, correct tree. Therefore congruence analysis provides an important mechanism with which to evaluate the reliability of different methods for constructing phylogenetic trees. Given a common data set, those approaches leading to the congruent result should be preferred over those that do not. Studies of congruence can also provide insights into the limitations and assumption of different tree-constructing algorithms. Congruence analysis further permits an evaluation of the reliability of different weighting schemes of character transformations used in a phylogenetic analysis.

The catalog of software programs for phylogenetic analyses is available at http://evolution.gentics.washington.edu/philip/software.html.

Was this article helpful?

## Post a comment