A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (or indels) that appear as gaps in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.
"Multiple sequence alignment" also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length are nearly impossible to align by hand, computational algorithms are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex to produce. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.
For n individual sequences, the method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise dynamic programming. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. To find the global optimum for n sequences this way has been shown to be an NP-complete problemWang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348.Just W. (2001). Computational complexity of multiple sequence alignment with SP-score. J Comput Biol 8(6):615-23.. Methods to reduce the search space by first performing pairwise dynamic programming on each pair of sequences in the query set and searching only the solution space near these results (effectively finding the intersection between local paths immediately surrounding each pairwise optimum solution) render the dynamic programming technique more efficient. The so-called "sum of pairs" method has been implemented in the software package MSA, but it is still impractical for many MSA applications that require the simultaneous alignment of dozens or even a few hundred sequences. Dynamic programming methods are now used only when an extremely high-quality alignment of a small number of sequences is needed, and as a benchmarking standard in evaluating new or refined heuristic techniques.
One major limitation of progressive methods is their heavy dependence on the initial assignment of relatedness and on the quality of the initial alignment. The methods are thus sensitive as well to the distribution of sequences in the query set; performance improves when relatedness among query sequences is a relatively smooth gradient rather than distantly separated clusters. Performance also degrades significantly when all of the sequences in the set are rather distantly related, because inaccuracies in the initial alignment are then more likely. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. Judicious choice of weighting can aid in evaluating relatedness and mitigate the effects of relatively poor initial alignments early in the progression.
Progressive alignment methods are efficient enough to implement on a large scale for many sequences and are often run on publicly accessible web servers so users need not locally install the applications of interest. A very popular progressive alignment method is the Clustal familyHiggins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237-44., especially the weighted variant ClustalWThompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680. to which access is provided by a large number of web portals including GenomeNet, EBI, and EMBNet. Different portals or implementations can vary in user interface and make different parameters accessible to the user. Clustal is used extensively for phylogenetic tree construction and as input for protein structure prediction by homology modeling.
Another common progressive alignment method called T-CoffeeNotredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205-17. is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-coffee uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple reqions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors.
Because progressive methods are heuristics that are not guaranteed to converge to a global optimum, alignment quality can be difficult to evaluate and their true biological significance can be obscure. A very recent semi-progressive method that improves alignment quality and does not use a lossy heuristic while still running in polynomial timeSze SH, Lu Y, Yang Q. (2006). A polynomial time solvable formulation of multiple sequence alignment. J Comput Biol 13(2):309-19. has been implemented in the program PSAlign.
A variety of subtly different iteration methods have been implemented and made available in software packages; reviews and comparisons have been useful but generally refrain from choosing a "best" techniqueHirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13-18.. The software package PRRN/PRRP uses a hill-climbing algorithm to optimize its MSA alignment scoreGotoh O. (1996). Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264(4):823-38. and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSAMount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.. PRRP performs best when refining an alignment previously constructed by a faster method.
Another iterative program, DIALIGN, takes an unusual approach of focusing narrowly on local alignments between sub-segments or sequence motifs without introducing a gap penaltyBrudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B. (2003) Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 4:66.. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. DIALIGN is also available as a web portal at CHAOS/DIALIGN.
A third popular iteration-based method called MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequencesEdgar RC. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32(5), 1792-97.. The distance measure is updated between iteration stages (although, in its original form, MUSCLE contained only 2-3 iterations depending on whether refinement was enabled). A web portal and download site is available at MUSCLE.
Typical HMM-based methods work by representing an MSA as a form of directed acyclic graph known as a partial-order graph, which consists of a series of nodes representing possible entries in the columns of an MSA. In this representation a column that is absolutely conserved (that is, that all the sequences in the MSA share a particular character at a particular position) is coded as a single node with as many outgoing connections as there are possible characters in the next column of the alignment. In the terms of a typical hidden Markov model, the observed states are the individual alignment columns and the "hidden" states represent the presumed ancestral sequence from which the sequences in the query set are hypothesized to have descended. A efficient search variant of the dynamic programming method, known as the Viterbi algorithm, is generally used to successively align the growing MSA to the next sequence in the query set to produce a new MSAHughey R, Krogh A. (1996). Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 12(2):95-107.. This is distinct from progressive alignment methods because the alignment of prior sequences is updated at each new sequence addition. However, like progressive methods, this technique can be influenced by the order in which the sequences in the query set are integrated into the alignment, especially when the sequences are distantly related.
Several software programs are available in which variants of HMM-based methods have been implemented and which are noted for their scalability and efficiency, although properly using an HMM method is more complex than using more common progressive methods. The simplest is POA (Partial-Order Alignment, temporarily available for download via SourceForge); a similar but more generalized method is implemented in the package SAM (Sequence Alignment and Modeling System), which is also available as a web portal. SAM has been used as a source of alignments for protein structure prediction to participate in the CASP structure prediction experiment and to develop a database of predicted proteins in the yeast species S. cerevisiae. A database-search technique based on HMM methods is available in the program HMMer.
The technique of simulated annealing, by which an existing MSA produced by another method is refined by a series of rearrangements designed to find more optimal regions of alignment space than the one the input alignment already occupies. Like the genetic algorithm method, simulated annealing maximizes an objective function like the sum-of-pairs function. Simulated annealing uses a metaphorical "temperature factor" that determines the rate at which rearrangements proceed and the likelihood of each rearrangement; typical usage alternates periods of high rearrangement rates with relatively low likelihood (to explore more distant regions of alignment space) with periods of lower rates and higher likelihoods to more thoroughly explore local minima near the newly "colonized" regions. This approach has been implemented in the program MSASA (Multiple Sequence Alignment by Simulated Annealing)Kim J, Pramanik S, Chung MJ. (1994). Multiple sequence alignment using simulated annealing. Comput Appl Biosci 10(4):419-26..
Blocks analysis is a method of motif finding that restricts motifs to ungapped regions in the alignment. Blocks can be generated from an MSA or they can be extracted from unaligned sequences using a precalculated set of common motifs previously generated from known gene familiesHenikoff S, Henikoff JG. (1991). Automated assembly of protein blocks for database searching. Nucleic Acids Res 19:6565-72.. Block scoring generally relies on the spacing of high-frequency characters rather than on the calculation of an explicit substitution matrix. A server for locating motifs in unaligned sequences is located at BLOCKS.
Statistical pattern-matching has been implemented using both the expectation-maximization algorithm and the Gibbs sampler. One of the most common motif-finding tools, known as MEME, uses expectation maximization and hidden Markov methods to generate motifs that are then used as search tools by its companion program MAST. Both are available at MEME/MAST.
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Multiple sequence alignment".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world