A sequence alignment in bioinformatics is a way of arranging DNA, RNA, or protein primary sequences to emphasize their regions of similarity, which may indicate functional or evolutionary relationships between the genes or proteins in the query. Sequences are typically written with their characters (generally amino acids or nucleotides) in aligned columns into which gaps are inserted so that successive columns contain identical or similar characters.
If two sequences in an alignment share a common ancestor, then mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In protein sequence alignment, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. Absence of substitutions, or presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, can suggest that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than amino acids, conservation of base pairing can indicate a similar functional or structural role. Sequence alignment can also be used for non-biological sequences such as identifying similarities in the series of letters and words present in human language.
Very short or very similar sequences can be aligned by hand; however, most interesting problems require the alignment of lengthy, highly variable, or extremely numerous sequences that cannot be aligned solely by human effort. Instead human knowledge is primarily applied in contstructing algorithms for producing high-quality sequence alignments and occasionally adjusting the final results to reflect patterns difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable but can be more difficult to calculate due to the added challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, the most intuitive of which is dynamic programming.
Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a number of input and output formats, such as FASTA format and GenBank format; however, the use of specific tools authored by individual research laboratories can be complicated by limited file format compatibility. A general conversion program is available at READSEQ.
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (Note that this does not mean global alignments cannot end in gaps.) A general global alignment technique is called the Needleman-Wunsch algorithm and is based on dynamic programming. Local alignments are more useful for dissimilar sequences suspected to contain regions of similarity or similar sequence motifs within the larger sequence context. A general local alignment method based on dynamic programming is called the Smith-Waterman algorithm. With sufficiently similar sequences, there is no difference between local and global alignments.
Hybrid methods, known as semi-global or "glocal" methods, attempt to find the best possible alignment that includes the start of one or the other sequence and the end of one or the other sequence. This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlapBrudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S. (2003). Glocal alignment: finding rearrangements during alignment. Bioinformatics 19 Suppl 1:i54-62..
Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.
Dynamic programming can also be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. Although the BLAST family of database search methods includes tools for working with translated sequences, more general methods are presently only available as commercial software such as FrameSearch, distributed as part of the Accelrys GCG package.
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical matter rather than a theoretical one. Although dynamic programming is also extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or for sequences that are extremely long.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; it too uses a word search of length k, but evaluates only the most significant word matches rather than every word match as in FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type and is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.
Multiple sequence alignment (MSA) is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. MSAs are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problemsWang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348.. Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.
Many variations of the Clustal progressive implementationHiggins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237-44.Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680. are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-CoffeeNotredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205-17.; implementations can be found at ClustalW and T-Coffee.
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence Chothia C, Lesk AM. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823-6., structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure predictionZhang Y, Skolnick J. (2005). The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 102(4):1029-34. because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.
Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness. The field of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence homology suggests that the sequences in question have a comparatively young most recent common ancestor, while low homology suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change through mutation can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that rates of mutation are constant across sequence lineages. Therefore it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein.) The basic molecular clock hypothesis is now known to be false because it relies on these assumed conditions, which obtain only rarely in nature.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble MSAs and phylogenetic trees score and sort trees first and calculate an MSA from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal MSA, is NP-hardFelsenstein J. (2004). Inferring Phylogenies Sinauer Associates: Sunderland, MA..
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
Common software tools used for general sequence alignment tasks include ClustalW and T-coffee for alignment, and BLAST for database searching. A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software.
Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASEThompson JD, Plewniak F, Poch O. (1999). BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15: 87-88.. The dataset consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASEThompson JD, Plewniak F, Poch O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 7(13):2682-90..
Bioinformatics | Computational phylogenetics
Sequenzalignment | Sequentie-alignering | Linjering | Bắt cặp trình tự | 序列比對
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Sequence alignment".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world