Structural alignment is a form of sequence alignment that attempts to establish equivalences between two or more polymer structures based on their shape and three-dimensional conformation. This process is most commonly applied to proteins but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence homology where the evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common primary structure; however, caution should be used in interpreting the results as demonstrating shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated sequences converge on a common tertiary structure.
Structural alignments can be performed in a pairwise manner or they can be multiple sequence alignments. Because structural alignments rely on information about all the query sequences' three-dimensional conformations, the method can only be used on sequences whose structures are known, usually by X-ray crystallography or NMR spectroscopy. It is possible to perform a structural alignment on structures produced by structure prediction methods; evaluating such prediction methods often requires a structural alignment between the model and the true known structure to assess the model's quality. Structural alignments are especially important in analyzing data from structural genomics and proteomics efforts, and they can be used as comparison points by which to evaluate alignments produced by purely bioinformatics methods[Zhang Y, Skolnick J. (2005). The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 102(4):1029-34.].
The output of a structural alignment consists of a superposition of their atomic coordinate sets with a minimal root mean square distance (RMSD) between the two structures. The RMSD of two aligned structures is taken as a measure of their divergence from one another. Structural alignment can be complicated by the existence of multiple protein domains within one or more of the input structures because changes in relative orientation of the domains between two structures to be aligned can artificially inflate the RMSD.
Data produced by structural alignment
The minimum information produced from a successful structural alignment is a set of superposed three-dimensional coordinates for each input structure. (Note that one input element may be fixed as a reference and therefore its superposed coordinates do not change.) The structural alignment also implies a corresponding one-dimensional
sequence alignment from which a sequence identity (the percentage of residues that are identical between the input structures) can be calculated. The corresponding residues in this alignment are used to determine the RMSD value and the percent of structural identity (PSI). Finally, most structural alignment implementations produce some form of quality or likelihood score for a given superposition. In the one-dimensional structural alignment below
The PSI can be easily calculated by normalizing the number of aligned residues by the length of the shortest structure () where N is the number of the corresponded residues that are within a given Cartesian cutoff distance (often 4 Å) and "norm" is the normalization factor. In the immunoglobulin example below the number of aligned residues within 4 Å is 57 and the norm of the set is 83, the PSI is therefore 68.67%.
Blue
---KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGP
|||||||||||| ||||||||||| |||||
Red LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRS
P
--RGKNI--
Blue SKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD--TYICEVEDQKEEVQ
--
||||||||||||||| ||||||||
Red
---QGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVL
Types of comparisons
Because protein structures are composed of
amino acids whose
side chains are linked by a common protein backbone, a number of different possible subsets of the atoms that make up a protein macromolecule can be used in producing a structural alignment and calculating the corresponding RMSD values. When aligning structures with very different sequences, the side chain atoms generally are not taken into account because their identities differ between many aligned residues. For this reason it is common for structural alignment methods to use by default only the backbone atoms included in the
peptide bond. For simplicity and efficiency, often only the
alpha carbon positions are considered, since the peptide bond has a minimally variant planar conformation. Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions, in which case the RMSD reflects not only the conformation of the protein backbone but also the
rotameric states of the side chains. Other comparison criteria that reduce noise and bolster positive matches include
secondary structure assignment, contact maps or residue interaction patterns, measures of side chain packing, and measures of hydrogen bond retention
[Godzik A. (1996). The structural alignment of proteins: is there a unique answer? Protein Sci 5:1325-8.].
Structural superposition
The most basic possible comparison between protein structures makes no attempt to align the input structures and requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation. This method uses a simple least-squares fitting algorithm such as that implemented in the software package
ProFit[Martin ACR. http://www.bioinf.org.uk/software/profit/doc/ (implements McLachlan, AD (1982). Rapid Comparison of Protein Structres. Acta Cryst A38, 871-873.)]. Structural superposition is commonly used to compare multiple conformations of the same protein (in which case no alignment is really necessary, since the sequences are the same) and to evaluate the quality of alignments produced using only sequence information between two or more sequences whose structures are known. Structural superposition generally uses either the alpha carbons or the full set of backbone atoms in performing its RMSD calculations.
Algorithmic complexity
Both the optimal "threading" of a protein sequence onto a known structure
[Lathrop RH. (1994). The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng 7(9):1059-68.] and the production of an optimal multiple sequence alignment
[Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348.] have been shown to be
NP-complete. However, this does not imply that the structural alignment problem is NP-complete. On the basis of the argument that a true optimal solution is not biologically meaningful due to the experimental error inherent in protein structure determination, an approximate
polynomial-time algorithm for structural alignment that produces a family of "optimal" solutions within an approximation parameter for a given scoring function has been developed
[Kolodny R, Linial N. (2004). Approximate protein structural alignment in polynomial time PNAS 101(33): 12201-12206.]. However, at
for a globular protein of
n residues, the algorithm is still too expensive for practical use. As a consequence all current methods of structural alignment are
heuristics not guaranteed to converge to an optimum solution given their scoring function.
Representation of structures
Protein structures have to be represented in some coordinate independent space to make them comparable. This is typically achieved by constructing a sequence-to-sequence matrix or series of matrices that encompass comparative metrics rather than absolute distances relative to a fixed coordinate space. An intuitive representation is the
distance matrix , which is a two-dimensional
matrix containing all pairwise distances between some subset of the atoms in each structure (usually the alpha carbons). Note that the matrix increases in dimensionality as the number of structures to be simulatenously aligned increases. Reducing the protein to some coarse metric such as
secondary structure elements (SSEs) or structural fragments can also produce sensible alignments despite the loss of information inherent in discarding distance information due to the concurrent loss of noise inherent in structural data. Choosing an appropriate representation that will facilitate efficient computation is critical to the development of an efficient alignment mechanism.
Methods
Structural alignment techniques have been used in comparing individual structures or sets of structures as well as in the production of "all-to-all" comparison databases that measure the divergence between every pair of structures present in the
Protein Data Bank (PDB). Such databases are used to classify proteins by their
fold.
DALI
A common and popular structural alignment method is the DALI, or distance alignment matrix, method, which breaks the input structures into hexapeptide fragments and calculates a distance matrix by evaluating the contact patterns between successive fragments
[Holm L, Sander C (1996). Mapping the protein universe. Science 273: 595-603.].
Secondary structure features that involve residues that are contiguous in sequence appear on the matrix's
main diagonal; other diagonals in the matrix reflect spatial contacts between residues that are not near each other in the sequence. When these diagonals are parallel to the main diagonal, the features they represent are parallel; when they are perpendicaular, their features are antiparallel. This representation is expensive because the features in the square matrix are symmetrical (and thus redundant) about the main diagonal.
When two proteins' distance matrices share the same or similar features in approximately the same positions, they can be said to have similar folds with similar-length loops connecting their secondary structure elements. DALI's actual alignment process requires a similarity search after the two proteins' distance matrices are built; this is normally conducted via a series of overlapping submatrices of size 6x6. Submatrix matches are then reassembled into a final alignment via a standard score-maximization algorithm - the original version of DALI used a Monte Carlo simulation to maximize a structural similarity score that is a function of the distances between putative corresponding atoms. In particular, more distant atoms within corresponding features are exponentially downweighted to reduce the effects of noise introduced by loop mobility, helix torsions, and other minor structural variations. Because DALI relies on an all-to-all distance matrix, it can account for the possibility that structurally aligned features might appear in different orders within the two sequences being compared.
The DALI method has also been used to construct a database known as FSSP (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins) in which all known protein structures are aligned with each other to determine their structural neighbors and fold classification. A web implementation of DALI, known as DaliLite, is available at EBI DALI and the current version of the FSSP database is located at The Dali Database.
SSAP
The SSAP (Sequential Structure Alignment Program) method uses double dynamic programming to produce a structural alignment based on atom-to-atom vectors in structure space. Instead of the alpha carbons typically used in structural alignment, SSAP constructs its vectors from the
beta carbons for all residues except glycine, a method which thus takes into account the rotameric state of each residue as well as its location along the backbone. SSAP works by first constructing a series of inter-residue distance vectors between each residue and its nearest non-contiguous neighbors on each protein. A series of matrices are then constructed containing the vector differences between neighbors for each pair of residues for which vectors were constructed. Dynamic programming applied to each resulting matrix determines a series of optimal local alignments which are then summed into a "summary" matrix to which dynamic programming is applied again to determine the overall structural alignment.
SSAP originally produced only pairwise alignments but has since been extended to multiple alignments as well[Taylor WR, Flores TP, Orengo CA. (1994). Multiple protein structure alignment. Protein Sci 3(10):1858-70.]. It has been applied in an all-to-all fashion to produce a hierarchical fold classification scheme known as CATH (Class, Architecture, Topology, Homolgy)[Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. (1997) CATH: A hierarchical classification of protein domain structures. Structure 5(8): 1093-1108.], which has been used to construct a database available at CATH Protein Structure Classification.
Combinatorial extension
The combinatorial extension (CE) method is similar to DALI in that it too breaks each structure in the query set into a series of fragments that it then attempts to reassemble into a complete alignment. A series of pairwise combinations of fragments called aligned fragment pairs, or AFPs, are used to define a similarity matrix through which an optimal path is generated to identify the final alignment. Only AFPs that meet given criteria for local similarity are included in the matrix as a means of reducing the necessary search space and thereby increasing efficiency
[Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11(9) 739-747.]. A number of similarity metrics are possible; the original definition of the CE method included only structural superpositions and inter-residue distances but has since been expanded to include local environmental properties such as secondary structure, solvent exposure, hydrogen-bonding patterns, and
dihedral angles.
An alignment path is calculated as the optimal path through the similarity matrix by linearly progressing through the sequences and extending the alignment with the next possible high-scoring AFP pair. The initial AFP pair that nucleates the alignment can occur at any point in the sequence matrix. Extensions then proceed with the next AFP that meets given distance criteria restricting the alignment to low gap sizes. The size of each AFP and the maximum gap size are required input parameters but are usually set to empirically determined values of 8 and 30 respectively.
Like DALI and SSAP, CE has been used to construct an all-to-all fold classification database from the known protein structures in the PDB. Both the CE web portal and the database are accessible at the Combinatorial Extension website.
RNA structural alignment
Structural alignment techniques have traditionally been applied exclusively to proteins, as the primary biological
macromolecules that assume characteristic three-dimensional structures. However, large
RNA molecules also form characteristic tertiary structures mediated primarily by
hydrogen bonds formed between
base pairs as well as
base stacking. Functionally similar
noncoding RNA molecules can be especially difficult to extract from
genomics data because structure is more strongly conserved than sequence in RNA as well as in proteins, and the more limited alphabet of RNA decreases the
information content of any given
nucleotide at any given position.
A recent method for pairwise structural alignment of RNA sequences with low sequence identity has been published and implemented in the program FOLDALIGN[Havgaard JH, Lyngso RB, Stormo GD, Gorodkin J. (2005). Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 21(9):1815-24.]. However, this method is not truly analogous to protein structural alignment techniques because it computationally predicts the structures of the RNA input sequences rather than requiring experimentally determined structures as input. Although computational prediction of the protein folding process has not been successful to date, RNA structures without pseudoknots can often be sensibly predicted using free energy-based scoring methods that account for base pairing and stacking. The program is available at FOLDALIGN.
Software
Choosing a software tool for structural alignment can be a challenge due to the large variety of available packages that differ significantly in algorithmic technique. Due to its integration with other European Bioinformatics Institute web-based tools, the EBI DALI DaliLite webserver has an advantage in the production of single structural alignments for researchers interested in using the alignments to guide experimental work (rather than studying alignment methods themselves). A more complete list of currently available and freely distributed structural alignment software can be found in structural alignment software.
See also
References
Further reading
- Bourne PE, Shindyalov IN. (2003): Structure Comparison and Alignment. In: Bourne, P.E., Weissig, H. (Eds): Structural Bioinformatics. Hoboken NJ: Wiley-Liss. ISBN 0-471-20200-2
- Yuan X, Bystroff C.(2004) "Non-sequential Structure-based Alignments Reveal Topology-independent Core Packing Arrangements in Proteins", Bioinformatics. Nov 5, 2004
- Krissinel E, Henrick K. (2004). Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst. D60, 2256-2268
- Jung J, Lee B. (2000). Protein structure alignment using environmental profiles. Protein Eng 13:535-543.
- Ye Y, Godzik A. (2005). Multiple flexible structure alignment using partial order graphs Bioinformatics 21(10): 2362-2369 Abstract
- Kawabata T, Nishikawa K. (2000). Protein structure comparison using the Markov transition model of evolution. Proteins 41(1): 108-122
- Lessel U, Schomburg D. (1994). Similarities between protein 3-D structures. Protein Eng 7:1175-1187
- Kolodny R, Linial N. (2004). Approximate protein structural alignment in polynomial time Proc Natl Acad Sci 101(33): 12201-12206.
Protein methods