A suffix tree for a string S of n characters is a radix tree containing all n suffixes of S. Hence it is a substring index. It was first introduced as a position tree by Weiner in 1973 in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976, and further clarified by Ukkonen in 1995 who provided the first online construction. The suffix tree data structure can be built in time, and gives you all z occurrences of a pattern P of length m in time, which is optimal. You can also extract a lot of information about your text data with a suffix tree. It was one of the first linear-time solutions for the longest common substring problem.
Functionality
A suffix tree for a string
S of length
n can be built in
time, if the alphabet is constant or integer . Otherwise, the construction time depends on the implementation. Assume that a suffix tree has been built for the string
S, or that a
generalised suffix tree has been built for the set of strings
of total length
. Assume that the size of the alphabet
is
.
You can:
- Check if a string P of length m is a substring of any of the strings in D in time.
- Find all z occurrences of P in D in time.
- Search for a regular expression P in time expected sublinear on D.
- Search for a pattern P in D with x mismatches in time.
- Find the longest common substrings of all strings in D in time.
- Find the most frequently occurring substrings of a minimum length in time.
- Find the shortest strings from that do not occur in D, in time. (There might be such strings, so it takes time to report them explicitly).
Uses
Suffix trees are mainly used in
bioinformatics applications, where they are used for searching for patterns in
DNA or
protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in
data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the
Burrows-Wheeler transform.
Variants of the
LZW compression schemes use it (
LZSS).
Description
Before a string is indexed with a suffix tree, it is usually padded with a unique terminator, to make sure no suffix is a substring of another. Then you know that each suffix will be uniquely represented by a leaf node. The suffix tree for the string
S is defined as the tree which has the property that edges spell non-empty strings, the paths from the root to the leaves have a one-to-one relationship with the suffixes of
S, and all internal nodes except the root have at least two children. There are
suffixes, and hence
leaf nodes. Since all internal non-root nodes are branching, there can be at most
such nodes, and
nodes in total. If each node and edge can be represented in
space, the entire tree can be represented in
space. The total length of the edges in the tree is
, but each edge can be stored as the position and length of a substring of
S. The worst-case space usage of a suffix tree is seen with a
fibonacci string, giving the full
nodes.
The main choice when making a suffix tree implementation is the parent--child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Suffix tree construction time is , and search time is . Hash maps and arrays may also be used, giving different running time properties.
Suffix links are a key feature for linear-time construction of the tree, since they allow changes to propagate to all suffixes quickly. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string , where has length 0 or 1, it has a suffix link to the internal node representing . See for example the nodes for ANA and NA in the figure above. Suffix links are used both in construction and in some algorithms running on the tree.
The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.
References
External links
Data structures | Trees (structure) | Algorithms on strings
Suffixbaum | Priesagų medis | Дерево суффиксов