article

A suffix tree for a string S of n characters is a radix tree containing all n suffixes of S. Hence it is a substring index. It was first introduced as a position tree by Weiner in 1973 in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976, and further clarified by Ukkonen in 1995 who provided the first online construction. The suffix tree data structure can be built in \Theta(n) time, and gives you all z occurrences of a pattern P of length m in O(m+z) time, which is optimal. You can also extract a lot of information about your text data with a suffix tree. It was one of the first linear-time solutions for the longest common substring problem.

Functionality


A suffix tree for a string S of length n can be built in \Theta(n) time, if the alphabet is constant or integer . Otherwise, the construction time depends on the implementation. Assume that a suffix tree has been built for the string S, or that a generalised suffix tree has been built for the set of strings D=\{S^1,S^2,...,S^d\} of total length n=|S^1|+|S^2|+...+|S^d|. Assume that the size of the alphabet \Sigma is \sigma.

You can:

  • Check if a string P of length m is a substring of any of the strings in D in O(m) time.
  • Find all z occurrences of P in D in O(m+z) time.
  • Search for a regular expression P in time expected sublinear on D.
  • Search for a pattern P in D with x mismatches in O(m \sigma x+z) time.
  • Find the longest common substrings of all strings in D in O(n) time.
  • Find the most frequently occurring substrings of a minimum length in O(n) time.
  • Find the shortest strings from \Sigma that do not occur in D, in O(n) time. (There might be O(n\sigma) such strings, so it takes O(n \sigma) time to report them explicitly).

Uses


Suffix trees are mainly used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS).

Description


Before a string is indexed with a suffix tree, it is usually padded with a unique terminator, to make sure no suffix is a substring of another. Then you know that each suffix will be uniquely represented by a leaf node. The suffix tree for the string S is defined as the tree which has the property that edges spell non-empty strings, the paths from the root to the leaves have a one-to-one relationship with the suffixes of S, and all internal nodes except the root have at least two children. There are n suffixes, and hence n leaf nodes. Since all internal non-root nodes are branching, there can be at most n-1 such nodes, and n+(n-1)+1=2n nodes in total. If each node and edge can be represented in \Theta(1) space, the entire tree can be represented in \Theta(n) space. The total length of the edges in the tree is O(n^2), but each edge can be stored as the position and length of a substring of S. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full 2n nodes.

The main choice when making a suffix tree implementation is the parent--child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Suffix tree construction time is O(n\sigma), and search time is O(m\sigma+z). Hash maps and arrays may also be used, giving different running time properties.

Suffix links are a key feature for linear-time construction of the tree, since they allow changes to propagate to all suffixes quickly. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string \chi\alpha, where \chi has length 0 or 1, it has a suffix link to the internal node representing \alpha. See for example the nodes for ANA and NA in the figure above. Suffix links are used both in construction and in some algorithms running on the tree.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

References


External links


Data structures | Trees (structure) | Algorithms on strings

Suffixbaum | Priesagų medis | Дерево суффиксов

 

This article is licensed under the GNU Free Documentation License. It uses material from the "Suffix tree".

Home Pageartsbusinesscomputersgameshealthhospitalshomekids & teensnewsphysiciansrecreationreferenceregionalscienceshoppingsocietysportsworld