String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.
Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are concatenation of elements of Σ. The Σ may be a usual human alphabet (for example, the letters A through Z in English). Other applications may use binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics.
The various algorithms can be classified by the number of patterns each uses.
Let m be the length of the pattern and let n be the length of the searchable text.
| Algorithm | Preprocessing time | Matching time |
| Naïve string search algorithm | 0 (no preprocessing) | Θ(n m) |
| Rabin-Karp string search algorithm | Θ(m) | average Θ(n+m), worst Θ(n m) |
| Finite automaton | Θ(m |Σ|) | Θ(n) |
| Knuth-Morris-Pratt algorithm | Θ(m) | Θ(n) |
| Boyer-Moore string search algorithm | Θ(m) | average Θ(n/m), worst Θ(n) |
| Bitap algorithm (shift-or, shift-and, Baeza-Yates-Gonnet) | Θ(m+|Σ|) | Θ(n) |
| Shift Or Algorithm | Θ(m + |Σ|) | Θ(n) |
Naturally, the patterns can not be enumerated in this case. They are represented usually by a regular grammar or regular expression.
Other classification approaches are possible. One of the most common uses preprocessing as main criteria.
| Text not preprocessed | Text preprocessed | |
|---|---|---|
| Patterns not preprocessed | Elementary algorithms | Index methods |
| Patterns preprocessed | Constructed search engines | Signature methods |
The simplest and least efficient way to see where one string occurs inside another is to check each place it could be, one by one, to see if it's there. So first we see if there's a copy of the needle in the first few characters of the haystack; if not, we look to see if there's a copy of the needle starting at the second character of the haystack; if not, we look starting at the third character, and so forth. In the normal case, we only have to look at one or two characters for each wrong position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the length of the haystack and m is the length of the needle; but in the worst case, searching for a string like "aaaab" in a string like "aaaaaaaaab", it takes O(nm) steps.
Faster search algorithms are based on preprocessing of the text. After building an index, for example a suffix tree or suffix array, these algorithms can find pattern quickly, using a binary search of the index.
Some search methods, for instance trigram search, are intended to find a "closeness" score between the search string and the text rather than a "match/non-match". These are sometimes called "fuzzy" searches.
Algorithms on strings | Search_algorithms
String-Matching-Algorithmus | Algoritmos de búsqueda en textos | Merkkijonohakualgoritmi | Algorithme de recherche de sous-chaîne
This article is licensed under the GNU Free Documentation License.
It uses material from the
"String searching algorithm".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world