The Rabin-Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp that seeks a pattern, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. For text of length n and pattern of length m, its average and best case running time is O(n), but the (highly unlikely) worst case performance is O(nm), which is one of the reasons why it is not widely used. However, it has the unique advantage of being able to find any one of k strings or less in O(n) time on average, regardless of the size of k.
One of the simplest practical applications of Rabin-Karp is in detection of plagiarism. Say, for example, that a student is writing an English paper on Moby Dick. A cunning professor might locate a variety of source material on Moby Dick and automatically extract a list of all sentences in those materials. Then, Rabin-Karp can rapidly search through a particular paper for any instance of any of the sentences from the source materials. To avoid easily thwarting the system with minor changes, it can be made to ignore details such as case and punctuation by removing these first. Because the number of strings we're searching for, k, is very large, single-string searching algorithms are impractical.
1 function NaiveSearch(string sstring sub[1..m) 2 for i from 1 to n 3 for j from 1 to m 4 if s≠ sub[j 5 jump to next iteration of outer loop 6 return i 7 return not found
This algorithm works well in many practical cases, but can exhibit relatively long running times on certain examples, such as searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s, in which case it exhibits its worst-case Θ(mn) time.
The Knuth-Morris-Pratt algorithm reduces this to Θ(n) time using precomputation to examine each text character only once; the Boyer-Moore algorithm skips forward not by 1 character, but by as many as possible for the search to succeed, effectively decreasing the number of times we iterate through the outer loop, so that the number of characters examined can be as small as n/m in the best case. The Rabin-Karp algorithm focuses instead on speeding up lines 3-6, as we'll discuss in the next section.
However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.
The algorithm is as shown:
1 function RabinKarp(string sstring sub[1..m) 2 hsub := hash(sub*) 3 hs := hash(s*) 4 for i from 1 to n 5 if hs = hsub 6 if s* = sub 7 return i 8 hs := hash(s*) 9 return not found
Lines 2, 3, 6 and 8 each require Ω(m) time. However, lines 2 and 3 are only executed once, and line 6 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 5 is executed n times, but only requires constant time. So the only problem is line 8.
If we naively recompute the hash value for the substring sthis would require Ω(m) time, and since this is done on each loop, the algorithm would require Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable hs already contains the hash value of s[i..i+m-1. If we can use this to compute the next hash value in constant time, then our problem will be solved.
We do this using what is called a rolling hash. A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time: s= s*" target="_blank" >+ s[i+m This simple function works, but will result in statement 6 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.
Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 6 might very well be executed n times, on every iteration of the loop. Because it requires Ω(m) time, the whole algorithm then takes a worst-case Ω(mn) time.
Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash function for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 × 1012 (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 × 1010 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited of the size of integer data type and the necessity of using modular arithmetic to scale down the hash results, for which see hash function article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many hash collisions and hence slow down the algorithm. Hence the described hash function is typically the preferred one in Rabin-Karp.
That is, if we want to find any of a large number, say k, fixed length patterns in a text, we can create a simple variant of Rabin-Karp that uses a hash table or any other set data structure to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for:
function RabinKarpSet(string s*, set of string subs, m) { set hsubs := emptySet for each sub in subs insert hash(sub*) into hsubs hs := hash(s*) for i from 1 to n if hs ∈ hsubs if s* = a substring with hash hs return i hs := hash(s*) return not found }
Here we assume all the substrings have a fixed length m, but this assumption can be eliminated. We simply compare the current hash value against the hash values of all the substrings simultaneously using a quick lookup in our set data structure, and then verify any match we find against all substrings with that hash value.
Other algorithms can search for a single pattern in O(n) time, and hence they can be used to search for k patterns in O(n k) time. In contrast, the variant Rabin-Karp above can find all k patterns in O(n+k) time in expectation, because a hash table checks whether a substring hash equals any of the pattern hashes in O(1) time.
Rabin-Karp-Algorithmus | Algorytm Karpa-Rabina | Алгоритм Рабина — Карпа | Rabin-Karps algoritm
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Rabin-Karp string search algorithm".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world