In information theory and computer science, the Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965. It is useful in applications that need to determine how similar two strings are, such as spell checkers.
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since these three edits change one into the other, and there is no way to do it with fewer than three edits:
It can be considered a generalization of the Hamming distance, which is used for strings of the same length and only considers substitution edits. There are also further generalizations of the Levenshtein distance that consider, for example, exchanging two characters as an operation, like in the Damerau-Levenshtein distance algorithm.
int LevenshteinDistance(char str1char str2[1..lenStr2) // d is a table with lenStr1+1 rows and lenStr2+1 columns declare int d0..lenStr2 // i and j are used to iterate over str1 and str2 declare int i, j, cost for i from 0 to lenStr1 d0 := i for j from 0 to lenStr2 dj := j for i from 1 to lenStr1 for j from 1 to lenStr2 if str1= str2[j then cost := 0 else cost := 1 dj := minimum( dj + 1, // deletion d , j-1 + 1, // insertion dj-1 + cost // substitution ) return dlenStr2
Two examples of the resulting matrix (the minimum steps to be taken are highlighted):
| k | i | t | t | e | n | ||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
| s | 1 | 2 | 3 | 4 | 5 | 6 | |
| i | 2 | 2 | 2 | 3 | 4 | 5 | |
| t | 3 | 3 | 2 | 2 | 3 | 4 | |
| t | 4 | 4 | 3 | 2 | 2 | 3 | |
| i | 5 | 5 | 4 | 3 | 2 | 3 | |
| n | 6 | 6 | 5 | 4 | 3 | 3 | |
| g | 7 | 7 | 6 | 5 | 4 | 4 |
| S | a | t | u | r | d | a | y | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| S | 1 | 3 | 4 | 5 | 6 | 7 | |||
| u | 2 | 1 | 1 | 2 | 3 | 4 | 5 | 6 | |
| n | 3 | 2 | 2 | 2 | 3 | 4 | 5 | 6 | |
| d | 4 | 3 | 3 | 3 | 3 | 4 | 4 | 5 | |
| a | 5 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | |
| y | 6 | 5 | 5 | 5 | 5 | 5 | 5 | 4 |
The invariant maintained throughout the algorithm is that we can transform the initial segment str1into str2* operations. At the end, the bottom-right element of the array contains the answer.
j.
d* can be moved inside the main outer loop.
costs can be computed in parallel, and the algorithm can be adapted to perform the minimum function in phases to eliminate dependencies.
sinto t* operations. This invariant holds since:
- It is initially true on row and column 0 because
scan be transformed into the empty string t*" target="_blank" >to t[1..j by simply adding all j characters.
- The minimum is taken over three distances, each of which is feasible:
- If we can transform
sto t*" target="_blank" >afterwards to get t[1..j in k+1 operations.
- If we can transform
sto t*" target="_blank" >and then remove the original s[i at the end in k+1 operations.
- If we can transform
sto t*" target="_blank" >and then do a substitution of t* at the end if necessary, requiring k+cost operations.
- The operations required to transform
sinto t* holds our result.
This proof fails to validate that the number placed in dis in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j
is smaller than the minimum of the three, and use this to show one of the three is not minimal.
The implementations of the Levenshtein algorithm on this page are illustrative only. Applications will, in most cases, use implementations which use heap allocations sparingly, in particular when large lists of words are compared to each other. The following remarks indicate some of the variations on this and related topics:
1. Most implementations use one- or two-dimensional arays to store the distances of prefixes of the words compared. In most applications the size of these structures is previously known. This is the case, when, for instance the distance is relevant only if it is below a certian maximally allowed distance (this happens when words are selected from a dictionary to approximately match a given word). In this case the arrays can be preallocated and reused over the various runs of the algorithm over successive words.
2. Using maximally allowed distances reduces the search. To achieve this, include the maximal distance in the parameter list to the algorithm and keep track of the shortest found distance of prefixes of the words compared.
3. Deletion/Insertion and replacement of characters can be assigned different weights. The usual choice is to set them both to 1 (this is the choice for all of the algorithms on this page). Different values for these weights allows for more flexible search strategies in lists of words.
Below is a slightly faster version using only linear space.
const unsigned int cost_del = 1; const unsigned int cost_ins = 1; const unsigned int cost_sub = 1; unsigned int edit_distance( const std::string& s1, const std::string& s2 ) { unsigned int n1 = s1.length(); unsigned int n2 = s2.length(); unsigned int* p = new unsigned intn2+1 ; unsigned int* q = new unsigned intn2+1 ; unsigned int* r; p0 = 0; for( unsigned int j = 1; j <= n2; ++j ) pj = pj-1 + cost_ins; for( unsigned int i = 1; i <= n1; ++i ) { q0 = p0 + cost_del; for( unsigned int j = 1; j <= n2; ++j ) { unsigned int d_del = pj + cost_del; unsigned int d_ins = qj-1 + cost_ins; unsigned int d_sub = pj-1 + ( s1== s2[j-1 ? 0 : cost_sub ); qj = std::min( std::min( d_del, d_ins ), d_sub ); } r = p; p = q; q = r; } unsigned int tmp = pn2 ; delete* p; delete* q; return tmp; }
(defun levenshtein-distance (a b) "Computes the Levenshtein distance between strings a and b." (let ((al (length a)) (bl (length b))) ;; Swap a and b if a is longer. (when (> al bl) (multiple-value-setq (a b al bl) (values b a bl al))) (loop for i below bl for prev = (loop for i from 0 to al collect i) then (copy-list current) for current = (cons (1+ i) (make-list al :initial-element 0)) do (loop for j below al for (c0 . ccons) on current for (p1 p0) on prev for add = (1+ p1) for delete = (1+ c0) for change = (+ p0 (if (eql (elt a j) (elt b i)) 0 1)) do (rplaca ccons (min add delete change))) finally (return (car (last current))))))
(defun levenshtein-distance (str1 str2) "Return the edit distance between strings STR1 and STR2." (if (not (stringp str1)) (error "Argument was not a string: %s" str1)) (if (not (stringp str2)) (error "Argument was not a string: %s" str2)) (let* ((make-table (function (lambda (columns rows init) (make-vector rows (make-vector columns init))))) (tref (function (lambda (table x y) (aref (aref table y) x)))) (tset (function (lambda (table x y object) (let ((row (copy-sequence (aref table y)))) (aset row x object) (aset table y row) object)))) (length-str1 (length str1)) (length-str2 (length str2)) (d (funcall make-table (1+ length-str1) (1+ length-str2) 0))) (let ((i 0) (j 0)) (while (<= i length-str1) (funcall tset d i 0 i) (setq i (1+ i))) (while (<= j length-str2) (funcall tset d 0 j j) (setq j (1+ j)))) (let ((i 1)) (while (<= i length-str1) (let ((j 1)) (while (<= j length-str2) (let* ((cost (if (equal (aref str1 (1- i)) (aref str2 (1- j))) 0 1)) (deletion (1+ (funcall tref d (1- i) j))) (insertion (1+ (funcall tref d i (1- j)))) (substitution (+ (funcall tref d (1- i) (1- j)) cost))) (funcall tset d i j (min insertion deletion substitution))) (setq j (1+ j)))) (setq i (1+ i)))) (funcall tref d length-str1 length-str2)))
editDistance :: Eq a => -> [a -> Int editDistance s * = length s editDistance * t = length t editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts, 1 + editDistance ss (t:ts), 1 + editDistance (s:ss) ts ]
def distance(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n > m: # Make sure n <= m, to use O(min(n,m)) space a,b = b,a n,m = m,n current = range(n+1) for i in range(1,m+1): previous, current = current, *+**n for j in range(1,n+1): add, delete = previouscurrent[j-1+1 change = previous* if a!= b[i-1: change = change + 1 current* = min(add, delete, change) return current*
(define add1 (lambda (x) (+ x 1))) (define sub1 (lambda (x) (- x 1))) (define levenshtein-distance (lambda (s1 s2) (let* ((width (add1 (string-length s1))) (height (add1 (string-length s2))) (d (make-array (shape 0 height 0 width) 0))) (do-ec (:range x width) (array-set! d 0 x x)) (do-ec (:range y height) (array-set! d y 0 y)) (do-ec (:range x (string-length s1)) (:range y (string-length s2)) (array-set! d (add1 y) (add1 x) (min (add1 (array-ref d y (add1 x))) (add1 (array-ref d (add1 y) x)) (+ (array-ref d y x) (if (eqv? (string-ref s1 x) (string-ref s2 y)) 0 1))))) (displarray d) (array-ref d (sub1 height) (sub1 width)))))
Contributed by Stephan Maier. This is an implementation in Prolog which makes use only of ISO-Prolog with the exception of the append(?list, ?list, ?list)-rule which, however, is available on most prolog implementations. The implementation of the algorithm is recursive but the recursion only serves to simulate two loops over the length of the words compared. Efficiency of the implementation is contrasted with rather large parameter set for the rule 'lev(...)'.
% Implementation of the computation of the leveshtein distance of two words. % This implementation works on all ISO-compatible implemenations of Prolog % which implement append(?list, ?list, ?list) in the usual way. % levenshtein(+atom, +atom, -integer): Computes the Levenshtein distance between two given words % W1: Atom % W2: Atom % D : Distance levenshtein(W1, W2, D) :- atom_length(W1, 0), atom_length(W2, L), D is L, !. levenshtein(W1, W2, D) :- atom_length(W2, 0), atom_length(W1, L), D is L, !. levenshtein(W1, W2, D) :- atom_length(W1, L1), initialList(L1, StartRow), lev(W1, W2, *, StartRow, 1, 1, 0, D), !. % lev(+W1, +W2, +CR, +LR, +P1, +P2, +CD, -D): Rule which represents the actual calculation. This rule % works recursively simulating a loop over the positions (P1, P2) in the words W1 and W2. The outer % loop is over P2, the inner loop is over P1. To keep track of previously computed distances two lists % are used. % W1: Atom % W2: Atom % CR: Start of the row (current row) % LR: Remainder of last row (last row) % P1: Position in word W1 % P2: Position in word W2 % CD: Current value of distance; this is the last element of CR % D : Value of distance % Verify the end-condition for the recursion first: This is achieved when the P2-counter exceeds the length of W2 lev(_, W2, _, _, _, P2, CD, D) :- atom_length(W2, L2), P2 > L2, D is CD, !. % The initial rule after P2 has been increased by 1 lev(W1, W2, LR, 1, P2, _, D) :- CR = [P2, lev(W1, W2, CR, LR, 1, P2, P2, D), !. % This is the actual computational rule lev(W1, W2, CR, H2|LR, P1, P2, CD, D) :- % T1 is the distance at (P1, P2) obtained from removing the letter at P1 from W1 T1 is CD + 1, % T2 is the distance at (P1, P2) obtained from adding the letter at P2 from W2 T2 is H2 + 1, charAt(W1, P1, C1), charAt(W2, P2, C2), % T3 is the distance at (P1, P2) obtained from exchanging the letters at P1 from W1 and P2 from W2 ( C1 = C2 -> T3 is H1; T3 is H1 + 1 ), % The distance at (P1, P2) is the minimum of the previously computed distances min(T1, T2, T3, NCD), % Compute the next value for CR, increase the P1 position by i and the recursively call lev(...) append(CR, *, NCR), NP1 is P1 + 1, lev(W1, W2, NCR, *, NP1, P2, NCD, D), !. % This rule verifies the end of line condition for the loop over P1 lev(W1, W2, CR, _, P2, CD, D) :- NP2 is P2 + 1, lev(W1, W2, [, CR, 1, NP2, CD, D), !. % initialList(+N, -L): Returns a list L of the form 1, 2, ..., N. initialList(N, _) :- (\+ integer(N); N < 1) -> fail. initialList(1, L) :- L = 1, !. initialList(N, L) :- N1 is N - 1, L2 = *, initialList(N1, L1), append(L1, L2, L). % charAt(+A, ?N, -C): Returns the character at position N in the atom A % The position is 1-based % A: The atom % N: The position at which to extract the character % C: The character of A at position N charAt(A, N, C) :- P is N - 1, sub_atom(A, P, 1, _, C). % min(...): These rules compute the minimum of the given integer values % I1, I2, I3: Integer values % M: The minimum over the values min(I1, I2, M) :- integer(I1), integer(I2), ( I1 =< I2 -> M is I1; M is I2). min(I1, I2, I3, M) :- min(I1, I2, A), min(I2, I3, B), min(A, B, M).
class LevenshteinDistance { function LevenshteinDistance() { } private function compute(a:Array, b:Array) { if(typeof(a)
printf("\ndistance is: %d", dist("mama", "tatko"));
sub dist{
my($str1, $str2)=@_;
my($i, $j, $cost, @temp, @d);
for($i=0; $i
Uses linear space, and can be curried with just one string to factor arrays allocations
(* Written by Pierre Etchemaite
function levenshtein(string1, string2) local str1, str2, distance = {}, {}, {}; str1.len, str2.len = string.len(string1), string.len(string2); string.gsub(string1, "(.)", function(s) table.insert(str1, s); end); string.gsub(string2, "(.)", function(s) table.insert(str2, s); end); for i = 0, str1.len do distance** = i; end for i = 0, str2.len do distance** = i; end for i = 1, str1.len do for j = 1, str2.len do local tmpdist = 1; if(str1== str2[j-1) then tmpdist = 0; end distance** = math.min( distance*" target="_blank" >+ 1, distance*+1," target="_blank" >distance* + tmpdist); end end return distance**; end
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
s and t, the number of characters (not counting duplicates) found in s but not in t is a lower bound.
Algorithms on strings | Discrete mathematics
Levenshteinafstand | Levenshtein-Distanz | Distancia de Levenshtein | Distance de Levenshtein | Distanza di Levenshtein | レーベンシュタイン距離 | Levenshtein-distanse | Odległość Levenshteina | Distância Levenshtein | Расстояние Левенштейна | Levenšteinin etäisyys | Відстань Левенштейна | 編輯距離
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Levenshtein distance".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world