In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of macromolecules such as DNA, which can be viewed as strings of the letters A, C, G and T. Several definitions of edit distance exist, using different sets of string operations. One of the most common variants is called Levenshtein distance, named after the Soviet Russian computer scientist Vladimir Levenshtein. In this version, the allowed operations are the removal or insertion of a single character, or the substitution of one character for another. Levenshtein distance may also simply be called "edit distance", although several variants exist. Formal definition and properties Given two strings a and b on an alphabet (e.g. the set of ASCII characters, the set of bytes [0..255], etc.), the edit distance d(a, b) is the minimum-weight series of edit operations that transforms a into b. One of the simplest sets of edit operations is that defined by Levenshtein in 1966 Insertion of a single symbol. If a = uv, then inserting the symbol x produces uxv. This can also be denoted e->x, using e to denote the empty string. Deletion of a single symbol changes uxv to uv (x->e). Substitution of a single symbol x for a symbol y != x changes uxv to uyv (x->y). In Levenshtein's original definition, each of these operations has unit cost (except that substitution of a character by itself has zero cost), so the Levenshtein distance is equal to the minimum number of operations required to transform a to b. A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(x y) with the operations. Additional primitive operations have been suggested. A common mistake when typing text is transposition of two adjacent characters commonly occur, formally characterized by an operation that changes uxyv into uyxv. For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice-versa. Other variants of edit distance are obtained by restricting the set of operations. Longest common subsequence (LCS) distance is edit distance with insertion and deletion as the only two edit operations, both at unit cost. Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings. Jaro–Winkler distance can be obtained from an edit distance where only transpositions are allowed. Example The Levenshtein distance between "kitten" and "sitting" is 3. The minimal edit script that transforms the former into the latter is: kitten -> sitten (substitution of "s" for "k") sitten -> sittin (substitution of "i" for "e") sittin -> sitting (insertion of "g" at the end). LCS distance (insertions and deletions only) gives a different distance and minimal edit script: delete k at 0 insert s at 0 delete e at 4 insert i at 4 insert g at 6 for a total cost/distance of 5 operations. Properties Edit distance with non-negative cost satisfies the axioms of a metric, giving rise to a metric space of strings, when the following conditions are met: Every edit operation has positive cost; for every operation, there is an inverse operation with equal cost. With these properties, the metric axioms are satisfied as follows: d(a, a) = 0, since each string can be trivially transformed to itself using exactly zero operations. d(a, b) > 0 when a != b, since this would require at least one operation at non-zero cost. d(a, b) = d(b, a) by equality of the cost of each operation and its inverse. Triangle inequality: d(a, c) <= d(a, b) + d(b, c). Levenshtein distance and LCS distance with unit cost satisfy the above conditions, and therefore the metric axioms. Variants of edit distance that are not proper metrics have also been considered in the literature. Other useful properties of unit-cost edit distances include: LCS distance is bounded above by the sum of lengths of a pair of strings. LCS distance is an upper bound on Levenshtein distance. For strings of the same length, Hamming distance is an upper bound on Levenshtein distance. Regardless of cost/weights, the following property holds of all edit distances: When a and b share a common prefix, this prefix has no effect on the distance. Formally, when a = uv and b = uw, then d(a, b) = d(v, w). This allows speeding up many computations involving edit distance and edit scripts, since common prefixes and suffixes can be skipped in linear time. Computation Basic algorithm Using Levenshtein's original operations, the edit distance between a = a_1 ... a_n and b = b_1 ... b_m is given by d_{mn}, defined by the recurrence d_{i0} = \sum_{k=1}^{i} w_\mathrm{del}(b_{k}), \quad for\; 1 \leq i \leq m d_{0j} = \sum_{k=1}^{j} w_\mathrm{ins}(a_{k}), \quad for\; 1 \leq j \leq n d_{ij} = \begin{cases} d_{i-1, j-1} & \quad a_{j} = b_{i}\\ \min \begin{cases} d_{i-1, j} + w_\mathrm{del}(b_{i})\\ d_{i,j-1} + w_\mathrm{ins}(a_{j}) \\ d_{i-1,j-1} + w_\mathrm{sub}(a_{j}, b_{i}) \end{cases} & \quad a_{j} \neq b_{i}\end{cases} , \quad for\; 1 \leq i \leq m, 1 \leq j \leq n. This algorithm can be generalized to handle transpositions by adding an additional term in the recursive clause's minimization. The straightforward, recursive way of evaluating this recurrence takes exponential time. Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fisher, although it has a history of multiple invention. After completion of the Wagner–Fischer algorithm, a minimal sequence of edit operations can be read off as a backtrace of the operations used during the dynamic programming algorithm starting at d_{mn}. This algorithm has a time complexity of Θ(mn). When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory. However, this optimization makes it impossible to read off the minimal series of edit operations. Improved algorithm Improving on the Wagner–Fisher algorithm described above, Ukkonen describes a variant that takes two strings and a maximum edit distance s, and returns min(s, d). It achieves this by only computing and storing a part of the dynamic programming table around its diagonal. This algorithm takes time O(s×min(m,n)), where m and n are the lengths of the strings. Space complexity is O(s^2) or O(s), depending on whether the edit sequence needs to be read off.