Minimum Edit Distance with Dynamic Programming

  • 1. Question / 实践题目
  • 2. Analysis / 问题描述
  • 3. Algorithm / 算法描述
    • 3.1. Substitution
    • 3.2. Insertion
    • 3.3. Deletion
    • 3.4. Sepcial Cases
    • 3.5. Equation
  • 4. Fill the table / 填表
    • 4.1. Dimention
    • 4.2. Range
    • 4.3. Order
    • 4.4. Related Code
  • 5. Show Me the Code / 完整代码
  • 6. T(n) and S(n) / 算法时间及空间复杂度分析(要有分析过程)
  • 7. Experience / 心得体会(对本次实践收获及疑惑进行总结)

1. Question / 实践题目

2. Analysis / 问题描述

Our task is to modify the two strings with three operations: (1) deleting one character, (2) inserting one character, (3) substituting a character with another, with the minimum edit distance (the least edit times). It seems that it can be solved using dynamic programming. To use this strategy we should first try to find out the optimal substructure and its overlapping subproblems.

3. Algorithm / 算法描述

Assume that the lengths of string A and string B are \(m\) and \(n\). Let‘s first try to make the last characters of the two strings identical. We should perform some operations if the last characters of the two strings are not the same. If they are the same, we should only consider their substrings, the \(1^{st}\) one to the \((m - 1)^{th}\) one for string A and the \(1^{st}\) one to the \((n - 1)^{th}\) one for string B.

3.1. Substitution

We can both substitute the last character of string A with the last character of string B and vice versa. Let‘s first consider the first situation. The original strings in the question are:

fxpimu
  xwrs

We substitute "u" in string A with "s" in string B. Thus the strings now are:

fxpims
  xwrs

Since the last characters of the two strings are identical now, we should merely find out the minimum edit distance for their substrings, the \(1^{st}\) one to the \((m - 1)^{th}\) one for string A and the \(1^{st}\) one to the \((n - 1)^{th}\) one for string B. The situation for substiting the last character of string B with the last one of string A is similar.

3.2. Insertion

We can both insert the last character of string A to the end of string B, and vice versa. First let‘s consider the situation that the last character of string B is to be inserted to the end of string A. After the insertion the strings will be:

fxpimus
   xwrs

The lengths of the two strings now are \(m + 1\) and \(n\). The last characters of the two strings are identical now, so we can consider the minimum edit distance of their substrings. The \(1^{st}\) one to the \(m^{th}\) (the length of string A is increased by one after the insertion, which is \(m + 1\). Since we are to omit the last character and consider its substring now, the last character of the substring is the \(m^{th}\) one.) for string A and the \(1^{st}\) one to the \((n - 1)^{th}\) one for string B. The other situation that the last character of string A is inserted to the end of string B is similar.

3.3. Deletion

Both insertion and substitution can make the last character of the two strings identical, but that‘s not the case for the deletion operation. After deleting one character of the string, the last character of the modified string may not be the same as the last character of the other string. Concretely, if we delete the last character "u" in string A, the strings will be like:

fxpim
 xwrs

The last characters are still not identical. Therefore, we should still consider the whole length of the two strings. That is, the \(1^{st}\) character to the \((m - 1)^{th}\) character for string A (because the last character of string A is deleted, the length of it is decreased by one.), and the \(1^{st}\) character to the \(n^{th}\) one for string B. Deleting the last character of string B will be similar.

3.4. Special Cases

When one of the strings are empty, the minimum edit distance will be the length of the non-empty string. The empty string can be edited to be the non-empty string by inserting characters of the non-empty string into the empty string, one by one from the end.

And when both of the strings are empty, we need to do nothing.

According to the analysis above we can find out the optimal substructure of the task. The optimal solution of the original task depends on the solution of its subtasks.

The task also has many overlapping subproblems. Let‘s say we want to find out the minimum edit distance of the substrings of two strings, we have to find out the minimum distance of their subsubstrings. And when we are to find out the minimum edit distance of two strings, we find the minimum distance of their substrings, whose minimum distance depends on their subsubstrings. Thus here the minimum edit distance of the subsubstrings are calculated twice.

3.5. Equation

Let‘s say the minimum edit distance of string A and string B with lengths \(m\) and \(n\) is edit[m][n]. With the optimal substucture, we can work out the equation for the task:
\[
edit[i][j]=
\left
\{
\begin{aligned}
& 0 & {i = 0, j = 0} & \& n & {i = 0, j > 0} & \& m & {i > 0, j = 0} \& min\{edit[i - 1][j] + 1, edit[i][j - 1] + 1, edit[i - 1][j - 1] + notIdentical\} & {i > 0, j > 0}
\end{aligned}
\right.
\]

where notIdentical is like:
\[
notIdentical=
\left
\{
\begin{aligned}
& 0 & {A[i] = B[j]} & \& 1 & {A[i] \neq B[j]} & \\end{aligned}
\right.
\]

Note that \(min(edit[i - 1][j] + 1\) is for the case that (1) deleting the last character of string A and that (2) inserting the last character of string A to the end of string B. \(edit[i][j - 1] + 1\) is for the case that (1) deleting the last character of string B and that (2) inserting the last character of string B to the end of string A. \(edit[i - 1][j - 1] + notIdentical\) is for the case that (1) the last characters of the two strings are identical and (2) the substitution operation is performed.

The equation can be simplefied as:
\[
edit[i][j]=
\left
\{
\begin{aligned}
& i == 0\;?\;i\;:\;j & {i == 0\;||\;j == 0} & \ & min\{edit[i - 1][j] + 1, \\
& \qquad edit[i][j - 1] + 1, \\
& \qquad edit[i - 1][j - 1] + int(!(A[i] == B[j])\} & {i > 0, j > 0}
\end{aligned}
\right.
\]

4. Fill the Table / 填表

What we should do when solving a dynamic programming task is merely filling a table. What we should know are:

  1. the dimension of the table
  2. the range to fill
  3. the filling order

4.1. Dimension

Since we use a 2-dimensional matrix edit[i][j] to store the solutions, the table is 2D.

4.2. Range

The range of the indices of the solution is \(0 \leq i \leq m\), \(0 \leq j \leq n\). Thus the whole table should be filled.

4.3. Order

To calculate edit[i][j], we should first calculate edit[i - 1][j], edit[i][j - 1] and edit[i - 1][j - 1]. Let‘s find out their related position in the table:
\[
\begin{matrix}
edit[i - 1][j - 1] & edit[i - 1][j] \\edit[i][j - 1] & *edit[i][j]* \\end{matrix}
\]

thus the order is from left to right, from the top to the bottem.

4.4. Related Code

After considering how the table should be filled, we can start writing code with the equation.

for (int i = 0; i <= stringA.length(); i++) {  // from the top to the bottom
  for (int j = 0; j <= stringB.length(); j++) {  // from left to right
    if (i && j) {  // i > 0 and j > 0
      // (1) delete A[m - 1]
      // (2) insert A[m - 1] to B[n]
      int tmpEditTimes1 = editTimes[i - 1][j] + 1;

      // (1) delete B[n - 1]
      // (2) insert B[n - 1] to A[m]
      int tmpEditTimes2 = editTimes[i][j - 1] + 1;

      // (1) A[m - 1] == B[n - 1]
      // (2) substitution
      int tmpEditTimes3 = editTimes[i - 1][j - 1] +
        int(!(stringA[i - 1] == stringB[j - 1]));

      // find out the smallest edit distance
      editTimes[i][j] = min(
        tmpEditTimes1,
        tmpEditTimes2,
        tmpEditTimes3);
    }
    else {  // i = 0 or j = 0 or both equal 0
      editTimes[i][j] = i == 0 ? j : i;
    }
  }
}

5. Show Me the Code

#include <iostream>
#include <string>
using namespace std;

int editTimes[2001][2001];

int min(int a, int b);
int min(int a, int b, int c);

int main(void) {
    // receive string A
    string stringA;
    getline(cin, stringA);

    // receive string B
    string stringB;
    getline(cin, stringB);

    // fill the table
    for (int i = 0; i <= stringA.length(); i++) {  // from the top to the bottom
        for (int j = 0; j <= stringB.length(); j++) {  // from left to right
            if (i && j) {  // i > 0 and j > 0
                // (1) delete A[m - 1]
                // (2) insert A[m - 1] to B[n]
                int tmpEditTimes1 = editTimes[i - 1][j] + 1;

                // (1) delete B[n - 1]
                // (2) insert B[n - 1] to A[m]
                int tmpEditTimes2 = editTimes[i][j - 1] + 1;

                // (1) A[m - 1] == B[n - 1]
                // (2) substitution
                int tmpEditTimes3 = editTimes[i - 1][j - 1] +
                    int(!(stringA[i - 1] == stringB[j - 1]));

                // find out the smallest edit distance
                editTimes[i][j] = min(
                    tmpEditTimes1,
                    tmpEditTimes2,
                    tmpEditTimes3);
            }
            else {  // i = 0 or j = 0 or both equal 0
                editTimes[i][j] = i == 0 ? j : i;
            }
        }
    }

    // display the minimum edit distance
    cout << editTimes[stringA.length()][stringB.length()];

    return 0;
}

int min(int a, int b) {
    return a < b ? a : b;
}

int min(int a, int b, int c) {
    return min(min(a, b), c);
}

6. T(n) and S(n) / 算法时间及空间复杂度分析(要有分析过程)

In the code which fills the table, there are two loops, and statements in the loops all have a time complexity of \(O(1)\). Thus the time complexity is
\[T(m, n) = O(m) * O(n) = O(mn)\]
where \(m\) and \(n\) is the lengths of the two strings, respectively.

We used a 2-dimensional array whose size is \((m + 1) * (n + 1)\). Thus the space complexity is
\[S(m, n) = O(m) * O(n) = O(mn)\]

7. Experience / 心得体会(对本次实践收获及疑惑进行总结)

To find out the solution to the problems of this chapter, we should:

  1. find out the optimal substructure
  2. find out the overlapping subproblems
    Once we find out these two we have confidence to say that the problem can be solved with dynamic programming.

The 3 things needed to determine to solve a dynamic programming problem is:

  1. the dimension of the table
  2. the range to be filled
  3. the filling order

Everything gets simple if we finish the steps above.



reference: 动态规划之编辑距离问题

原文地址:https://www.cnblogs.com/Chunngai/p/11691697.html

时间: 2024-08-30 04:17:56

Minimum Edit Distance with Dynamic Programming的相关文章

Minimum edit distance(levenshtein distance)(最小编辑距离)初探

最小编辑距离的定义:编辑距离(Edit Distance),又称Levenshtein距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数.许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符. 例如将kitten一字转成sitting: sitten(k→s) sittin(e→i) sitting(→g) 俄罗斯科学家Vladimir Levenshtein在1965年提出这个概念. Thewords `computer' and `commuter' are

最小编辑距离(Minimum edit distance)

最小编辑距离是计算欧式距离的一种方法,可以被用于计算文本的相似性以及用于文本纠错,因为这个概念是俄罗斯科学家 Vladimir Levenshtein 在1965年提出来的,所以编辑距离又称为Levenshtein距离. 简单的理解就是将一个字符串转换到另一个字符串所需要的代价(cost),付出的代价越少表示两个字符串越相似,编辑距离越小,从一个字符串转换到另一个字符串简单的归纳可以有以下几种操作,1.删除(delete)2.插入(insert)3.修改(update),其中删除和插入的代价可以

[LeetCode] 72. Edit Distance_hard tag: Dynamic Programming

Given two words word1 and word2, find the minimum number of operations required to convert word1to word2. You have the following 3 operations permitted on a word: Insert a character Delete a character Replace a character Example 1: Input: word1 = "ho

Min Edit Distance

Min Edit Distance ----两字符串之间的最小距离 PPT原稿参见Stanford:http://www.stanford.edu/class/cs124/lec/med.pdf Tips:由于本人水平有限,对MED的个人理解可能有纰漏之处,请勿尽信. Edit:个人理解指编辑之意,也即对于两个字符串,对其中的一个进行各种编辑操作(插入.删除.替换)使其变为另一个字符串.要解决的问题是求出最小的编辑操作次数是多少. 基因系列比对 定义距离: X,Y是大小分别为n,m的字符串. 定

[LeetCode] questions for Dynamic Programming

Questions: [LeetCode] 198. House Robber _Easy tag: Dynamic Programming [LeetCode] 221. Maximal Square _ Medium Tag: Dynamic Programming [LeetCode] 62. Unique Paths_ Medium tag: Dynamic Programming [LeetCode] 64. Minimum Path Sum_Medium tag: Dynamic P

Edit Distance (or Levenshtein Distance) python solution for leetcode EPI 17.2

https://oj.leetcode.com/problems/edit-distance/ Edit Distance  Given two words word1 and word2, find the minimum number of steps required to convert word1 to word2. (each operation is counted as 1 step.) You have the following 3 operations permitted

72. Edit Distance &amp;&amp; 161. One Edit Distance

72. Edit Distance Given two words word1 and word2, find the minimum number of steps required to convert word1 to word2. (each operation is counted as 1 step.) You have the following 3 operations permitted on a word: a) Insert a characterb) Delete a c

About Dynamic Programming

Main Point: Dynamic Programming = Divide + Remember + Guess 1. Divide the key is to find the subproblem 2. Remember use a data structure to write down what has been done 3. Guess when don't know what to do, just guess what next step can be Problems:

Baozi Leetcode solution 72. Edit Distance

Problem Statement Given two words word1 and word2, find the minimum number of operations required to convert word1 to word2. You have the following 3 operations permitted on a word: Insert a character Delete a character Replace a character Example 1: