转:The Knuth-Morris-Pratt Algorithm in my own words

The Knuth-Morris-Pratt Algorithm in my own words

For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.

Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than na”ive string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here’s the partial match table for the pattern “abababca”:

1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.

Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

1
2
3
bacbababaabcbab
 |
 abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

1
2
3
bacbababaabcbab
    |||||
    abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

1
2
3
4
5
// x denotes a skip

bacbababaabcbab
    xx|||
      abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

1
2
3
4
5
// x denotes a skip

bacbababaabcbab
      xx|
        abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Conclusion

So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.

Posted by Jake Boxer Dec 13th, 2009

时间: 2024-12-20 23:56:28

转:The Knuth-Morris-Pratt Algorithm in my own words的相关文章

模板: 字符串模式匹配 Knuth–Morris–Pratt Algorithm

Knuth–Morris–Pratt Algorithm KMP字符串模式匹配算法 模板题 Brief Introduction To be updated Algorithm To be updated Template Code #include <cstdio> #include <cstring> #define MAXN 1000005 int n,m,cnt,next[MAXN]; char s1[MAXN],s2[MAXN]; void kmp(){ next[0]=

KMP(Knuth–Morris–Pratt) Algorithm

文章 String S, 要查找的字符串 String W 重点:Get int T[] from W 用变量 c 来表示当前比较到的 W 中前面的字符,i 表示当前要处理的 T 中的 index,int[] T = new int[W.length];T[0] = -1; T[1] = 0;c=0; i=2 example:array:    a    b    a    b    c    d    a    b    a    b    a    bidx    :    0    1  

我所理解的 KMP(Knuth–Morris–Pratt) 算法

假设要在 haystack 中匹配 needle . 要理解 KMP 先需要理解两个概念 proper prefix 和 proper suffix,由于找到没有合适的翻译,暂时分别称真实前缀 和 真实后缀. 真实前缀(Proper prefix): 一个字符串中至少不包含一个尾部字符的前缀字符串.例如 "Snape" 的全部真实前缀是 “S”, “Sn”, “Sna”, and “Snap” . 真实后缀(Proper suffix): 一个字符串中至少不包含一个头部字符的后缀字符串

Go语言(golang)开源项目大全

转http://www.open-open.com/lib/view/open1396063913278.html内容目录Astronomy构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析器控制台用户界面加密数据处理数据结构数据库和存储开发工具分布式/网格计算文档编辑器Encodings and Character SetsGamesGISGo ImplementationsGraphics and AudioGUIs and Widget ToolkitsHardwareLangu

GO语言的开源库

Indexes and search engines These sites provide indexes and search engines for Go packages: godoc.org gowalker gosearch Sourcegraph Contributing To edit this page you must be a contributor to the go-wiki project. To get contributor access, send mail t

[转]Go语言(golang)开源项目大全

内容目录 Astronomy 构建工具 缓存 云计算 命令行选项解析器 命令行工具 压缩 配置文件解析器 控制台用户界面 加密 数据处理 数据结构 数据库和存储 开发工具 分布式/网格计算 文档 编辑器 Encodings and Character Sets Games GIS Go Implementations Graphics and Audio GUIs and Widget Toolkits Hardware Language and Linguistics 日志 机器学习 Math

“浅析kmp算法”

"浅析kmp算法" By 钟桓 9月 16 2014 更新日期:9月 16 2014 文章目录 1. 暴力匹配: 2. 真前缀和真后缀,部分匹配值 3. 如何使用部分匹配值呢? 4. 寻找部分匹配值 5. 拓展 5.1. 最小覆盖字串 6. 参考资料 首先,KMP是一个字符串匹配算法,什么是字符串匹配呢?简单地说,有一个字符串"BBC ABCDAB ABCDABCDABDE",我想知道这个字符串里面是否有"ABCDABD":我想,你的脑海中马上就

省选前模板复习

PREFACE 也许是OI生涯最后一场正式比赛了,说是省选前模板,其实都是非常基础的东西,穿插了英文介绍和部分代码实现 祝各位参加JXOI2019的都加油吧 也希望今年JX能翻身,在国赛中夺金 数学知识 见数学知识小结 字符串 KMP算法Knuth-Morris-Pratt Algorithm KMP算法,又称模式匹配算法,是用来在一个文本串(text string)s中找到所有模式串(pattern)w出现的位置. 它是通过当失配(mismatch)发生时,模式串本身能提供足够的信息来决定下一

KMP算法解析(转自图灵社区)

KMP算法是一个很精妙的字符串算法,个人认为这个算法十分符合编程美学:十分简洁,而又极难理解.笔者算法学的很烂,所以接触到这个算法的时候也是一头雾水,去网上看各种帖子,发现写着各种KMP算法详解的转载帖子上面基本都会附上一句:“我也看的头晕”——这种诉苦声一片的错觉仿佛人生苦旅中找到知音,让我几乎放弃了这个算法的理解,准备把它直接记在脑海里了事. 但是后来在背了忘忘了背的反复过程中发现一个真理:任何对于算法的直接记忆都是徒劳无功的,基本上忘得比记的要快.后来看到刘未鹏先生的这篇文章:知其所以然(

字符串匹配:从后缀自动机到KMP

后缀自动机(sam)上的字符串匹配 ==== 我们把相对较短的模式串构造成sam. 对于P="abcabcacab", T[1..i]的后缀,使得它是sam的最长前缀长度: T: b a b c b a b c a b c a a b c a b c a b c a c a b  c 1 1 2 3 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 5 6 7 8 9 10 4 如果最长前缀长度是|P|,则表示T[1..i]的后缀和P匹配. 内存使用 可能多个trans指针同