字符串查找算法-KMP

/**

*    KMP algorithm is a famous way to find a substring from a text. To understand its‘ capacity, we should acquaint onself with the normal algorithm.

*/

/**

*    simple algorithm

*

*    workflow:

*        (say,  @ct means for currently position of search text)

*

*        step 1: match text from index @ct with pattern.

*        step 2: if success, step to next character. or,

*

*        The most straightforward algorithm is to look for a character match at successive values of index @ct, the position in the string being searched, i.e. S[ct]. If the index

*        if the index m reaches the end of the string, then there is no match. In which case the search is said to "fail". At each position m, the algorithm first check whether

*        S[ct] and P[0] is equal. If success, check the rest of characters. if fail, check next position, ct+1;

*

*        put all of detail into this code.

*/

int work( char *s, char *pat)
{
        int    ct = 0;
        int    mi = 0;
        while( s[ct]!='\0' )
        {
                while( s[ct + mi]==pat[mi])
                {
                        mi++;
                        if( pat[mi]=='\0')
                            return ct;
                }
                ct++;
        }
        return -1;
}

/**

*    It‘s simple and have a good performance usually. But thing will become terrible when we deal with some special string. Just like:

*

*        S[m] = 00000000....00000001

*        P[n] = 000001

*    wow, the result is really awful. In conclusion, if we take a comparison operation as the basic element, we could get the time complexity.At ideal case,

*        T(m,n) = O( m+n)

*

*    that is reasonable.But at worst case,

*        T(m,n) = O( m*n)

*

*    Now , It‘s time to KMP show its excellent performance.

*/

/**

*    KMP

*    KMP is a algorithm for search a substring from a string. At case above, the situation become awful, because of some special situation: we always

*    be told we are wrong when we almost success. There are so many traps.

*

*    Q: Could I do something with those traps?

*/

/*

*    Of course, we can. Image this, you encounter a trap when you check substring from position S[ct], and you realize this when you match S[ct+m] with P[m].

*    At this moment ,except for fail, we also know another thing.

*

*        S[ct]S[ct+1]....S[ct+m-1] = P[0]P[1]....p[m-1]

*

*    Have you find anything ?

*    All of traps must be the substring of the pattern and must be started by index 0. That‘s means we can do some funny thing , since we know what we will face. Now,

*    pick a trap up to analysis. In the case above

*

*        P[0]P[1]P[2]....P[m].

*

*    what we are interested is whether a really substring is remain under this.

*

*    Q: How can we know that?

*/

/*

*    if a really substring is remain under this substring, It must have

*

*        P[0]P[1]P[2]....P[n] = P[m-n]...P[m-1]P[m]  n<m

*    (this is the key point for the KMP algorithm.)

*

*    Of course, some other false one may meet with this formula. But who care about it, we just want to find the doubtful one. So , by the help of this conclusion,

*    we could know whether a doubtful substring is here and where is it. Then if we know where is the doubtful substring, we could skip some undoubtful one.That‘s all.

*/

/*

*    Here is a examples for KMP.

*/

#include <stdio.h>
#include <iostream>

typedef int    INDEX;
class KMP {
        public:
                KMP( );
                ~KMP( );
                bool set( char *s, char *patt );
                bool work( void);

                bool showPT( void);
                bool	show( void);

        private:
                /*
            *    initialize the table of partial match string.
            */
                bool createPT( void);
                bool computePTV( INDEX mmi);

                /*
            *    text string
            */
                char *str;
                /*
            *    pattern string
            */
                char *pattern;
                /*
            *    the length of pattern
            */
                int    p_len;
                /*
            *    a pointer for the table of
            */
                int    *ptab;
                INDEX    ip;        //index position of the string
};

KMP::KMP( )
{
        this->str = NULL;
        this->pattern = NULL;
        this->p_len = 0;
        this->ptab = NULL;
        this->ip = -1;
}

KMP::~KMP()
{
        if( this->ptab!=NULL)
        {
                delete []this->ptab;
                this->ptab = NULL;
        }
}

bool KMP::set(char * s,char * patt)
{
        this->str = s;
        this->pattern = patt;
        this->ip = -1;

        this->p_len = strlen( this->pattern);
        this->ptab = new int[this->p_len];

        return true;
}

bool KMP::work(void)
{
/*
*    As the analysis above, To skip some traps, we need a table to record
*    some information about those traps.
*/
        this->createPT( );

        // current match index
        INDEX    cmi = 0;
        //index of text string, this is a relative length
        INDEX    si = 0 ;
        //index of pattern
        //INDEX    pi = 0;
/*
*    the process of search is almost same as the simple algorithm. The difference
*    of operation is occur when we encounter a trap. Based on our theory, we
*    could skip some traps.
*/
        while( this->str[cmi]!='\0' )
        {
                /*
            *    match pattern with the text, this is same as the normal algorithm.
            */
                while( this->str[cmi + si]==this->pattern[si] )
                {
                        si++;
                        //pi++;
                        if( this->pattern[si]=='\0')
                        {
                                this->ip = cmi;
                                return true;
                        }
                }
                /*
            *    when we encounter a trap, we need to revise the start position
            *    in next search. In this expression, @(cmi + si) is the last position
            *    in the text where we compare with our pattern. @( this->ptab[si])
            *    is the length of possible string in this template.
            */
                cmi = cmi + si -this->ptab[si];
                /*
            *    update the relative length to the corresponding value.
            */
                if( si!=0)
                {
                        si = this->ptab[si];
                        //pi = this->ptab[pi];
                }
        }

        return false;
}

/**
*
*/
bool KMP::createPT( void)
{
        //mismatch index
        INDEX    mmi = 1;
        /*
       *    deal with every template string that is possible to become a trap.
       */
        while( mmi<this->p_len)
        {
                this->computePTV( mmi);
                mmi++;
        }
        /*
       *    Actually, the following is not a trap, but for make our program
       *    more concise , we do a trick.
       */
        this->ptab[0] = -1;
        return true;
}

/**
*    This function work for compute the max match string. That's comformation
*    will be used to skip some traps.
*/
bool KMP::computePTV( INDEX mmi)
{
        //max match index
        INDEX    max_mi = mmi -1;
        //initial position for search substring
        INDEX    i = max_mi -1;
        //the length of substring
        INDEX    ss_len = 0;
        while( i>=0 )
        {
                while( this->pattern[ max_mi - ss_len]==this->pattern[ i-ss_len] )
                {
                        if( i==ss_len)
                        {
                                this->ptab[ mmi] = ss_len + 1;
                                return true;
                        }

                        ss_len ++;
                }

                ss_len = 0;
                i--;
        }

        this->ptab[ mmi] = 0;
        return false;
}

bool KMP::showPT(void)
{
        int    i = 0;
        for( i=0; i<this->p_len; i++)
        {
                printf( "%2c", this->pattern[i] );
        }
        printf("\n");

        for( i=0; i<this->p_len; i++)
        {
                printf("%2d", this->ptab[i]);
        }
        printf("\n");

        return true;
}

bool KMP::show(void)
{
        printf("text: %s\n", this->str);
        printf("pattern: %s\n", this->pattern);

        printf(" index = %d\n", this->ip);

        return true;
}

#define TEXT	"aababcaababcabcdabbabcdabbaababcabcdabbaababaababcabcdabbcaabaacaabaaabaabaababcabcdabbabcabcdaababcabcdabbabbcdabb"
#define PATTERN	"abaabcac"

int main( )

{
        KMP    kmp;
        kmp.set( TEXT, PATTERN);
        kmp.work( );
        kmp.showPT( );
        kmp.show( );

        return 0;
}

字符串查找算法-KMP

时间: 2024-11-05 06:05:34

字符串查找算法-KMP的相关文章

Rabin-Karp字符串查找算法

1.简介 暴力字符串匹配(brute force string matching)是子串匹配算法中最基本的一种,它确实有自己的优点,比如它并不需要对文本(text)或模式串(pattern)进行预处理.然而它最大的问题就是运行速度太慢,所以在很多场合下暴力字符串匹配算法并不是那么有用.我们需要一些更快的方法来完成模式匹配的工作,然而在此之前,我们还是回过头来再看一遍暴力法匹配,以便更好地理解其他子串匹配算法. 如下图所示,在暴力字符串匹配里,我们将文本中的每一个字符和模式串的第一个字符进行比对.

基本算法——字符串查找之KMP算法

虽然,c++标准库中为我们提供了字符串查找函数,但我们仍需了解一种较为快捷的字符串匹配查找——KMP算法. 在时间复杂度上,KMP算法是一种较为快捷的字符串匹配方法. 实现代码如下: 1 #include <iostream> 2 #include <string> 3 #include <vector> 4 #include <stdexcept> 5 using namespace std; 6 7 void get(const string &

Rabin-Karp指纹字符串查找算法

首先计算模式字符串的散列函数, 如果找到一个和模式字符串散列值相同的子字符串, 那么继续验证两者是否匹配. 这个过程等价于将模式保存在一个散列表中, 然后在文本中的所有子字符串查找. 但不需要为散列表预留任何空间, 因为它只有一个元素. 基本思想 长度为M的字符串对应着一个R进制的M位数, 为了用一张大小为Q的散列表来保存这种类型的键, 需要一个能够将R进制的M位数转化为一个0到Q-1之间的int值散列函数, 这里可以用除留取余法. 举个例子, 需要在文本 3 1 4 1 5 9 2 6 5 3

KMP字符串查找算法

#include <iostream> #include <windows.h> using namespace std; void get_next(char *str,int *num) { int idFront = 0; int len = strlen(str); int amount = 1; int flag = 0;//相等时一直往下循环 int flag2 = 0;//标记是否在循环过程中不匹配,如果在循环过程中不匹配,则要防止跳过这个数 for(int i =

字符串查找算法

#include<iostream> using namespace std; int BFMatch(char* s,char* p) { int i=0; int j=0; while(i<strlen(s)) { while(s[i]==p[j]&&j<strlen(p)) { j++; i++; } if(j==strlen(p)) return i-strlen(p); i=i-j+1; } return -1; } int main() { char*

字符串查找与匹配之BM算法

一.字符串查找:1.在Word. IntelliJ IDEA.Codeblocks等编辑器中都有字符串查找功能.2.字符串查找算法是一种搜索算法,目的是在一个长的字符串中找出是否包含某个子字符串. 二.字符串匹配:1.一个字符串是一个定义在有限字母表上的字符序列.例如,ATCTAGAGA是字母表 E ={A,C,G,T}上的一个字符串.2.字符串匹配算法就是在一个大的字符串T中搜索某个字符串P的所有出现位置.其中,T称为文本,P称为模式,T和P都定义在同一个字母表E上.3.字符串匹配的应用包括信

字符串查找与匹配算法

一.字符串查找:1.在Word. IntelliJ IDEA.Codeblocks等编辑器中都有字符串查找功能.2.字符串查找算法是一种搜索算法,目的是在一个长的字符串中找出是否包含某个子字符串. 二.字符串匹配:1.一个字符串是一个定义在有限字母表上的字符序列.例如,ATCTAGAGA是字母表 E ={A,C,G,T}上的一个字符串.2.字符串匹配算法就是在一个大的字符串T中搜索某个字符串P的所有出现位置.其中,T称为文本,P称为模式,T和P都定义在同一个字母表E上.3.字符串匹配的应用包括信

暴力子字符串查找

子字符串查找:给定一段长度为N的文本和一个长度为M的模式字符串,在文本中找到一个和该模式相符的子字符串 广泛使用的暴力算法,虽然在最坏情况下的运行时间与M*N成正比,但是在实际中,绝大多数比较在比较第一个字符时就会产生不匹配,它实际运行时间一般与M+N成正比 下面是暴力子字符串查找算法的Java实现: /** * 暴力字符串查找,如果找到,返回pat在txt中第一次出现的位置:没有找到则返回N的值 * @param txt * @param pat * @return */ public int

一步一步写算法(之字符串查找 上篇)

原文:一步一步写算法(之字符串查找 上篇) [ 声明:版权所有,欢迎转载,请勿用于商业用途.  联系信箱:feixiaoxing @163.com] 字符串运算是我们开发软件的基本功,其中比较常用的功能有字符串长度的求解.字符串的比较.字符串的拷贝.字符串的upper等等.另外一个经常使用但是却被我们忽视的功能就是字符串的查找.word里面有字符串查找.notepad里面有字符串查找.winxp里面也有系统自带的字符串的查找,所以编写属于自己的字符串查找一方面可以提高自己的自信心,另外一方面在某