Data structure - < naive matching algorithm and optimized KMP matching algorithm >

Concept of string

String: a finite sequence of zero or more characters, also known as a string

Some strings

	Empty string( null string),It is zero in length and can be directly represented by two pairs of quotation marks ""
 The so-called sequence shows that the adjacent characters of the string have the relationship between predecessor and successor.
There are still some concepts to explain.
 	A space string is a string that contains only spaces. Note the difference between it and empty string. The space string has content and length, and can have more than one space. Substring and main string. The subsequence composed of any number of continuous characters in the string is called	The substring of the string. Accordingly, the string containing the substring is called the main string.
The position of the substring in the main string is the sequence number of the first character of the substring in the main string,
 "over","end" ~ * lie "
 In fact, it can be considered as"lover " friend"
 believe" Substrings of these word strings

Naive matching algorithm



The question itself emphasizes c In language strstr function
 Concept: definition: strstr(str1,str2) Function to determine the string str2 Is it str1 Substring of.
If yes, the function returns str2 stay str1 The address that first appears in the;
Otherwise, return NULL. 

The first thing we think of is a simple and rough solution to match slowly. Start from the first character of the main string and match the string all the way back. If you encounter a different direct stop, then start to match backward from the second element of the main string. You can complete it by knowing that it is a substring and finding its starting position
Here's the code

int strStr (char * haystack, char * needle){
    int len1 = strlen(haystack);
    int len2 = strlen(needle);
    if(len2 == 0)
        //The first step is to find the length of two strings. If the length of the shorter string is zero, we directly return 0-false
    return 0;
    int i = 0,j = 0;
    //Pay attention to the conditions for judging the end of the while loop
    //The reason why we don't use the for loop is that we don't know when i or j will move when reading the string
    // Double pointer method i and j point to long string and short string respectively
    while (haystack[i] != '\0' && needle[j] != '\0') {
        if(haystack[i] == needle[j]) {
            //If equal, i and j move together
        } else {
            i = i - j + 1; //The long string returns to the position where the comparison started
            j = 0;//Substring starts from scratch
        if (j == len2)
            //Substring end
        return i-j;
        //The long string pointer minus the substring pointer is the initial letter, because the substring length is j
    return -1;//If the judgment fails, return - 1

This method is too slow! Finding the position of a substring requires too many steps. And many steps are not necessary at all.
This idea is very good!! Many great ideas are born in the process of perfecting and correcting the existing methods step by step. In the best case, the time complexity of this algorithm is O(n). That is, the n characters of the substring are exactly equal to the first n characters of the main string, and in the worst case, the time complexity is O(m*n). In contrast, the space complexity of this algorithm is O(1), that is, it consumes no space but time.

**Now let's get to our topic: how KMP algorithm optimizes these steps. In fact, the main idea of KMP is: "space for time".

KMP matching algorithm


  The longest equal prefix and suffix of a string:

**The formula given in the book is as follows**

Simple textbooks are boring. My understanding is as follows
String abcdab
Set of prefixes: {a,ab,abc,abcd,abcda}
Set of suffixes: {b,ab,dab,cdab,bcdab}
Then the longest equal prefix and suffix is ab, which is the essence of this formula!


The so-called KMP algorithm:
**It is to build a next array to judge the backtracking position of J pointer. In our simple algorithm, j is backtracked every time
The initial position of the substring, but kmp is to avoid unnecessary backtracking

**The string before each character has the longest prefix, and the length of the longest prefix is the key to our shift, so we use a next array to store the length of the longest prefix of the substring. Moreover, the value of the next array is only related to the substring itself.
Therefore, next[i]=j means that the length of the longest phase of the string before the character with subscript I is j.
We can calculate that the next array of substring t= "abcabcmn" is next [0] = - 1 (there is no separate string processing in the front)
The code is as follows, which needs to be understood


void getNext(int* next, char* s) {
    //Initialize next
    int j = 0;
    next[0] = j;
    for (int i = 1; i < strlen(s); i++) {	//Note that i starts with 1
        //If the front suffixes are different
        while (j > 0 && s[i] != s[j]) {
            //Then go back forward
            j = next[j - 1];
        //If the front suffix is the same
        if (s[i] == s[j]) {
            //i and j move backward at the same time (the increase of i is in the for loop)
        //Assign j (length of prefix) to next[i]
        next[i] = j;

int strStr(char * haystack, char * needle){
    int len1 = strlen(haystack);
    int len2 = strlen(needle);
    //When the need is an empty string, 0 is returned
    if (len2 == 0) {
        return 0;

    //Build next array
    int* next = malloc(sizeof(int) * len2);
    getNext(next, needle);
    //The starting position of the next record is 0, so it also starts from 0 here
    int j = 0;
    for (int i = 0; i < len1; i++) {	//Note that i starts from 0 when matching
        //If not
        while (j > 0 && haystack[i] != needle[j]) {
            //Return to the previous matching position j
            j = next[j - 1];
        //If match
        if (haystack[i] == needle[j]) {
            //i and j move backward at the same time (the increase of i is in the for loop)
        //When j is equal to the length of the need, the string need appears in the string haystack
        if (j == len2) {
            //Returns the first position of the need string
            return (i - len2 + 1);
    //If it is not found, it indicates that it does not exist and returns - 1
    return -1;

Tags: Algorithm data structure

Posted by AcousticJames on Sun, 15 May 2022 11:32:02 +0300