[string algorithm] AC automata

Two days after the national day, I even wanted to make a cooing sound. Cough, cough, these are not important! Recently, I studied AC automata and found that it is not as difficult as I thought.

Origin of AC automata

I know that many people are very excited when they first see this thing. (don't ask me why I know)

But AC automata is not an automatic program...

AC automata is called AC automata because the algorithm was originally called aho Corasick automaton, which was invented by a man named aho Corasick.

So AC automata is also called aho Corasick algorithm

The algorithm was produced in Bell Labs in 1975. It is a famous multi-mode matching algorithm.

Use of AC automata

Then some students may have questions. AC automata can't automatically AC. what's the role?

In fact, the usage of AC automata is similar to KMP, which is used to solve the problem of string matching; But the difference is that AC automata is more used to solve the matching problem of multiple strings, in other words, the KMP problem with multiple substrings to match.

For example, give some words acbs, asf,dsef;
Then give a long article (sentence), acbsdfgeasf;
Ask how many words appear in this article, or the total number of words, which is the problem to be solved by AC automata.

Implementation method of AC automata

AC automata is based on the structure of Trie and the idea of KMP.

In short, there are two steps to establish an AC automata:

  1. Basic Trie structure: all pattern strings form a Trie.
  2. KMP's idea: construct mismatch pointers for all nodes on the Trie tree.

Then we can use it for multi pattern matching.

Students who don't understand trie can Click here to learn

Students who can't understand KMP can Click here to learn

So let's realize AC automata step by step!

Define a dictionary tree

First, we need to define a dictionary tree. We use struct to realize the definition of each node:

struct node
{
    int next[27];
    int fail;
    int count;
    void init()
    {
        memset(next,-1,sizeof(next));
        fail=0;
        count=0;
    }
}s[1100001];

The next [] array that stores the drive values

The next [] array is used to store the position of the last character of each character in the s array in the normal Trie tree. For example, if we read a string APPLE, then:

s [1] stores A, its next [P] = 2, and the rest is - 1;
s [2] stores P, its next [P] = 3, and the rest is - 1;
s [3] stores P, its next [L] = 4, and the rest is - 1;
s [4] stores L, its next [E] = 5, and the rest is - 1;
s [5] stores E, and its next is - 1.

fail: failed pointer

fail is the failure pointer. The following structure will talk about how to construct it quickly, so what's the use?

Let's take an example. This example only shows the mismatch pointer of e:
Suppose we read she, shr, say and her, then we get a lovely dictionary tree:

Then we just construct a failure pointer:

For example, matching article: sher, we just started from s and walked to the left. After walking to e, we found that: ah, there is no way to continue walking. If we start another round of matching from h, it will be a great waste of time; At this time, we are like, can we use the previous matching information? tolerable! The prefix he of her is exactly the same as that of she, so when she matching fails, we jump to the back of he and continue matching, and find that R matches r! This is the use of the fail pointer. Is it very similar to the next array of KMP!

count at the end of the record

If I insert a word APPLE, insert it into the last e, and find that this word has no subsequent letter. At this time, we will add a 1 to the count of this e, indicating that there is a word ending with this E.

Initialized init()

Here we also define an initialization function init(), which is used to initialize when we start a new starting point.

Insert words into the dictionary tree

Let's explain it in combination with the procedure:

int ins()
{
    int len=strlen(str+1);
    int ind=0,i,j;

    for(int i=1;i<=len;i++)
    {
        j=str[i]-'a'+1;
        if(s[ind].next[j]==-1)
        {
            num++;
            s[num].init();
            s[ind].next[j]=num;
        }
        ind=s[ind].next[j];

    }
    s[ind].count++;
    return 0;
}

First, str array is the string we want to read in, and ind represents my current position in s [] array; Next, let's start the cycle - for each point:

If the next of his previous letter does not point to his letter, we will create a new point to store the letter and let the next of his previous letter point to it;

If there is a position that directly points to its letter, just jump over it!

Finally, don't forget to add 1 to the count at the end of each word.

a key!!! Quick construction of fail pointer

What's the use of the fail pointer

First, what is the use of the fail pointer? Let's continue with the previous example:

We found that the fail pointer of e on the left points to e on the far right of L, so what is the meaning of this pointer? When a point i points to a point J, let's start from J and go up l characters to the vertex, where the string from the vertex to j is s;

In this example, s is "he" and the length is l, that is, 2; Then start with i and go up L-1 characters to get a string ss. In this example, ss is also "he"!

At this time, we were surprised to find that s is the same as ss!!

We know that when the fail pointer of i points to j, the string s from vertex to j is the suffix of the string from vertex to i!

In this way, if i continues to match down and fails, i can start matching directly from his fail instead of starting from scratch! Save a lot of time! This is the essence of the fail pointer!

How to construct the fail pointer

Let's paste the code first:

int make_fail()
{
    int head=1,tail=1;
    int ind,ind_f;
    for(int i=1;i<=26;i++)
    {
        if(s[0].next[i]!=-1)
        {
            q[tail]=s[0].next[i];
            tail++;
        }
    }
    while(head<tail)
    {
        ind=q[head];
        for(int i=1;i<=26;i++)
        {
            if(s[ind].next[i]!=-1)
            {
                q[tail]=s[ind].next[i];
                tail++;

                ind_f=s[ind].fail;

                while(ind_f>0 && s[ind_f].next[i]==-1)
                ind_f=s[ind_f].fail;

                if(s[ind_f].next[i]!=-1)ind_f=s[ind_f].next[i];
                s[s[ind].next[i]].fail=ind_f;
            }
        }
        head++;
    }
    return 0;
}

First, we need to open a queue q to store the points to be processed;

Then we add all the points connected to the vertices to the queue, and then we operate on each number in the queue:

First, add all his sons to the end of the queue, and then as a responsible father node, we can't just throw our sons to the end of the queue, and do a good job - help our sons do a good job in the fail pointer——

First, if I am the parent node x, for the child node of the letter A, I will first look at the node y pointed to by my fail pointer and see if y has the letter a child node z. if so, it would be great. I will let the fail pointer of my child node point to z;

If not, start from y and continue to see if there are any child nodes of the letter a at the point he fail s to point to... Until you find the point that meets the conditions.

If there's no way, even if the fail can't be found all the way to node 0, there's no way. The fail of my letter a child node has to point to node 0 [because the initialization is 0, so there's no need to operate at this time]

Let's take a concrete chestnut to see:











So this operation can quickly construct the fail pointer!

KMP on tree

Let's look at the code first:

int find()
{
    int len=strlen(des+1);
    int j,ind=0;
    for(int i=1;i<=len;i++)
    {
        j=des[i]-'a'+1;
        while(ind>0 && s[ind].next[j]==-1)ind=s[ind].fail;

        if(s[ind].next[j]!=-1)
        {
            ind=s[ind].next[j];
            p=ind;
            while(p>0 && s[p].count!=-1)
            {
                ans=ans+s[p].count;
                s[p].count=-1;
                p=s[p].fail;
            }
        }
    }
    return 0;
}

Similarly, Ind indicates that I have matched the current point. If the current point does not continue to be the same as any child node of ind, then I will jump to the point pointed by the fail pointer of ind... I know that I find a match with the current point or jump to the root node, which is very same as KMP!

It should be noted that since this problem is to solve which points appear in the parent string, we have carried out a layer of Optimization:

while(p>0 && s[p].count!=-1)
 {
     ans=ans+s[p].count;
     s[p].count=-1;
     p=s[p].fail;
 }

When we match a string s [string from the root node to IND], we jump to its fail. Because the string ss from his fail to the root node must be the suffix of S, ss must also appear in the parent string. At this time, add its count and set it to - 1 to prevent subsequent repeated access!

Template question

[Luogu p3808]
Topic background
This is a simple AC automata template problem.
Used to detect correctness and algorithm constants.
In order to prevent card OJ, there are only two groups of data on the basis of ensuring correctness. Please do not submit maliciously.
Administrator's note: there are repeated words in the data of this question, and the repeated words should be calculated many times. Please pay attention
Title Description
Given n pattern strings and 1 text string, find how many pattern strings have appeared in the text string.
Input / output format
Input format:
An n in the first line indicates the number of pattern strings;
The following n lines have a pattern string for each line;
The next line is a text string.
Output format:
A number represents the answer
Sample input and output
Input sample #1: copy
2
a
aa
aa
Output example #1: copy
2
explain
subtask1[50pts]: ∑ length (mode string) < = 10 ^ 6, length (text string) < = 10 ^ 6, n = 1;
subtask2[50pts]: ∑ length (mode string) < = 10 ^ 6, length (text string) < = 10 ^ 6;

This is the template question. The template is given below:

#include<cstdio>
#include<cstdlib>
#include<cstring>
#include<algorithm>
using namespace std;
struct node
{
    int next[27];
    int fail;
    int count;
    void init()
    {
        memset(next,-1,sizeof(next));
        fail=0;
        count=0;
    }
}s[1100001];

int i,j,k,m,n,o,p,js,jl,jk,len,ans,num;
char str[1100000],des[1100000];
int q[1100000];

int ins()
{
    int len=strlen(str+1);
    int ind=0,i,j;

    for(int i=1;i<=len;i++)
    {
        j=str[i]-'a'+1;
        if(s[ind].next[j]==-1)
        {
            num++;
            s[num].init();
            s[ind].next[j]=num;
        }
        ind=s[ind].next[j];

    }
    s[ind].count++;
    return 0;
}

int make_fail()
{
    int head=1,tail=1;
    int inf,inf_f;
    for(int i=1;i<=26;i++)
    {
        if(s[0].next[i]!=-1)
        {
            q[tail]=s[0].next[i];
            tail++;
        }
    }
    while(head<tail)
    {
        inf=q[head];
        for(int i=1;i<=26;i++)
        {
            if(s[inf].next[i]!=-1)
            {
                q[tail]=s[inf].next[i];
                tail++;

                inf_f=s[inf].fail;

                while(inf_f>0 && s[inf_f].next[i]==-1)
                inf_f=s[inf_f].fail;

                if(s[inf_f].next[i]!=-1)inf_f=s[inf_f].next[i];
                s[s[inf].next[i]].fail=inf_f;
            }
        }
        head++;
    }
    return 0;
}

int find()
{
    int len=strlen(des+1);
    int j,ind=0;
    for(int i=1;i<=len;i++)
    {
        j=des[i]-'a'+1;
        while(ind>0 && s[ind].next[j]==-1)ind=s[ind].fail;

        if(s[ind].next[j]!=-1)
        {
            ind=s[ind].next[j];
            p=ind;
            while(p>0 && s[p].count!=-1)
            {
                ans=ans+s[p].count;
                s[p].count=-1;
                p=s[p].fail;
            }
        }
    }
    return 0;
}

int main()
{
    scanf("%d",&m);

    num=0;s[0].init();
    for(int i=1;i<=m;i++)
    {
        scanf("%s",str+1);
        ins();
    }

    scanf("%s",des+1);

    ans=0;

    make_fail();

    find();

    printf("%d",ans);
    return 0;
}

epilogue

Through this blog, I believe you must have learned AC automata! I hope you like this blog!!!

Posted by Shaba1 on Thu, 12 May 2022 13:36:24 +0300