Using Java to write a search engine system, this is too strong...

Source: https://blog.csdn.net/m0_57315623

preface

If we use our small server to do Baidu, Sogou's engine must not work. It belongs to the whole site search. We do an on-site search here. This is still OK, which is similar to our search for resources in the website.

I How do search engines search

Search engine is like a little bee picking honey every day, that is to crawl various web pages, and then build an index for us to search.

Here we can use Python or download the document compression package. Let's pack here. It's much faster. I wanted to build a hero League, but I can't find it. If Lao tie finds it later, I can share it.

I suggest you don't crawl (or you'll be accused, but we can climb the official website of our school freely. We practiced with this at that time). Why use the index?

Because there are too many data to climb and there is no index, can I traverse it? The time complexity is too large.

Here, we need to create an index. The indexes are forward index and reverse index.

Take LOL as an example. The positive row is equivalent to that when we mention the skills of the limitless swordsman, we can think of:

  • Q skill alpha raid
  • W skill meditation
  • E unparalleled skills
  • R skill plateau descent

So it's based on the name

The inverted index is who has a sword in LOL:

  1. Tryndamere
  2. Limitless swordsman
  3. Jianji

So this is to choose heroes according to their characteristics

II Module division

1. Index module

1) Scan the downloaded documents, analyze the contents, and build forward index and reverse index. And save the index content to a file.

2) Load the index made i well. It also provides some API s to realize the functions of positive check and inverted check.

2. Search module

1) Call the index module to realize a complete search process.

Input: user's query words output: complete search results

3.web module

We need to implement a simple web program, which can interact with users in the form of web pages. It includes front-end and back-end.

III How to realize word segmentation

Principle of word segmentation:

1. Based on Thesaurus

Try to enumerate all the words and put these results in the dictionary file.

2. Based on statistics

Many corpora are collected and labeled manually. The probability of knowing those words together is relatively high~

There are also many third-party tools that can implement word segmentation in java

For example, ansj (I heard that my brother may have heard of ansj, ha ha) is a third-party database of word segmentation in maven central warehouse.

We download the latest version directly and put it into POM XML inside

Direct operation in test package: we use this test code to do it directly. Try how to use this bag.

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;
import java.util.List;
public class TastAnsj {
    public static void main(String[] args) {
        String str = "Master Yi is an assassin and warrior hero with ultra-high mobility. He is good at quickly defeating his opponent by using fast attack. Master Yi generally fights in the wild and takes a single path. As the last descendant of limitless Kendo, Yi can quickly cut a lot of damage. At the same time, he can also use his skills to avoid fierce attack and avoid the enemy's fire gathering.";
        List<Term> terms = ToAnalysis.parse(str).getTerms();
        for (Term term : terms) {
            System.out.println(term.getName());
        }
    }
}

IV File reading

Copy the path of the just downloaded document into a String and mark it with a constant.

This step is to use the traversal method to get out all html files. We use a recursion here. If it is an absolute path, it will be added to the file chain list. If it is not, it will continue to add the values inside.

import java.io.File;
import java.util.ArrayList;

//Just read the document
public class Parser {
     private static final  String INPUT_PATH="D:/test/docs/api";
      public  void run(){
          //The entry of the whole Parser class
          //1. List all documents according to the path (html);
          ArrayList<File> fileList=new ArrayList<>();
          enumFile(INPUT_PATH,fileList);
          System.out.println(fileList);
          System.out.println(fileList.size());
          //2. For the files listed above, open the file, read the file content, and analyze it
          //3. Set the index data structure constructed in memory into the specified file.
      }
      //The first parameter indicates where to start the traversal / / the second parameter indicates the result.
      private void enumFile(String inputPath,ArrayList<File>fileList){
         File rootPath=new File(inputPath);
         //listFiles can get the files in the first level directory
        File[] files= rootPath.listFiles();
         for(File f:files){
             //Determine whether to recurse according to the type of current f.
             //If f is an ordinary file, add f to the fileList
             //If not, call recursion
             if(f.isDirectory()){
                 enumFile(f.getAbsolutePath(),fileList);
             }else {
                 fileList.add(f);
             }
         }
      }
    public static void main(String[] args) {
        //The main method is used to realize the whole process of index making
        Parser parser=new Parser();
        parser.run();
    }
}

Let's try to run it. There are too many files here, and whatever it is, it's printed out. So our next step is to filter these files and select useful ones.

else {
                 if(f.getAbsolutePath().endsWith(",html"))
                 fileList.add(f);
             }

This code is only for files with html at the end. The following figure shows the results.

4.1 open the file and analyze the content.

It is divided into three parts: parsing Title, parsing Url and parsing Content

4.1.1 analysis of Title

f.getName() is a method to read the file name directly.

We use name substring(0,f.getName(). length()-5); Why subtract 5 from the total file name length HTML is exactly five.

private  String parseTitle(File f) {
          String name= f.getName();
         return name.substring(0,f.getName().length()-5);

    }

4.1.2 parsing Url

The url here is that we usually go to a browser to input something, and there will be a url below. This url is our absolute path. After intercepting, we get our relative directory, and then splice it with our http, so we can directly get a page.

private  String parseUrl(File f) {
      String part1="https://docs.oracle.com/javase/8/docs/api/";
      String part2=f.getAbsolutePath().substring(INPUT_PATH.length());
          return part1+part2;
    }

4.1.3 analysis content

Take < > as the switch to read the data. Why use int instead of char when reading with int type? Because after reading the int type, it becomes - 1. You can judge whether the reading is completed. The specific code is as follows, which is easy to understand.

private  String parseContent(File f) throws IOException {
          //First read one character by one, and use < > as the switch
        try(FileReader fileReader=new FileReader(f)) {
            //Add a switch whether to copy or not
            boolean isCopy=true;
            //You also need to prepare a result to save
            StringBuilder content=new StringBuilder();
            while (true){
                //The return value of read here is int, not char
                //If you read the end of the file, you will return - 1, which is the advantage of using int;
                int  ret = 0;
                try {
                    ret = fileReader.read();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                if(ret==-1) {
                        break;
                    }
                    char c=(char) ret;
                    if(isCopy){
                        if(c=='<'){
                            isCopy=false;
                            continue;
                        }
                        //Direct copy of other characters
                        if(c=='\n'||c=='\r'){
                            c=' ';
                        }
                        content.append(c);
                    }else{
                        if(c=='>'){
                            isCopy=true;
                        }
                    }
            }

            return  content.toString();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return "";
    }

The total code block of this module is as follows:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;

//Just read the document
public class Parser {
     private static final  String INPUT_PATH="D:/test/docs/api";
      public  void run(){
          //The entry of the whole Parser class
          //1. List all documents according to the path (html);
          ArrayList<File> fileList=new ArrayList<>();
          enumFile(INPUT_PATH,fileList);
          System.out.println(fileList);
          System.out.println(fileList.size());
          //2. For the files listed above, open the file, read the file content, and analyze it
          for (File f:fileList){
              System.out.println("Start parsing"+f.getAbsolutePath());
              parseHTML(f);
          }
          //3. Set the index data structure constructed in memory into the specified file.
      }

    private  String parseTitle(File f) {
          String name= f.getName();
         return name.substring(0,f.getName().length()-5);

    }
    private  String parseUrl(File f) {
      String part1="https://docs.oracle.com/javase/8/docs/api/";
         String part2=f.getAbsolutePath().substring(INPUT_PATH.length());
          return part1+part2;
    }
    private  String parseContent(File f) throws IOException {
          //First read one character by one, and use < > as the switch
        try(FileReader fileReader=new FileReader(f)) {
            //Add a switch whether to copy or not
            boolean isCopy=true;
            //You also need to prepare a result to save
            StringBuilder content=new StringBuilder();
            while (true){
                //The return value of read here is int, not char
                //If you read the end of the file, you will return - 1, which is the advantage of using int;
                int  ret = 0;
                try {
                    ret = fileReader.read();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                if(ret==-1) {
                        break;
                    }
                    char c=(char) ret;
                    if(isCopy){
                        if(c=='<'){
                            isCopy=false;
                            continue;
                        }
                        //Direct copy of other characters
                        if(c=='\n'||c=='\r'){
                            c=' ';
                        }
                        content.append(c);
                    }else{
                        if(c=='>'){
                            isCopy=true;
                        }
                    }
            }

            return  content.toString();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return "";
    }
    private void parseHTML (File f){
        //Parse out the title
          String title=parseTitle(f);
        //Parse the corresponding url
          String url=parseUrl(f);
        //Parse the corresponding text
        try {
            String content=parseContent(f);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
      //The first parameter indicates where to start the traversal / / the second parameter indicates the result.
      private void enumFile(String inputPath,ArrayList<File>fileList){
         File rootPath=new File(inputPath);
         //listFiles can get the files in the first level directory
        File[] files= rootPath.listFiles();
         for(File f:files){
             //Determine whether to recurse according to the type of current f.
             //If f is an ordinary file, add f to the fileList
             //If not, call recursion
             if(f.isDirectory()){
                 enumFile(f.getAbsolutePath(),fileList);
             }else {
                 if(f.getAbsolutePath().endsWith(".html"))
                 fileList.add(f);
             }
         }
      }
    public static void main(String[] args) {
        //The main method is used to realize the whole process of index making
        Parser parser=new Parser();
        parser.run();
    }
}

Recent hot article recommendations:

1.1000 + Java interview questions and answers (2022 latest version)

2.Hot! The Java collaboration is coming...

3.Spring Boot 2.x tutorial, too complete!

4.Don't write the explosive category on the full screen. Try the decorator mode. This is the elegant way!!

5.Java development manual (Songshan version) is the latest release. Download it quickly!

Feel good, don't forget to like + forward!

Posted by djot on Tue, 10 May 2022 05:06:17 +0300