Use Elasticsearch 7.9.1 to realize the full-text content retrieval of word, pdf and txt files

Briefly introduce the requirements

  1. It can support file upload and download
  2. It is required to be able to search out documents according to keywords. It is required to be able to search the text in the document. The document type should support word, pdf and txt

File upload and download are relatively simple. You should be able to retrieve the text in the file and be as accurate as possible. In this case, many things need to be taken into account. In this case, I decided to use Elasticsearch to implement it.

I found that many interviewers asked about Elasticsearch because I was preparing to find a job. In addition, I didn't even know what Elasticsearch was at that time, so I decided to try something new. I have to say that the update of Elasticsearch version is really fast. 7.9.1 was used a few days ago, and 7.9.2 came out on the 25th.

Introduction to Elasticsearch

Elasticsearch is an open source literature search engine. The general meaning is that you tell it the keyword through the Rest request, and it returns the corresponding content to you. It's so simple.

Elasticsearch encapsulates Lucene, an open source full-text search engine toolkit of apache Software Foundation. The call of Lucene is complex, so elasticsearch encapsulates another layer and provides some advanced functions such as distributed storage.

There are many plug-ins based on Elasticsearch. I mainly use two this time, kibana and Elasticsearch head.

  • kibana is mainly used to build requests. It provides many automatic completion functions.
  • Elasticsearch head is mainly used to visualize elasticsearch.

development environment

First, install Elasticsearch, Elasticsearch head and kibana. All three things are used out of the box. Double click to run them. It should be noted that the version of kibana should correspond to the version of Elasticsearch.

Elasticsearch head is the visual interface of elasticsearch. Elasticsearch is operated based on the Rest style API. With the visual interface, you don't have to use the Get operation to query every time, which can improve the development efficiency.

Elasticsearch head uses node JS. You may encounter cross domain problems during installation: the default port of elasticsearch is 9200, while the default port of elasticsearch head is 9100. You need to change the configuration file. I won't say in detail how to change it. After all, there is a universal search engine.

After Elasticsearch is installed, access the port and the following interface will appear.

Core issues

There are two core problems to be solved, file upload and keyword input query.

File upload

First of all, for the plain text form of txt, it is relatively simple. You can directly pass in the content. However, for the two special formats of pdf and word, there are many irrelevant information in the file, such as pictures, labels in pdf and so on. This requires file preprocessing.

Elasticsearch5. After X, a function called ingest node is provided, which can preprocess the input documents. As shown in the figure, after the PUT request enters, it will first judge whether there is a pipline. If so, it will enter the ingrest node for processing, and then it will be officially processed.

Ingest Attachment Processor Plugin is a text extraction plug-in, which essentially uses the ingest node function of Elasticsearch to provide a key preprocessor attachment. Run the following command in the installation directory to install.

./bin/elasticsearch-plugin install ingest-attachment

Define text extraction pipeline

PUT /_ingest/pipeline/attachment
{
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "content",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "content"
            }
        }
    ]
}

The field to be filtered is specified as content in the attachment, so the document content needs to be placed in the content field when writing Elasticsearch.

The operation results are shown in the figure below:

Establish document structure mapping

We need to establish a document structure mapping to define the form in which text files are stored after uploading through the preprocessor. When PUT defines the document structure mapping, the index will be automatically created, so we first create a docwrite index for testing.

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "type":{
        "type": "keyword"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "ik_smart"
          }
        }
      }
    }
  }
}

The attachment field is added in ElasticSearch. This field is automatically attached after the attachment named pipeline extracts the text in the document attachment. This is a nested field, which contains multiple sub fields, including extracted text content and some document information metadata.

Similarly, specify the analyzer IK for the file name name_ max_ Word to allow ElasticSearch to segment Chinese words when building a full-text index.

test

After the above two steps, we conduct a simple test. Because ElasticSearch is a document database based on JSON format, the attachment documents must be base64 encoded before being inserted into ElasticSearch. First, convert a pdf file into base64 text through the following website. PDF to Base64

The test document is shown in the figure below:

Then I uploaded it through the following request, and I found a large pdf file. What needs to be specified is the pipeline we just created, and the result is shown in the figure.

The original index has a type. The new version will be discarded later. The default version is_ doc

Then we use the GET operation to see if our document is uploaded successfully. You can see that it has been parsed successfully.

If pipline is not specified, it will not be parsed.

According to the results, we can see that our PDF file has passed the pipline defined by ourselves, and then officially entered the index database docwrite.

Keyword query

Keyword query means that certain word segmentation can be carried out on the input text. For example, for the string of words "database computer network my computer", we should be able to divide it into three keywords: "database", "computer network" and "my computer", and then query them respectively according to the keywords.

Elasticsearch has its own word splitter, which supports all Unicode characters, but it only makes the largest division. For example, the four words of imported red wine will be divided into "import", "mouth", "red" and "wine". In this way, the query results will include "import", "lipstick" and "red wine".

This is not the result we want. The result we want is only divided into "import" and "red wine", and then query the corresponding results. This requires Chinese word segmentation support.

ik participle

ik word segmentation is a popular Chinese word segmentation plug-in in the open source community. We first install ik word segmentation. Note that the following code cannot be used directly.

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/... Find your version here

ik word splitter includes two modes.

  1. ik_max_word will split Chinese as much as possible.
  2. ik_smart will be divided according to common habits. For example, "imported red wine" will be divided into "imported" and "red wine".

When querying, we specify ik word splitter to query documents. For example, for the inserted test documents, we use ik_smart mode search, the results are shown in the figure.

GET /docwrite/_search
{
  "query": {
    "match": {
      "attachment.content": {
        "query": "Experiment 1",
        "analyzer": "ik_smart"
      }
    }
  }
}

We can specify the highlight in Elasticsearch to label the filtered text. In this way, the text will be labeled before and after. As shown in the figure.

code

The coding uses the development environment of Idea+maven. First, import the dependency. The dependency must correspond to the version of Elasticsearch.

Import dependency

Elstacisearch has two APIs for Java. We use a well encapsulated high-level API.

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.9.1</version>
</dependency>

File upload

First create a fileObj object corresponding to the above

public class FileObj {
    String id; //Used to store file id
    String name; //file name
    String type; //File type, pdf, word, or txt
    String content; //Convert all contents of base64 encoded files.
}

First, according to the above, we need to read the file in the form of byte array, and then convert it into Base64 encoding.

public FileObj readFile(String path) throws IOException {
    //read file
    File file = new File(path);
    
    FileObj fileObj = new FileObj();
    fileObj.setName(file.getName());
    fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
    
    byte[] bytes = getContent(file);
    
    //Convert file contents to base64 encoding
    String base64 = Base64.getEncoder().encodeToString(bytes);
    fileObj.setContent(base64);
    
    return fileObj;
}

java.util.Base64 already provides ready-made functions getEncoder(). Encodetostring is for us to use.

Next, you can upload the file using Elasticsearch's API.

The IndexRequest object needs to be used for uploading. Use fastjason to convert fileObj into Json, and then upload. IndexRequest is required The setpipeline function specifies the pipeline we defined above. In this way, the file will be preprocessed through pipline and then entered into the fileindex index.

public void upload(FileObj file) throws IOException {
    IndexRequest indexRequest = new IndexRequest("fileindex");
    
    //At the same time of uploading, use attachment pipline to extract files
    indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
    indexRequest.setPipeline("attatchment");
    
    IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);
    System.out.println(indexResponse);
}

File query

The file query needs to use the SearchRequest object. First, I want to specify the ik that uses the ik participle for our keywords_ Smart mode word segmentation

SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);

Then we can get each hit through the returned Response object, and then get the returned content.

Iterator<SearchHit> iterator = hits.iterator();
int count = 0;
while (iterator.hasNext()) {
    SearchHit hit = iterator.next();
}

A very powerful function of elasticseal is the highlight function of files, so we can set a highlight to highlight the queried text.

HighlightBuilder highlightBuilder = new HighlightBuilder();
HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("attachment.content");
highlightContent.highlighterType();
highlightBuilder.field(highlightContent);
highlightBuilder.preTags("<em>");
highlightBuilder.postTags("</em>");
srb.highlighter(highlightBuilder);

I set the front < EM > < / EM > tag to wrap the query results. In this way, the query results will contain the corresponding results.

Multi file test

A simple demo has been written, but the effect still needs to be tested with multiple files. This is one of my test folders. There are various types of files under it.

After uploading all the files in this folder, use the elestacisearch head visual interface to view the imported files.

Search code:

    /**
     * This part will query the information in the database according to the entered keywords, and then return the corresponding results
     * @throws IOException
     */
    @Test
    public void fileSearchTest() throws IOException {
        ElasticOperation elo = eloFactory.generate();

        elo.search("Database State Council computer network");
    }

Run our demo, and the query results are shown in the figure.

The demo code is all in: github link

There are still some problems

1. File length

Through the test, it is found that for files with text content of more than 100000 words, elasticsearch only retains 10w words, and the later ones are truncated. This requires further understanding of elasticsearch's support for texts with more than 10w words.

2. Some coding problems

In my code, after all the files are read into memory, a series of processing is carried out. There is no doubt that it will bring problems. For example, if it is a super large file beyond memory, or several large files, in the actual production environment, file upload will occupy a considerable part of the memory and bandwidth of the server, which should be further optimized according to the specific needs.

Reference content

[1] ElasticSearch full text retrieval practice
[2] How to use pipeline API to handle events in Elasticsearch
[3] Teaching of crazy God Theory in station b
[4] The use of ik participle in Elasticsearch

Tags: Back-end

Posted by undertaker16 on Thu, 12 May 2022 15:48:57 +0300