Detailed explanation of multilingual and Chinese word segmentation and retrieval of Elasticsearch

1. Natural language and query Recall

When dealing with human natural language, in some cases, although the search does not exactly match the original text, we want to find some content

  • Quick brown fox and fast brown fox / Jumping fox and jumpedfoxes

Some optimizations that can be taken

  • Normalized morpheme: when removing diacritic symbols, such as role, it will also match role
  • Root extraction: clear the differences between singular, plural and tense
  • Include synonyms
  • Misspellings: misspellings, or homonyms

2. The challenge of mixed Multilingualism

Some specific multilingual scenarios

  • Different indexes use different languages / different fields in the same index use different languages / different languages are mixed in one field of a document

Some challenges of mixed language

  • Secondary stem extraction: Israeli documents, including Hebrew, Arabic, Russian and English
  • Not striving for document frequency - German scores high (rare) in English based articles
  • It is necessary to judge the language used by users when searching, such as compact language detector
    For example, query different indexes according to language

3. The challenge of word segmentation

You're divided into one or more? Half -baked
Chinese word segmentation

  • Mark of participle: in mark of Harbin Institute of technology, surname and first name are separated. HanLP is together. Different standards shall be formulated according to specific conditions
  • Ambiguity (combinatorial ambiguity, communicative ambiguity, true ambiguity)
    The people's Republic of China / the United States will pass the arms collection bill for Taiwan / Shanghai Renhe garment factory

4. The evolution of Chinese word segmentation - dictionary method

Look up the dictionary - the easiest way to think of word segmentation (proposed by Professor Liang Nanyuan of Beijing Aviation University)

  • Scan a sentence from left to right. Identify some words when you encounter them. If you find a compound word, find the longest one
  • Unrecognized strings are divided into single words

Word segmentation theory of minimum number of words - theorization of the method of looking up the dictionary by Dr. Wang Xiaolong of Harbin University of Technology

  • A sentence should have the least number of word segments
  • In case of ambiguous division, there is nothing to do (e.g. "developing country" / "Shanghai University Town bookstore")
  • Using various cultural rules to solve ambiguity is not successful

5. Evolution of Chinese word segmentation methods - machine learning algorithm based on statistical method

Statistical language model - around 1990, Dr. Guo Jin, Department of electronic engineering, Tsinghua University

  • The problem of ambiguity is solved, and the error rate of Chinese word segmentation is reduced by one data level. Probability problem, dynamic programming + using Viterbi algorithm to quickly find the best word segmentation

Machine learning algorithm based on statistics

  • These commonly used algorithms are HMM, CRF, SVM, deep learning algorithm and so on. For example, Hanlp word segmentation tool is based on CRF algorithm. The basic idea is to label Chinese characters. It considers not only the frequency of words, but also the context, and has good learning ability. Therefore, it has good effects on the recognition of ambiguous words and unregistered words. With the rise of deep learning, a word segmentation device based on neural network has also appeared. Some people try to use bidirectional LSTM + CRF to realize the word segmentation, Its essence is sequence tagging. It is reported that the character accuracy of its word splitter can be as high as 97.5%

6. Current situation of Chinese word splitter

Chinese word segmentation is based on statistical language model. After decades of development, it can be regarded as a solved problem today
The main difference between the quality of different word splitters lies in the use of data and the accuracy of engineering use
Common word splitters use the combination of machine algorithm and dictionary. On the one hand, they can improve the accuracy of word segmentation, on the other hand, they can improve the adaptability of the field

7. Some Chinese word splitters

  • HanLP - natural language processing package for production environment
  • IK word splitter

7.1 HanLP

./elasticsearch-plugin install https://github.com/KennFalcon/elasticsearc...

7.2 IK Analysis

/elasticsearch-plugin install https://github.com/medcl/elasticsearch-ana...

7.3 Pinyin

./elasticsearch-plugin install https://github.com/medcl/elasticsearch-ana...

7.4 Chinese word segmentation DEMO

  • Use different word splitters to test the effect
  • When indexing, try to segment short, and when querying, try to use long words
  • Pinyin participle
#Install plug-ins
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
#Install plug-ins
bin/elasticsearch install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip
docker exec -it es7_01 bash
mkdir /usr/share/elasticsearch/plugins/hanlp
docker cp elasticsearch-analysis-hanlp-7.1.0.zip es7_02:/usr/share/elasticsearch/plugins/hanlp
docker exec -it es7_01 bash
cd /usr/share/elasticsearch/plugins/hanlp
unzip elasticsearch-analysis-hanlp-7.1.0.zip 
rm -rf elasticsearch-analysis-hanlp-7.1.0.zip 
#ik_max_word
#ik_smart
#hanlp: hanlp default participle
#hanlp_standard: Standard participle
#hanlp_index: index participle
#hanlp_nlp: NLP participle
#hanlp_n_short: N-shortest participle
#hanlp_dijkstra: shortest path participle
#hanlp_crf: CRF word segmentation (abandoned in hanlp 1.6.6)
#hanlp_speed: speed dictionary segmentation

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text": ["A number of senior executives at Cambridge analytics told undercover reporters that they ensured DonaldĀ·Trump won the presidential election"]

}     

#Pinyin
PUT /artists/
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}


GET /artists/_analyze
{
  "text": ["Liu Dehua, Zhang Xueyou, Guo Fucheng, four heavenly kings of dawn"],
  "analyzer": "user_name_analyzer"
}

Tags: ElasticSearch ELK

Posted by tonyw on Sat, 07 May 2022 13:09:09 +0300