Core technology and algorithm of python natural language processing -- Chinese syntactic analysis based on PCFG

This chapter in this book is a little thin, but the author also explained that this book is an introductory practice book of NLP, and syntactic analysis belongs to a higher-level problem in NLP, so I didn't explain it in depth. I'm also an introductory NLP after learning this book. After learning this book, I will learn statistical natural language processing.
Because there is little actual combat content in this chapter, and there is no particularly obscure code, this article is more about windows configuration and other issues.

1, JDK installation and configuration

Because stanford parser is a java implementation based on statistical probability parsing, JDK needs to be installed, The JDK download link is here.

I downloaded the 1.8 version of jdk, because I used 1.8 when I was still touching java two or three years ago

I heard a boss who has been working for a long time say that JDK is best installed in the default directory, that is, the program files folder of drive C, but it doesn't affect it. Anyway, as long as the path is right, I install it on drive D. Because there are JDK and JRE in the installation package, I use two folders to package them respectively. The installation process is a fool installation. Just click. This is my installation directory:

After installation, configure environment variables. Right click computer > Properties > Advanced System Settings > environment variables. Here is the difference between user variables and system variables:

  • User variable: refers to a variable that can only be used by this user. The corresponding user is the user you selected when you started up. For example, your brother or brother created a new user on this computer because he didn't want to share some small movies with you, and your brother or brother can't use the user variables you configured.
  • System variable: refers to that all users on this computer can use it. Is that you share a little movie with your brother or brother.

The specific configuration can be anywhere. If multiple users on the computer need it, it can be configured in the system variable, of course, it can also be configured in the user variable.

The easiest way to configure JDK is to add "root/jdk/bin" to any path in the environment variable. For example, if my installation path is added to the path:

D:\java\JDK\bin

However, there will be a problem with this, that is, if it needs to be combined with other plug-ins, Java is required in many cases_ Home (such as our program), so the most standardized is to configure JAVA_HOME, CLASSPATH and path. Let's start to explain how to configure them.

  1. Create a new variable. The variable name is JAVA_HOME, where the variable value is stored in the JDK. For example, my configuration path is as follows:
D:\java\JDK

  1. Create a new variable with the name CLASSPATH and the value:
.;%JAVA_HOME%\lib\tools,jar;%JAVA_HOME%\lib\dt.jar;

This direct copy is OK, because as long as Java_ After home is configured, there are variables to store your JDK path.

  1. Double click path and add two lines at the end of the variable value of path:
%JAVA_HOME%\bin
%PATH%


Note: JAVA_HOME, CLASSPATH and configured path must be in the same user / system variable. For example, I am configured in the same user variable. If they are not in the same variable, the JDK will not be recognized.

After configuration, confirm all the way to the end, open cmd and enter commands respectively:

java -version
javac

If the following content appears, congratulations on your successful configuration, otherwise reconfigure it.

Note: if there is an error, you need to close cmd and restart after reconfiguration, otherwise you enter the command again, which is the result of the last time.

2, PCFG file download

The file download address is as follows: https://nlp.stanford.edu/software/lex-parser.shtml#Download , I need to download two files in total. The files I downloaded are as follows:

  1. The download link above downloads stanford-parser-4.0.0 Zip, which contains the jar package of Stanford Parser and the trained model package of Stanford Parser required in this project.
  2. The download link below downloads the PCFG model.

I unzip these two files to the path of the project. In my project, they are as follows:

The PCFG model downloaded from the following link is stored in 1, and the jar package of Stanford Parser and the model package trained by Stanford Parser are stored in 2.

Note: the file path should be considered here, because the path in the following code should be changed according to this.

3, Code

import jieba
from nltk.parse import stanford
import os

string = 'He went to the vegetable market by bike.'
seg_list = jieba.cut(string, cut_all=False, HMM=True)
seg_str = " ".join(seg_list)

root = "./"  # root directory
parser_path = root + 'stanford_parser/stanford-parser.jar'
model_path = root + 'stanford_parser/stanford-parser-4.0.0-models.jar'

# Specify JDK path
if not os.environ.get("JAVA_HOME"):
    JAVA_HOME = 'D:/java/JDK'
    os.environ['JAVA_HOME'] = JAVA_HOME

# PCFG model path
pcfg_path = root + "pcfg/edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz"

parser = stanford.StanfordParser(
    path_to_jar=parser_path,  # Specify the jar package of stanford parser
    path_to_models_jar=model_path,  # Specify the trained model jar package
    model_path=pcfg_path  # Specify the java class path of the parsing algorithm to be called
)

sentence = parser.raw_parse(seg_str)
print("sentence:", sentence)
for line in sentence:
    print(line)
    print(line.leaves())
    line.draw()

The points needing attention in the code are as follows:

  1. seg_ There is a space in the quotation mark after STR, because the input received by the parser of Stanford Parser is a sentence separated by a space after word segmentation.
  2. There is a clerical error in the package imported from the book. The book I use is reprinted on April 2018 (March 2020). If the reader uses a later book, there may be no such error. The book reads:
from nltk.parser import staford

Here should be:

from nltk.parse import stanford
  1. Path and file name problems. The path in the example in the book is as follows:
parser_path = "./stanford-parser.jar"
model_path = './stanford-parser-3.8.0-models.jar'
pcfg_path = "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz"

However, because I rebuilt two folders here and installed different versions, the path in my project should be as follows:

parser_path = './stanford_parser/stanford-parser.jar'
model_path = './stanford_parser/stanford-parser-4.0.0-models.jar'
pcfg_path = "./pcfg/edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz"

Readers adjust the path problem here according to their own path. You can use the following code to test whether the file exists:

print(os.path.exist(path))
  1. The last is Java_ The problem with home.
  2. When using the above code, readers should see a warning:
DeprecationWarning: The StanfordParser will be deprecated
Please use nltk.parse.corenlp.CoreNLPParser instead.
  model_path=pcfg_path  # Specify the java class path of the parsing algorithm to be called

This means that this method will be abandoned, so it is best to use the package recommended by them and check the documents to solve this warning problem.

Since this is a specific application problem, I don't have an in-depth understanding of the specific content of this library, so I just taste it. When I study statistical natural language processing later, I will slowly come back to learn syntactic analysis.

The relevant outputs are as follows:
The content above is the content of print(line), and the content below is the content of print(line.leaves()).

4, Summary

This paper mainly introduces the download, installation, configuration and code of Chinese Parsing Based on PCFG in Chapter 6 of python natural language processing core technology and algorithm.

5, Reference

[1] Tu Ming, Liu Xiang, Liu Shuchun Core technology and algorithm of python natural language processing [M] China Machine Press: Beijing, 2018.4:116

Tags: Python NLP

Posted by forcerecon on Tue, 24 May 2022 12:42:37 +0300