Analysis of named entity recognition task code based on LSTM/BLSTM/CNNBLSTM -- 1

The original code comes from github. The specific website is: https://github.com/OustandingMan/LSTM-CRF

However, reading the corpus is not a template, but a code written by yourself to read the data

 

My understanding of deep learning:

  1. Processing data: processing data into a format that can be read by the network
  2. Network construction: various calls of "function"
  3. Training: it may take some time to feed neural network data for training
  4. Test: test the trained model, and select the evaluation parameters to quantitatively evaluate the trained model, that is, the quality of the so-called model
  5. Adjusting parameters: try to debug the parameters of neural network (in fact, try to see which parameters make the model test result better)

(I have just come into contact with neural network, and some opinions only represent my own opinions. If there are different places, I can communicate together)

 

The following is a simple explanation of the code (my understanding + Baidu)

 

The procedure mainly includes:

main.py main program

model.py# neural network model setup program

pretreatment.py: Data preprocessor

There are other py files on git. If you are interested, you can download them yourself

 

1,main.py

(1) Import of related packages

#coding:utf-8
 
import random
 
import numpy as np
 
 
 
import codecs as cs
 
import tensorflow as tf
 
from pretreatment import process_data,process_test ##Data processing
 
import model as tagger ##Give the model a nickname, that is, take a new name
 
 
 
import sys
 
reload(sys)
 
sys.setdefaultencoding('utf8')  ##Selected python2 7. Everyone knows.. Coding problem.. Headache..
 

(2) Default value setting of each parameter

Before executing the main function, first analyze the flags. By setting flags, you can give parameters when running the program or select the default value, that is, initialize a default value of each parameter before running the program.

stay https://blog.csdn.net/weiqi_fan/article/details/72722510 The interpretation of is:

Before executing the main function, first analyze the flags, that is, Tensorflow passes TF by setting flags app. For the parameters required by run (), we can initialize flags directly before the program runs, or set command-line parameters when running the program to achieve the purpose of parameter transmission

#FLAGS is an object that saves the parsed command line parameters

FLAGS = tf.app.flags.FLAGS
 
tf.app.flags.DEFINE_string('train_data',DATA_DIR+r'train_data.txt','train data file')
 
tf.app.flags.DEFINE_string('test_data',DATA_DIR+r'test_data.txt','test data file')
 
tf.app.flags.DEFINE_string('valid_data',DATA_DIR+r'dev_data.txt','validation data file')
 
tf.app.flags.DEFINE_string('log_dir',LOG_DIR,'the log dir')
 
tf.app.flags.DEFINE_string('model_dir',MODEL_DIR,'models dir')
 
tf.app.flags.DEFINE_string('model','LSTM','model type:LSTM/BLSTM/CNNBLSTM')
 
tf.app.flags.DEFINE_string('restore_model','None','path of the model to restored')
 
tf.app.flags.DEFINE_string('emb_file',EMB_DIR+'/data_vec.txt','embeddings file')
 
tf.app.flags.DEFINE_integer('emb_dim',100,'embedding size')
 
tf.app.flags.DEFINE_string('output_dir',OUTPUT_DIR,'output dir')
 
tf.app.flags.DEFINE_float('lr',0.002,'learning rate')
 
tf.app.flags.DEFINE_float('dropout',0.,'dropout rate of input layer')
 
tf.app.flags.DEFINE_boolean('fine_tuning',True,'whether fine-tuning the embeddings')
 
tf.app.flags.DEFINE_boolean('eval_test',True,'whether evaluate the test data')
 
tf.app.flags.DEFINE_integer("max_len", MAX_LEN,'max num of tokens per query')
 
tf.app.flags.DEFINE_integer("nb_classes", 7, 'Tagset size')
 
tf.app.flags.DEFINE_integer("hidden_dim", 80, 'hidden unit number')
 
tf.app.flags.DEFINE_integer("batch_size", 200, 'num example per mini batch')
 
tf.app.flags.DEFINE_integer("train_steps", 50, 'trainning steps')
 
tf.app.flags.DEFINE_integer("display_step", 1, 'number of test display step')
 
tf.app.flags.DEFINE_float("l2_reg", 0.0001, 'L2 regularization weight')
 
tf.app.flags.DEFINE_boolean('log', True, 'Whether to record the TensorBoard log.')

example:

tf.app.flags.DEFINE_string('model','LSTM','model type:LSTM/BLSTM/CNNBLSTM')

Parameter name: model

Default value of parameter: LSTM

Data type of parameter: DEFINE_string

Comments of parameters: model type:LSTM/BLSTM/CNNBLSTM

The three parameters in brackets correspond to the parameter name, parameter default value and parameter comment respectively

Running main Py file can be run in two ways:

python main.py --model BLSTM
 
python main.py

The value of the first parameter model is BLSTM, which is no longer LSTM. It is changed through the command line parameters

The value of the second parameter model is still the default value LSTM (if it is not transmitted, the default value will be used)

 

tf.app.flags.DEFINE_xxx() is the optional argument added to the command line, and TF app. flags. Flags can take parameters from the corresponding command line parameters.

 

(3) Main function

if __name__ == '__main__':
 
    tf.app.run()

The above code exists in many Tensorflow codes. The source code is as follows:

def run(main=None):
 
   f = flags.FLAGS
 
   f._parse_flags()
 
   main = main or sys.modules['__main__'].main
 
   sys.exit(main(sys.argv))

effect:

First handle the flag parsing, and then execute the main function

Various explanations found on the Internet are as follows (do not paste the website one by one):

1) In the program of tensorflow, TF is used under the main function app. Run() to start

2) You can see that the process in the source code is to first load the parameter item of flags, and then execute the main function. Where the parameter is TF app. flags. Flags defined

3) Execute startup tensorflow in the main program

 

(4) main function

def main(_):
 
    np.random.seed(1337)
 
    random.seed(1337)##The purpose of setting random seed is to keep the given initial value consistent during each initialization
 
 
 
##Data reading
 
    train,valid,test,dict1,max_len,label= process_data(FLAGS.train_data,FLAGS.valid_data,FLAGS.test_data)
 
 
 
    train_x,train_y,train_len = train
 
    valid_x,valid_y,valid_len = valid
 
    test_x,test_y,test_len = test
 
 
 
    FLAGS.max_len = max_len
 
 
 
    idx2label = {}
 
    for i in range(len(label)):
 
        idx2label[i] = label[i]
 
 
 
    nb_classes = len(label) ##Total number of labels
 
    FLAGS.nb_classes = nb_classes
 
    print FLAGS.nb_classes
 
 
 
    nb_words = len(dict1) ##Dictionary size
 
    FLAGS.nb_words = nb_words
 
    FLAGS.in_dim = FLAGS.nb_words + 1
 
 
 
 
 
##Read the word vector. The dimension of the word vector is 100
 
    emb_mat,idx_map = read_emb_from_file(FLAGS.emb_file,dict1)
 
    FLAGS.emb_dim = max(emb_mat.shape[1],FLAGS.emb_dim)
 
 
 
    if FLAGS.model == 'LSTM':
 
        MODEL_type = tagger.LSTM_NER
 
    elif FLAGS.model == 'BLSTM':
 
        MODEL_type = tagger.Bi_LSTM_NER
 
    elif FLAGS.model == 'CNNBLSTM':
 
        MODEL_type = tagger.CNN_Bi_LSTM_NER
 
 
 
    num_feature = 5 ##It represents several columns of features
 
model=MODEL_type(nb_words,FLAGS.emb_dim,emb_mat,FLAGS.hidden_dim,
 
FLAGS.nb_classes,FLAGS.dropout,FLAGS.batch_size,
 
FLAGS.max_len,num_feature,FLAGS.l2_reg,FLAGS.fine_tuning)
 
 
 
       pred_test,test_loss,test_acc = model.run(
 
        train_x, train_y, train_len,
 
        valid_x, valid_y, valid_len,
 
        test_x, test_y, test_len,FLAGS
 
    )
 
 
 
    print "test loss:%f,accuracy:%f"%(test_loss,test_acc)
 
    pred_test = [pred_test[i][:test_len[i]] for i in xrange(len(pred_test))]
 
pred_test_label = convert_id_to_word(pred_test,idx2label)
 

Question:

def main(_):

Why is there an underscore "" in parentheses, If you know, your little brother and sister can leave a message and let me know

 

batch_size: indicates the number of sentences entered. It is set to 200 in the program, that is, 200 sentences are entered at a time

 

Word vector of a word (obtained by training the selected word2vec):

emb_mat[0]:

 [-0.6621169  -0.8313406   0.54108036  0.20365714 -0.10455681  0.3607058

  0.4160717  -0.41833383 -0.39466462  0.07336954 -0.49563563 -0.08958961

  0.49159595 -0.16177754 -0.06243266 -0.14767128  0.1618933   0.19220804

 -0.10710583  0.29073772  0.7336261  -0.7396908  -0.6750916   0.02076059

  0.13271192 -0.28970304  0.12586454  0.35763028  0.22733922 -0.09528491

 -0.08213616 -0.10439471  0.2566883  -0.08572228 -0.00877656 -0.01470754

  0.09599475  0.08488567 -0.22608955  0.29944983 -0.1588087   0.16511342

 -0.5654839   0.02626183 -0.00412571  0.08016261 -0.66539353 -0.04139498

  0.31533444  0.1254148  -0.05564674  0.42645916 -0.5808047  -0.3405478

  0.36155587  0.18621838 -0.05239308  0.10274373 -0.36228842 -0.27298418

  0.33608422 -0.2988138  -0.5349861  -0.38662362  0.28941253  0.09757522

 -0.28427303  0.0545605  -0.07313918 -0.31062493  0.36393994  0.10052888

  0.3193981  -0.16685288 -0.19736792 -0.1944135   0.45230377  0.23940851

  0.17697854  0.19814879 -0.19274928  0.6112448  -0.20306586 -0.11211285

 -0.48181373  0.4691558   0.14557801  0.25496432 -0.28298065 -0.3830366

 -0.6511909  -0.1889271  -0.05878077 -0.20141794  0.32011527 -0.06556274

  0.05855491  0.07617607 -0.08813886 -0.20229647]

model=MODEL_type(nb_words,FLAGS.emb_dim,emb_mat,FLAGS.hidden_dim,
 
FLAGS.nb_classes,FLAGS.dropout,FLAGS.batch_size,
 
FLAGS.max_len,num_feature,FLAGS.l2_reg,FLAGS.fine_tuning)

The corresponding values of the parameters are:

 4832  100  4835  80  7  0.0  200  150  5  0.0001  True

 

Style after data reading:

print len(train_x),type(train_x),len(train_y),type(train_y),len(train_len),type(train_len)
 
print 'train_x[0]',train_x[0]
 
print 'train_y[0]',train_y[0]
 
print 'train_len',train_len

46658 <type 'numpy.ndarray'> 46658 <type 'numpy.ndarray'> 46658 <type 'numpy.ndarray'>

train_x[0]

[[4834 4834    1    2    3]

 [4834    1    2    3    4]

 [   1    2    3    4    5]

 [   2    3    4    5    6]

 [   3    4    5    6    7]

 [   4    5    6    7    8]

 [   5    6    7    8    9]

 [   6    7    8    9   10]

 [   7    8    9   10   11]

 [   8    9   10   11   12]

 [   9   10   11   12   13]

 [  10   11   12   13   14]

 [  11   12   13   14   15]

 [  12   13   14   15   16]

 [  13   14   15   16   17]

 [  14   15   16   17   18]

 [  15   16   17   18   19]

 [  16   17   18   19   20]

 [  17   18   19   20   21]

 [  18   19   20   21   22]

 [  19   20   21   22   23]

 [  20   21   22   23   24]

 [  21   22   23   24   25]

 [  22   23   24   25   26]

 [  23   24   25   26   27]

 [  24   25   26   27   28]

 [  25   26   27   28   29]

 [  26   27   28   29   30]

 [  27   28   29   30   31]

 [  28   29   30   31   20]

 [  29   30   31   20   32]

 [  30   31   20   32   33]

 [  31   20   32   33   34]

 [  20   32   33   34   35]

 [  32   33   34   35   13]

 [  33   34   35   13   14]

 [  34   35   13   14   20]

 [  35   13   14   20   36]

 [  13   14   20   36   37]

 [  14   20   36   37   36]

 [  20   36   37   36   38]

 [  36   37   36   38   39]

 [  37   36   38   39   40]

 [  36   38   39   40   41]

 [  38   39   40   41   31]

 [  39   40   41   31   42]

 [  40   41   31   42 4834]

 [  41   31   42 4834 4834]

 [   0    0    0    0    0]

..........

 [   0    0    0    0    0]]

 train_y[0] [0 1 1 1 2 2 3 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 0 0]

train_len [48 19 47 ... 66 53 70]

pred_test,test_loss,test_acc = model.run() ##Training of neural network

I'll write to main today Py file, and the rest will continue to be updated

Tags: NLP

Posted by FraggleRock on Mon, 16 May 2022 16:11:02 +0300