Knowledge map tracing 1.1 -- starting from NER

Overview of this article: recurrence of knowledge- KG open source project set Medium BERT-NER-pytorch Some learning records after the project are of reference significance to Xiaobai, who is also a newcomer.

Data: about the introduction of transformer in BERT model, what must be shared is Animation of Jay Alammar , why didn't I see such a good foreign work early?

1, Preparation:

1. Data set

The organization and processing of data sets are the highlights of the in-depth field. It can be said that an algorithm engineer spends 80% of his time dealing with data. Now let's introduce the dataset:

  • Access method: directly download from git with google extended gitzip,

There are only 3 items of raw data, yes

There are 45000 sentences in the train set and 3442 in the test set. We need to divide the val set artificially.

  • Generate data form - similar to the following, let's take a look at MsrA_ train_ The first 17 lines of bio (176042 lines in total):
in    B-ORG
 common    I-ORG
 in    I-ORG
 Central    I-ORG
 To    O
 in    B-ORG
 country    I-ORG
 To    I-ORG
 common    I-ORG
 party    I-ORG
 ten    I-ORG
 one    I-ORG
 large    I-ORG
 of    O
 Congratulation    O
 Words    O
 various    O
  • tags (only three entities: organization, person and location):
O
B-ORG
I-PER
B-PER
I-LOC
I-ORG
B-LOC

ps: it can be seen that the BIO labeling method is adopted. Of course, we can modify it!

  • The dataset to be divided into (3 directories):
Dataset Number
training set 42000
validation set 3000
test set 3442

Get three directories 👆.

  • Form after data processing (take the first two):

sentences.txt file:

How to solve the long-standing contradictions in the football field and revitalize the glory of Jinmen football in the past has become a topic of discussion in Tianjin football circles.
The county pays attention to the promotion of agricultural technology and the improvement of farmers' scientific and technological education and agricultural technology level.
The key to innovation is the production, dissemination and use of knowledge and information.

Accordingly, tags Txt file:

O O O O O O O O O O O O O O O O O O O O O B-LOC I-LOC O O O O O O O O B-LOC I-LOC O O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O

2. Prepare the model and trained model parameters

After many attempts in the experiment, it is found that it is not difficult to obtain the model parameters,
When the author reproduces the code, there are no directly available model parameters under pt. Now it is a matter of parameters.
train. In py code, there is a sentence when creating a model:

model = BertForTokenClassification.from_pretrained()

Let's click in and check from_ Pre trained () method, in pytorch_ pretrained_ Modeling. In the Bert directory Py file.
You can obtain the model and parameters through the path name or url (there was a little episode in the training later, I gave up the one provided by the author and downloaded the compressed package myself, which will be mentioned later):

PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
}

ps: first get the model parameters of TF and then convert them to pt. It took me a lot of effort to download a convert directly from the Transformer library_ tf_ checkpoint_ to_ pytorch. Py file to the TF model parameter directory. After some operation, we get pt bin file, fortunately, it's a one sentence thing now.

3,train.py file interpretation

The processing of data sets and the setting of super parameters are simple and do not involve the core of our article. There will be no too many records here. If you want to know, go to the github project, Portal,
This file contains a lot of things, including
â‘  parse, params setting, logger log
â‘¡ dataloader, model, optimizer and train_and_evaluate

  • Parameter parse (this parameter resolves the parameter when running the. py file)
parser = argparse.ArgumentParser()
parser.add_argument('--data_dir', default='NER-BERT-pytorch-data-msra', help="Directory containing the dataset")
parser.add_argument('--bert_model_dir', default='pt_things', help="Directory containing the BERT model in PyTorch")
......

Then, in main, record the parameters to memory:
args = parser.parse_args()
In the next use, args Param to get parameters,

  • logger log

The related processing is put in utils Py, in train Py directly:

#establish
utils.set_logger(os.path.join(args.model_dir, 'train.log'))
# When recording is required:
logging.info("device: {}, n_gpu: {}, 16-bits training: {}".format(params.device, params.n_gpu, args.fp16))
  • model

Simple, two sentences:

model = BertForTokenClassification.from_pretrained(args.bert_model_dir, num_labels=len(params.tag2idx))
model.to(params.device)

It includes model loading and parameter loading. During training, we will see two prompts after seeing modelconfig:

Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

(at first, I thought it was a failure to import model parameters, so I went to find a solution.

I thought pytorch_ model. The parameters in the bin file do not match the model. I downloaded the compressed package from the link provided by the source code again, but there are still these two sentences.

As a result, it was found in a Google forum that these two sentences mean successfully calling parameters.)
Puzzle solving: this model includes embedding and bert NER. The parameters of the model should be embedded, and the specific ner task should be trained with our own data!
Read model source code: Detailed explanation of each layer
View model (BertForTokenClassification):
There are three layers:
â‘ BertModel
Bert embeddings: it contains multiple layers

    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
BertEncoder: There are 12 floors inside encoder,every last encoder It's all one BertLayer: 
    (attention): BertAttention  #The most important thing is to brush when you have a chance
    (intermediate): BertIntermediate
    (output): BertOutput
BertPooler

â‘¡DropOut
â‘¢Linear

  • optimizer

full_finetuning

4. Single machine multi card parallel and fp16

  • Doka

Assign models and data to multiple cards;
â‘  Specify virtual gpu:
os.environ["CUDA_VISIBLE_DEVICES"] = '1,2,3,0'
The physical address corresponding to the virtual address is 1, 2, 3, 0 (at that time, because the elder martial brother's main gpu was physical card 0, the card with relatively small remaining video memory was avoided)

ps: elder martial brother is no longer used when taking the screenshot.

â‘¡ Before preparing the model and data, put this sentence:
params.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
It is estimated that the part behind 'cuda' is that the author doesn't want to see the error report
Note: if you want to write 'cuda:[num]', please write the first one when specifying gpu, which is 1 here!

â‘¢ Random seed allocation to each gpu:

# Set random seeds for repeatable tests
random.seed(args.seed)
torch.manual_seed(args.seed)    #Setting for cpu
if params.n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)  # set random seed for all GPUs

ps: if it is a single card, just remove all;

â‘£ Assign multiple cards to the model:

model = BertForTokenClassification.from_pretrained(args.bert_model_dir, num_labels=len(params.tag2idx))
model.to(params.device)
if params.n_gpu > 1 and args.multi_gpu:
    model = torch.nn.DataParallel(model)

⑤ Finally, assign multiple cards to the data:

# Initialize the DataLoader
data_loader = DataLoader(args.data_dir, args.bert_model_dir, params, token_pad_idx=0)
# Load training data and test data
train_data = data_loader.load_data('train')
val_data = data_loader.load_data('val')
......
stay train()Before function:
# data iterator for training
train_data_iterator = data_loader.data_iterator(train_data, shuffle=True)
# Train for one epoch on training set
train(model, train_data_iterator, optimizer, scheduler, params)

In data_ Cards allocated to each batch set in iterator():

# shift tensors to GPU if available
batch_data, batch_tags = batch_data.to(self.device), batch_tags.to(self.device)
yield batch_data, batch_tags

Here's self Device is the parameter accepted by the class.

  • fp16

Reference: a lecture CSDN of fp16 acceleration principle
fp16 uses 2-byte encoding for storage
Advantages: less memory consumption (main) + accelerated calculation
Disadvantages: addition operation is easy to overflow up and down
(you can have a special experiment if you have a chance)

5. Progress bar tool

Here, 1400 batch es are calculated for each epoch,
So put the progress bar in each epoch:

t = trange(params.train_steps)
for i in t:
    # fetch the next training batch
 batch_data, batch_tags = next(data_iterator)
 ......
 loss = model(~)
 ......
 t.set_postfix(loss='{:05.3f}'.format(loss_avg()))

Result display:

6. Evaluation index database

Available outcome evaluation indicators (tip of the iceberg):

from metrics import f1_score
from metrics import accuracy_score
from metrics import classification_report

Template:

metrics = {}
f1 = f1_score(true_tags, pred_tags)
accuracy=accuracy_score(true_tags, pred_tags)
metrics['loss'] = loss_avg()
metrics['f1'] = f1
metrics['accuracy']=accuracy
metrics_str = "; ".join("{}: {:05.2f}".format(k, v) for k, v in metrics.items())
logging.info("- {} metrics: ".format(mark) + metrics_str)

Result display:

How to distinguish between accuracy rate and recall rate?

7. Problems in multiple experiments

My f1 score has always been below 50, but the accuracy rate has always been around 97%. Recently, I need to extract liver links, so let's put this part first. Go back and complete

Tags: Algorithm Pytorch NLP

Posted by MartiniMan on Sun, 08 May 2022 05:09:16 +0300