Detailed explanation of omnipotent Embedding 1 - Word2vec model & code implementation

Word2vec is a model proposed by Google in 2013, which trains word vectors from large-scale corpus. It is applied in many scenes, such as information extraction, similarity calculation and so on. Starting from word2vec, embedding has become popular in various fields, so word2vec is an appropriate opening. This paper hopes to give a more comprehensive overview of word2vec from the model structure, derivation, training, and based on TF Details of the implementation of estimator. Full code stamp here https://github.com/DSXiangLi/Embedding

Model overview

The structure of word2vec model is relatively simple, which is to train on large-scale data, reduce the complexity of the model and remove the nonlinear hidden layer. According to different input and output forms, it is divided into two methods: CBOW and SG.

Let's simplify the problem into a bigram problem of 1v1. The word I is the context and the word j is the target. V is the total number of words, N is the length of word vector, D is the training word pair, and input \ (x_i \in R ^{1*V} \) is the one hot vector.

Two weight matrices for model training, \ (W \in R ^{V*N} \) is the input matrix, and each line corresponds to the word vector of the input word, \ (W^{'} \in R ^{V*N} \) is the output matrix, and each line corresponds to the word vector of the output word. The co-occurrence information of word i and word j is expressed by the inner product of word vector. The probability of each word obtained by softmax is as follows

\[\begin{align} h =v_{wI} &= W^T x_i \\ v_{w^{'}j} &= W^{'T} x_j \\ u_j &= v_{w^{'}j}^T h \\ y_j = p(w_j|w_I) &= \frac{exp(u_j)}{\sum_{j^{'}=1}^Vexp(u_{j^{'}})}\\ \end{align} \]

For each training sample, the goal of the model is to maximize the conditional probability \ (p(w_j|w_I) \), so our logarithmic loss function is as follows

\[\begin{align} E & = - logP(w_j|w_I) \\ & = -u_j^* + log\sum_{j^{'}=1}^Vexp(u_{j^{'}}) \end{align} \]

CBOW : Continuous bag of words

CBOW expands the input context of bigram to 2 * window around the target word_ For words in size, use the context before and after the headword to predict the headword.

Compared with bigram, CBOW only does one more operation for the input 2 * Window_size is a word. After the word vector is mapped, you need to do average_pooling gets the input vector of 1*N, so the difference is only in the calculation of h. Assume $C = 2 * \text{window_size}$ $$ \begin{align} h & = \frac{1}{C}W^T(x_1 + x_2 +... + x_C) \\ & = \frac{1}{C}(v_{w1} + v_{w2} + ... + v_{wc}) ^T \\ E &= -log \, p(w_O|w_{I,1}...w_{I,C}) \\ & = -u_j^* + log\sum_{j^{'}=1}^Vexp(u_{j^{'}}) \end{align} $$

SG : Skip Gram

SG expands the output target of bigram to 2 * window around the input word_ For words in size, use the center word to predict the occurrence probability of surrounding words.

Compared with bigram, the difference of SG is that the multinomial distribution of output probability is no longer one, but C

\[\begin{align} E &= -log \, p(w_{O,1},w_{O,2},...w_{O,C}|w_I) \\ & =\sum_{c=1}^Cu_{j,c}^* + C\cdot log\sum_{j^{'}=1}^Vexp(u_{j^{'}}) \end{align} \]

Model derivation: how is word embedding obtained?

Let's deduce how to learn word vectors from the above model structure from back propagation. To simplify, let's first look at bigram, where \ (\ eta \) is learning rate.

First, update the word vector of hidden - > output \ (W ^ {'} \)

\[\begin{align} \frac{\partial E}{\partial v_{w^{'}j}} &= \frac{\partial E}{\partial u_j}\frac{\partial u_j}{\partial v_{w^{'}j}}\\ & = (p(w_j|w_i) - I(j=j^*))\cdot h \\ & = e_j\cdot h \\ v_{w^{'}j}^{(new)} &= v_{w^{'}j}^{(old)} - \eta \cdot e_j \cdot h \\ \end{align} \]

\(e_j \) is the prediction probability error of word J, so the update of \ (W ^ {'} \) can be understood as subtracting \ (\ eta \cdot e_j \cdot h \) from \ (V ^ {w ^} \) if word J is overestimated, so as to reduce the vector inner product (similarity) of H and \ (v_{w^{'}j} \). Otherwise, if it is underestimated, superimpose \ (\ eta \cdot e_j \cdot h \) on \ (V ^ {w {'} \) to increase the inner product similarity. The greater the error, the greater the update range.

Then the word vector of input - > hidden W is updated

\[\begin{align} \frac{\partial E}{\partial h} &= \sum_{j=1}^V\frac{\partial E}{\partial u_j}\frac{\partial u_j}{\partial h}\\ & = \sum_{j=1}^V e_j \cdot v_{w^{'}j}\\ v_{w_I}^{(new)} &= v_{w_I}^{(old)} - \eta \cdot \sum_{j=1}^V e_j \cdot v_{w^{'}j} \\ \end{align} \]

The word vector \ (v_{wI} \) corresponding to each input word is updated with the vector obtained by the weighted average of the output word vectors of all words according to the prediction error. The same as the above logic, the overestimated word is subtracted, the underestimated word is added, and then the input word vector is updated by weighting according to the error size.

Therefore, the model learning process will be that the input word vector updates the output word vector, the output word vector updates the input word vector, and then back and forth reaches the steady state.

The only change in extending bigram to CBOW is that when updating the word vector of input hidden, instead of updating the vector corresponding to one word at a time, the word vector of C words is updated at the same time with the same amplitude

\[v_{w_{I,c}}^{(new)} = v_{w_{I,c}}^{(old)} - \frac{1}{C}\eta \cdot \sum_{j=1}^V e_j \cdot v_{w^{'}j} \]

When expanding bigram to SG, the only change is that when updating the word vector of hidden output, it is no longer the prediction error of word j, but the sum of prediction errors of C words

\[v_{w^{'}j}^{(new)} = v_{w^{'}j}^{(old)} - \eta \cdot \sum_{c=1}^C e_{c,j} \cdot h \]

model training

Although the model structure has been optimized and the nonlinear hidden layer has been removed, the training of the model is not efficient. The bottleneck is that Word2vec is essentially a multi classification task, and there are so many categories in the whole vocabulary, so \ (P (w_j|w_i) = \ frac {exp (u_j)} {\ sum_{J ^ {'} = 1} ^ vexp (u_j ^ {'})} \) the probability of the whole vocabulary needs to be calculated every time \ (O(VN) \). Even if batch has only one training sample, the embedding matrix of all words hidden - > output needs to be updated. There are two solutions to this problem

Hierarchical Softmax

If softmax is regarded as a 1-layer tree, each word is a leaf node. Because normalization is required, the complexity of calculating the probability of each word is \ (O(V) \). Hierarchical Softmax only changes 1-layer into multi-layer. Without increasing the embedding size (V leaf nodes and V-1 inner nodes in the tree), it reduces the complexity of calculating the probability of each word to \ (O(logV) \), and directly uses the path from root to leaf node to calculate the probability of each word. The author selects huffman tree for tree construction. The advantage is that the path from root to leaf of high-frequency words will be shorter than that of low-frequency words, which can further accelerate the training. See this blog for details human coding

For example, the following figure( picture source)

$$ \begin{align} P(Horse) &= P(0,left)\cdot P(1,right)\cdot P(2,left) \end{align} $$

How to calculate p(0,left) above?

Each node has its own embedding \(v_n{(w,j)} \), that is, the embedding of the j-th node on the word w path, and the inner product of the input and output word becomes the inner product of the input word and node. The probability of each word is calculated as follows

\[p(w=w_o) = \prod_{j=1}^{L(w)-1}\sigma([n(w,j+1) = ch(n(w,j))] \cdot {v_{n(w,j)}}^{T} h) \]

I have to say that this formula is written for fear that others can understand >_<

\([n(w,j+1) = ch(n(w,j))] \) what is it? ch is left child, \ ([\ cdot] \) is only used to judge whether the path is left or right

\[\ [\cdot] = \begin{cases} 1 & \ Quad \ text {if left}\ -1 & \ Quad \ text {if right} \end{cases} \ \]

therefore

\[\begin{align} p(n,left) &= \sigma(v_n^T\cdot h )\\ p(n, right) &= \sigma(-v_n^T\cdot h )= 1- \sigma(v_n^T\cdot h ) \end{align} \]

Corresponding to the above model derivation, the hidden - > output part changes, and the loss function becomes the following

\[E= -log P(w=w_j|w_I) = - \sum_{j=1}^{L(w)-1}log([\cdot]v_j^T h) \]

The embedding on the path corresponding to each output word will be updated, and the prediction task will change to each inner on the path_ Should node go left or right.

The simple implementation of huffman Hierarchy softmax is as follows

class TreeNode(object):
    total_node = 0
    def __init__(self, frequency, char = None , word_index = None, is_leaf = False):
        self.frequency = frequency
        self.char = char # word character
        self.word_index = word_index # word look up index
        self.left = None
        self.right = None
        self.is_leaf = is_leaf
        self.counter(is_leaf)

    def counter(self, is_leaf):
        # node_index will be used for embeeding_lookup
        self.node_index = TreeNode.total_node
        if not is_leaf: TreeNode.total_node += 1

    def __lt__(self, other):
        return self.frequency < other.frequency

    def __repr__(self):
        if self.is_leaf:
            return 'Leaf Node char = [{}] index = {} freq = {}'.format(self.char, self.word_index, self.frequency)
        else:
            return 'Inner Node [{}] freq = {}'.format(self.node_index, self.frequency)

class HuffmanTree(object):
    def __init__(self, freq_dic):
        self.nodes = []
        self.root = None
        self.max_depth = None
        self.freq_dic = freq_dic
        self.all_paths = {}
        self.all_codes = {}
        self.node_index = 0

    @staticmethod
    def merge_node(left, right):
        parent = TreeNode(left.frequency + right.frequency)
        parent.left = left
        parent.right = right
        return parent

    def build_tree(self):
        """
        Build huffman tree with word being leaves
        """
        TreeNode.total_node = 0 # avoid train_and_evaluate has different node_index

        heap_nodes = []
        for word_index, (char, freq) in enumerate(self.freq_dic.items()):
            tmp = TreeNode( freq, char, word_index, is_leaf=True )
            heapq.heappush(heap_nodes, tmp )

        while len(heap_nodes)>1:
            node1 = heapq.heappop(heap_nodes)
            node2 = heapq.heappop(heap_nodes)
            heapq.heappush(heap_nodes, HuffmanTree.merge_node(node1, node2))

        self.root = heapq.heappop(heap_nodes)

    @property
    def num_node(self):
        return self.root.node_index + 1

    def traverse(self):
        """
        Compute all node to leaf path and direction: list of node_id, list of 0/1
        """
        def dfs_helper(root, path, code):
            if root.is_leaf :
                self.all_paths[root.word_index] = path
                self.all_codes[root.word_index] = code
                return
            if root.left :
                dfs_helper(root.left, path + [root.node_index], code + [0])
            if root.right :
                dfs_helper(root.right, path + [root.node_index], code + [1])

        dfs_helper(self.root, [], [] )

        self.max_depth = max([len(i) for i in self.all_codes.values()])



class HierarchySoftmax(HuffmanTree):
    def __init__(self, freq_dic):
        super(HierarchySoftmax, self).__init__(freq_dic)

    def convert2tensor(self):
        # padded to max_depth and convert to tensor
        with tf.name_scope('hstree_code'):
            self.code_table = tf.convert_to_tensor([ code + [INVALID_INDEX] * (self.max_depth - len(code)) for word, code
                                                     in sorted( self.all_codes.items(),  key=lambda x: x[0] )],
                                                   dtype = tf.float32)
        with tf.name_scope('hstree_path'):
            self.path_table = tf.convert_to_tensor([path + [INVALID_INDEX] * (self.max_depth - len(path)) for word, path
                                                    in sorted( self.all_paths.items(), key=lambda x: x[0] )],
                                                   dtype = tf.int32)

    def get_loss(self, input_embedding_vector, labels, output_embedding, output_bias, params):
        """
        :param input_embedding_vector: [batch * emb_size]
        :param labels: word index [batch * 1]
        :param output_embedding: entire embedding matrix []
        :return:
            loss
        """
        loss = []
        labels = tf.unstack(labels, num = params['batch_size']) # list of [1]
        inputs = tf.unstack(input_embedding_vector, num = params['batch_size']) # list of [emb_size]

        for label, input in zip(labels, inputs):

            path = self.path_table[tf.squeeze(label)]#  (max_depth,)
            code = self.code_table[tf.squeeze(label)] # (max_depth,)

            path = tf.boolean_mask(path, tf.not_equal(path, INVALID_INDEX)) # (real_path_length,)
            code = tf.boolean_mask(code, tf.not_equal(code, INVALID_INDEX) ) # (real_path_length,)

            output_embedding_vector = tf.nn.embedding_lookup(output_embedding, path) # real_path_length * emb_size
            bias = tf.nn.embedding_lookup(output_bias, path) # (real_path_length,)

            logits = tf.matmul(tf.expand_dims(input, axis=0), tf.transpose(output_embedding_vector) ) + bias # (1,emb_size) *(emb_size, real_path_length)
            loss.append(tf.nn.sigmoid_cross_entropy_with_logits(labels = code, logits = tf.squeeze(logits) ))

        loss = tf.reduce_mean(tf.concat(loss, axis = 0), axis=0, name = 'hierarchy_softmax_loss') # batch -> scaler

        return loss

Negative Sampling

Negative Sampling is more intuitive to understand, because the goal of the model is to train high-quality word embedding, that is, input word embedding. It doesn't matter whether each batch updates all output word embedding. We can update only K embedding samples at a time. The original positive samples are retained, and we then sample K groups of negative samples for training. The model only needs to learn positive samples vs negative samples, which bypasses the problem of normalization with V words and successfully simplifies the multi classification problem into a two classification problem. The author said that small sample K=5~20 and large sample k=2~5.

Corresponding to the above model derivation, the hidden - > output part changes and the loss function becomes

\[E = -log\sigma(v_j^Th) - \sum_{w_j \in neg} log\sigma(-v_{w_j}^Th) \]

Only K embedding s per iteration are updated

\[v_{w^{'}j}^{(new)} = v_{w^{'}j}^{(old)} - \eta \cdot e_j \cdot h \,\,\,\, \text{where } j \in k \]

In the input - > hidden part, only k embedding weighting vectors will be used to update the input embedding

\[v_{w_I}^{(new)} = v_{w_I}^{(old)} - \eta \cdot \sum_{j=1}^K e_j \cdot v_{w^{'}j} \]

tensorflow has several implementations of candidate sample, and the two commonly used ones are NN sampled_softmax_loss and NN nce_loss, they call the same sampling function. The difference is sampled_softmax_loss uses softmax (exclusive list classification), while nce_loss is seeking logistic (non exclusive multi classification). These two implementations are slightly different from negative sampling. You can see the details Notes on Noise Contrastive Estimation and Negative Sampling . The comparison between the two is that nce is more suitable for skip gram and sample is more suitable for CBOW. I have to use more specific differences.

Subsampling

Another key point of the paper is subsampling. For words with high frequency, too many training samples can not further improve the performance, so we can downsample these samples. T is the word frequency threshold, \ (f(w_i) \) is the occurrence frequency of words in corpus. All words with a frequency higher than t will be downsampled according to the following probability

\[p(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} \]

Model implementation

The actual experience of the disabled party is a word2vec more complex part, not a model... It's input_pipe and loss function, so when implementing, we also hope to integrate dataset and model as much as possible_ FN, separated from the part of train. Only model is given below_ The core of FN

def avg_pooling_embedding(embedding, features, params):
    """
    :param features: (batch, 2*window_size)
    :param embedding: (vocab_size, emb_size)
    :return: 
        input_embedding : average pooling of context embedding
    """
    input_embedding= []
    samples = tf.unstack(features, params['batch_size'])
    for sample in samples:
        sample = tf.boolean_mask(sample, tf.not_equal(sample, INVALID_INDEX), axis=0) # (real_size,)
        tmp = tf.nn.embedding_lookup(embedding, sample) # (real_size, emb_size)
        input_embedding.append(tf.reduce_mean(tmp, axis=0)) # (emb_size, )

    input_embedding = tf.stack(input_embedding, name = 'input_embedding_vector') # batch * emb_size
    return input_embedding
    
def model_fn(features, labels, mode, params):
    if params['train_algo'] == 'HS':
        # If Hierarchy Softmax is used, initialize a huffman tree first
        hstree = HierarchySoftmax( params['freq_dict'] )
        hstree.build_tree()
        hstree.traverse()
        hstree.convert2tensor()

    if params['model'] == 'CBOW':
        features = tf.reshape(features, shape = [-1, 2 * params['window_size']])
        labels = tf.reshape(labels, shape = [-1,1])
    else:
        features = tf.reshape(features, shape = [-1,])
        labels = tf.reshape(labels, shape = [-1,1])

    with tf.variable_scope( 'initialization' ):
        w0 = tf.get_variable( shape=[params['vocab_size'], params['emb_size']],
                              initializer=tf.truncated_normal_initializer(), name='input_word_embedding' )
        if params['train_algo'] == 'HS':
            w1 = tf.get_variable( shape=[hstree.num_node, params['emb_size']],
                                  initializer=tf.truncated_normal_initializer(), name='hierarchy_node_embedding' )
            b1 = tf.get_variable( shape = [hstree.num_node],
                                  initializer=tf.random_uniform_initializer(), name = 'bias')
        else:
            w1 = tf.get_variable( shape=[params['vocab_size'], params['emb_size']],
                                  initializer=tf.truncated_normal_initializer(), name='output_word_embedding' )
            b1 = tf.get_variable( shape=[params['vocab_size']],
                                  initializer=tf.random_uniform_initializer(), name='bias')
        add_layer_summary( w0.name, w0)
        add_layer_summary( w1.name, w1 )
        add_layer_summary( b1.name, b1 )

    with tf.variable_scope('input_hidden'):
        # batch_size * emb_size
        if params['model'] == 'CBOW':
            input_embedding_vector = avg_pooling_embedding(w0, features, params)
        else:
            input_embedding_vector = tf.nn.embedding_lookup(w0, features, name = 'input_embedding_vector')
        add_layer_summary(input_embedding_vector.name, input_embedding_vector)

    with tf.variable_scope('hidden_output'):
        if params['train_algo'] == 'HS':
            loss = hstree.get_loss( input_embedding_vector, labels, w1, b1, params)
        else:
            loss = negative_sampling(mode = mode,
                                     output_embedding = w1,
                                     bias = b1,
                                     labels = labels,
                                     input_embedding_vector =input_embedding_vector,
                                     params = params)

    optimizer = tf.train.AdagradOptimizer( learning_rate = params['learning_rate'] )
    update_ops = tf.get_collection( tf.GraphKeys.UPDATE_OPS )

    with tf.control_dependencies( update_ops ):
        train_op = optimizer.minimize( loss, global_step= tf.train.get_global_step() )

    return tf.estimator.EstimatorSpec( mode, loss=loss, train_op=train_op )

You are welcome to leave messages, comments and roast code

Ref

  1. [Word2Vec A]Tomas Mikolov et al, 2013, Efficient Edtimation of Word Representations in Vector Space
  2. [Word2Vec B]Tomas Mikolow et al, 2013, Distributed Representations of Words and Phrases and their Compositionality
  3. Yoav GoldBerg, Omer Levy, 2014, Wor2Vec Explained: Deribing Mikolow et al's Negative-Sampling Word Embedding Method
  4. Xin Rong, 2016, word2vec ParameterLearning Explained
  5. [Candidate Sampling]https://www.tensorflow.org/extras/candidate_sampling.pdf
  6. [Negative Sampling]Chris Dyer, 2014, Notes on Noise Contrastive Estimation and Negative Sampling
  7. https://github.com/chao-ji/tf-word2vec
  8. https://github.com/akb89/word2vec
  9. https://ruder.io/word-embeddings-softmax/index.html#negativesampling
  10. https://blog.csdn.net/lilong117194/article/details/82849054

Tags: Deep Learning

Posted by Gordicron on Wed, 25 May 2022 17:17:02 +0300