Word2vec is a model proposed by Google in 2013, which trains word vectors from large-scale corpus. It is applied in many scenes, such as information extraction, similarity calculation and so on. Starting from word2vec, embedding has become popular in various fields, so word2vec is an appropriate opening. This paper hopes to give a more comprehensive overview of word2vec from the model structure, derivation, training, and based on TF Details of the implementation of estimator. Full code stamp here https://github.com/DSXiangLi/Embedding

## Model overview

The structure of word2vec model is relatively simple, which is to train on large-scale data, reduce the complexity of the model and remove the nonlinear hidden layer. According to different input and output forms, it is divided into two methods: CBOW and SG.

Let's simplify the problem into a bigram problem of 1v1. The word I is the context and the word j is the target. V is the total number of words, N is the length of word vector, D is the training word pair, and input \ (x_i \in R ^{1*V} \) is the one hot vector.

Two weight matrices for model training, \ (W \in R ^{V*N} \) is the input matrix, and each line corresponds to the word vector of the input word, \ (W^{'} \in R ^{V*N} \) is the output matrix, and each line corresponds to the word vector of the output word. The co-occurrence information of word i and word j is expressed by the inner product of word vector. The probability of each word obtained by softmax is as follows

For each training sample, the goal of the model is to maximize the conditional probability \ (p(w_j|w_I) \), so our logarithmic loss function is as follows

### CBOW : Continuous bag of words

CBOW expands the input context of bigram to 2 * window around the target word_ For words in size, use the context before and after the headword to predict the headword.

Compared with bigram, CBOW only does one more operation for the input 2 * Window_size is a word. After the word vector is mapped, you need to do average_pooling gets the input vector of 1*N, so the difference is only in the calculation of h. Assume $C = 2 * \text{window_size}$ $$ \begin{align} h & = \frac{1}{C}W^T(x_1 + x_2 +... + x_C) \\ & = \frac{1}{C}(v_{w1} + v_{w2} + ... + v_{wc}) ^T \\ E &= -log \, p(w_O|w_{I,1}...w_{I,C}) \\ & = -u_j^* + log\sum_{j^{'}=1}^Vexp(u_{j^{'}}) \end{align} $$

### SG : Skip Gram

SG expands the output target of bigram to 2 * window around the input word_ For words in size, use the center word to predict the occurrence probability of surrounding words.

Compared with bigram, the difference of SG is that the multinomial distribution of output probability is no longer one, but C

## Model derivation: how is word embedding obtained?

Let's deduce how to learn word vectors from the above model structure from back propagation. To simplify, let's first look at bigram, where \ (\ eta \) is learning rate.

First, update the word vector of hidden - > output \ (W ^ {'} \)

\(e_j \) is the prediction probability error of word J, so the update of \ (W ^ {'} \) can be understood as subtracting \ (\ eta \cdot e_j \cdot h \) from \ (V ^ {w ^} \) if word J is overestimated, so as to reduce the vector inner product (similarity) of H and \ (v_{w^{'}j} \). Otherwise, if it is underestimated, superimpose \ (\ eta \cdot e_j \cdot h \) on \ (V ^ {w {'} \) to increase the inner product similarity. The greater the error, the greater the update range.

Then the word vector of input - > hidden W is updated

The word vector \ (v_{wI} \) corresponding to each input word is updated with the vector obtained by the weighted average of the output word vectors of all words according to the prediction error. The same as the above logic, the overestimated word is subtracted, the underestimated word is added, and then the input word vector is updated by weighting according to the error size.

Therefore, the model learning process will be that the input word vector updates the output word vector, the output word vector updates the input word vector, and then back and forth reaches the steady state.

The only change in extending bigram to CBOW is that when updating the word vector of input hidden, instead of updating the vector corresponding to one word at a time, the word vector of C words is updated at the same time with the same amplitude

When expanding bigram to SG, the only change is that when updating the word vector of hidden output, it is no longer the prediction error of word j, but the sum of prediction errors of C words

## model training

Although the model structure has been optimized and the nonlinear hidden layer has been removed, the training of the model is not efficient. The bottleneck is that Word2vec is essentially a multi classification task, and there are so many categories in the whole vocabulary, so \ (P (w_j|w_i) = \ frac {exp (u_j)} {\ sum_{J ^ {'} = 1} ^ vexp (u_j ^ {'})} \) the probability of the whole vocabulary needs to be calculated every time \ (O(VN) \). Even if batch has only one training sample, the embedding matrix of all words hidden - > output needs to be updated. There are two solutions to this problem

### Hierarchical Softmax

If softmax is regarded as a 1-layer tree, each word is a leaf node. Because normalization is required, the complexity of calculating the probability of each word is \ (O(V) \). Hierarchical Softmax only changes 1-layer into multi-layer. Without increasing the embedding size (V leaf nodes and V-1 inner nodes in the tree), it reduces the complexity of calculating the probability of each word to \ (O(logV) \), and directly uses the path from root to leaf node to calculate the probability of each word. The author selects huffman tree for tree construction. The advantage is that the path from root to leaf of high-frequency words will be shorter than that of low-frequency words, which can further accelerate the training. See this blog for details human coding

For example, the following figure( picture source)

$$ \begin{align} P(Horse) &= P(0,left)\cdot P(1,right)\cdot P(2,left) \end{align} $$

How to calculate p(0,left) above?

Each node has its own embedding \(v_n{(w,j)} \), that is, the embedding of the j-th node on the word w path, and the inner product of the input and output word becomes the inner product of the input word and node. The probability of each word is calculated as follows

I have to say that this formula is written for fear that others can understand >_<

\([n(w,j+1) = ch(n(w,j))] \) what is it? ch is left child, \ ([\ cdot] \) is only used to judge whether the path is left or right

therefore

Corresponding to the above model derivation, the hidden - > output part changes, and the loss function becomes the following

The embedding on the path corresponding to each output word will be updated, and the prediction task will change to each inner on the path_ Should node go left or right.

The simple implementation of huffman Hierarchy softmax is as follows

class TreeNode(object): total_node = 0 def __init__(self, frequency, char = None , word_index = None, is_leaf = False): self.frequency = frequency self.char = char # word character self.word_index = word_index # word look up index self.left = None self.right = None self.is_leaf = is_leaf self.counter(is_leaf) def counter(self, is_leaf): # node_index will be used for embeeding_lookup self.node_index = TreeNode.total_node if not is_leaf: TreeNode.total_node += 1 def __lt__(self, other): return self.frequency < other.frequency def __repr__(self): if self.is_leaf: return 'Leaf Node char = [{}] index = {} freq = {}'.format(self.char, self.word_index, self.frequency) else: return 'Inner Node [{}] freq = {}'.format(self.node_index, self.frequency) class HuffmanTree(object): def __init__(self, freq_dic): self.nodes = [] self.root = None self.max_depth = None self.freq_dic = freq_dic self.all_paths = {} self.all_codes = {} self.node_index = 0 @staticmethod def merge_node(left, right): parent = TreeNode(left.frequency + right.frequency) parent.left = left parent.right = right return parent def build_tree(self): """ Build huffman tree with word being leaves """ TreeNode.total_node = 0 # avoid train_and_evaluate has different node_index heap_nodes = [] for word_index, (char, freq) in enumerate(self.freq_dic.items()): tmp = TreeNode( freq, char, word_index, is_leaf=True ) heapq.heappush(heap_nodes, tmp ) while len(heap_nodes)>1: node1 = heapq.heappop(heap_nodes) node2 = heapq.heappop(heap_nodes) heapq.heappush(heap_nodes, HuffmanTree.merge_node(node1, node2)) self.root = heapq.heappop(heap_nodes) @property def num_node(self): return self.root.node_index + 1 def traverse(self): """ Compute all node to leaf path and direction: list of node_id, list of 0/1 """ def dfs_helper(root, path, code): if root.is_leaf : self.all_paths[root.word_index] = path self.all_codes[root.word_index] = code return if root.left : dfs_helper(root.left, path + [root.node_index], code + [0]) if root.right : dfs_helper(root.right, path + [root.node_index], code + [1]) dfs_helper(self.root, [], [] ) self.max_depth = max([len(i) for i in self.all_codes.values()]) class HierarchySoftmax(HuffmanTree): def __init__(self, freq_dic): super(HierarchySoftmax, self).__init__(freq_dic) def convert2tensor(self): # padded to max_depth and convert to tensor with tf.name_scope('hstree_code'): self.code_table = tf.convert_to_tensor([ code + [INVALID_INDEX] * (self.max_depth - len(code)) for word, code in sorted( self.all_codes.items(), key=lambda x: x[0] )], dtype = tf.float32) with tf.name_scope('hstree_path'): self.path_table = tf.convert_to_tensor([path + [INVALID_INDEX] * (self.max_depth - len(path)) for word, path in sorted( self.all_paths.items(), key=lambda x: x[0] )], dtype = tf.int32) def get_loss(self, input_embedding_vector, labels, output_embedding, output_bias, params): """ :param input_embedding_vector: [batch * emb_size] :param labels: word index [batch * 1] :param output_embedding: entire embedding matrix [] :return: loss """ loss = [] labels = tf.unstack(labels, num = params['batch_size']) # list of [1] inputs = tf.unstack(input_embedding_vector, num = params['batch_size']) # list of [emb_size] for label, input in zip(labels, inputs): path = self.path_table[tf.squeeze(label)]# (max_depth,) code = self.code_table[tf.squeeze(label)] # (max_depth,) path = tf.boolean_mask(path, tf.not_equal(path, INVALID_INDEX)) # (real_path_length,) code = tf.boolean_mask(code, tf.not_equal(code, INVALID_INDEX) ) # (real_path_length,) output_embedding_vector = tf.nn.embedding_lookup(output_embedding, path) # real_path_length * emb_size bias = tf.nn.embedding_lookup(output_bias, path) # (real_path_length,) logits = tf.matmul(tf.expand_dims(input, axis=0), tf.transpose(output_embedding_vector) ) + bias # (1,emb_size) *(emb_size, real_path_length) loss.append(tf.nn.sigmoid_cross_entropy_with_logits(labels = code, logits = tf.squeeze(logits) )) loss = tf.reduce_mean(tf.concat(loss, axis = 0), axis=0, name = 'hierarchy_softmax_loss') # batch -> scaler return loss

### Negative Sampling

Negative Sampling is more intuitive to understand, because the goal of the model is to train high-quality word embedding, that is, input word embedding. It doesn't matter whether each batch updates all output word embedding. We can update only K embedding samples at a time. The original positive samples are retained, and we then sample K groups of negative samples for training. The model only needs to learn positive samples vs negative samples, which bypasses the problem of normalization with V words and successfully simplifies the multi classification problem into a two classification problem. The author said that small sample K=5~20 and large sample k=2~5.

Corresponding to the above model derivation, the hidden - > output part changes and the loss function becomes

Only K embedding s per iteration are updated

In the input - > hidden part, only k embedding weighting vectors will be used to update the input embedding

tensorflow has several implementations of candidate sample, and the two commonly used ones are NN sampled_softmax_loss and NN nce_loss, they call the same sampling function. The difference is sampled_softmax_loss uses softmax (exclusive list classification), while nce_loss is seeking logistic (non exclusive multi classification). These two implementations are slightly different from negative sampling. You can see the details Notes on Noise Contrastive Estimation and Negative Sampling . The comparison between the two is that nce is more suitable for skip gram and sample is more suitable for CBOW. I have to use more specific differences.

### Subsampling

Another key point of the paper is subsampling. For words with high frequency, too many training samples can not further improve the performance, so we can downsample these samples. T is the word frequency threshold, \ (f(w_i) \) is the occurrence frequency of words in corpus. All words with a frequency higher than t will be downsampled according to the following probability

## Model implementation

The actual experience of the disabled party is a word2vec more complex part, not a model... It's input_pipe and loss function, so when implementing, we also hope to integrate dataset and model as much as possible_ FN, separated from the part of train. Only model is given below_ The core of FN

def avg_pooling_embedding(embedding, features, params): """ :param features: (batch, 2*window_size) :param embedding: (vocab_size, emb_size) :return: input_embedding : average pooling of context embedding """ input_embedding= [] samples = tf.unstack(features, params['batch_size']) for sample in samples: sample = tf.boolean_mask(sample, tf.not_equal(sample, INVALID_INDEX), axis=0) # (real_size,) tmp = tf.nn.embedding_lookup(embedding, sample) # (real_size, emb_size) input_embedding.append(tf.reduce_mean(tmp, axis=0)) # (emb_size, ) input_embedding = tf.stack(input_embedding, name = 'input_embedding_vector') # batch * emb_size return input_embedding def model_fn(features, labels, mode, params): if params['train_algo'] == 'HS': # If Hierarchy Softmax is used, initialize a huffman tree first hstree = HierarchySoftmax( params['freq_dict'] ) hstree.build_tree() hstree.traverse() hstree.convert2tensor() if params['model'] == 'CBOW': features = tf.reshape(features, shape = [-1, 2 * params['window_size']]) labels = tf.reshape(labels, shape = [-1,1]) else: features = tf.reshape(features, shape = [-1,]) labels = tf.reshape(labels, shape = [-1,1]) with tf.variable_scope( 'initialization' ): w0 = tf.get_variable( shape=[params['vocab_size'], params['emb_size']], initializer=tf.truncated_normal_initializer(), name='input_word_embedding' ) if params['train_algo'] == 'HS': w1 = tf.get_variable( shape=[hstree.num_node, params['emb_size']], initializer=tf.truncated_normal_initializer(), name='hierarchy_node_embedding' ) b1 = tf.get_variable( shape = [hstree.num_node], initializer=tf.random_uniform_initializer(), name = 'bias') else: w1 = tf.get_variable( shape=[params['vocab_size'], params['emb_size']], initializer=tf.truncated_normal_initializer(), name='output_word_embedding' ) b1 = tf.get_variable( shape=[params['vocab_size']], initializer=tf.random_uniform_initializer(), name='bias') add_layer_summary( w0.name, w0) add_layer_summary( w1.name, w1 ) add_layer_summary( b1.name, b1 ) with tf.variable_scope('input_hidden'): # batch_size * emb_size if params['model'] == 'CBOW': input_embedding_vector = avg_pooling_embedding(w0, features, params) else: input_embedding_vector = tf.nn.embedding_lookup(w0, features, name = 'input_embedding_vector') add_layer_summary(input_embedding_vector.name, input_embedding_vector) with tf.variable_scope('hidden_output'): if params['train_algo'] == 'HS': loss = hstree.get_loss( input_embedding_vector, labels, w1, b1, params) else: loss = negative_sampling(mode = mode, output_embedding = w1, bias = b1, labels = labels, input_embedding_vector =input_embedding_vector, params = params) optimizer = tf.train.AdagradOptimizer( learning_rate = params['learning_rate'] ) update_ops = tf.get_collection( tf.GraphKeys.UPDATE_OPS ) with tf.control_dependencies( update_ops ): train_op = optimizer.minimize( loss, global_step= tf.train.get_global_step() ) return tf.estimator.EstimatorSpec( mode, loss=loss, train_op=train_op )

You are welcome to leave messages, comments and roast code

## Ref

- [Word2Vec A]Tomas Mikolov et al, 2013, Efficient Edtimation of Word Representations in Vector Space
- [Word2Vec B]Tomas Mikolow et al, 2013, Distributed Representations of Words and Phrases and their Compositionality
- Yoav GoldBerg, Omer Levy, 2014, Wor2Vec Explained: Deribing Mikolow et al's Negative-Sampling Word Embedding Method
- Xin Rong, 2016, word2vec ParameterLearning Explained
- [Candidate Sampling]https://www.tensorflow.org/extras/candidate_sampling.pdf
- [Negative Sampling]Chris Dyer, 2014, Notes on Noise Contrastive Estimation and Negative Sampling
- https://github.com/chao-ji/tf-word2vec
- https://github.com/akb89/word2vec
- https://ruder.io/word-embeddings-softmax/index.html#negativesampling
- https://blog.csdn.net/lilong117194/article/details/82849054