Recommended system practice 0x0c FM series

Logistic regression (LR)

Before introducing FM series, I would like to briefly introduce logistic regression. Generally speaking, logistic regression model can comprehensively use more information, such as users, items, context and other different characteristics to produce more comprehensive results. In addition, logistic regression regards the recommendation problem as a classification problem. Sort the items by predicting the probability of positive samples. The positive samples here can be users watching a video, users clicking on a commodity, or users playing a music, etc. The logistic regression model transforms the recommendation problem into the problem of CTR (click through rate) prediction.


Generally speaking, the recommendation process of logistic regression model is divided into the following steps:

  1. The user's age, gender and other information, commodity name, attribute and other information, as well as context and other information are converted into numerical feature vectors.
  2. Taking logistic regression as the optimization goal, the logistic regression model is trained by using sample data to adjust the internal parameters of the model.
  3. In the model service stage, input the feature vector into the model to obtain the probability of positive feedback such as user "click".
    4 sort the items according to the probability of positive feedback to get the recommendation list.

The logistic regression here also uses the gradient descent algorithm. Here I recommend one article The mathematical principle of logistic regression is specially introduced. Interested readers can continue to read. Another special thing to note is that logistic regression is a classification model, not a regression model.


  • It has specific mathematical meaning as support. Because the CTR model conforms to Bernoulli distribution, the use of logistic regression as CTR model conforms to the logical law.
  • Strong interpretability, able to locate various factors by weight and give the interpretable reasons of the results.
  • Actual engineering needs. Because of the characteristics of easy parallelization, simple model and low training cost, logistic regression has been widely recognized.


  • The expression ability is not strong, so it is impossible to carry out operations such as feature crossover and feature screening


POLY2 is the simplest feature crossover algorithm. It can be known by directly combining the features and looking at its mathematical form

\[\mathrm{POLY2}(w,x)=\sum_{j_1=1}^{n-1}\sum_{j_2=j_1+1}^{n}w_{h(j_1,j_2)}x_{j_1}x_{j_2} \]

Directly cross the features in pairs, and give weight to the cross feature combination. POLY2 is still a linear model, and the training method is no different from the logistic regression model.


  1. For many Internet data, one hot coding is usually used. The non selective feature crossover makes the feature vector more sparse, and the weight is lack of effective training, or even unable to converge.
  2. It is difficult to directly calculate the weight by one order of magnitude

Factorization Machines(FM)

In order to solve the limitation of POLY2, FM model uses two vector inner products instead of a single weight coefficient. FM model learns an implicit weight vector for each feature, and uses the inner product of the two implicit vectors as the weight of the cross feature when doing feature cross. As follows:

\[\mathrm{FM}(w,x)=\sum_{j_1=1}^{n-1}\sum_{j_2=j_1+1}^{n}(w_{j_1}w_{j_2})x_{j_1}x_{j_2} \]

The characteristic implicit vector introduced by FM is similar to the implicit vector in matrix decomposition. By introducing the feature hidden vector, the weight parameter of \ (n^2 \) level in POLY2 is reduced to \ (nk \), which greatly reduces the training cost.

In addition, due to the existence of feature hidden vectors, the model has the ability to calculate the weight of feature combination, such as a training sample of furniture and vegetables (table and tomato), so there is no need to appear table and tomato at the same time to learn this feature combination. In addition, when new samples appear, online services can also be carried out through the calculated feature hidden vector.

Similarly, FM can also use gradient descent method for learning without losing timeliness and flexibility. Let's take a look at how the PyTorch version of FM is implemented.

class FactorizationMachine(nn.Module):
    def __init__(self, reduce_sum=True):
        super(FactorizationMachine, self).__init__()
        self.reduce_sum = reduce_sum

    def forward(self, x):
        :param x: float tensor of size (batch_size, num_fields, embed_dim)
        square_of_sum = torch.sum(x, dim=1) ** 2
        sum_of_square = torch.sum(x ** 2, dim=1)
        ix = square_of_sum - sum_of_square
        if self.reduce_sum:
            ix = torch.sum(ix, dim=1, keepdim=True)
        return 0.5 * ix
import torch.nn.functional as F
from base import BaseModel
import torch as torch
import torch.nn as nn

from model.layers import *

class FM(BaseModel):

    def __init__(self, field_dims=None, embed_dim=None):
        self.linear = FeaturesLinear(field_dims)
        self.embedding = FeaturesEmbedding(field_dims, embed_dim) = FactorizationMachine(reduce_sum=True)

    def forward(self, x):
        x = self.linear(x) +
        x = torch.sigmoid(x.squeeze(1))
        return x

Field-aware Factorization Machine(FFM)

In order to solve the problem of data characteristic coefficient, FFM is further improved on the basis of FM and introduces the concept of domain, namely field, into the model. The features of the same domain are individually one-hot. Therefore, in FFM, each one-dimensional feature will learn an implicit variable for each domain of other features. The implicit variable is not only related to the feature, but also related to the domain.

\[\mathrm{FFM}(w,x)=\sum_{j_1=1}^{n-1}\sum_{j_2=j_1+1}^{n}(w_{j_1,f_2}w_{j_2,f_1})x_{j_1}x_{j_2} \]

According to my understanding, the concept of feature domain is actually introduced in the hope that each feature can be targeted and have a more appropriate weight for other features, that is, the weight distribution between learning domains as feature hidden variables. However, at the same time, the computational complexity has increased from \ (nk \) to \ (n^2k \), which needs to be weighed between effect and engineering investment in practical application.

Let's take a look at the relevant codes:

class FieldAwareFactorizationMachine(nn.Module):
    def __init__(self, field_dims, embed_dim):
        self.num_fields = len(field_dims)
        self.embeddings = nn.ModuleList([
            nn.Embedding(sum(field_dims), embed_dim) for _ in range(self.num_fields)
        self.offsets = np.array((0, *np.cumsum(field_dims)[:-1]), dtype=np.long)
        for embedding in self.embeddings:

    def forward(self, x):
        x = x + x.new_tensor(self.offsets).unsqueeze(0)
        xs = [self.embeddings[i](x) for i in range(self.num_fields)]
        ix = list()
        for i in range(self.num_fields-1):
            for j in range(i+1, self.num_fields):
                ix.append(xs[j][:, j] * xs[i][:, j])
        ix = torch.stack(ix, dim=1)
        return ix
from model.layers import *

class FFM(nn.Module):

    def __init__(self, field_dims, embed_dim):
        self.linear = FeaturesLinear(field_dims)
        self.ffm = FieldAwareFactorizationMachine(field_dims, embed_dim)

    def forward(self, x):
        ffm_term = torch.sum(torch.sum(self.ffm(x), dim=1), dim=1, keepdim=True)
        x = self.linear(x) + ffm_term
        return x.squeeze(1)

reference resources

[machine learning] logistic regression (very detailed)

Posted by salathe on Tue, 03 May 2022 06:15:44 +0300