A little knowledge of machine learning one day -- the solution to the problem of category imbalance

preface

Class imbalance is a common problem in machine learning. Sometimes class imbalance will directly affect the training results of the model. Here are several common ways to alleviate category imbalance. It is assumed that the class with a small number of samples is a positive class, and vice versa.

1, Change threshold

1. Theoretical introduction

For example, logistic regression can be written in the following form, if
y 1 − y > m + m − \frac{y}{1-y}>\frac{m^{+}}{m^{-}} If 1 − yy > m − m +, the prediction is a positive example. In fact, it is a "rescaling" of the category (the traditional logistic regression is a hypothesis) m + m − = 1 \frac{m^{+}}{m^{-}}=1 m−m+​=1) y ′ 1 − y ′ = y 1 − y × m − m + \frac{y^{\prime}}{1-y^{\prime}}=\frac{y}{1-y} \times \frac{m^{-}}{m^{+}} 1−y′y′​=1−yy​ × m+m − or generally speaking, in traditional logistic regression, if the output is greater than 0.5, it is judged as positive, otherwise it is negative; If the categories are unbalanced and there are too few positive samples, we can increase this threshold. For example, the output is positive only when the output is greater than 0.8, and vice versa.

2. Code implementation

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

data = load_breast_cancer() # Get the sample data. The data here is breast cancer detection. y=0 means no breast cancer, and y=1 means yes
X = pd.DataFrame(data.data,columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test =train_test_split(X,y,test_size=0.3,random_state=0)
model = LogisticRegression().fit(X_train,y_train)
y_pred0 = model.predict(X_test)  # Direct output category

res = model.predict_proba(X_test) # First output the probability, scale it, and then convert it to category
y_pred1 = []
r =  y_train.tolist().count(0)/y_train.tolist().count(1)        # Suppose 0 is a positive class and 1 is a negative class, where r stands for m+ / m-
for i in range(res.shape[0]):
    if res[i][0] / res[i][1] > r:
        y_pred1.append(0)
    else:
        y_pred1.append(1)
print('Similarity ratio of two outputs:%.3f'%(accuracy_score(y_pred2,y_pred)))
Similarity ratio of two outputs: 0.994

It can be seen that the output before and after adjustment is similar, because there is no sample imbalance in my data here.

2, Sampling method

1. Theoretical introduction

Although the idea of rescaling is simple, it is not necessarily ideal in practical application. This is mainly because the distribution of real samples may be inconsistent with that of training samples, so the threshold selected according to training samples may not be suitable for real samples. In fact, the most intuitive and common way to solve the problem of sample imbalance is to change the number of samples of a certain type to make it a balanced sample. There are two ideas to change the number of samples to make the samples balanced: the first is to reduce the majority (negative class) samples, which is called under sampling; The other is to add a few (positive) samples, which is called oversampling

1.1 under sampling

The idea of under sampling is to reduce most (negative) samples and balance the number of positive and negative samples. There are three common implementation methods of under sampling:
(1) Random undersampling: randomly select part of most samples to participate in model training
(2) First cluster most samples (Kmeans), and then select some samples from each cluster to participate in model training, which is similar to stratified sampling. Its main idea is to select a representative majority of samples to participate in model training.
(3) Divide the negative samples into k scores, then train K models respectively, and then integrate the K models (EasyEnsemble)
Due to the random selection of some samples, the model may lose some important information; Although the method of clustering first and then sampling selects representative samples as much as possible, some information will be lost in the process of model training; EasyEnsemble uses the integrated learning mechanism to divide the counterexample into several sets for different learners. In this way, each learner is under sampled, but the important information will not be lost in the global view. The disadvantage of this method is that it will increase the computational complexity.

1.2 oversampling

The idea of oversampling is to add a few (positive) samples to balance the number of positive and negative samples. There are two common implementation methods of under sampling:
(1) Random oversampling: take samples from a small number of samples at random. Repeat this operation until the number of positive and negative samples is balanced, similar to bootstrap sampling
(2) SMOTE oversampling: linear combination (interpolation) of positive samples to obtain new positive samples. The steps can be roughly divided into: first, randomly select a positive sample i i i. Then find out i i K neighbors of i (if k=3, knn algorithm can be used to find neighbors), and then randomly select one of the K neighbors j j j. Finally, the sample i i i and j j j is linearly combined to obtain a new sample z = λ i + ( 1 − λ ) j z=\lambda i + (1-\lambda)j z= λ i+(1− λ) j. Among them λ \lambda λ The value range of is (0,1)
Random oversampling is easy to over fit due to the introduction of many repeated samples; SMOTE generates new samples according to the linear combination between positive samples, so it may expand the impact of noise on the model.

2. Code implementation

2.1 under sampling

All X and y are the same as above (1 is the majority (negative class) sample and 0 is the minority (positive class) sample)
(1) Random undersampling

index = np.where(y==1)[0]  # Find the index of all negative samples
index_sample = np.random.choice(index,size=len(np.where(y==0)[0]), replace=False)  # Randomly select some samples from negative samples, and select the number of samples with positive scale. replace=False indicates no return
X_sample = X.iloc[index_sample.tolist()+np.where(y==0)[0].tolist() , :]  # For the final X, don't forget to add the index of the positive sample
y_sample = y[index_sample.tolist()+np.where(y==0)[0].tolist()]   # Final y

(2) Clustering undersampling (stratified sampling)

from sklearn.cluster import KMeans
index = np.where(y==1)[0]  # Find the index of all negative class (majority) samples
y_cluster = y[index]
X_cluster = X.iloc[index,:]
model = KMeans(n_clusters=3).fit(X_cluster)
print(model.labels_)
[2 1 0 1 0 1 2 1 2 1 1 1 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 2 2 2 2 1 0 1 0 1
 0 0 1 1 1 0 1 2 0 0 1 0 1 2 1 2 2 1 2 1 1 0 0 1 1 0 1 2 2 2 1 0 0 0 2 1 2
 1 1 1 1 2 0 2 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 2 1 1 0 2 2 0 2 2 0 2
 1 1 1 0 2 2 2 1 1 2 0 1 1 0 1 1 0 2 1 0 2 1 0 1 1 2 2 1 1 1 1 1 0 1 2 2 1
 1 1 2 0 2 0 1 0 1 1 1 0 2 2 1 2 1 1 0 1 1 0 1 0 1 1 1 2 1 1 0 1 1 1 0 2 0
 0 1 0 1 2 1 1 1 1 1 1 2 0 0 1 1 1 2 2 1 2 2 2 0 2 2 1 0 1 1 1 1 2 1 0 0 1
 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 0 2 1 0 1 1 1 2 1 2 0 0 0 1 0 1 1 2 1 2 2 2
 1 2 0 1 2 2 1 1 2 1 2 1 1 1 0 2 1 2 2 2 0 1 0 1 2 1 0 1 2 2 1 1 2 2 2 2 1
 2 1 1 2 1 1 2 1 1 2 1 0 0 1 0 2 1 2 2 1 1 1 0 0 2 0 0 2 1 2 1 1 1 2 0 1 0
 0 1 2 2 1 2 2 0 0 0 1 0 0 1 0 1 0 0 0 2 1 2 0 0]
 
index_sample = []
Then select some samples from each class
for i in range(3):
    temp_index = index[np.where(model.labels_==i)[0]]
    index_sample += np.random.choice(temp_index,size=int(len(np.where(y==0)[0])/3), replace=False).tolist()
X_sample = X.iloc[index_sample.tolist()+np.where(y==0)[0].tolist() , :]  # For the final X, don't forget to add the index of the positive sample
y_sample = y[index_sample.tolist()+np.where(y==0)[0].tolist()]   # Final y

(3)EasyEnsemble

index = np.where(y==1)[0]  # Find the index of all negative samples
X_sample = []
y_sample = []
Len = len(index)// 10
for i in range(10):  # Here, 10 represents how many negative samples are divided into and how many models are finally trained
    X_temp = np.vstack((X.iloc[index,:].values[i*Len:(i+1)*Len], X.iloc[np.where(y==0)[0]].values))
    y_temp = y[index][i*Len:(i+1)*Len].tolist() + y[np.where(y==0)[0]].tolist()
    X_sample.append(X_temp)
    y_sample.append(y_temp)

Here X_sample and Y_ The sample contains 10 groups of data, and the positive and negative samples in each group of data are close to balance (adjusted by the 10 parameter). We use these 10 groups of data to train an integrated model, and there will be no loss of important information in the global view.

2.1 under sampling

All X and y are the same as above
(1) Random oversampling

index = np.where(y==0)[0]  # Find the index of all positive (minority) samples
index_sample = np.random.choice(index, size=100)
X_sample = X.iloc[index_sample]
y_sample = [0] * 100
X_sample =np.vstack((X.values, X_sample.values))  # Final X, full sample plus new sample
y_sample = y.tolist() + y_sample  # Final y

(2) SMOTE sampling

from sklearn.neighbors import KNeighborsClassifier
X_sample,y_sample = [],[]
index = np.where(y==0)[0]  # Find the index of all positive (minority) samples
model = KNeighborsClassifier(n_neighbors=4).fit(X.iloc[index,:],y[index])
smote_index = np.random.choice(index, size=100,replace=False) # Change 100 here to the number of samples you want to increase
for ind in smote_index: 
    neig_index = model.kneighbors(X.iloc[ind,:].values.reshape(1,-1),return_distance=False)[0]
    i = neig_index[0]  # itself
    j = np.random.choice(neig_index[1:], size=1)[0] # Choose one of his neighbors
    z = 0.5* X.iloc[i,:].values + 0.5* X.iloc[j,:].values # Interpolate to produce new samples
    X_sample.append(z.tolist())
    y_sample.append(0)
    
X_sample =np.vstack((X.values, np.array(X_sample)))  # Final X, full sample plus new sample
y_sample = y.tolist() + y_sample  # Final y

3, Change sample weight

1. Theoretical introduction

The third method is the weight of probability samples. Logically, the smaller the number of samples, the greater its weight should be. So we can use the reciprocal of the sample frequency as the weight of the sample.
w e i g h t i = ∣ N ∣ ∣ N i ∣ weight_i=\frac{|N|}{|N_i|} Ni = ྒྷ N ྒྷ i i Weight of class i samples, where ∣ N ∣ |N| ∣ N ∣ is the total amount of samples, ∣ N i ∣ |N_i| ∣ Ni ∣ is the second i i Number of class i samples. The advantage of this is that the category with a small number of samples will get higher weight.

2. Code implementation

sklearn has a weight parameter to choose when training the model. We're here Just add it directly to fit(). Let's take logistic regression as an example.

weight = np.array([len(y_train[y_train==0])/len(y_train),  len(y_train[y_train==1])/len(y_train)])
sample_weight = y_train.copy()
sample_weight[sample_weight==0] = weight[0]
sample_weight[sample_weight==1] = weight[1]
model = LogisticRegression().fit(X_train,y_train,sample_weight=sample_weight)
y_pred = model.predict(X_test)

summary

This paper mainly introduces several common methods to solve the problem of category imbalance, which have their own advantages and disadvantages and can also be used in combination. In addition, there are other solutions, such as selecting models that are not sensitive to categories, transforming the classification problem into an anomaly detection problem (this can be considered when the samples are extremely unbalanced, because there are generally only a few positive classes at this time, and the methods mentioned above will fail), and so on

Posted by leoric1928 on Fri, 13 May 2022 07:05:45 +0300