Introduction to sklearn Library in python machine learning

1. Introduction

sklearn It is based on python language machine learning Toolkit is the first tool to do machine learning projects. sklearn comes with a large number of data sets for us to practice various machine learning algorithms. sklearn integrates very comprehensive algorithms such as data preprocessing, data feature selection, data feature dimensionality reduction, classification \ regression \ clustering model, model evaluation and so on.

2.sklearn data type

The final data processed by machine learning are all numbers, but these data may be presented in different forms, such as matrix, text, picture, video, audio, etc.

3.sklearn overview

data set

  • sklearn.datasets
  1. Get small datasets (loaded locally): datasets load_ xxx( )
  2. Get big data sets (online download): datasets fetch_ xxx( )
  3. Locally generated dataset (locally constructed): datasets make_ xxx( )
data setintroduce
load_iris( )Iris data set: 3 categories, 4 characteristics and 150 samples
load_boston( )Boston house price data set: 13 features and 506 samples
load_digits( )Handwritten numeral set: 10 categories, 64 features, 1797 samples
load_breast_cancer( )Breast cancer dataset: 2 categories, 30 features, 569 samples
load_diabets( )Diabetes data set: 10 features, 442 samples
load_wine( )Red wine dataset: 3 categories, 13 characteristics and 178 samples
load_files( )Load custom text classification dataset
load_linnerud( )Physical training data set: 3 features and 20 samples
load_sample_image( )Load a single image sample
load_svmlight_file( )Load data in svmlight format
make_blobs( )Generate multi class single label dataset
make_biclusters( )Generate biclustered dataset
make_checkerboard( )Generate checkerboard structure array for double clustering
make_circles( )Generate two-dimensional binary classification dataset
make_classification( )Generate multi class single label dataset
make_friedman1( )Generate a data set using polynomial and sinusoidal transformation
make_gaussian_quantiles( )Generate Gaussian distribution data set
make_hastie_10_2( )Generate 10 dimensional binary classification dataset
make_low_rank_matrix( )Generating low order matrices with bell shaped singular values
make_moons( )Generate two-dimensional binary classification dataset
make_multilabel_classification( )Generate multi class and multi label data sets
make_regression( )Generate data sets for regression tasks
make_s_curve( )Generate S-curve dataset
make_sparse_coded_signal( )Sparse combination of generated signals as dictionary elements
make_sparse_spd_matrix( )Generate a sparse positive definite matrix
make_sparse_uncorrelated( )Using sparse uncorrelated design to generate random regression problems
make_spd_matrix( )Generating a positive definite matrix randomly stacked
make_swiss_roll( )Generate Swiss volume curve dataset

Partial code read from dataset:

from sklearn import datasets
import matplotlib.pyplot as plt

iris = datasets.load_iris()
features = iris.data
target = iris.target
print(features.shape,target.shape)
print(iris.feature_names)

boston = datasets.load_boston()
boston_features = boston.data
boston_target = boston.target
print(boston_features.shape,boston_target.shape)
print(boston.feature_names)

digits = datasets.load_digits()
digits_features = digits.data
digits_target = digits.target
print(digits_features.shape,digits_target.shape)

img = datasets.load_sample_image('flower.jpg')
print(img.shape)
plt.imshow(img)
plt.show()

data,target = datasets.make_blobs(n_samples=1000,n_features=2,centers=4,cluster_std=1)
plt.scatter(data[:,0],data[:,1],c=target)
plt.show()

data,target = datasets.make_classification(n_classes=4,n_samples=1000,n_features=2,n_informative=2,n_redundant=0,n_clusters_per_class=1)
print(data.shape)
plt.scatter(data[:,0],data[:,1],c=target)
plt.show()

x,y = datasets.make_regression(n_samples=10,n_features=1,n_targets=1,noise=1.5,random_state=1)
print(x.shape,y.shape)
plt.scatter(x,y)
plt.show()

Data preprocessing

  • sklearn.preprocessing
functionfunction
preprocessing.scale( )Standardization
preprocessing.MinMaxScaler( )Standardization of maximum and minimum values
preprocessing.StandardScaler( )Data standardization
preprocessing.MaxAbsScaler( )Absolute maximum standardization
preprocessing.RobustScaler( )Standardization of data sets with outliers
preprocessing.QuantileTransformer( )Transform features using quantile information
preprocessing.PowerTransformer( )Mapping to a normal distribution is performed using a power transformation
preprocessing.Normalizer( )Regularization
preprocessing.OrdinalEncoder( )Convert classification features to classification values
preprocessing.LabelEncoder( )Convert classification features to classification values
preprocessing.MultiLabelBinarizer( )Multi label binarization
preprocessing.OneHotEncoder( )Unique heat coding
preprocessing.KBinsDiscretizer( )Discretize continuous data
preprocessing.FunctionTransformer( )Custom feature processing function
preprocessing.Binarizer( )Characteristic binarization
preprocessing.PolynomialFeatures( )Create polynomial features
preprocesssing.Normalizer( )Regularization
preprocessing.Imputer( )Make up for missing values

Data preprocessing code

import numpy as np
from sklearn import preprocessing

#Standardization: convert the data into data with mean value of 0 and variance of 1, that is, data marked with normal distribution
x = np.array([[1,-1,2],[2,0,0],[0,1,-1]])
x_scale = preprocessing.scale(x)
print(x_scale.mean(axis=0),x_scale.std(axis=0))

std_scale = preprocessing.StandardScaler().fit(x)
x_std = std_scale.transform(x)
print(x_std.mean(axis=0),x_std.std(axis=0))

#Scale data to a given range (0-1)
mm_scale = preprocessing.MinMaxScaler()
x_mm = mm_scale.fit_transform(x)
print(x_mm.mean(axis=0),x_mm.std(axis=0))

#Scale the data to a given range (- 1-1) for sparse data
mb_scale = preprocessing.MaxAbsScaler()
x_mb = mb_scale.fit_transform(x)
print(x_mb.mean(axis=0),x_mb.std(axis=0))

#For data with outliers
rob_scale = preprocessing.RobustScaler()
x_rob = rob_scale.fit_transform(x)
print(x_rob.mean(axis=0),x_rob.std(axis=0))

#Regularization
nor_scale = preprocessing.Normalizer()
x_nor = nor_scale.fit_transform(x)
print(x_nor.mean(axis=0),x_nor.std(axis=0))

#Feature binarization: converts numeric features to bit Boolean values
bin_scale = preprocessing.Binarizer()
x_bin = bin_scale.fit_transform(x)
print(x_bin)

#Convert classification features or data labels to bit independent coding
ohe = preprocessing.OneHotEncoder()
x1 = ([[0,0,3],[1,1,0],[1,0,2]])
x_ohe = ohe.fit(x1).transform([[0,1,3]])
print(x_ohe)


import numpy as np
from sklearn.preprocessing import PolynomialFeatures

x = np.arange(6).reshape(3,2)
poly = PolynomialFeatures(2)
x_poly = poly.fit_transform(x)
print(x)
print(x_poly)

import numpy as np
from sklearn.preprocessing import FunctionTransformer

#Custom feature conversion function
transformer = FunctionTransformer(np.log1p)

x = np.array([[0,1],[2,3]])
x_trans = transformer.transform(x)
print(x_trans)

import numpy as np
import sklearn.preprocessing

x = np.array([[-3,5,15],[0,6,14],[6,3,11]])
kbd = preprocessing.KBinsDiscretizer(n_bins=[3,2,2],encode='ordinal').fit(x)
x_kbd = kbd.transform(x)
print(x_kbd)

from sklearn.preprocessing import MultiLabelBinarizer

#Multi label binarization
mlb = MultiLabelBinarizer()
x_mlb = mlb.fit_transform([(1,2),(3,4),(5,)])
print(x_mlb)
  • sklearn.svm
functionintroduce
svm.OneClassSVM( )Unsupervised outlier detection

The methods of the above preprocessing class functions are as follows:

http://preprocessing.xxx Function methodintroduce
xxx.fit( )Fitting data
xxx.fit_transform( )Data to be merged and converted
xxx.get_params( )Get function parameters
xxx.inverse_transform( )Inverse transformation
xxx.set_params( )Set parameters
xxx.transform( )Convert data

feature selection

Most of the time, the data set we use for model training contains many features, which are either redundant or have little correlation with the results; At this time, by carefully selecting some "good" features to train the model, we can not only reduce the training time of the model, but also improve the performance of the model.

For example, a data set contains four characteristics (nose wing length, eye corner length, forehead width and blood type); When we use these data sets for face recognition, we will remove the feature (blood group) before face recognition; Because this feature is a useless feature for the goal of face recognition.

  • sklean.feature_selection
functionfunction
feature_selection.SelectKBest( ) feature_selection.chi2 feature_selection.f_regression feature_selection.mutual_info_regressionSelect the K highest scoring features
feature_selection.VarianceThreshold( )Unsupervised feature selection
feature_selection.REF( )Recursive feature elimination
feature_selection.REFCV( )Recursive feature elimination cross validation method
feature_selection.SelectFromModel( )feature selection

Feature selection implementation code

from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectKBest,chi2

digits = load_digits()
data = digits.data
target = digits.target
print(data.shape)
data_new = SelectKBest(chi2,k=20).fit_transform(data,target)
print(data_new.shape)

from sklearn.feature_selection import VarianceThreshold

x = [[0,0,1],[0,1,0],[1,0,0],[0,1,1],[0,1,0],[0,1,1]]
vt = VarianceThreshold(threshold=(0.8*(1-0.8)))
x_new = vt.fit_transform(x)
print(x)
print(x_new)

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

iris = load_iris()
x,y = iris.data,iris.target

lsvc = LinearSVC(C=0.01,penalty='l1',dual=False).fit(x,y)
model = SelectFromModel(lsvc,prefit=True)
x_new = model.transform(x)

print(x.shape)
print(x_new.shape)

from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn.feature_selection import RFECV
from sklearn.datasets import load_iris

iris = load_iris()
x,y = iris.data,iris.target

svc = SVC(kernel='linear')
rfecv = RFECV(estimator=svc,step=1,cv=StratifiedKFold(2),scoring='accuracy',verbose=1,n_jobs=1).fit(x,y)
x_rfe = rfecv.transform(x)
print(x_rfe.shape)

clf = SVC(gamma="auto", C=0.8)   
scores = (cross_val_score(clf, x_rfe, y, cv=5))
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()*2))


Feature dimensionality reduction

In the face of huge feature data sets, in addition to feature selection, we can also adopt feature dimension reduction algorithm to reduce the number of features; The difference between feature dimensionality reduction and feature selection is that feature selection is to select features from original features; Feature dimensionality reduction is to generate new features from the original features.

Many people will have the psychology of comparing the advantages and disadvantages of feature selection and feature dimensionality reduction. In fact, this comparison divorced from practical problems is of little significance. We should understand that each algorithm has its own field of expertise.

  • sklearn.decomposition
functionfunction
decomposition.PCA( )principal component analysis
decomposition.KernelPCA( )Nuclear principal component analysis
decomposition.IncrementalPCA( )Incremental principal component analysis
decomposition.MiniBatchSparsePCA( )Small batch sparse principal component analysis
decomposition.SparsePCA( )Sparse principal component analysis
decomposition.FactorAnalysis( )factor analysis
decomposition.TruncatedSVD( )Truncated singular value decomposition
decomposition.FastICA( )Fast algorithm of independent component analysis
decomposition.DictionaryLearning( )Dictionary learning
decomposition.MiniBatchDictonaryLearning( )Small batch dictionary learning
decomposition.dict_learning( )Dictionary learning for matrix decomposition
decomposition.dict_learning_online( )Online dictionary learning for matrix decomposition
decomposition.LatentDirichletAllocation( )Implicit Dirichlet distribution of on-line variable Bayesian algorithm
decomposition.NMF( )Nonnegative matrix decomposition
decomposition.SparseCoder( )Sparse coding

Feature dimensionality reduction code implementation

"""Data dimensionality reduction"""

from sklearn.decomposition import PCA

x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
pca1 = PCA(n_components=2)
pca2 = PCA(n_components='mle')
pca1.fit(x)
pca2.fit(x)
x_new1 = pca1.transform(x)
x_new2 = pca2.transform(x)
print(x_new1.shape)
print(x_new2.shape)

import numpy as np
from sklearn.decomposition import KernelPCA
import matplotlib.pyplot as plt
import math

#Kernel PCA is suitable for nonlinear dimensionality reduction of data
x = []
y = []
N = 500

for i in range(N):
    deg = np.random.randint(0,360)
    if np.random.randint(0,2)%2 == 0:
        x.append([6*math.sin(deg),6*math.cos(deg)])
        y.append(1)
    else:
        x.append([15*math.sin(deg),15*math.cos(deg)])
        y.append(0)
        
y = np.array(y)
x = np.array(x)

kpca = KernelPCA(kernel='rbf',n_components=14)
x_kpca = kpca.fit_transform(x)
print(x_kpca.shape)

from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
from scipy import sparse
X, _ = load_digits(return_X_y=True)

#Incremental principal component analysis: applicable to big data
transform = IncrementalPCA(n_components=7,batch_size=200)
transform.partial_fit(X[:100,:])

x_sparse = sparse.csr_matrix(X)
x_transformed = transform.fit_transform(x_sparse)
x_transformed.shape

import numpy as np
from sklearn.datasets import make_friedman1
from sklearn.decomposition import MiniBatchSparsePCA

x,_ = make_friedman1(n_samples=200,n_features=30,random_state=0)
transformer = MiniBatchSparsePCA(n_components=5,batch_size=50,random_state=0)
transformer.fit(x)
x_transformed = transformer.transform(x)
print(x_transformed.shape)

from sklearn.datasets import load_digits
from sklearn.decomposition import FactorAnalysis

x,_ = load_digits(return_X_y=True)
transformer = FactorAnalysis(n_components=7,random_state=0)
x_transformed = transformer.fit_transform(x)
print(x_transformed.shape)
  • sklearn.manifold
functionfunction
manifold.LocallyLinearEmbedding( )Local nonlinear embedding
manifold.Isomap( )manifold learning
manifold.MDS( )Multidimensional scaling method
manifold.t-SNE( )t-distribution random neighborhood embedding
manifold.SpectralEmbedding( )Spectrum embedding nonlinear dimensionality reduction

Classification model

Classification model is a model that can learn knowledge from data set and improve self cognition. After learning, it can distinguish what it has seen; This model is very similar to a child who knows things.

functionfunction
tree.DecisionTreeClassifier( )Decision tree

Decision tree classification

from sklearn.datasets import load_iris
from sklearn import tree

x,y = load_iris(return_X_y=True)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x,y)
tree.plot_tree(clf)
  • sklearn.ensemble
functionfunction
ensemble.BaggingClassifier()Bagging integrated learning
ensemble.AdaBoostClassifier( )Lifting method integrated learning
ensemble.RandomForestClassifier( )Random forest classification
ensemble.ExtraTreesClassifier( )Limit random tree classification
ensemble.RandomTreesEmbedding( )Embedded complete random tree
ensemble.GradientBoostingClassifier( )Gradient lifting tree
ensemble.VotingClassifier( )Voting classification

BaggingClassifier

#The decision tree bagging method implemented by sklearn library improves the classification effect. Where X and Y are independent variables (flower characteristics) and dependent variables (flower categories) in iris data set, respectively

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets

#Load iris dataset
iris=datasets.load_iris()
X=iris.data
Y=iris.target

#Generate K-fold cross validation data
kfold=KFold(n_splits=9)

#Decision tree and cross validation
cart=DecisionTreeClassifier(criterion='gini',max_depth=2)
cart=cart.fit(X,Y)
result=cross_val_score(cart,X,Y,cv=kfold)  #K-fold cross validation method is used to verify the effect of the algorithm
print('CART Number result:',result.mean())

#Bagging method and cross validation
model=BaggingClassifier(base_estimator=cart,n_estimators=100) #n_estimators=100 to establish 100 classification models
result=cross_val_score(model,X,Y,cv=kfold)  #K-fold cross validation method is used to verify the effect of the algorithm
print('Results after bagging:',result.mean())

AdaBoostClassifier

#Based on the lifting method classifier in sklearn library, the decision tree is optimized to improve the classification accuracy_ breast_ The cancer () method loads the breast cancer data set, and the independent variable (nuclear characteristics) and dependent variable (benign and malignant) are assigned to the X and Y variables respectively

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets

#Load data
dataset_all=datasets.load_breast_cancer()
X=dataset_all.data
Y=dataset_all.target

#Initialize basic random number generator
kfold=KFold(n_splits=10)

#Decision tree and cross validation
dtree=DecisionTreeClassifier(criterion='gini',max_depth=3)

#Lifting method and cross validation
model=AdaBoostClassifier(base_estimator=dtree,n_estimators=100)
result=cross_val_score(model,X,Y,cv=kfold)
print("Improvement results of lifting method:",result.mean())

RandomForestClassifier ,ExtraTreesClassifier

#The random forest algorithm and decision tree algorithm in sklearn library are used to compare the effects, and the data set is randomly generated by the generator


from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

#make_blobs: the built-in class data generator in sklearn randomly generates test samples, make_ N in blobs method_ Samples represents the number of random number samples generated, n_features represents the number of features of each sample, centers represents the number of categories, and random_state stands for random seed
x,y=make_blobs(n_samples=1000,n_features=6,centers=50,random_state=0)
plt.scatter(x[:,0],x[:,1],c=y)
plt.show()

#Constructing random forest model
clf=RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)  #n_estimators represents the maximum number of iterations of weak learners, or the maximum number of weak learners. If the setting value is too small, the model is easy to be under fitted; If it is too large, the amount of calculation will be large, and if it exceeds a certain amount, the improvement of the model will be very small
scores=cross_val_score(clf,x,y)
print('RandomForestClassifier result:',scores.mean())

#Construct limit forest model
clf=ExtraTreesClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0)
scores=cross_val_score(clf,x,y)
print('ExtraTreesClassifier result:',scores.mean())
#The effect of limit random number is better than that of random forest, because the randomness in the method of calculating segmentation points is further enhanced; Compared with random forest, its threshold is randomly generated for each candidate feature, and the best threshold is selected as the segmentation rule, which can reduce the equation of the model, and the overall effect is better

GradientBoostingClassifier

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_blobs


#make_blobs: the built-in class data generator in sklearn randomly generates test samples, make_ N in blobs method_ Samples represents the number of random number samples generated, n_features represents the number of features of each sample, centers represents the number of categories, and random_state stands for random seed
x,y=make_blobs(n_samples=1000,n_features=6,centers=50,random_state=0)
plt.scatter(x[:,0],x[:,1],c=y)
plt.show()

x_train, x_test, y_train, y_test = train_test_split(x,y)

# Model training, using GBDT algorithm
gbr = GradientBoostingClassifier(n_estimators=3000, max_depth=2, min_samples_split=2, learning_rate=0.1)
gbr.fit(x_train, y_train.ravel())

y_gbr = gbr.predict(x_train)
y_gbr1 = gbr.predict(x_test)
acc_train = gbr.score(x_train, y_train)
acc_test = gbr.score(x_test, y_test)
print(acc_train)
print(acc_test)

VotingClassifier

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#VotingClassifier method uses multiple classification models for prediction at one time, and takes most of the prediction results as the final result
x,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42)

plt.scatter(x[y==0,0],x[y==0,1])
plt.scatter(x[y==1,0],x[y==1,1])
plt.show()

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

voting_hard = VotingClassifier(estimators=[
    ('log_clf', LogisticRegression()),
    ('svm_clf', SVC()),
    ('dt_clf', DecisionTreeClassifier(random_state=10)),], voting='hard')

voting_soft = VotingClassifier(estimators=[
    ('log_clf', LogisticRegression()),
    ('svm_clf', SVC(probability=True)),
    ('dt_clf', DecisionTreeClassifier(random_state=10)),
], voting='soft')

voting_hard.fit(x_train,y_train)
print(voting_hard.score(x_test,y_test))

voting_soft.fit(x_train,y_train)
print(voting_soft.score(x_test,y_test))
  • sklearn.linear_model
functionfunction
linear_model.LogisticRegression( )logistic regression
linear_model.Perceptron( )Linear model perceptron
linear_model.SGDClassifier( )Linear classifier with SGD training
linear_model.PassiveAggressiveClassifier( )Incremental learning classifier

LogisticRegression

import numpy as np
from sklearn import linear_model,datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
x = iris.data
y = iris.target

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(x_train,y_train)

prepro = logreg.score(x_test,y_test)
print(prepro)

Perceptron

from sklearn.datasets import load_digits
from sklearn.linear_model import Perceptron

x,y = load_digits(return_X_y=True)
clf = Perceptron(tol=1e-3,random_state=0)
clf.fit(x,y)
clf.score(x,y)

SGDClassifier

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

x = np.array([[-1,-1],[-2,-1],[1,1],[2,1]])
y = np.array([1,1,2,2])

clf = make_pipeline(StandardScaler(),SGDClassifier(max_iter=1000,tol=1e-3))
clf.fit(x,y)
print(clf.score(x,y))
print(clf.predict([[-0.8,-1]]))

PassiveAggressiveClassifier

from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

x,y = make_classification(n_features=4,random_state=0)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

clf = PassiveAggressiveClassifier(max_iter=1000,random_state=0,tol=1e-3)
clf.fit(x_train,y_train)
print(clf.score(x_test,y_test))
functionfunction
svm.SVC( )Support vector machine classification
svm.NuSVC( )Nu support vector classification
svm.LinearSVC( )Linear support vector classification

SVC

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

x = [[2,0],[1,1],[2,3]]
y = [0,0,1]

clf = SVC(kernel='linear')
clf.fit(x,y)
print(clf.predict([[2,2]]))

NuSVC

from sklearn import svm
from numpy import *

x = array([[0],[1],[2],[3]])
y = array([0,1,2,3])

clf = svm.NuSVC()
clf.fit(x,y)
print(clf.predict([[4]]))

LinearSVC

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris.data
y = iris.target

plt.scatter(X[y==0, 0], X[y==0, 1], color='red')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue')
plt.show()

svc = LinearSVC(C=10**9)
svc.fit(X, y)
print(svc.score(X,y))
functionfunction
neighbors.NearestNeighbors( )Unsupervised learning proximity search
neighbors.NearestCentroid( )Nearest centroid classifier
neighbors.KNeighborsClassifier()K-nearest neighbor classifier
neighbors.KDTree( )KD tree search nearest neighbor
neighbors.KNeighborsTransformer( )The data is converted into a weighted graph of K nearest neighbors

NearestNeighbors

import numpy as np
from sklearn.neighbors import NearestNeighbors

samples = [[0,0,2],[1,0,0],[0,0,1]]
neigh = NearestNeighbors(n_neighbors=2,radius=0.4)
neigh.fit(samples)

print(neigh.kneighbors([[0,0,1.3]],2,return_distance=True))
print(neigh.radius_neighbors([[0,0,1.3]],0.4,return_distance=False))

NearestCentroid

from sklearn.neighbors import NearestCentroid
import numpy as np

x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
y = np.array([1,1,1,2,2,2])

clf = NearestCentroid()
clf.fit(x,y)
print(clf.predict([[-0.8,-1]]))

KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

x,y = [[0],[1],[2],[3]],[0,0,1,1]

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(x,y)
print(neigh.predict([[1.1]]))

KDTree

import numpy as np
from sklearn.neighbors import KDTree
rng = np.random.RandomState(0)
x = rng.random_sample((10,3))
tree = KDTree(x,leaf_size=2)
dist,ind = tree.query(x[:1],k=3)
print(ind)

KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier
 
X = [[0], [1], [2], [3], [4], [5], [6], [7], [8]]
y = [0, 0, 0, 1, 1, 1, 2, 2, 2]
 
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
  • sklearn.discriminant_analysis
functionfunction
discriminant_analysis.LinearDiscriminantAnalysis( )Linear discriminant analysis
discriminant_analysis.QuadraticDiscriminantAnalysis( )Quadratic discriminant analysis

LDA

from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

iris = datasets.load_iris()
X = iris.data[:-5]
pre_x = iris.data[-5:]
y = iris.target[:-5]
print ('first 10 raw samples:', X[:10])
clf = LDA()
clf.fit(X, y)
X_r = clf.transform(X)
pre_y = clf.predict(pre_x)
#Dimensionality reduction results
print ('first 10 transformed samples:', X_r[:10])
#Forecast target classification results
print ('predict value:', pre_y)

QDA

from sklearn import datasets
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

x = iris.data
y = iris.target

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

clf = QDA()
clf.fit(x_train,y_train)
print(clf.score(x_test,y_test))

  • sklearn.gaussian_process
functionfunction
gaussian_process.GaussianProcessClassifier( )Gaussian process classification
  • sklearn.naive_bayes
functionfunction
naive_bayes.GaussianNB( )Naive Bayes
naive_bayes.MultinomialNB( )Polynomial naive Bayes
naive_bayes.BernoulliNB( )Bernoulli naive Bayes

GaussianNB

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

iris = datasets.load_iris()
clf = GaussianNB()
clf = clf.fit(iris.data,iris.target)

y_pre = clf.predict(iris.data)

MultinomialNB

from sklearn import datasets
from sklearn.naive_bayes import MultinomialNB

iris = datasets.load_iris()
clf = MultinomialNB()
clf = clf.fit(iris.data, iris.target)
y_pred=clf.predict(iris.data)

BernoulliNB

from sklearn import datasets
from sklearn.naive_bayes import BernoulliNB

iris = datasets.load_iris()
clf = BernoulliNB()
clf = clf.fit(iris.data, iris.target)
y_pred=clf.predict(iris.data)

regression model

  • sklearn.tree
functionfunction
tree.DecisionTreeRegress( )Regression decision tree
tree.ExtraTreeRegressor( )Limit regression tree

DecisionTreeRegressor,ExtraTreeRegressor

"""regression"""

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor,ExtraTreeRegressor
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
import numpy as np

boston = load_boston()
x = boston.data
y = boston.target

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)

dtr = DecisionTreeRegressor()
dtr.fit(x_train,y_train)

etr = ExtraTreeRegressor()
etr.fit(x_train,y_train)

yetr_pred = etr.predict(x_test)
ydtr_pred = dtr.predict(x_test)

print(dtr.score(x_test,y_test))
print(r2_score(y_test,ydtr_pred))

print(etr.score(x_test,y_test))
print(r2_score(y_test,yetr_pred))

  • sklearn.ensemble
functionfunction
ensemble.GradientBoostingRegressor( )Gradient lifting regression
ensemble.AdaBoostRegressor( )Lifting regression
ensemble.BaggingRegressor( )Bagging regression
ensemble.ExtraTreeRegressor( )Limit tree regression
ensemble.RandomForestRegressor( )Random forest regression

GradientBoostingRegressor

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor as GBR
from sklearn.datasets import make_regression

X, y = make_regression(1000, 2, noise=10)

gbr = GBR()
gbr.fit(X, y)
gbr_preds = gbr.predict(X)

AdaBoostRegressor

from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression

x,y = make_regression(n_features=4,n_informative=2,random_state=0,shuffle=False)
regr = AdaBoostRegressor(random_state=0,n_estimators=100)
regr.fit(x,y)
regr.predict([[0,0,0,0]])

BaggingRegressor

from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
from sklearn.svm import SVR

x,y = make_regression(n_samples=100,n_features=4,n_informative=2,n_targets=1,random_state=0,shuffle=False)
br = BaggingRegressor(base_estimator=SVR(),n_estimators=10,random_state=0).fit(x,y)
br.predict([[0,0,0,0]])

ExtraTreesRegressor

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor

x,y = load_diabetes(return_X_y=True)
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=0)

etr = ExtraTreesRegressor(n_estimators=100,random_state=0).fit(x_train,y_train)
print(etr.score(x_test,y_test))

RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

x,y = make_regression(n_features=4,n_informative=2,random_state=0,shuffle=False)

rfr = RandomForestRegressor(max_depth=2,random_state=0)
rfr.fit(x,y)
print(rfr.predict([[0,0,0,0]]))
  • sklearn.linear_model
functionfunction
linear_model.LinearRegression( )linear regression
linear_model.Ridge( )Ridge regression
linear_model.Lasso( )Regularizer trained by L1
linear_model.ElasticNet( )Elastic network
linear_model.MultiTaskLasso( )Multitasking Lasso
linear_model.MultiTaskElasticNet( )Multi task elastic network
linear_model.Lars( )Minimum angle regression
linear_model.OrthogonalMatchingPursuit( )Orthogonal matching tracking model
linear_model.BayesianRidge( )Bayesian ridge regression
linear_model.ARDRegression( )Bayesian ADA regression
linear_model.SGDRegressor( )Random gradient descent regression
linear_model.PassiveAggressiveRegressor( )Incremental learning regression
linear_model.HuberRegression( )Huber regression
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

np.random.seed(0)
x = np.random.randn(10,5)
y = np.random.randn(10)
clf1 = Ridge(alpha=1.0)
clf2 = Lasso()
clf2.fit(x,y)
clf1.fit(x,y)
print(clf1.predict(x))
print(clf2.predict(x))
  • sklearn.svm
functionfunction
svm.SVR( )Support vector machine regression
svm.NuSVR( )Nu support vector regression
svm.LinearSVR( )Linear support vector regression
  • sklearn.neighbors
functionfunction
neighbors.KNeighborsRegressor( )K-nearest neighbor regression
neighbors.RadiusNeighborsRegressor( )Radius based nearest neighbor regression
  • sklearn.kernel_ridge
functionfunction
kernel_ridge.KernelRidge( )Kernel ridge regression
  • sklearn.gaussian_process
functionfunction
gaussian_process.GaussianProcessRegressor( )Gaussian process regression

GaussianProcessRegressor

from sklearn.datasets import make_friedman2
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct,WhiteKernel

x,y = make_friedman2(n_samples=500,noise=0,random_state=0)

kernel = DotProduct()+WhiteKernel()
gpr = GaussianProcessRegressor(kernel=kernel,random_state=0).fit(x,y)
print(gpr.score(x,y))
  • sklearn.cross_decomposition
functionfunction
cross_decomposition.PLSRegression( )Partial least squares regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import train_test_split

boston = datasets.load_boston()

x = boston.data
y = boston.target

x_df = pd.DataFrame(x,columns=boston.feature_names)
y_df = pd.DataFrame(y)

pls = PLSRegression(n_components=2)

x_train,x_test,y_train,y_test = train_test_split(x_df,y_df,test_size=0.3,random_state=1)

pls.fit(x_train,y_train)
print(pls.predict(x_test))

Clustering model

  • sklearn.cluster
functionfunction
cluster.DBSCAN( )Density based clustering
cluster.GaussianMixtureModel( )Gaussian mixture model
cluster.AffinityPropagation( )Attraction propagation clustering
cluster.AgglomerativeClustering( )hierarchical clustering
cluster.Birch( )Balanced iterative clustering using hierarchical method
cluster.KMeans( )K-means clustering
cluster.MiniBatchKMeans( )Small batch K-means clustering
cluster.MeanShift( )Average shift clustering
cluster.OPTICS( )Identifying clustering structure based on point ranking
cluster.SpectralClustering( )Spectral clustering
cluster.Biclustering( )Double clustering
cluster.ward_tree( )Cluster ward tree
  • Model method
methodfunction
xxx.fit( )model training
xxx.get_params( )Get model parameters
xxx.predict( )Forecast new input data
xxx.score( )Evaluation model classification / regression / clustering model
xxx.set_params( )Set model parameters

Model evaluation

  • Classification model evaluation
functionfunction
metrics.accuracy_score( )Accuracy
metrics.average_precision_score( )Average accuracy
metrics.log_loss( )Logarithmic loss
metrics.confusion_matrix( )Confusion matrix
metrics.classification_report( )Classification model evaluation report: accuracy, recall, F1 score
metrics.roc_curve( )Working characteristic curve of subjects
metrics.auc( )Area under ROC curve
metrics.roc_auc_score( )AUC value
  • Regression model evaluation
functionfunction
metrics.mean_squared_error( )Average decision error
metrics.median_absolute_error( )Median absolute error
metrics.r2_score( )Coefficient of determination
  • Cluster model evaluation
functionfunction
metrics.adjusted_rand_score( )Random Rand adjustment index
metrics.silhouette_score( )Contour coefficient

Model optimization

functionfunction
model_selection.cross_val_score( )Cross validation
model_selection.LeaveOneOut( )Leave one method
model_selection.LeavePout( )Cross validation of retention P method
model_selection.GridSearchCV( )Grid search
model_selection.RandomizedSearchCV( )random search
model_selection.validation_curve( )Validation curve
model_selection.learning_curve( )learning curve

Write at the end

Learning machine learning and programming simply through articles is easy to encounter a lot of bugs, which will undoubtedly waste a lot of time for a novice and hurt everyone's confidence in learning and mastering machine learning.

Using sklearn library to learn machine learning can get started very quickly. You have to ask me how many days it takes. The following course gives the answer, that's right! You can get started with machine learning in three days. Maybe this article I wrote is not very friendly for novices, but at least this online course can get you started in three days.

This course explains the whole process of machine learning in detail, and takes you to tap the code examples of each process. Finally, it strengthens the course knowledge and code understanding through practical projects.

Tags: Python sklearn

Posted by oakld on Thu, 05 May 2022 06:00:46 +0300