1. Introduction
sklearn It is based on python language machine learning Toolkit is the first tool to do machine learning projects. sklearn comes with a large number of data sets for us to practice various machine learning algorithms. sklearn integrates very comprehensive algorithms such as data preprocessing, data feature selection, data feature dimensionality reduction, classification \ regression \ clustering model, model evaluation and so on.
2.sklearn data type
The final data processed by machine learning are all numbers, but these data may be presented in different forms, such as matrix, text, picture, video, audio, etc.
3.sklearn overview
- data set
- Data preprocessing
- feature selection
- Feature dimensionality reduction
- Classification model
- regression model
- Clustering model
- Model evaluation
- Model optimization
- Write at the end
- Excellent reference
data set
- sklearn.datasets
- Get small datasets (loaded locally): datasets load_ xxx( )
- Get big data sets (online download): datasets fetch_ xxx( )
- Locally generated dataset (locally constructed): datasets make_ xxx( )
data set | introduce |
---|---|
load_iris( ) | Iris data set: 3 categories, 4 characteristics and 150 samples |
load_boston( ) | Boston house price data set: 13 features and 506 samples |
load_digits( ) | Handwritten numeral set: 10 categories, 64 features, 1797 samples |
load_breast_cancer( ) | Breast cancer dataset: 2 categories, 30 features, 569 samples |
load_diabets( ) | Diabetes data set: 10 features, 442 samples |
load_wine( ) | Red wine dataset: 3 categories, 13 characteristics and 178 samples |
load_files( ) | Load custom text classification dataset |
load_linnerud( ) | Physical training data set: 3 features and 20 samples |
load_sample_image( ) | Load a single image sample |
load_svmlight_file( ) | Load data in svmlight format |
make_blobs( ) | Generate multi class single label dataset |
make_biclusters( ) | Generate biclustered dataset |
make_checkerboard( ) | Generate checkerboard structure array for double clustering |
make_circles( ) | Generate two-dimensional binary classification dataset |
make_classification( ) | Generate multi class single label dataset |
make_friedman1( ) | Generate a data set using polynomial and sinusoidal transformation |
make_gaussian_quantiles( ) | Generate Gaussian distribution data set |
make_hastie_10_2( ) | Generate 10 dimensional binary classification dataset |
make_low_rank_matrix( ) | Generating low order matrices with bell shaped singular values |
make_moons( ) | Generate two-dimensional binary classification dataset |
make_multilabel_classification( ) | Generate multi class and multi label data sets |
make_regression( ) | Generate data sets for regression tasks |
make_s_curve( ) | Generate S-curve dataset |
make_sparse_coded_signal( ) | Sparse combination of generated signals as dictionary elements |
make_sparse_spd_matrix( ) | Generate a sparse positive definite matrix |
make_sparse_uncorrelated( ) | Using sparse uncorrelated design to generate random regression problems |
make_spd_matrix( ) | Generating a positive definite matrix randomly stacked |
make_swiss_roll( ) | Generate Swiss volume curve dataset |
Partial code read from dataset:
from sklearn import datasets import matplotlib.pyplot as plt iris = datasets.load_iris() features = iris.data target = iris.target print(features.shape,target.shape) print(iris.feature_names) boston = datasets.load_boston() boston_features = boston.data boston_target = boston.target print(boston_features.shape,boston_target.shape) print(boston.feature_names) digits = datasets.load_digits() digits_features = digits.data digits_target = digits.target print(digits_features.shape,digits_target.shape) img = datasets.load_sample_image('flower.jpg') print(img.shape) plt.imshow(img) plt.show() data,target = datasets.make_blobs(n_samples=1000,n_features=2,centers=4,cluster_std=1) plt.scatter(data[:,0],data[:,1],c=target) plt.show() data,target = datasets.make_classification(n_classes=4,n_samples=1000,n_features=2,n_informative=2,n_redundant=0,n_clusters_per_class=1) print(data.shape) plt.scatter(data[:,0],data[:,1],c=target) plt.show() x,y = datasets.make_regression(n_samples=10,n_features=1,n_targets=1,noise=1.5,random_state=1) print(x.shape,y.shape) plt.scatter(x,y) plt.show()
Data preprocessing
- sklearn.preprocessing
function | function |
---|---|
preprocessing.scale( ) | Standardization |
preprocessing.MinMaxScaler( ) | Standardization of maximum and minimum values |
preprocessing.StandardScaler( ) | Data standardization |
preprocessing.MaxAbsScaler( ) | Absolute maximum standardization |
preprocessing.RobustScaler( ) | Standardization of data sets with outliers |
preprocessing.QuantileTransformer( ) | Transform features using quantile information |
preprocessing.PowerTransformer( ) | Mapping to a normal distribution is performed using a power transformation |
preprocessing.Normalizer( ) | Regularization |
preprocessing.OrdinalEncoder( ) | Convert classification features to classification values |
preprocessing.LabelEncoder( ) | Convert classification features to classification values |
preprocessing.MultiLabelBinarizer( ) | Multi label binarization |
preprocessing.OneHotEncoder( ) | Unique heat coding |
preprocessing.KBinsDiscretizer( ) | Discretize continuous data |
preprocessing.FunctionTransformer( ) | Custom feature processing function |
preprocessing.Binarizer( ) | Characteristic binarization |
preprocessing.PolynomialFeatures( ) | Create polynomial features |
preprocesssing.Normalizer( ) | Regularization |
preprocessing.Imputer( ) | Make up for missing values |
Data preprocessing code
import numpy as np from sklearn import preprocessing #Standardization: convert the data into data with mean value of 0 and variance of 1, that is, data marked with normal distribution x = np.array([[1,-1,2],[2,0,0],[0,1,-1]]) x_scale = preprocessing.scale(x) print(x_scale.mean(axis=0),x_scale.std(axis=0)) std_scale = preprocessing.StandardScaler().fit(x) x_std = std_scale.transform(x) print(x_std.mean(axis=0),x_std.std(axis=0)) #Scale data to a given range (0-1) mm_scale = preprocessing.MinMaxScaler() x_mm = mm_scale.fit_transform(x) print(x_mm.mean(axis=0),x_mm.std(axis=0)) #Scale the data to a given range (- 1-1) for sparse data mb_scale = preprocessing.MaxAbsScaler() x_mb = mb_scale.fit_transform(x) print(x_mb.mean(axis=0),x_mb.std(axis=0)) #For data with outliers rob_scale = preprocessing.RobustScaler() x_rob = rob_scale.fit_transform(x) print(x_rob.mean(axis=0),x_rob.std(axis=0)) #Regularization nor_scale = preprocessing.Normalizer() x_nor = nor_scale.fit_transform(x) print(x_nor.mean(axis=0),x_nor.std(axis=0)) #Feature binarization: converts numeric features to bit Boolean values bin_scale = preprocessing.Binarizer() x_bin = bin_scale.fit_transform(x) print(x_bin) #Convert classification features or data labels to bit independent coding ohe = preprocessing.OneHotEncoder() x1 = ([[0,0,3],[1,1,0],[1,0,2]]) x_ohe = ohe.fit(x1).transform([[0,1,3]]) print(x_ohe) import numpy as np from sklearn.preprocessing import PolynomialFeatures x = np.arange(6).reshape(3,2) poly = PolynomialFeatures(2) x_poly = poly.fit_transform(x) print(x) print(x_poly) import numpy as np from sklearn.preprocessing import FunctionTransformer #Custom feature conversion function transformer = FunctionTransformer(np.log1p) x = np.array([[0,1],[2,3]]) x_trans = transformer.transform(x) print(x_trans) import numpy as np import sklearn.preprocessing x = np.array([[-3,5,15],[0,6,14],[6,3,11]]) kbd = preprocessing.KBinsDiscretizer(n_bins=[3,2,2],encode='ordinal').fit(x) x_kbd = kbd.transform(x) print(x_kbd) from sklearn.preprocessing import MultiLabelBinarizer #Multi label binarization mlb = MultiLabelBinarizer() x_mlb = mlb.fit_transform([(1,2),(3,4),(5,)]) print(x_mlb)
- sklearn.svm
function | introduce |
---|---|
svm.OneClassSVM( ) | Unsupervised outlier detection |
The methods of the above preprocessing class functions are as follows:
http://preprocessing.xxx Function method | introduce |
---|---|
xxx.fit( ) | Fitting data |
xxx.fit_transform( ) | Data to be merged and converted |
xxx.get_params( ) | Get function parameters |
xxx.inverse_transform( ) | Inverse transformation |
xxx.set_params( ) | Set parameters |
xxx.transform( ) | Convert data |
feature selection
Most of the time, the data set we use for model training contains many features, which are either redundant or have little correlation with the results; At this time, by carefully selecting some "good" features to train the model, we can not only reduce the training time of the model, but also improve the performance of the model.
For example, a data set contains four characteristics (nose wing length, eye corner length, forehead width and blood type); When we use these data sets for face recognition, we will remove the feature (blood group) before face recognition; Because this feature is a useless feature for the goal of face recognition.
- sklean.feature_selection
function | function |
---|---|
feature_selection.SelectKBest( ) feature_selection.chi2 feature_selection.f_regression feature_selection.mutual_info_regression | Select the K highest scoring features |
feature_selection.VarianceThreshold( ) | Unsupervised feature selection |
feature_selection.REF( ) | Recursive feature elimination |
feature_selection.REFCV( ) | Recursive feature elimination cross validation method |
feature_selection.SelectFromModel( ) | feature selection |
Feature selection implementation code
from sklearn.datasets import load_digits from sklearn.feature_selection import SelectKBest,chi2 digits = load_digits() data = digits.data target = digits.target print(data.shape) data_new = SelectKBest(chi2,k=20).fit_transform(data,target) print(data_new.shape) from sklearn.feature_selection import VarianceThreshold x = [[0,0,1],[0,1,0],[1,0,0],[0,1,1],[0,1,0],[0,1,1]] vt = VarianceThreshold(threshold=(0.8*(1-0.8))) x_new = vt.fit_transform(x) print(x) print(x_new) from sklearn.svm import LinearSVC from sklearn.datasets import load_iris from sklearn.feature_selection import SelectFromModel iris = load_iris() x,y = iris.data,iris.target lsvc = LinearSVC(C=0.01,penalty='l1',dual=False).fit(x,y) model = SelectFromModel(lsvc,prefit=True) x_new = model.transform(x) print(x.shape) print(x_new.shape) from sklearn.svm import SVC from sklearn.model_selection import StratifiedKFold,cross_val_score from sklearn.feature_selection import RFECV from sklearn.datasets import load_iris iris = load_iris() x,y = iris.data,iris.target svc = SVC(kernel='linear') rfecv = RFECV(estimator=svc,step=1,cv=StratifiedKFold(2),scoring='accuracy',verbose=1,n_jobs=1).fit(x,y) x_rfe = rfecv.transform(x) print(x_rfe.shape) clf = SVC(gamma="auto", C=0.8) scores = (cross_val_score(clf, x_rfe, y, cv=5)) print(scores) print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()*2))
Feature dimensionality reduction
In the face of huge feature data sets, in addition to feature selection, we can also adopt feature dimension reduction algorithm to reduce the number of features; The difference between feature dimensionality reduction and feature selection is that feature selection is to select features from original features; Feature dimensionality reduction is to generate new features from the original features.
Many people will have the psychology of comparing the advantages and disadvantages of feature selection and feature dimensionality reduction. In fact, this comparison divorced from practical problems is of little significance. We should understand that each algorithm has its own field of expertise.
- sklearn.decomposition
function | function |
---|---|
decomposition.PCA( ) | principal component analysis |
decomposition.KernelPCA( ) | Nuclear principal component analysis |
decomposition.IncrementalPCA( ) | Incremental principal component analysis |
decomposition.MiniBatchSparsePCA( ) | Small batch sparse principal component analysis |
decomposition.SparsePCA( ) | Sparse principal component analysis |
decomposition.FactorAnalysis( ) | factor analysis |
decomposition.TruncatedSVD( ) | Truncated singular value decomposition |
decomposition.FastICA( ) | Fast algorithm of independent component analysis |
decomposition.DictionaryLearning( ) | Dictionary learning |
decomposition.MiniBatchDictonaryLearning( ) | Small batch dictionary learning |
decomposition.dict_learning( ) | Dictionary learning for matrix decomposition |
decomposition.dict_learning_online( ) | Online dictionary learning for matrix decomposition |
decomposition.LatentDirichletAllocation( ) | Implicit Dirichlet distribution of on-line variable Bayesian algorithm |
decomposition.NMF( ) | Nonnegative matrix decomposition |
decomposition.SparseCoder( ) | Sparse coding |
Feature dimensionality reduction code implementation
"""Data dimensionality reduction""" from sklearn.decomposition import PCA x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]]) pca1 = PCA(n_components=2) pca2 = PCA(n_components='mle') pca1.fit(x) pca2.fit(x) x_new1 = pca1.transform(x) x_new2 = pca2.transform(x) print(x_new1.shape) print(x_new2.shape) import numpy as np from sklearn.decomposition import KernelPCA import matplotlib.pyplot as plt import math #Kernel PCA is suitable for nonlinear dimensionality reduction of data x = [] y = [] N = 500 for i in range(N): deg = np.random.randint(0,360) if np.random.randint(0,2)%2 == 0: x.append([6*math.sin(deg),6*math.cos(deg)]) y.append(1) else: x.append([15*math.sin(deg),15*math.cos(deg)]) y.append(0) y = np.array(y) x = np.array(x) kpca = KernelPCA(kernel='rbf',n_components=14) x_kpca = kpca.fit_transform(x) print(x_kpca.shape) from sklearn.datasets import load_digits from sklearn.decomposition import IncrementalPCA from scipy import sparse X, _ = load_digits(return_X_y=True) #Incremental principal component analysis: applicable to big data transform = IncrementalPCA(n_components=7,batch_size=200) transform.partial_fit(X[:100,:]) x_sparse = sparse.csr_matrix(X) x_transformed = transform.fit_transform(x_sparse) x_transformed.shape import numpy as np from sklearn.datasets import make_friedman1 from sklearn.decomposition import MiniBatchSparsePCA x,_ = make_friedman1(n_samples=200,n_features=30,random_state=0) transformer = MiniBatchSparsePCA(n_components=5,batch_size=50,random_state=0) transformer.fit(x) x_transformed = transformer.transform(x) print(x_transformed.shape) from sklearn.datasets import load_digits from sklearn.decomposition import FactorAnalysis x,_ = load_digits(return_X_y=True) transformer = FactorAnalysis(n_components=7,random_state=0) x_transformed = transformer.fit_transform(x) print(x_transformed.shape)
- sklearn.manifold
function | function |
---|---|
manifold.LocallyLinearEmbedding( ) | Local nonlinear embedding |
manifold.Isomap( ) | manifold learning |
manifold.MDS( ) | Multidimensional scaling method |
manifold.t-SNE( ) | t-distribution random neighborhood embedding |
manifold.SpectralEmbedding( ) | Spectrum embedding nonlinear dimensionality reduction |
Classification model
Classification model is a model that can learn knowledge from data set and improve self cognition. After learning, it can distinguish what it has seen; This model is very similar to a child who knows things.
function | function |
---|---|
tree.DecisionTreeClassifier( ) | Decision tree |
Decision tree classification
from sklearn.datasets import load_iris from sklearn import tree x,y = load_iris(return_X_y=True) clf = tree.DecisionTreeClassifier() clf = clf.fit(x,y) tree.plot_tree(clf)
- sklearn.ensemble
function | function |
---|---|
ensemble.BaggingClassifier() | Bagging integrated learning |
ensemble.AdaBoostClassifier( ) | Lifting method integrated learning |
ensemble.RandomForestClassifier( ) | Random forest classification |
ensemble.ExtraTreesClassifier( ) | Limit random tree classification |
ensemble.RandomTreesEmbedding( ) | Embedded complete random tree |
ensemble.GradientBoostingClassifier( ) | Gradient lifting tree |
ensemble.VotingClassifier( ) | Voting classification |
BaggingClassifier
#The decision tree bagging method implemented by sklearn library improves the classification effect. Where X and Y are independent variables (flower characteristics) and dependent variables (flower categories) in iris data set, respectively from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import datasets #Load iris dataset iris=datasets.load_iris() X=iris.data Y=iris.target #Generate K-fold cross validation data kfold=KFold(n_splits=9) #Decision tree and cross validation cart=DecisionTreeClassifier(criterion='gini',max_depth=2) cart=cart.fit(X,Y) result=cross_val_score(cart,X,Y,cv=kfold) #K-fold cross validation method is used to verify the effect of the algorithm print('CART Number result:',result.mean()) #Bagging method and cross validation model=BaggingClassifier(base_estimator=cart,n_estimators=100) #n_estimators=100 to establish 100 classification models result=cross_val_score(model,X,Y,cv=kfold) #K-fold cross validation method is used to verify the effect of the algorithm print('Results after bagging:',result.mean())
AdaBoostClassifier
#Based on the lifting method classifier in sklearn library, the decision tree is optimized to improve the classification accuracy_ breast_ The cancer () method loads the breast cancer data set, and the independent variable (nuclear characteristics) and dependent variable (benign and malignant) are assigned to the X and Y variables respectively from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import datasets #Load data dataset_all=datasets.load_breast_cancer() X=dataset_all.data Y=dataset_all.target #Initialize basic random number generator kfold=KFold(n_splits=10) #Decision tree and cross validation dtree=DecisionTreeClassifier(criterion='gini',max_depth=3) #Lifting method and cross validation model=AdaBoostClassifier(base_estimator=dtree,n_estimators=100) result=cross_val_score(model,X,Y,cv=kfold) print("Improvement results of lifting method:",result.mean())
RandomForestClassifier ,ExtraTreesClassifier
#The random forest algorithm and decision tree algorithm in sklearn library are used to compare the effects, and the data set is randomly generated by the generator from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt #make_blobs: the built-in class data generator in sklearn randomly generates test samples, make_ N in blobs method_ Samples represents the number of random number samples generated, n_features represents the number of features of each sample, centers represents the number of categories, and random_state stands for random seed x,y=make_blobs(n_samples=1000,n_features=6,centers=50,random_state=0) plt.scatter(x[:,0],x[:,1],c=y) plt.show() #Constructing random forest model clf=RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0) #n_estimators represents the maximum number of iterations of weak learners, or the maximum number of weak learners. If the setting value is too small, the model is easy to be under fitted; If it is too large, the amount of calculation will be large, and if it exceeds a certain amount, the improvement of the model will be very small scores=cross_val_score(clf,x,y) print('RandomForestClassifier result:',scores.mean()) #Construct limit forest model clf=ExtraTreesClassifier(n_estimators=10,max_depth=None,min_samples_split=2,random_state=0) scores=cross_val_score(clf,x,y) print('ExtraTreesClassifier result:',scores.mean()) #The effect of limit random number is better than that of random forest, because the randomness in the method of calculating segmentation points is further enhanced; Compared with random forest, its threshold is randomly generated for each candidate feature, and the best threshold is selected as the segmentation rule, which can reduce the equation of the model, and the overall effect is better
GradientBoostingClassifier
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_blobs #make_blobs: the built-in class data generator in sklearn randomly generates test samples, make_ N in blobs method_ Samples represents the number of random number samples generated, n_features represents the number of features of each sample, centers represents the number of categories, and random_state stands for random seed x,y=make_blobs(n_samples=1000,n_features=6,centers=50,random_state=0) plt.scatter(x[:,0],x[:,1],c=y) plt.show() x_train, x_test, y_train, y_test = train_test_split(x,y) # Model training, using GBDT algorithm gbr = GradientBoostingClassifier(n_estimators=3000, max_depth=2, min_samples_split=2, learning_rate=0.1) gbr.fit(x_train, y_train.ravel()) y_gbr = gbr.predict(x_train) y_gbr1 = gbr.predict(x_test) acc_train = gbr.score(x_train, y_train) acc_test = gbr.score(x_test, y_test) print(acc_train) print(acc_test)
VotingClassifier
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.ensemble import VotingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split #VotingClassifier method uses multiple classification models for prediction at one time, and takes most of the prediction results as the final result x,y = datasets.make_moons(n_samples=500,noise=0.3,random_state=42) plt.scatter(x[y==0,0],x[y==0,1]) plt.scatter(x[y==1,0],x[y==1,1]) plt.show() x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3) voting_hard = VotingClassifier(estimators=[ ('log_clf', LogisticRegression()), ('svm_clf', SVC()), ('dt_clf', DecisionTreeClassifier(random_state=10)),], voting='hard') voting_soft = VotingClassifier(estimators=[ ('log_clf', LogisticRegression()), ('svm_clf', SVC(probability=True)), ('dt_clf', DecisionTreeClassifier(random_state=10)), ], voting='soft') voting_hard.fit(x_train,y_train) print(voting_hard.score(x_test,y_test)) voting_soft.fit(x_train,y_train) print(voting_soft.score(x_test,y_test))
- sklearn.linear_model
function | function |
---|---|
linear_model.LogisticRegression( ) | logistic regression |
linear_model.Perceptron( ) | Linear model perceptron |
linear_model.SGDClassifier( ) | Linear classifier with SGD training |
linear_model.PassiveAggressiveClassifier( ) | Incremental learning classifier |
LogisticRegression
import numpy as np from sklearn import linear_model,datasets from sklearn.model_selection import train_test_split iris = datasets.load_iris() x = iris.data y = iris.target x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3) logreg = linear_model.LogisticRegression(C=1e5) logreg.fit(x_train,y_train) prepro = logreg.score(x_test,y_test) print(prepro)
Perceptron
from sklearn.datasets import load_digits from sklearn.linear_model import Perceptron x,y = load_digits(return_X_y=True) clf = Perceptron(tol=1e-3,random_state=0) clf.fit(x,y) clf.score(x,y)
SGDClassifier
import numpy as np from sklearn.linear_model import SGDClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline x = np.array([[-1,-1],[-2,-1],[1,1],[2,1]]) y = np.array([1,1,2,2]) clf = make_pipeline(StandardScaler(),SGDClassifier(max_iter=1000,tol=1e-3)) clf.fit(x,y) print(clf.score(x,y)) print(clf.predict([[-0.8,-1]]))
PassiveAggressiveClassifier
from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split x,y = make_classification(n_features=4,random_state=0) x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3) clf = PassiveAggressiveClassifier(max_iter=1000,random_state=0,tol=1e-3) clf.fit(x_train,y_train) print(clf.score(x_test,y_test))
function | function |
---|---|
svm.SVC( ) | Support vector machine classification |
svm.NuSVC( ) | Nu support vector classification |
svm.LinearSVC( ) | Linear support vector classification |
SVC
import numpy as np from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC x = [[2,0],[1,1],[2,3]] y = [0,0,1] clf = SVC(kernel='linear') clf.fit(x,y) print(clf.predict([[2,2]]))
NuSVC
from sklearn import svm from numpy import * x = array([[0],[1],[2],[3]]) y = array([0,1,2,3]) clf = svm.NuSVC() clf.fit(x,y) print(clf.predict([[4]]))
LinearSVC
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.svm import LinearSVC iris = datasets.load_iris() X = iris.data y = iris.target plt.scatter(X[y==0, 0], X[y==0, 1], color='red') plt.scatter(X[y==1, 0], X[y==1, 1], color='blue') plt.show() svc = LinearSVC(C=10**9) svc.fit(X, y) print(svc.score(X,y))
function | function |
---|---|
neighbors.NearestNeighbors( ) | Unsupervised learning proximity search |
neighbors.NearestCentroid( ) | Nearest centroid classifier |
neighbors.KNeighborsClassifier() | K-nearest neighbor classifier |
neighbors.KDTree( ) | KD tree search nearest neighbor |
neighbors.KNeighborsTransformer( ) | The data is converted into a weighted graph of K nearest neighbors |
NearestNeighbors
import numpy as np from sklearn.neighbors import NearestNeighbors samples = [[0,0,2],[1,0,0],[0,0,1]] neigh = NearestNeighbors(n_neighbors=2,radius=0.4) neigh.fit(samples) print(neigh.kneighbors([[0,0,1.3]],2,return_distance=True)) print(neigh.radius_neighbors([[0,0,1.3]],0.4,return_distance=False))
NearestCentroid
from sklearn.neighbors import NearestCentroid import numpy as np x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]]) y = np.array([1,1,1,2,2,2]) clf = NearestCentroid() clf.fit(x,y) print(clf.predict([[-0.8,-1]]))
KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier x,y = [[0],[1],[2],[3]],[0,0,1,1] neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(x,y) print(neigh.predict([[1.1]]))
KDTree
import numpy as np from sklearn.neighbors import KDTree rng = np.random.RandomState(0) x = rng.random_sample((10,3)) tree = KDTree(x,leaf_size=2) dist,ind = tree.query(x[:1],k=3) print(ind)
KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier X = [[0], [1], [2], [3], [4], [5], [6], [7], [8]] y = [0, 0, 0, 1, 1, 1, 2, 2, 2] neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(X, y) print(neigh.predict([[1.1]]))
- sklearn.discriminant_analysis
function | function |
---|---|
discriminant_analysis.LinearDiscriminantAnalysis( ) | Linear discriminant analysis |
discriminant_analysis.QuadraticDiscriminantAnalysis( ) | Quadratic discriminant analysis |
LDA
from sklearn import datasets from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA iris = datasets.load_iris() X = iris.data[:-5] pre_x = iris.data[-5:] y = iris.target[:-5] print ('first 10 raw samples:', X[:10]) clf = LDA() clf.fit(X, y) X_r = clf.transform(X) pre_y = clf.predict(pre_x) #Dimensionality reduction results print ('first 10 transformed samples:', X_r[:10]) #Forecast target classification results print ('predict value:', pre_y)
QDA
from sklearn import datasets from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA from sklearn.model_selection import train_test_split iris = datasets.load_iris() x = iris.data y = iris.target x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3) clf = QDA() clf.fit(x_train,y_train) print(clf.score(x_test,y_test))
- sklearn.gaussian_process
function | function |
---|---|
gaussian_process.GaussianProcessClassifier( ) | Gaussian process classification |
- sklearn.naive_bayes
function | function |
---|---|
naive_bayes.GaussianNB( ) | Naive Bayes |
naive_bayes.MultinomialNB( ) | Polynomial naive Bayes |
naive_bayes.BernoulliNB( ) | Bernoulli naive Bayes |
GaussianNB
from sklearn import datasets from sklearn.naive_bayes import GaussianNB iris = datasets.load_iris() clf = GaussianNB() clf = clf.fit(iris.data,iris.target) y_pre = clf.predict(iris.data)
MultinomialNB
from sklearn import datasets from sklearn.naive_bayes import MultinomialNB iris = datasets.load_iris() clf = MultinomialNB() clf = clf.fit(iris.data, iris.target) y_pred=clf.predict(iris.data)
BernoulliNB
from sklearn import datasets from sklearn.naive_bayes import BernoulliNB iris = datasets.load_iris() clf = BernoulliNB() clf = clf.fit(iris.data, iris.target) y_pred=clf.predict(iris.data)
regression model
- sklearn.tree
function | function |
---|---|
tree.DecisionTreeRegress( ) | Regression decision tree |
tree.ExtraTreeRegressor( ) | Limit regression tree |
DecisionTreeRegressor,ExtraTreeRegressor
"""regression""" from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor,ExtraTreeRegressor from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error import numpy as np boston = load_boston() x = boston.data y = boston.target x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3) dtr = DecisionTreeRegressor() dtr.fit(x_train,y_train) etr = ExtraTreeRegressor() etr.fit(x_train,y_train) yetr_pred = etr.predict(x_test) ydtr_pred = dtr.predict(x_test) print(dtr.score(x_test,y_test)) print(r2_score(y_test,ydtr_pred)) print(etr.score(x_test,y_test)) print(r2_score(y_test,yetr_pred))
- sklearn.ensemble
function | function |
---|---|
ensemble.GradientBoostingRegressor( ) | Gradient lifting regression |
ensemble.AdaBoostRegressor( ) | Lifting regression |
ensemble.BaggingRegressor( ) | Bagging regression |
ensemble.ExtraTreeRegressor( ) | Limit tree regression |
ensemble.RandomForestRegressor( ) | Random forest regression |
GradientBoostingRegressor
import numpy as np from sklearn.ensemble import GradientBoostingRegressor as GBR from sklearn.datasets import make_regression X, y = make_regression(1000, 2, noise=10) gbr = GBR() gbr.fit(X, y) gbr_preds = gbr.predict(X)
AdaBoostRegressor
from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression x,y = make_regression(n_features=4,n_informative=2,random_state=0,shuffle=False) regr = AdaBoostRegressor(random_state=0,n_estimators=100) regr.fit(x,y) regr.predict([[0,0,0,0]])
BaggingRegressor
from sklearn.ensemble import BaggingRegressor from sklearn.datasets import make_regression from sklearn.svm import SVR x,y = make_regression(n_samples=100,n_features=4,n_informative=2,n_targets=1,random_state=0,shuffle=False) br = BaggingRegressor(base_estimator=SVR(),n_estimators=10,random_state=0).fit(x,y) br.predict([[0,0,0,0]])
ExtraTreesRegressor
from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.ensemble import ExtraTreesRegressor x,y = load_diabetes(return_X_y=True) x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=0) etr = ExtraTreesRegressor(n_estimators=100,random_state=0).fit(x_train,y_train) print(etr.score(x_test,y_test))
RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import make_regression x,y = make_regression(n_features=4,n_informative=2,random_state=0,shuffle=False) rfr = RandomForestRegressor(max_depth=2,random_state=0) rfr.fit(x,y) print(rfr.predict([[0,0,0,0]]))
- sklearn.linear_model
function | function |
---|---|
linear_model.LinearRegression( ) | linear regression |
linear_model.Ridge( ) | Ridge regression |
linear_model.Lasso( ) | Regularizer trained by L1 |
linear_model.ElasticNet( ) | Elastic network |
linear_model.MultiTaskLasso( ) | Multitasking Lasso |
linear_model.MultiTaskElasticNet( ) | Multi task elastic network |
linear_model.Lars( ) | Minimum angle regression |
linear_model.OrthogonalMatchingPursuit( ) | Orthogonal matching tracking model |
linear_model.BayesianRidge( ) | Bayesian ridge regression |
linear_model.ARDRegression( ) | Bayesian ADA regression |
linear_model.SGDRegressor( ) | Random gradient descent regression |
linear_model.PassiveAggressiveRegressor( ) | Incremental learning regression |
linear_model.HuberRegression( ) | Huber regression |
import numpy as np from sklearn.linear_model import Ridge from sklearn.linear_model import Lasso np.random.seed(0) x = np.random.randn(10,5) y = np.random.randn(10) clf1 = Ridge(alpha=1.0) clf2 = Lasso() clf2.fit(x,y) clf1.fit(x,y) print(clf1.predict(x)) print(clf2.predict(x))
- sklearn.svm
function | function |
---|---|
svm.SVR( ) | Support vector machine regression |
svm.NuSVR( ) | Nu support vector regression |
svm.LinearSVR( ) | Linear support vector regression |
- sklearn.neighbors
function | function |
---|---|
neighbors.KNeighborsRegressor( ) | K-nearest neighbor regression |
neighbors.RadiusNeighborsRegressor( ) | Radius based nearest neighbor regression |
- sklearn.kernel_ridge
function | function |
---|---|
kernel_ridge.KernelRidge( ) | Kernel ridge regression |
- sklearn.gaussian_process
function | function |
---|---|
gaussian_process.GaussianProcessRegressor( ) | Gaussian process regression |
GaussianProcessRegressor
from sklearn.datasets import make_friedman2 from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import DotProduct,WhiteKernel x,y = make_friedman2(n_samples=500,noise=0,random_state=0) kernel = DotProduct()+WhiteKernel() gpr = GaussianProcessRegressor(kernel=kernel,random_state=0).fit(x,y) print(gpr.score(x,y))
- sklearn.cross_decomposition
function | function |
---|---|
cross_decomposition.PLSRegression( ) | Partial least squares regression |
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cross_decomposition import PLSRegression from sklearn.model_selection import train_test_split boston = datasets.load_boston() x = boston.data y = boston.target x_df = pd.DataFrame(x,columns=boston.feature_names) y_df = pd.DataFrame(y) pls = PLSRegression(n_components=2) x_train,x_test,y_train,y_test = train_test_split(x_df,y_df,test_size=0.3,random_state=1) pls.fit(x_train,y_train) print(pls.predict(x_test))
Clustering model
- sklearn.cluster
function | function |
---|---|
cluster.DBSCAN( ) | Density based clustering |
cluster.GaussianMixtureModel( ) | Gaussian mixture model |
cluster.AffinityPropagation( ) | Attraction propagation clustering |
cluster.AgglomerativeClustering( ) | hierarchical clustering |
cluster.Birch( ) | Balanced iterative clustering using hierarchical method |
cluster.KMeans( ) | K-means clustering |
cluster.MiniBatchKMeans( ) | Small batch K-means clustering |
cluster.MeanShift( ) | Average shift clustering |
cluster.OPTICS( ) | Identifying clustering structure based on point ranking |
cluster.SpectralClustering( ) | Spectral clustering |
cluster.Biclustering( ) | Double clustering |
cluster.ward_tree( ) | Cluster ward tree |
- Model method
method | function |
---|---|
xxx.fit( ) | model training |
xxx.get_params( ) | Get model parameters |
xxx.predict( ) | Forecast new input data |
xxx.score( ) | Evaluation model classification / regression / clustering model |
xxx.set_params( ) | Set model parameters |
Model evaluation
- Classification model evaluation
function | function |
---|---|
metrics.accuracy_score( ) | Accuracy |
metrics.average_precision_score( ) | Average accuracy |
metrics.log_loss( ) | Logarithmic loss |
metrics.confusion_matrix( ) | Confusion matrix |
metrics.classification_report( ) | Classification model evaluation report: accuracy, recall, F1 score |
metrics.roc_curve( ) | Working characteristic curve of subjects |
metrics.auc( ) | Area under ROC curve |
metrics.roc_auc_score( ) | AUC value |
- Regression model evaluation
function | function |
---|---|
metrics.mean_squared_error( ) | Average decision error |
metrics.median_absolute_error( ) | Median absolute error |
metrics.r2_score( ) | Coefficient of determination |
- Cluster model evaluation
function | function |
---|---|
metrics.adjusted_rand_score( ) | Random Rand adjustment index |
metrics.silhouette_score( ) | Contour coefficient |
Model optimization
function | function |
---|---|
model_selection.cross_val_score( ) | Cross validation |
model_selection.LeaveOneOut( ) | Leave one method |
model_selection.LeavePout( ) | Cross validation of retention P method |
model_selection.GridSearchCV( ) | Grid search |
model_selection.RandomizedSearchCV( ) | random search |
model_selection.validation_curve( ) | Validation curve |
model_selection.learning_curve( ) | learning curve |
Write at the end
Learning machine learning and programming simply through articles is easy to encounter a lot of bugs, which will undoubtedly waste a lot of time for a novice and hurt everyone's confidence in learning and mastering machine learning.
Using sklearn library to learn machine learning can get started very quickly. You have to ask me how many days it takes. The following course gives the answer, that's right! You can get started with machine learning in three days. Maybe this article I wrote is not very friendly for novices, but at least this online course can get you started in three days.
This course explains the whole process of machine learning in detail, and takes you to tap the code examples of each process. Finally, it strengthens the course knowledge and code understanding through practical projects.