Basic concepts and advantages and disadvantages of decision tree
Decision Tree (Decision Tree) is based on the known probability of occurrence of various situations, by forming a decision tree to obtain the probability that the expected value of the net present value is greater than or equal to zero, evaluate the project risk, and judge its feasibility. A graphical method that uses probability analysis intuitively. Since this decision branch is drawn as a graph like the branches of a tree, it is called a decision tree.
The main advantages of decision trees:
- Being very interpretable, the model can generate understandable rules.
- The importance of features can be found.
- The computational complexity of the model is low.
The main disadvantages of decision trees:
- The model is prone to overfitting and needs to be dealt with by branch reduction techniques.
- Does not make good use of continuous features.
- The prediction ability is limited and cannot achieve the effect of other strong supervision models.
- The variance is high, and a slight change in the data distribution can easily result in a completely different tree structure.
An example of decision tree based on penguin dataset
- Step1: Function library import and data reading
## Basic function library import numpy as np import pandas as pd ## Drawing function library import matplotlib.pyplot as plt import seaborn as sns
This time, we choose the penguin data (palmerpenguins) to try to train the method. The data set contains a total of 8 variables, including 7 feature variables and 1 target categorical variable. There are 150 samples in total, and the target variable is the category of penguins, which all belong to the three subgenera of penguin, namely (Adélie, Chinstrapand Gentoo). The seven characteristics of the three species of penguins included are the island, the length of the mouth, the depth of the mouth, the length of the flippers, body size, gender and age.
## We use the read_csv function that comes with Pandas to read and convert to DataFrame format data = pd.read_csv('penguins_raw.csv') ## In order to facilitate us to select only four simple features, interested students can study the meaning and usage of other features data = data[['Species','Culmen Length (mm)','Culmen Depth (mm)','Flipper Length (mm)','Body Mass (g)']]
- Step2: Simple view of data information
## Use .info() to view the overall information of the data data.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 344 entries, 0 to 343 Data columns (total 5 columns): Species 344 non-null object Culmen Length (mm) 342 non-null float64 Culmen Depth (mm) 342 non-null float64 Flipper Length (mm) 342 non-null float64 Body Mass (g) 342 non-null float64 dtypes: float64(4), object(1) memory usage: 13.6+ KB
## For simple data viewing, we can use .head() head.tail() tail data.head()
Here we find that there are NaNs in the data set. Generally, we think that NaNs represent missing values in the data set, which may be an error in data collection or processing. Here we use -1 to fill in missing values, and there are other missing value processing methods such as "median filling and mean filling". Students who are interested can also try.
data = data.fillna(-1) #Fill missing values with -1
data.tail()
## The corresponding category labels are 'Adelie Penguin', 'Gentoo penguin', 'Chinstrap penguin' three different penguin categories. data['Species'].unique()array(['Adelie Penguin (Pygoscelis adeliae)', 'Gentoo penguin (Pygoscelis papua)', 'Chinstrap penguin (Pygoscelis antarctica)'], dtype=object)
## Use the value_counts function to view the number of each category pd.Series(data['Species']).value_counts() Adelie Penguin (Pygoscelis adeliae) 152 Gentoo penguin (Pygoscelis papua) 124 Chinstrap penguin (Pygoscelis antarctica) 68 Name: Species, dtype: int64
## Do some statistical description of the features data.describe()
- Step3: Visual description
## Scatter visualization of feature and label combinations sns.pairplot(data=data, diag_kind='hist', hue= 'Species') plt.show()
'''For convenience we convert labels to numbers 'Adelie Penguin (Pygoscelis adeliae)' ------0 'Gentoo penguin (Pygoscelis papua)' ------1 'Chinstrap penguin (Pygoscelis antarctica) ------2 '''
def trans(x): if x == data['Species'].unique()[0]: return 0 if x == data['Species'].unique()[1]: return 1 if x == data['Species'].unique()[2]: return 2 data['Species'] = data['Species'].apply(trans)
Draw boxplots of different types of penguins under each attribute, and we can also get the distribution differences of different categories on different characteristics by using the boxplots.
for col in data.columns: if col != 'Species': sns.boxplot(x='Species', y=col, saturation=0.5, palette='pastel', data=data) plt.title(col) plt.show()
# Select its first three features to draw a 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D fig = plt.figure(figsize=(10,8)) ax = fig.add_subplot(111, projection='3d') data_class0 = data[data['Species']==0].values data_class1 = data[data['Species']==1].values data_class2 = data[data['Species']==2].values # 'setosa'(0), 'versicolor'(1), 'virginica'(2) ax.scatter(data_class0[:,0], data_class0[:,1], data_class0[:,2],label=data['Species'].unique()[0]) ax.scatter(data_class1[:,0], data_class1[:,1], data_class1[:,2],label=data['Species'].unique()[1]) ax.scatter(data_class2[:,0], data_class2[:,1], data_class2[:,2],label=data['Species'].unique()[2]) plt.legend() plt.show()
- Step4: Use the decision tree model to train and predict on the binary classification
## To properly evaluate model performance, the data is divided into training and test sets, and the model is trained on the training set and the model performance is verified on the test set. from sklearn.model_selection import train_test_split ## Select samples whose classes are 0 and 1 (excluding samples with class 2) data_target_part = data[data['Species'].isin([0,1])][['Species']] data_features_part = data[data['Species'].isin([0,1])][['Culmen Length (mm)', 'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)']] ## Test set size is 20%, 80%/20% points x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020)
## Import decision tree model from sklearn from sklearn.tree import DecisionTreeClassifier from sklearn import tree ## Define Logistic Regression Model clf = DecisionTreeClassifier(criterion='entropy') # Train a decision tree model on the training set clf.fit(x_train, y_train) DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
## visualization import graphviz dot_data = tree.export_graphviz(clf, out_file=None) graph = graphviz.Source(dot_data) graph.render("penguins"