Realize the clustering algorithm with Sklearn and show the effect with scatter diagram

1, Background

A good memory is better than a bad pen. Record the algorithms and ideas used in the project

2, Problem description

Recently, I received a project to integrate data for company a, and developed a set of character portrait system and two algorithm models, one of which is to classify drawings. Before the final meeting, Party A (company a) gave some sample data. Let's make a demo and show it to Party A's big bosses

3, Data sample

The data provided by Party A includes: project data, personnel data and drawing data. The following content only uses the drawing data. After all, it is only for the classification of drawings and does not use other sample data

The following is an example of a drawing Name:

          

 

IV. ideas

  1. The drawing name is segmented, the stop words are removed, and a user-defined dictionary is added as the feature of analysis

  2. Calculate TF IDF (word frequency inverse document frequency) of each feature as the value of the feature

  3. I chose the clustering algorithm. At the beginning, I preferred Dbscan. After all, compared with Kmeans, I don't need to know how to classify in advance. The classification can be completed by adjusting the field and the minimum number of points. However, the theory failed to reality. The effect of drawing name segmentation is too close, and

The sample data is not enough to distinguish them. As a result, the value of the final field and the minimum number of points is not satisfactory no matter how they are adjusted. After all, we have to do a demo for others, and finally we have to choose Kmeans.

5, Code implementation

  1. Model code

 1 # -*- coding:UTF-8 -*-
 2 import jieba
 3 import matplotlib.pyplot as plt
 4 import pandas  as pd
 5 from sklearn.cluster import KMeans
 6 from sklearn.feature_extraction.text import TfidfVectorizer
 7 from sklearn.manifold import TSNE
 8 
 9 data = open(r'xxx[File path].csv', encoding="utf8")
10 data = pd.read_csv(data)
11 file_userdict = 'userdict.txt'
12 # Import custom dictionary (do not let jieba (separate them)
13 jieba.load_userdict(file_userdict)
14 # stop word 
15 stop_words = ["-", "(", ")", ".", "pdf", "-"]
16 # tf-idf word frequency-Inverse document frequency tokenizer  Word splitter stop_words Stop words
17 tf = TfidfVectorizer(tokenizer=jieba.lcut, stop_words=stop_words)
18 # tf-idf  calculation
19 X = tf.fit_transform(data["Drawing name"])
20 # Convert results to numpy Array of
21 res_matrix = X.toarray()
22 # kmeans,Target positioning class 2
23 kmeans = KMeans(n_clusters=2)
24 kmeans.fit(X)

  2. Result visualization code

 1 # Label list of forecast results
 2 labels = kmeans.labels_
 3 labels = pd.DataFrame(labels, columns=['labels'])
 4 res_matrix = pd.DataFrame(res_matrix)
 5 res_matrix.insert(res_matrix.shape[1], 'labels', labels)
 6 # tsne()Reduce relational data to 2D data
 7 tsne = TSNE()
 8 a = tsne.fit_transform(res_matrix)
 9 liris = pd.DataFrame(a, index=res_matrix.index)
10 d1 = liris[res_matrix['labels'] == 0]
11 d2 = liris[res_matrix['labels'] == 1]
12 print(d1)
13 print(d2)
14 plt.plot(d1[0], d1[1], 'r.', d2[0], d2[1], 'go')
15 plt.show()

  3. Classification effect display

     

 

Posted by nrussell on Wed, 11 May 2022 08:39:12 +0300