Sprite recognition (CNN convolutional neural network training)
Gilded sky It is an Internet skill certification website, all of which are crawler problems. There is one question Reptile sprite figure-2 Need to use pictures to identify. So imitate mnist and train a model with CNN convolutional neural network.
# Based on tensorflow 2.0 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow==2.0.0
# Project composition ├─glidesky │ model.h5 # Model file │ predict.py # Model call │ train.py # model training │ ├─data_source │ │ data.h5 # Dataset file │ │ make_data_set.py # Generate dataset │ │ spider.py # Reptile │ │ │ └─imgs # Store collected pictures │ ├─logs # Training visualization log │ ├─test # Test picture
Data acquisition spider py
For data acquisition, first find a page containing all numbers from 0 to 9, and then use the crawler to collect the digital pictures for subsequent use as a data set for in-depth learning.
- Because each request is a different diagram, but the number is fixed, so just keep requesting the same page
- Only 10 pictures are reserved for each request, so as to ensure the uniform distribution of sample data
- The collection process is time-consuming and boring, so the original plan was to collect 1 million pieces, and only 450000 pieces were collected later, which should be similar
import re import os import uuid import base64 import requests from PIL import Image from io import BytesIO from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor Cookie = 'your cookies' headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Cookie': Cookie, 'Host': 'www.glidedsky.com', 'Referer': 'http://www.glidedsky.com/level/web/crawler-basic-2?page=1', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36' } def get_img(text): """ :param text: Get picture template :return: """ img_str = re.findall('base64,(.*?)"', text)[0] img_fp = BytesIO(base64.b64decode(img_str.encode('utf-8'))) img = Image.open(img_fp) return img def crawler(url): text = requests.get(url, headers=headers).text img = get_img(text) rows = BeautifulSoup(text, 'lxml').find_all('div', class_="col-md-1") num_labels = list(str(123171140339373274129338158411319368)) num_imgs = [] for row in rows: for div in row.find_all('div'): css_name = div.get('class')[0].split(' ')[0] tag_x = re.findall(f'\.{css_name} \{{ background-position-x:(.*?)px \}}', text) tag_y = re.findall(f'\.{css_name} \{{ background-position-y:(.*?)px \}}', text) width = re.findall(f'\.{css_name} \{{ width:(.*?)px \}}', text) height = re.findall(f'\.{css_name} \{{ height:(.*?)px \}}', text) tag_x = abs(int(tag_x[0])) tag_y = abs(int(tag_y[0])) width = int(width[0]) height = int(height[0]) box = (tag_x, tag_y, tag_x + width, tag_y + height) num_imgs.append(img.crop(box)) save_list = [str(i) for i in range(10)] for num_img, num_label in zip(num_imgs, num_labels): if num_label in save_list: file_name = f'./imgs/{num_label}_{uuid.uuid1()}.png' num_img = num_img.resize((20, 20)) num_img.save(file_name) save_list.remove(num_label) os.makedirs('./imgs', exist_ok=True) urls = [] for _ in range(90000): url = f'http://www.glidedsky.com/level/web/crawler-sprite-image-2?page=999' urls.append(url) pool = ThreadPoolExecutor(max_workers=20) for result in pool.map(crawler, urls): ...
Make dataset_ data_ set. py
After the unified size of all pictures is 20 * 20, it will be converted to gray value; The corresponding tag is transformed into a single hot code, and the training set and test set data are randomly segmented through sklearn, and finally saved as h5 data set file.
- The test set should not overlap with the training set, so as to evaluate the generalization ability of the model
- Reasons for not pre-processing when saving data sets: save directly, and the data file size is 190 M; If the normalized value is saved, it is 1.9 G
- h5 is the 5th generation version of Hierarchical Data Format (HDF5), which is a file format and library file used to store scientific data
- Single hot code refers to one effective code. For example, there are ten numbers from 0 to 9, which can be represented by a list with a length of 10. For example, 2 is [0,0,1,0,0,0,0,0,0,0], 9 is [0,0,0,0,0,0,0,0,1], and so on; The value can be passed through NP Argmax() get.
import os import h5py import numpy as np from PIL import Image from sklearn.model_selection import train_test_split images = [] labels = [] for path in os.listdir('./imgs'): label = int(path.split('_')[0]) label_one_hot = [0 if i != label else 1 for i in range(10)] labels.append(label_one_hot) img = Image.open('./imgs/' + path).resize((20, 20)).convert('L') img_arr = np.reshape(img, 20 * 20) images.append(img_arr) # Split training set and test set train_images, test_images, train_labels, test_labels = train_test_split(images, labels, test_size=0.1, random_state=0) with h5py.File('./data.h5', 'w') as f: f.create_dataset('train_images', data=np.array(train_images)) f.create_dataset('train_labels', data=np.array(train_labels)) f.create_dataset('test_images', data=np.array(test_images)) f.create_dataset('test_labels', data=np.array(test_labels))
Training model train py
Construct a convolutional neural network model, feed data (400000 training sets and 40000 test sets) and train the model.
- The data set is composed of a gray image of black words on a white background into a matrix (20 * 20). Each number is between 0-255, black 0 and white 255. After preprocessing, convert it into white words on black background, and divide it by 255.0 to complete normalization. After data normalization, it helps to improve the accuracy of the model. Why normalize
- epochs, the data of the training set are trained once, that is, an epoch; epochs settings are appropriate for many times. At present, there is no universal formula, which needs to be tried constantly
- When compiling the model, you need to specify parameters such as optimizer, loss loss function, metrics and so on
- In the process of model training, you can specify callback functions, such as saving models, logging, etc
- The trained model can continue training after loading
import os import h5py import tensorflow as tf from tensorflow.keras import layers, models class Train: def __init__(self): # Final model storage path self.modelpath = './model.h5' # Define model if os.path.exists(self.modelpath): self.model = tf.keras.models.load_model(self.modelpath) print(f"{self.model} The model is loaded successfully. Continue training...") else: self.model = models.Sequential([ # The first layer is convolution. The size of convolution kernel is 3 * 3, 32, and 28 * 28 is the size of the image to be trained layers.Conv2D(32, (3, 3), activation='relu', input_shape=(20, 20, 1)), layers.MaxPooling2D((2, 2)), # Layer 2 convolution, convolution kernel size is 3 * 3, 64 layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D((2, 2)), # Layer 3 convolution, convolution kernel size is 3 * 3, 64 layers.Conv2D(64, (3, 3), activation='relu'), layers.Flatten(), layers.Dense(64, activation='relu'), layers.Dense(10, activation='softmax'), ]) self.model.summary() # Read data with h5py.File('./data_source/data.h5', 'r') as f: self.train_images = f['train_images'][()] self.train_labels = f['train_labels'][()] self.test_images = f['test_images'][()] self.test_labels = f['test_labels'][()] train_count, test_count = 400000, 40000 self.train_images = self.train_images[:train_count].reshape((train_count, 20, 20, 1)) self.train_labels = self.train_labels[:train_count] self.test_images = self.test_images[:test_count].reshape((test_count, 20, 20, 1)) self.test_labels = self.test_labels[:test_count] # Data processing normalization self.train_images = 1 - self.train_images / 255.0 self.test_images = 1 - self.test_images / 255.0 def train(self): # Visual tensorboard --logdir=D:\GitHub\antman\glidedsky\logs TensorBoardcallback = tf.keras.callbacks.TensorBoard( log_dir='logs', histogram_freq=1, write_graph=True, write_images=True, update_freq=1 ) self.model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy']) self.model.fit(self.train_images, self.train_labels, epochs=5, callbacks=[TensorBoardcallback]) self.model.save(self.modelpath) def test(self): self.model = tf.keras.models.load_model(self.modelpath) test_loss, test_acc = self.model.evaluate(self.test_images, self.test_labels) print("Accuracy: %.4f,A total of%d Picture " % (test_acc, len(self.test_labels))) if __name__ == "__main__": app = Train() app.train() app.test()
Model call
Call input of the model: list composed of normalized four-dimensional matrix (size 20 * 20, need to be converted into white words on black background); Output: a list of labels with unique thermal codes.
- The input of the model should be processed in the same way as the data during training
- The subscript of the maximum value is taken as the unique heat code, that is, the label number represented
import numpy as np import tensorflow as tf from PIL import Image class Predict(object): def __init__(self): self.cnn = tf.keras.models.load_model('./model.h5') def predict(self, image_path): # Read pictures in black and white img = Image.open(image_path).resize((20, 20)).convert('L') img_arr = 1 - np.reshape(img, (20, 20, 1)) / 255.0 x = np.array([img_arr]) # API refer: https://keras.io/models/model/ y = self.cnn.predict(x) # Because only one picture is imported from x, take y[0] # np.argmax() gets the subscript of the maximum value, that is, the number represented by print(image_path) print(y[0]) print(' -> Predict digit', np.argmax(y[0])) if __name__ == "__main__": app = Predict() app.predict('./test/0.png') app.predict('./test/3.png') app.predict('./test/4.png') app.predict('./test/7.png') app.predict('./test/9.png')
Pass the reptile test
Call the model directly in the crawler. Because there is a probability problem, you can solve this crawler problem by running several times. If you are interested, you can refer to the complete code glidedsky clearance notes