Pyhton3, database table design and data storage for website construction~

Building your own website is one of the signs of success as a coder,
Then there are other signs of success, holding Bai Fumei in your left hand, rolling a small barbecue in your right hand, and stepping on Santana under your feet
Um

Crawling data

Some children will say, uncle fish, aren't you designing a database table? How do you return the package?
Ha ha~
Buy two get one free
We crawl data in order to store the data directly in the database, so as to save the process of creating data~
The content we crawled today is this website

url = https://arxiv.org/list/cs/recent

Here is a collection of the great God works of foreign leaders, and they are free!
If you don't understand, you can google translate ~!
Let's see what this website looks like

Old rules, code

# -*- coding: utf-8 -*-
"""
@ auth : carl_DJ
@ time : 2020-8-26
"""

import requests
import csv
import traceback
from bs4 import BeautifulSoup
from PaperWeb.Servers.utils.requests import get_http_session

def gen_paper(skip):
    try:
        #The skip setting parameter of url displays 100 pieces of data each time
        url = 'https://arxiv.org/list/cs/pastweek?skip={skip}&show=100'
        #Set headers
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'

        }
        #Set agent. If it is not purchased, it does not need to be filled in,
        proxies  = {
            'http':"http://127.0.0.1:1092"
            'https':"http://127.0.0.1:1092"
        }
        #Reference http proxy pool to avoid resource exhaustion
        r = get_http_session().get(url,headers=headers,proxies=proxies)
        #If agents are not used, comment out proxies=proxies
        #r = get_http_session().get(url,headers=headers)
        #Judge return status
        if r.status_code !=200:
            return False,'Failed to request web page'
        #lxml is used because of its fast parsing speed
        soup = BeautifulSoup(r.text,'lxml')

        #Get the information of the paper url
        all_dt = soup.find_all('dt')
        #Obtain the title and author of the paper
        all_dd = soup.find_all('dd')

        #Use the zip method to cycle through the contents of dd and dt,
        for  dt,dd in zip(all_dt,all_dd):
            #Because class is a keyword, use_ Make a distinction,
            #Get the url through the href link of the a tag under class
            api_url = dt.find(class_ = 'list-identifier').find('a').get('href')
            root_url = 'https://arxiv.org'
            full_url = root_url + api_url

            """
            First use contents,Is obtained directly title The content of, text Get a lot of content, many and messy
            """

            #Get the title, get it through the contents under the class tag, and return to the list
            title = dd.find(class_ = 'list-title mathjax').contents
            #Judge the content. If the length of titles is > = 3, intercept part 2
            if len(title) >=3:
                title = title[2]
            else:
                #text, str is returned,
                title = dd.find(class_ = 'list-title mathjax').text

            #Paper author
            authors = dd.find(class_='descriptor').text
            #Because there are many authors in a line, it needs to be segmented
            authors = authors.split(':')[1].replace('\n','')

            #generator 
            yield title,full_url,authors
    except Exception as e:
        #Output error log information
        log(traceback.format_exc(),'error','gen_paper.log')
        error_str =  f'authors:[full_url:[{full_url},{authors}],title:[{title}'
        log(error_str,'error','gen_paper.log')
        # #Printing error
        # print(e)
        # traceback.print_exc()

def log(content,level,filepath):
    """

    :param content:Enter the content of the error message
    :param level:Error level
    :param filepath:route
    :return:
    """
    if level =='error':
        with open(filepath,'a') as f:
            f.write(content)
    elif level =='fail':
        with open(filepath,'a') as f :
            f.write(content)

def run_main():
    '''
    Save to csv In the file
    :return:
    '''
    #Define an empty list
    resl = []
    #Cycle all data information, a total of 1273 papers, with an interval of 100
    for i in range(0,1273,100):
        #i here is skip,
        for full_url,title,authors in gen_paper(i):
            #Add to list
            resl.append([title,full_url,authors])
            print(title,'done')

        #Open the file and save it
        with open('paper.csv','w') as f :
            #Open the file content obtained above
            cw = csv.writer(f)
            #Write data to csv
            for i in resl:
                #Write a line
                cw.writerow(i)


if __name__ == '__main__':
    run_main()

There is not much content analysis here. If you don't understand, you can see the notes,
Personally, I think, well ~, although not every sentence has a comment, it doesn't reach the point that every sentence has no comment!
If you don't understand, read this article "Video barrage and comments of Python 3, multi thread climbing station B UP master" In this, what can be explained can be explained!
Look, this is the agent used

        #Set agent. If it is not purchased, it does not need to be filled in
        proxies  = {
            'http':"http://127.0.0.1:1092"
            'https':"http://127.0.0.1:1092"
         }
        #Reference http proxy pool to avoid resource exhaustion. Pass in three parameters
         r = get_http_session().get(url,headers=headers,proxies=proxies)

Agency costs money. If you don't rely on reptiles or enthusiasts, you can get a free agent. Although it's unstable, it's OK~
Operation results:

Why should I show the results,
It's because you want to taste one product, which is the result of crawling. Does it smell good or not.

Create database

Create database table

One is to write the created data table into pycharm's project
As follows:
create.sql

--data sheet

CREATE TABLE 'papers'(
    'id' int(11) NOT NULL AUTO_INCREMENT,
    'title' varchar(500) NOT NULL DEFAULT ''COMMIT 'Thesis title',
    'url' varchar(200) NOT NULL DEFAULT ''COMMIT 'paper url',
    'authors' varchar(200) NOT NULL DEFAULT ''COMMIT 'Paper author',
    'create_time' int(20) NOT NULL DEFAULT 0 COMMIT 'Creation time',
    'update_time' int(20) NOT NULL DEFAULT 0 COMMIT 'Update time',
    'is_delete' tinyint(4) NOT NULL DEFAULT 0 COMMIT 'Delete',
    PRIMARY KEY ('id'),
    KEY 'title_IDX' ('title') USING BTREE,
    KEY 'authors_IDX' ('authors') USING BTREE
)ENGINE=InnoDB AUTO_INCREMENT =1 DEFAULT CHARSET=utf8 COMMIT='Paper table'

Here is the content of the basic creation table of MySQL. No, you can see the article of Xiaoyu Basic usage of SQL I
If you don't understand it, just look at your bank card balance, your left hand and right hand~
I guess so!
After writing, don't forget to execute it in the database, otherwise how can you generate tables~

Connect database

When we create a database, we need to connect the database through pymysql so that we can use it~
Continue on code
mysql.py

# -*- coding: utf-8 -*-
"""
@ auth : carl_DJ
@ time : 2020-8-26
"""

import pymysql
import logging

'''
establish mysql Database link
'''

class MySQL(object):
    #Create basic link information for the database
    def __init__(self,host='localhost',port = 3306,user='root',password = '123456',db='test'):
        #cursorclass = pymysql.cursors.DictCursor database query returns dict -- > the default return value is tuple
        self.conn = pymysql.connect(
            host = host,
            port = port,
            user=user,
            password=password,
            db=db,
            charset = 'utf8',
            cursorclass = pymysql.cursors.DictCursor
        )
        #Define logger
        self.log = logging.getLogger(__name__)

    def execute(self,sql,kwargs = None):
        # Create mysql database link
        try:
            #Get cursor
            cursor = self.conn.cursor()
            #Get execution method through cursor
            cursor.execute(sql,kwargs)
            #Submit, insert, delete and other operations
            self.conn.commit()
            return cursor

        except Exception as e:
            #Record error messages
            self.log.error(f'mysql execute eror:{e}',exc_info=True)
            raise e

    def query(self,sql,kwargs = None):
        #Create query method
        try:
            cursor = self.execute(sql,kwargs)

            if cursor:
                # Returns all the contents found in the query
                return cursor.fetchall()

            else:
                raise Exception(f'sql error:{sql}')
        except Exception as e:
            self.log.error(e)
            raise e
        finally:
            #Close cursor
            cursor.close()

    def insert(self,sql,kwargs = None):
        #Data insertion
        try:
            cursor  = self.execute(sql,kwargs)
            if cursor:
                #Get the id of the last row
                row_id = cursor.lastrowid
                return row_id
            else:
                raise Exception(f'sql error:{e}')
        except Exception as e:
            self.log.error(e)
            raise e
        finally:
            cursor.close()

    def escape_string(self,_):
        #Transcode the contents of the data file to prevent some special characters
        return pymysql.escape_string(_)

db = MySQL(user='root',password='123456',db='papers')


The pymysql library is used here. It won't be used. Look at the article Xiaoyu Python 3 links to Mysql database

data storage

We store the crawled data in the database.
Again, go directly to the code

csv_to_mysql.py

# -*- coding: utf-8 -*-
"""
@ auth : carl_DJ
@ time : 2020-8-26
"""

import csv
import time
from PaperWeb.libs.mysql import db

def get_csv_info(path = 'paper.csv')
    #Open csv file
    csvfile = open(path,'r')
    reader = csv.reader(csvfile)
    for item in reader:
        yield item

def get_insert_sql():
    #Insert data into database
    items = []
    #Get time
    _time = int(time.time())
    for item in get_csv_info():
        #Transcoding characters in csv files
        item = [db.escape_string(_) for _ in item]
        #Add to the corresponding list, and the list field when creating the database
        items.append(f"('{item[0]}',{item[1]},{item[2]},{_time},{_time})")
    #Get values value
    values = ','.join(items)

    #Execute sql statement insertion
    sql = f'''
        INSERT INTO 
         Paper ('title','url','authors','create_time','updata_time')
        Values
         {values}
    '''
    row_id = db.insert(sql)
    print(f'{row_id}Data insertion completed')


After execution, you can go to the database and see the results!
If it's not perfect, the little fish thinks it's impossible~
If it's perfect, it won't waste the fruit of small fish liver until 12 pm!

After class in the evening, at 10:30,
Having some coffee, walking a baby,
Baby asleep, I started this code

Easter egg

After the website is built, Xiaoyu has two goals:
1. Be able to write some blog posts on python data analysis;
2. Xiaoyu shares Xiaoyu's experience and interview questions about Ali on all sides and finally getting an offer.

Haha, this is an early spoiler~

Why do you want to write about Ali's interview experience
Because recently, a little partner wanted to prepare for Ali's interview and asked me some questions about Ali's interview. Instead of talking about Ali's interview one by one, it's better to sort it out and share more people!

In this way, I can save more time and do more things!!

OK, finally, let's end today's technology sharing with a sentence from Wang Er:

If not forced by life, who is willing to be talented!

Tags: Python Database MySQL Python crawler

Posted by kingcobra96 on Fri, 20 May 2022 09:38:56 +0300