Building your own website is one of the signs of success as a coder,
Then there are other signs of success, holding Bai Fumei in your left hand, rolling a small barbecue in your right hand, and stepping on Santana under your feet
Um
Crawling data
Some children will say, uncle fish, aren't you designing a database table? How do you return the package?
Ha ha~
Buy two get one free
We crawl data in order to store the data directly in the database, so as to save the process of creating data~
The content we crawled today is this website
url = https://arxiv.org/list/cs/recent
Here is a collection of the great God works of foreign leaders, and they are free!
If you don't understand, you can google translate ~!
Let's see what this website looks like
Old rules, code
# -*- coding: utf-8 -*- """ @ auth : carl_DJ @ time : 2020-8-26 """ import requests import csv import traceback from bs4 import BeautifulSoup from PaperWeb.Servers.utils.requests import get_http_session def gen_paper(skip): try: #The skip setting parameter of url displays 100 pieces of data each time url = 'https://arxiv.org/list/cs/pastweek?skip={skip}&show=100' #Set headers headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36' } #Set agent. If it is not purchased, it does not need to be filled in, proxies = { 'http':"http://127.0.0.1:1092" 'https':"http://127.0.0.1:1092" } #Reference http proxy pool to avoid resource exhaustion r = get_http_session().get(url,headers=headers,proxies=proxies) #If agents are not used, comment out proxies=proxies #r = get_http_session().get(url,headers=headers) #Judge return status if r.status_code !=200: return False,'Failed to request web page' #lxml is used because of its fast parsing speed soup = BeautifulSoup(r.text,'lxml') #Get the information of the paper url all_dt = soup.find_all('dt') #Obtain the title and author of the paper all_dd = soup.find_all('dd') #Use the zip method to cycle through the contents of dd and dt, for dt,dd in zip(all_dt,all_dd): #Because class is a keyword, use_ Make a distinction, #Get the url through the href link of the a tag under class api_url = dt.find(class_ = 'list-identifier').find('a').get('href') root_url = 'https://arxiv.org' full_url = root_url + api_url """ First use contents,Is obtained directly title The content of, text Get a lot of content, many and messy """ #Get the title, get it through the contents under the class tag, and return to the list title = dd.find(class_ = 'list-title mathjax').contents #Judge the content. If the length of titles is > = 3, intercept part 2 if len(title) >=3: title = title[2] else: #text, str is returned, title = dd.find(class_ = 'list-title mathjax').text #Paper author authors = dd.find(class_='descriptor').text #Because there are many authors in a line, it needs to be segmented authors = authors.split(':')[1].replace('\n','') #generator yield title,full_url,authors except Exception as e: #Output error log information log(traceback.format_exc(),'error','gen_paper.log') error_str = f'authors:[full_url:[{full_url},{authors}],title:[{title}' log(error_str,'error','gen_paper.log') # #Printing error # print(e) # traceback.print_exc() def log(content,level,filepath): """ :param content:Enter the content of the error message :param level:Error level :param filepath:route :return: """ if level =='error': with open(filepath,'a') as f: f.write(content) elif level =='fail': with open(filepath,'a') as f : f.write(content) def run_main(): ''' Save to csv In the file :return: ''' #Define an empty list resl = [] #Cycle all data information, a total of 1273 papers, with an interval of 100 for i in range(0,1273,100): #i here is skip, for full_url,title,authors in gen_paper(i): #Add to list resl.append([title,full_url,authors]) print(title,'done') #Open the file and save it with open('paper.csv','w') as f : #Open the file content obtained above cw = csv.writer(f) #Write data to csv for i in resl: #Write a line cw.writerow(i) if __name__ == '__main__': run_main()
There is not much content analysis here. If you don't understand, you can see the notes,
Personally, I think, well ~, although not every sentence has a comment, it doesn't reach the point that every sentence has no comment!
If you don't understand, read this article "Video barrage and comments of Python 3, multi thread climbing station B UP master" In this, what can be explained can be explained!
Look, this is the agent used
#Set agent. If it is not purchased, it does not need to be filled in proxies = { 'http':"http://127.0.0.1:1092" 'https':"http://127.0.0.1:1092" } #Reference http proxy pool to avoid resource exhaustion. Pass in three parameters r = get_http_session().get(url,headers=headers,proxies=proxies)
Agency costs money. If you don't rely on reptiles or enthusiasts, you can get a free agent. Although it's unstable, it's OK~
Operation results:
Why should I show the results,
It's because you want to taste one product, which is the result of crawling. Does it smell good or not.
Create database
Create database table
One is to write the created data table into pycharm's project
As follows:
create.sql
--data sheet CREATE TABLE 'papers'( 'id' int(11) NOT NULL AUTO_INCREMENT, 'title' varchar(500) NOT NULL DEFAULT ''COMMIT 'Thesis title', 'url' varchar(200) NOT NULL DEFAULT ''COMMIT 'paper url', 'authors' varchar(200) NOT NULL DEFAULT ''COMMIT 'Paper author', 'create_time' int(20) NOT NULL DEFAULT 0 COMMIT 'Creation time', 'update_time' int(20) NOT NULL DEFAULT 0 COMMIT 'Update time', 'is_delete' tinyint(4) NOT NULL DEFAULT 0 COMMIT 'Delete', PRIMARY KEY ('id'), KEY 'title_IDX' ('title') USING BTREE, KEY 'authors_IDX' ('authors') USING BTREE )ENGINE=InnoDB AUTO_INCREMENT =1 DEFAULT CHARSET=utf8 COMMIT='Paper table'
Here is the content of the basic creation table of MySQL. No, you can see the article of Xiaoyu Basic usage of SQL I
If you don't understand it, just look at your bank card balance, your left hand and right hand~
I guess so!
After writing, don't forget to execute it in the database, otherwise how can you generate tables~
Connect database
When we create a database, we need to connect the database through pymysql so that we can use it~
Continue on code
mysql.py
# -*- coding: utf-8 -*- """ @ auth : carl_DJ @ time : 2020-8-26 """ import pymysql import logging ''' establish mysql Database link ''' class MySQL(object): #Create basic link information for the database def __init__(self,host='localhost',port = 3306,user='root',password = '123456',db='test'): #cursorclass = pymysql.cursors.DictCursor database query returns dict -- > the default return value is tuple self.conn = pymysql.connect( host = host, port = port, user=user, password=password, db=db, charset = 'utf8', cursorclass = pymysql.cursors.DictCursor ) #Define logger self.log = logging.getLogger(__name__) def execute(self,sql,kwargs = None): # Create mysql database link try: #Get cursor cursor = self.conn.cursor() #Get execution method through cursor cursor.execute(sql,kwargs) #Submit, insert, delete and other operations self.conn.commit() return cursor except Exception as e: #Record error messages self.log.error(f'mysql execute eror:{e}',exc_info=True) raise e def query(self,sql,kwargs = None): #Create query method try: cursor = self.execute(sql,kwargs) if cursor: # Returns all the contents found in the query return cursor.fetchall() else: raise Exception(f'sql error:{sql}') except Exception as e: self.log.error(e) raise e finally: #Close cursor cursor.close() def insert(self,sql,kwargs = None): #Data insertion try: cursor = self.execute(sql,kwargs) if cursor: #Get the id of the last row row_id = cursor.lastrowid return row_id else: raise Exception(f'sql error:{e}') except Exception as e: self.log.error(e) raise e finally: cursor.close() def escape_string(self,_): #Transcode the contents of the data file to prevent some special characters return pymysql.escape_string(_) db = MySQL(user='root',password='123456',db='papers')
The pymysql library is used here. It won't be used. Look at the article Xiaoyu Python 3 links to Mysql database
data storage
We store the crawled data in the database.
Again, go directly to the code
csv_to_mysql.py
# -*- coding: utf-8 -*- """ @ auth : carl_DJ @ time : 2020-8-26 """ import csv import time from PaperWeb.libs.mysql import db def get_csv_info(path = 'paper.csv') #Open csv file csvfile = open(path,'r') reader = csv.reader(csvfile) for item in reader: yield item def get_insert_sql(): #Insert data into database items = [] #Get time _time = int(time.time()) for item in get_csv_info(): #Transcoding characters in csv files item = [db.escape_string(_) for _ in item] #Add to the corresponding list, and the list field when creating the database items.append(f"('{item[0]}',{item[1]},{item[2]},{_time},{_time})") #Get values value values = ','.join(items) #Execute sql statement insertion sql = f''' INSERT INTO Paper ('title','url','authors','create_time','updata_time') Values {values} ''' row_id = db.insert(sql) print(f'{row_id}Data insertion completed')
After execution, you can go to the database and see the results!
If it's not perfect, the little fish thinks it's impossible~
If it's perfect, it won't waste the fruit of small fish liver until 12 pm!
After class in the evening, at 10:30,
Having some coffee, walking a baby,
Baby asleep, I started this code
Easter egg
After the website is built, Xiaoyu has two goals:
1. Be able to write some blog posts on python data analysis;
2. Xiaoyu shares Xiaoyu's experience and interview questions about Ali on all sides and finally getting an offer.
Haha, this is an early spoiler~
Why do you want to write about Ali's interview experience
Because recently, a little partner wanted to prepare for Ali's interview and asked me some questions about Ali's interview. Instead of talking about Ali's interview one by one, it's better to sort it out and share more people!
In this way, I can save more time and do more things!!
OK, finally, let's end today's technology sharing with a sentence from Wang Er:
If not forced by life, who is willing to be talented!