python tells you to analyze CSDN personal blog data
Get all personal blog titles and links, release time, views, collections and other data information, sort them according to the number of visits, and sort them into one Excel Table storage. When using, enter your personal blog ID From data acquisition to analytical storage requests,BeautifulSoup,pandas And other third-party libraries, a complete Python Reptile practice.
catalogue
-
Web page analysis
-
-
Blog list analysis
-
Single blog analysis
-
Environment configuration
-
code implementation
-
-
config configuration
-
run code
-
Execution process
-
Code download
Web page analysis
Blog list analysis
By analyzing the web page code of my blog list, extract the link of each article.
My blog list url is: https://blog.csdn.net/xiaoma_2018/article/list/1?t=1
Note that everyone's blog ID will be different. Therefore, when using this crawler, it is required to enter personal blog ID and page number to achieve general function.
Single blog analysis
By analyzing the web source code of a single blog, we can obtain the data information such as article link, article title, publishing time, browsing volume, collection volume and so on.
Environment configuration
This crawler program, operating environment description PyCharm 2020.1.1, Python 3.7.5
The third-party dependency libraries used are as follows:
Execute: PIP free > requirements Txt export
beautifulsoup4==4.9.1 pandas==1.1.1 requests==2.24.0
code implementation
The main idea of the code is:
-
Blog ID and number of pages required
-
Crawl all blog links
-
Crawl the data information of each blog
-
data storage
config configuration
In order to define the crawler's path and configuration file of different blogs, a separate parameter is used to define the crawler's path Py file is as follows:
''' @Func The request header information and file path information used by the crawler @File config.py ''' Host = "blog.csdn.net" # Request header host parameter User_Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362" Source = 'html.txt' # Temporarily save blog list html Source code EachSource = 'each.txt' # Save each blog temporarily html Source code OUTPUT = "Blog information.csv" # Output blog information to csv file
Where user_ The agent must be configured according to its own browser parameters before it can be used. Other parameters can default to this configuration.
run code
''' @Func Python Reptile CSDN Blog post data, and write excel In the table use re Module regular matching url address ''' import requests from bs4 import BeautifulSoup import pandas as pd import os import re from config import Host, User_Agent, Source, EachSource,OUTPUT results = [] # Store all data def parseEachBlog(link): referer = "Referer: " + link headers = {"Referer": referer, "User-Agent": User_Agent} r = requests.post(link, headers=headers) html = r.text with open(EachSource, 'w', encoding='UTF-8') as f: f.write(html) soup = BeautifulSoup(open(EachSource, 'r', encoding='UTF-8'), features="html.parser") readcontent = soup.select('.bar-content .read-count') collection = soup.select('.bar-content .get-collection') readcounts = re.sub(r'\D', "", str(readcontent[0])) collections = re.sub(r'\D', "", str(collection[0])) blogname = soup.select('.title-article')[0].text time = soup.select('.bar-content .time')[0].text eachBlog = [blogname, link, readcounts, collections, time] return eachBlog def getBlogList(blogID, pages): listhome = "https://" + Host + "/" + blogID + "/article/list/" pagenums = [] # Converted pages the number of pages for i in range(1, int(pages)+1): pagenums.append(str(i)) for number in pagenums: url = listhome + number + "?t=1" headers = {"Referer": url, "Host": Host, "User-Agent": User_Agent} response = requests.post(url, headers=headers) html = response.text with open(Source, 'a', encoding='UTF-8') as f: f.write(html) # Get links to all blogs soup = BeautifulSoup(open(Source, 'r', encoding='UTF-8'), features="html.parser") hrefs = [] re_patterm = "^https://blog.csdn.net/" + blogID + "/article/details/\d+$" for a in soup.find_all('a', href=True): if a.get_text(strip=True): href = a['href'] if re.match(re_patterm, href): if hrefs.count(href) == 0: hrefs.append(href) return hrefs def parseData(): results.sort(key=lambda result:int(result[2]), reverse=True) # Sort by views dataframe = pd.DataFrame(data=results) dataframe.columns = ['Article title', 'Article link', 'Views', 'Collection volume', 'Release time'] dataframe.to_csv(OUTPUT, index=False, sep=',') def delTempFile(): if os.path.exists(Source): os.remove(Source) if os.path.exists(EachSource): os.remove(EachSource) if __name__ == '__main__': blogID = input("Enter the blog name you want to climb: ") pages = input("Enter the number of blog list pages: ") print("Get all blog links...") linklist = getBlogList(blogID, pages) print("Start getting data...") for i in linklist: print("Current acquisition: %s"%(i)) results.append(parseEachBlog(i)) print("End getting data...") # Start parsing and storing .csv file print("Start parsing and storing data...") parseData() print("Delete temporary files...") delTempFile()
Execution process
Take my own blog ID as an example to show the implementation process and results. My blog list is currently two pages.
Start execution
End execution
The results show that
Code download
From the idea to the realization, and then to the end of the output of this blog post, it's still very fun. I summarize and share here.
The complete crawler code was uploaded to my Gitee
Download address: https://gitee.com/lyc96/analysis-of-personal-blogs
Welcome to official account: Python crawler data analysis and mining
Record every bit of learning python;
Reply to [open source code] to obtain more open source project source code for free;
Official account updates python knowledge and [free] tools every day;
This article has been synchronized to [open source China] and [Tencent cloud community];