Do you really read blogs??? Let's see what's going on

 

 

python tells you to analyze CSDN personal blog data

 

Get all personal blog titles and links, release time, views, collections and other data information, sort them according to the number of visits, and sort them into one Excel Table storage. When using, enter your personal blog ID From data acquisition to analytical storage requests,BeautifulSoup,pandas And other third-party libraries, a complete Python Reptile practice.

 

 

catalogue

  • Web page analysis

  •  

    • Blog list analysis

    • Single blog analysis

  • Environment configuration

  • code implementation

  •  

    • config configuration

    • run code

  • Execution process

  • Code download

 

Web page analysis

Blog list analysis


By analyzing the web page code of my blog list, extract the link of each article.
My blog list url is: https://blog.csdn.net/xiaoma_2018/article/list/1?t=1
Note that everyone's blog ID will be different. Therefore, when using this crawler, it is required to enter personal blog ID and page number to achieve general function.

Single blog analysis


By analyzing the web source code of a single blog, we can obtain the data information such as article link, article title, publishing time, browsing volume, collection volume and so on.

Environment configuration

This crawler program, operating environment description PyCharm 2020.1.1, Python 3.7.5
The third-party dependency libraries used are as follows:
Execute: PIP free > requirements Txt export

 

beautifulsoup4==4.9.1
pandas==1.1.1
requests==2.24.0

 

 

 

code implementation

The main idea of the code is:

  1. Blog ID and number of pages required

  2. Crawl all blog links

  3. Crawl the data information of each blog

  4. data storage

config configuration

In order to define the crawler's path and configuration file of different blogs, a separate parameter is used to define the crawler's path Py file is as follows:

 

 
'''

@Func The request header information and file path information used by the crawler
@File config.py
'''
Host = "blog.csdn.net" # Request header host parameter
User_Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"
Source = 'html.txt'   # Temporarily save blog list html Source code
EachSource = 'each.txt' # Save each blog temporarily html Source code
OUTPUT = "Blog information.csv"  # Output blog information to csv file

 

 

Where user_ The agent must be configured according to its own browser parameters before it can be used. Other parameters can default to this configuration.

run code

 

'''

@Func Python Reptile CSDN Blog post data, and write excel In the table
      use re Module regular matching url address
'''
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
from config import Host, User_Agent, Source, EachSource,OUTPUT

results = [] # Store all data

def parseEachBlog(link):
    referer = "Referer: " + link
    headers = {"Referer": referer, "User-Agent": User_Agent}
    r = requests.post(link, headers=headers)
    html = r.text
    with open(EachSource, 'w', encoding='UTF-8') as f:
        f.write(html)
    soup = BeautifulSoup(open(EachSource, 'r', encoding='UTF-8'), features="html.parser")
    readcontent = soup.select('.bar-content .read-count')
    collection = soup.select('.bar-content .get-collection')
    readcounts = re.sub(r'\D', "", str(readcontent[0]))
    collections = re.sub(r'\D', "", str(collection[0]))
    blogname = soup.select('.title-article')[0].text
    time = soup.select('.bar-content .time')[0].text
    eachBlog = [blogname, link, readcounts, collections, time]
    return eachBlog

def getBlogList(blogID, pages):
    listhome = "https://" + Host + "/" + blogID + "/article/list/"
    pagenums = [] # Converted pages the number of pages
    for i in range(1, int(pages)+1):
        pagenums.append(str(i))

    for number in pagenums:
        url = listhome + number + "?t=1"
        headers = {"Referer": url, "Host": Host, "User-Agent": User_Agent}
        response = requests.post(url, headers=headers)
        html = response.text
        with open(Source, 'a', encoding='UTF-8') as f:
            f.write(html)
    # Get links to all blogs
    soup = BeautifulSoup(open(Source, 'r', encoding='UTF-8'), features="html.parser")
    hrefs = []
    re_patterm = "^https://blog.csdn.net/" + blogID + "/article/details/\d+$"
    for a in soup.find_all('a', href=True):
        if a.get_text(strip=True):
            href = a['href']
            if re.match(re_patterm, href):
                if hrefs.count(href) == 0:
                    hrefs.append(href)
    return hrefs

def parseData():
    results.sort(key=lambda result:int(result[2]), reverse=True) # Sort by views
    dataframe = pd.DataFrame(data=results)
    dataframe.columns = ['Article title', 'Article link', 'Views', 'Collection volume', 'Release time']
    dataframe.to_csv(OUTPUT, index=False, sep=',')

def delTempFile():
    if os.path.exists(Source):
        os.remove(Source)
    if os.path.exists(EachSource):
        os.remove(EachSource)

if __name__ == '__main__':
    blogID = input("Enter the blog name you want to climb: ")
    pages = input("Enter the number of blog list pages: ")
    print("Get all blog links...")
    linklist = getBlogList(blogID, pages)
    print("Start getting data...")
    for i in linklist:
        print("Current acquisition: %s"%(i))
        results.append(parseEachBlog(i))
    print("End getting data...")
    # Start parsing and storing .csv file
    print("Start parsing and storing data...")
    parseData()
    print("Delete temporary files...")
    delTempFile()

 

 

Execution process

Take my own blog ID as an example to show the implementation process and results. My blog list is currently two pages.
Start execution

End execution

The results show that

 

Code download

From the idea to the realization, and then to the end of the output of this blog post, it's still very fun. I summarize and share here.
The complete crawler code was uploaded to my Gitee

Download address: https://gitee.com/lyc96/analysis-of-personal-blogs

 

Welcome to official account: Python crawler data analysis and mining

Record every bit of learning python;

Reply to [open source code] to obtain more open source project source code for free;

Official account updates python knowledge and [free] tools every day;

This article has been synchronized to [open source China] and [Tencent cloud community];

 

Posted by whir on Thu, 12 May 2022 01:59:17 +0300