Zero foundation introduction Python crawler web e-book batch crawling

 

 

Blessed are fiction fans. To learn this, you don't have to be harassed by web advertising, and you don't have to spend money on various novel platforms. Isn't it fragrant to crawl in batches by yourself?

It's also good news for friends who love learning. All kinds of data to crawl and save. It is more conducive to improve their learning efficiency.

The above two points are trails. The most important thing is that reptiles learn well. They can work or take orders to earn extra money.

E-book crawling in python crawler learning practice

1. Get web information

import requests        #Import requests library
'''
Get web information
'''
if __name__ == '__main__':          #Main function entry
    target = 'https://www.xsbiquge.com/78_78513/108078.html'#Destination address to crawl
    req = requests.get(url=target)  #conduct get request
    req.encoding='utf-8'            #Set code
    print(req.text)                 #Printout

2. Introduce beautiful soup to analyze the web content

import requests        #Import requests library
from bs4 import BeautifulSoup  #introduce BeautifulSoup library

'''
introduce BeautifulSoup Analyze the web content
 Get web e-book text information
'''
if __name__ == '__main__':          #Main function entry
    target = 'https://www.xsbiquge.com/78_78513/108078.html'#Destination address to crawl
    req = requests.get(url=target)  #Initiate request, get html information
    req.encoding='utf-8'            #Set code
    html = req.text                 #The of the web page html Information saved in html Variable
    bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information
    texts = bs.find('div',id='content') #Get all<div id = "content">Content of
    print(texts)                            #Printout

3. Segment the data, remove the blank and extract the text

import requests        #Import requests library
from bs4 import BeautifulSoup  #introduce BeautifulSoup library

'''
introduce BeautifulSoup Analyze the web content
 Get web e-book text information
 Last sentence texts.text Is to extract all text before using it strip Method: remove the carriage return,
Last use split Method basis \\xa0 Split the data, because there are four spaces at the beginning of each paragraph
'''
if __name__ == '__main__':          #Main function entry
    target = 'https://www.xsbiquge.com/78_78513/108078.html'#Destination address to crawl
    req = requests.get(url=target)  #Initiate request, get html information
    req.encoding='utf-8'            #Set code
    html = req.text                 #The of the web page html Information saved in html Variable
    bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information
    texts = bs.find('div',id='content') #Get all<div id = "content">Content of
    print(texts.text.strip().split('\\xa0'*4))                            #Printout

4. View the chapter list

import requests        #Import requests library
from bs4 import BeautifulSoup  #introduce BeautifulSoup library

'''
View chapter list information
 introduce BeautifulSoup Analyze the web content
 Get web e-book text information

'''
if __name__ == '__main__':          #Main function entry
    target = 'https://www.xsbiquge.com/78_78513/'#Destination address to crawl,<Chapter directory website of yuanzun
    req = requests.get(url=target)      #Initiate request, get html information
    req.encoding='utf-8'                #Set code
    html = req.text                     #The of the web page html Information saved in html Variable
    bs = BeautifulSoup(html,'lxml')     #use lxml Analyze web page information
    chapters = bs.find('div',id='list') #Get all<div id = "list">Content of
    chapters = chapters.find_all('a')         #find list Medium a Content in label
    for chapter in chapters:
        print(chapter)                  #Print chapter list

5. Get chapter contents and chapter links

import requests        #Import requests library
from bs4 import BeautifulSoup  #introduce BeautifulSoup library

'''
View chapter list information
 introduce BeautifulSoup Analyze the web content
 Get web e-book text information

'''
if __name__ == '__main__':          #Main function entry
    server = 'https://www.xsbiquge.com'
    target = 'https://www.xsbiquge.com/78_78513/'#Destination address to crawl,<Chapter directory website of yuanzun
    req = requests.get(url=target)      #Initiate request, get html information
    req.encoding='utf-8'                #Set code
    html = req.text                     #The of the web page html Information saved in html Variable
    bs = BeautifulSoup(html,'lxml')     #use lxml Analyze web page information
    chapters = bs.find('div',id='list') #Get all<div id = "list">Content of
    chapters = chapters.find_all('a')         #find list Medium a Content in label
    for chapter in chapters:
        url = chapter.get('href')       #Get in Chapter link href
        print("<"+chapter.string+">")           #Print chapter name
        print(server+url)               #Splice the e-book website with the obtained chapters to get the links of each chapter

6. Integrate data and download e-book documents

import requests        #Import requests library
from bs4 import BeautifulSoup  #introduce BeautifulSoup library
import time
from tqdm import  tqdm


'''
View chapter list information
 introduce BeautifulSoup Analyze the web content
 Get web e-book text information

'''
def get_content(target):
    req = requests.get(url=target)  # Initiate request, get html information
    req.encoding = 'utf-8'  # Set code
    html = req.text  # The of the web page html Information saved in html Variable
    bf = BeautifulSoup(html, 'lxml')  # use lxml Analyze web page information
    texts = bf.find('div', id='content')  # Get all<div id = "content">Content of
    content = texts.text.strip().split('\\xa0' * 4)
    return content


if __name__ == '__main__':          #Main function entry
    server = 'https://www.xsbiquge.com'     #E-book website address
    book_name = '<Yuan Zun.txt'
    target = 'https://www.xsbiquge.com/78_78513/'#Destination address to crawl,<Chapter directory website of yuanzun
    req = requests.get(url=target)      #Initiate request, get html information
    req.encoding='utf-8'                #Set code
    html = req.text                     #The of the web page html Information saved in html Variable
    chapter_bs = BeautifulSoup(html,'lxml')     #use lxml Analyze web page information
    chapters = chapter_bs.find('div',id='list') #Get all<div id = "list">Content of
    chapters = chapters.find_all('a')         #find list Medium a Content in label
    for chapter in tqdm(chapters):
        chapter_name = chapter.string           #Chapter name
        url = server + chapter.get('href')       #Get in Chapter link href
        content = get_content(url)
        with open(book_name,'a',encoding='utf-8') as f:
            f.write("<"+chapter_name+">")
            f.write('\n')
            f.write('\n'.join(content))
            f.write('\n')

At the beginning of downloading, it may be a little slow. Download a book for ten minutes. Better ways will be shared later. If you have better suggestions, you can also exchange them together.

Little brothers and sisters who need to learn can enter my technical exchange group to obtain. Please understand more.
There are all kinds of great gods in the group, and professional teachers answer questions one-on-one. Don't miss it. q skirt: 6 + 011 + 53 + 105. Python learning materials in the group are available for free

Posted by Moocat on Wed, 25 May 2022 15:53:29 +0300