Blessed are fiction fans. To learn this, you don't have to be harassed by web advertising, and you don't have to spend money on various novel platforms. Isn't it fragrant to crawl in batches by yourself?
It's also good news for friends who love learning. All kinds of data to crawl and save. It is more conducive to improve their learning efficiency.
The above two points are trails. The most important thing is that reptiles learn well. They can work or take orders to earn extra money.
E-book crawling in python crawler learning practice
1. Get web information
import requests #Import requests library ''' Get web information ''' if __name__ == '__main__': #Main function entry target = 'https://www.xsbiquge.com/78_78513/108078.html'#Destination address to crawl req = requests.get(url=target) #conduct get request req.encoding='utf-8' #Set code print(req.text) #Printout
2. Introduce beautiful soup to analyze the web content
import requests #Import requests library from bs4 import BeautifulSoup #introduce BeautifulSoup library ''' introduce BeautifulSoup Analyze the web content Get web e-book text information ''' if __name__ == '__main__': #Main function entry target = 'https://www.xsbiquge.com/78_78513/108078.html'#Destination address to crawl req = requests.get(url=target) #Initiate request, get html information req.encoding='utf-8' #Set code html = req.text #The of the web page html Information saved in html Variable bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information texts = bs.find('div',id='content') #Get all<div id = "content">Content of print(texts) #Printout
3. Segment the data, remove the blank and extract the text
import requests #Import requests library from bs4 import BeautifulSoup #introduce BeautifulSoup library ''' introduce BeautifulSoup Analyze the web content Get web e-book text information Last sentence texts.text Is to extract all text before using it strip Method: remove the carriage return, Last use split Method basis \\xa0 Split the data, because there are four spaces at the beginning of each paragraph ''' if __name__ == '__main__': #Main function entry target = 'https://www.xsbiquge.com/78_78513/108078.html'#Destination address to crawl req = requests.get(url=target) #Initiate request, get html information req.encoding='utf-8' #Set code html = req.text #The of the web page html Information saved in html Variable bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information texts = bs.find('div',id='content') #Get all<div id = "content">Content of print(texts.text.strip().split('\\xa0'*4)) #Printout
4. View the chapter list
import requests #Import requests library from bs4 import BeautifulSoup #introduce BeautifulSoup library ''' View chapter list information introduce BeautifulSoup Analyze the web content Get web e-book text information ''' if __name__ == '__main__': #Main function entry target = 'https://www.xsbiquge.com/78_78513/'#Destination address to crawl,<Chapter directory website of yuanzun req = requests.get(url=target) #Initiate request, get html information req.encoding='utf-8' #Set code html = req.text #The of the web page html Information saved in html Variable bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information chapters = bs.find('div',id='list') #Get all<div id = "list">Content of chapters = chapters.find_all('a') #find list Medium a Content in label for chapter in chapters: print(chapter) #Print chapter list
5. Get chapter contents and chapter links
import requests #Import requests library from bs4 import BeautifulSoup #introduce BeautifulSoup library ''' View chapter list information introduce BeautifulSoup Analyze the web content Get web e-book text information ''' if __name__ == '__main__': #Main function entry server = 'https://www.xsbiquge.com' target = 'https://www.xsbiquge.com/78_78513/'#Destination address to crawl,<Chapter directory website of yuanzun req = requests.get(url=target) #Initiate request, get html information req.encoding='utf-8' #Set code html = req.text #The of the web page html Information saved in html Variable bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information chapters = bs.find('div',id='list') #Get all<div id = "list">Content of chapters = chapters.find_all('a') #find list Medium a Content in label for chapter in chapters: url = chapter.get('href') #Get in Chapter link href print("<"+chapter.string+">") #Print chapter name print(server+url) #Splice the e-book website with the obtained chapters to get the links of each chapter
6. Integrate data and download e-book documents
import requests #Import requests library from bs4 import BeautifulSoup #introduce BeautifulSoup library import time from tqdm import tqdm ''' View chapter list information introduce BeautifulSoup Analyze the web content Get web e-book text information ''' def get_content(target): req = requests.get(url=target) # Initiate request, get html information req.encoding = 'utf-8' # Set code html = req.text # The of the web page html Information saved in html Variable bf = BeautifulSoup(html, 'lxml') # use lxml Analyze web page information texts = bf.find('div', id='content') # Get all<div id = "content">Content of content = texts.text.strip().split('\\xa0' * 4) return content if __name__ == '__main__': #Main function entry server = 'https://www.xsbiquge.com' #E-book website address book_name = '<Yuan Zun.txt' target = 'https://www.xsbiquge.com/78_78513/'#Destination address to crawl,<Chapter directory website of yuanzun req = requests.get(url=target) #Initiate request, get html information req.encoding='utf-8' #Set code html = req.text #The of the web page html Information saved in html Variable chapter_bs = BeautifulSoup(html,'lxml') #use lxml Analyze web page information chapters = chapter_bs.find('div',id='list') #Get all<div id = "list">Content of chapters = chapters.find_all('a') #find list Medium a Content in label for chapter in tqdm(chapters): chapter_name = chapter.string #Chapter name url = server + chapter.get('href') #Get in Chapter link href content = get_content(url) with open(book_name,'a',encoding='utf-8') as f: f.write("<"+chapter_name+">") f.write('\n') f.write('\n'.join(content)) f.write('\n')
At the beginning of downloading, it may be a little slow. Download a book for ten minutes. Better ways will be shared later. If you have better suggestions, you can also exchange them together.
Little brothers and sisters who need to learn can enter my technical exchange group to obtain. Please understand more.
There are all kinds of great gods in the group, and professional teachers answer questions one-on-one. Don't miss it. q skirt: 6 + 011 + 53 + 105. Python learning materials in the group are available for free