foreword
Sometimes, we like to go to the website to browse some beautiful pictures, or we need some good-looking pictures as materials in our daily work and study, or we want to pass the time and relax when we are bored. Click on the links one by one to browse? I think in this era of data explosion, is it a bit time-consuming to do this, let's take a look at a wave of operations! Feast your eyes on...
import library
Importing some third libraries required by crawlers is the first step for our crawlers:
from bs4 import BeautifulSoup import requests import os import re
These libraries, as well as some of the knowledge points involved later, I will not introduce them one by one here. Later, I will explain these basic knowledge in detail in the "First Acquaintance of Reptiles Series", this time is a practical chapter, let Everyone knows some things. I will publish the actual combat articles in this column, and structure my articles, so that I can easily refer to them in the future.
find url
urlHead = 'https://photo.fengniao.com/' url = 'https://photo.fengniao.com/pic_43591143.html'
request url
def getHtmlurl(url): # get url try: r = requests.get(url) # Solve the problem of parsing garbled characters r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return ""
Parse and save
def getpic(html): # Get the image address and download it, and then return to the next image address # Specify the parser of BeautifulSoup as: html.parser soup = BeautifulSoup(html, 'html.parser') # all_img = soup.find('div', class_='imgBig').find_all('img') all_img = soup.find('a', class_='downPic') img_url = all_img['href'] reg = r'<h3 class="title overOneTxt">(.*?)</h3>' # r'<a\sclass=".*?"\starget=".*?"\shref=".*?">(.*)</a>' # regular expression reg_ques = re.compile(reg) # Compile the regular expression to run faster image_name = reg_ques.findall(html) # match regular expression urlNextHtml = soup.find('a', class_='right btn') urlNext = urlHead + urlNextHtml['href'] print('downloading:' + img_url) root = 'E:\Python Experimental location\picture\cache' path = root + image_name[0] + '.jpg' try: # Create or determine whether a path image exists and download it if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): r = requests.get(img_url) with open(path, 'wb') as f: f.write(r.content) f.close() print("The picture is downloaded successfully") else: print("File already exists") except: print("Crawl failed") return urlNext
structured function
def main(): html = (getHtmlurl(url)) print(html) return getpic(html)
main function
# main function # Download 100 pictures! ! ! if __name__ == '__main__': for i in range(1, 100): url = main()
general idea
1. Request URL
2. Get the URL
3. Parse the web page
4. Save data
Don't underestimate these steps. If you want to understand in detail, you still need some practice. In the current crawler technology, there are many things that need to be paid attention to, such as anti-crawling technology, delay, and proxy. These are all we need to understand, remember not to Just copy some code on the network and run it yourself, which is very good to block your own computer IP.
Let's see how it works!
As long as you delete some things in this code, you can use it yourself. If you need it, you can leave a message!