Crawling pictures with Python, the content is too real!

foreword
Sometimes, we like to go to the website to browse some beautiful pictures, or we need some good-looking pictures as materials in our daily work and study, or we want to pass the time and relax when we are bored. Click on the links one by one to browse? I think in this era of data explosion, is it a bit time-consuming to do this, let's take a look at a wave of operations! Feast your eyes on...

import library
Importing some third libraries required by crawlers is the first step for our crawlers:

from bs4 import BeautifulSoup
import requests
import os
import re

These libraries, as well as some of the knowledge points involved later, I will not introduce them one by one here. Later, I will explain these basic knowledge in detail in the "First Acquaintance of Reptiles Series", this time is a practical chapter, let Everyone knows some things. I will publish the actual combat articles in this column, and structure my articles, so that I can easily refer to them in the future.

find url

urlHead = 'https://photo.fengniao.com/'
url = 'https://photo.fengniao.com/pic_43591143.html'

request url

def getHtmlurl(url):  # get url
    try:
        r = requests.get(url)
        # Solve the problem of parsing garbled characters
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

Parse and save

def getpic(html):  # Get the image address and download it, and then return to the next image address
    # Specify the parser of BeautifulSoup as: html.parser
    soup = BeautifulSoup(html, 'html.parser')
    # all_img = soup.find('div', class_='imgBig').find_all('img')

    all_img = soup.find('a', class_='downPic')
    img_url = all_img['href']

    reg = r'<h3 class="title overOneTxt">(.*?)</h3>'  # r'<a\sclass=".*?"\starget=".*?"\shref=".*?">(.*)</a>'  # regular expression
    reg_ques = re.compile(reg)  # Compile the regular expression to run faster
    image_name = reg_ques.findall(html)  # match regular expression

    urlNextHtml = soup.find('a', class_='right btn')
    urlNext = urlHead + urlNextHtml['href']

    print('downloading:' + img_url)
    root = 'E:\Python Experimental location\picture\cache'
    path = root + image_name[0] + '.jpg'
    try:  # Create or determine whether a path image exists and download it
        if not os.path.exists(root):
            os.mkdir(root)
        if not os.path.exists(path):
            r = requests.get(img_url)
            with open(path, 'wb') as f:
                f.write(r.content)
                f.close()
                print("The picture is downloaded successfully")
        else:
            print("File already exists")
    except:
        print("Crawl failed")
    return urlNext

structured function

def main():
    html = (getHtmlurl(url))
    print(html)
    return getpic(html)

main function

# main function
# Download 100 pictures! ! !
if __name__ == '__main__':
    for i in range(1, 100):
        url = main()

general idea
1. Request URL
2. Get the URL
3. Parse the web page
4. Save data

Don't underestimate these steps. If you want to understand in detail, you still need some practice. In the current crawler technology, there are many things that need to be paid attention to, such as anti-crawling technology, delay, and proxy. These are all we need to understand, remember not to Just copy some code on the network and run it yourself, which is very good to block your own computer IP.

Let's see how it works!

As long as you delete some things in this code, you can use it yourself. If you need it, you can leave a message!

Tags: Python regex

Posted by psychotomus on Tue, 24 May 2022 22:50:47 +0300