Crawler: how to use crawler to crawl website pictures

If you have a hobby, you should occasionally see some pictures of your sister. Recently, small websites often collapse and disappear. It's better to find a way to localize and collect small photos!
First, prepare a collection of small websites, and then you can start!

The complete code is at the end of the text!!
tips:vx:hsrjdz
Reply to csdn or crawler automatically enter the group!!

For this update, add the history module, and you can pause the download at any time!! Add the os library to automatically generate the path. After each folder is full of 490 pictures, it will automatically create a new folder, which is convenient for you to import Baidu cloud with one click!! Automatic folder change storage,

history module
def history(link):
    global picnamelist
    global name
    link=link
    pic_name = name + link[link.rfind('/') + 1:]
    path_out = os.getcwd()
    h_path=filenames(path_out)
    #p_path=pic_names(path_out,h_path)
    path_name = h_path+'/' +pic_name

    with open('history.txt','a+',encoding='utf8') as f:
        f.seek(0,0)
        picnamelist=f.readlines()
        if pic_name+'\n' not in picnamelist:
            f.writelines(pic_name+'\n')
            download_img(link,path_name)
            return
        else:
            print('picture%s Already exists, skipped!'%(pic_name))
            pass
Automatic new folder module
def filenames(path_root):
    global picCount
    root_path=path_root
    os.chdir(root_path)
    root_wwj="NewPics"
    n=picCount //490
    root_wwj=root_wwj+str(n)
    if os.path.exists(root_path+"/"+root_wwj)==False:
        os.mkdir(root_wwj)
        os.chdir(root_path)
        return root_wwj
    else:
        return root_wwj

Add the abnormal silence module. After the website reports frequently, it will automatically wait for 10 seconds, and then launch a new link

All right, let's get to the point

First step

We first write a link to get the url of the website. Because the url is often composed of page or other elements, we separate it. There is a picture column under the homepage of the website I found. There is a title page in the picture column, and there are about 10 photos in a title
So the steps are:
Step 1: enter the title page of the picture column and the title page of the picture column
The rule is the first page: www.xxxx... / TP html
Page 2: www.xxxx... / tp-2 html
Page 3: www.xxxx... / tp-3 html
...
...

def  getHTML(pages):
    for i in range(1,pages+1):
        url = 'https://'
        if i>1:
            url=url+'-'+str(i)+'.html'
            print('------Downloading page%d Picture of page!------'%(i))
            htmlTex(url)
        else:
            url=url+'.html'
            print('------Downloading page%d Picture of page!------'%(i))
            htmlTex(url)

Step two

We have got our link above. This link is the link of each page change. We need to get the link to enter the title from the page change link. There are various pictures in the title. We use xpath to get the title link path. Later

So let's disguise the header for the url. I choose the disguised mobile browsing here

headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
    }

, and then requests Get (URL, headers = headers) gets the text content of the page
Pass it to the next function to grab the link into the title

def htmlTex(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
    }
    r = requests.get(url, headers=headers)
    r.encoding = 'utf8'
    htmlTXT = r.text
    getLinks(htmlTXT)

Step 3

The getLinks function gets the parsing code and gets the specified title link according to the xpath path,
By requesting the title link again, you can get the text text of the page where the picture in the title is located and pass it to the getPicLINK function

def getLinks(txt):
    global name
    content = txt
    content = content.replace('<!--', '').replace('-->', '')
    LISTtree = html.etree.HTML(content)
    links=LISTtree.xpath('//div[@class="text-list-html"]/div/ul/li/a/@href')
    for link in links:
        name=link[link.rfind('/')+1:link.rfind('.')]
        url='https:'+link
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
            }
            r = requests.get(url, headers=headers)
            r.encoding = 'utf8'
            # print(r.text)
            htmlTXT = r.text
            getPicLINK(htmlTXT)

getPicLINK receives the picture address text, obtains the picture address with xpath, and then downloads it

def getPicLINK(txt):

    global url
    content=txt
    content=content.replace('<!--','').replace('-->','')
    LISTtree=html.etree.HTML(content)
    link_list1=LISTtree.xpath('//main/div[@class="content"]/img[@class="videopic lazy"]/@data-original')
    for link in link_list1:
        try:
         history(link)

        except Exception as e:
            with open("errorLog.txt", "a+", encoding="utf8") as f1:
                f1.write(getNowTime() + "-- " + traceback.format_exc() + '\n')
                print(getNowTime() + "-- " + traceback.format_exc() + '\n')
            print("piclink Error skipping")
            continue

The history function is used to judge whether the picture already exists. The principle is to store the picture name into TXT every time you download it. Before downloading, read the picture name in txt to judge whether the picture exists

def history(link):
    global picnamelist
    global name
    link=link
    pic_name = name + link[link.rfind('/') + 1:]
    path_name = 'pics2/' + pic_name
    with open('pics2/history.txt','a+',encoding='utf8') as f:
        f.seek(0,0)
        picnamelist=f.readlines()
        if pic_name+'\n' not in picnamelist:
            f.writelines(pic_name+'\n')
            download_img(link,path_name)
            return
        else:
            print('picture%s Already exists, skipped!'%(pic_name))
            pass

Download function:

picCount=0
def download_img(link,picName):
    global picCount
    global  picnamelist
    pic_name=picName
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
    }

    r = requests.get(link, headers=headers)
    print("Downloading:%s" % (pic_name[pic_name.rfind('/')+1:]))
    with open(pic_name,'wb',) as f:
        f.write(r.content)
        picCount=picCount+1
        print("The first%d Picture download completed! %s" % (picCount,getNowTime()))

After reading the article, I think it's good. Just praise and pay attention

A lot of people are looking for code. Here's the code

# vx:13237066568
# QQ:672377334
# Wjdrx: reptile
# Group password 1 (fill in when adding friends or reply after passing): crawler
# Group password 2 (fill in when adding friends or reply after passing): csdn
import  requests
from lxml import html
import time
import traceback
import os

i=0
name=''
picnamelist=[]
url_h= 'https://www.bpr5.com'
key_v='/tupian/list-%E7%B2%BE%E5%93%81%E5%A5%97%E5%9B%BE'
# https://www.bpr5.com/tupian/list-%E7%B2%BE%E5%93%81%E5%A5%97%E5%9B%BE.html

def getNowTime():
    year = time.localtime().tm_year
    month = time.localtime().tm_mon
    day = time.localtime().tm_mday
    hour = time.localtime().tm_hour
    minute = time.localtime().tm_min
    second = time.localtime().tm_sec
    return str(year) + "-" + str(month) + "-" + str(day) + " " + str(hour) + ":" + str(minute) + ":" + str(second)

def  getHTML(pages):
    global url_h
    global key_v
    for i in range(1,pages+1):
        url = url_h+key_v
        if i>1:
            url=url+'-'+str(i)+'.html'
            print('------Downloading page%d Picture of page!------'%(i))
            htmlTex(url)
        else:
            url=url+'.html'
            print('------Downloading page%d Picture of page!------'%(i))
            htmlTex(url)

def htmlTex(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
    }
    r = requests.get(url, headers=headers)
    r.encoding = 'utf8'
    htmlTXT = r.text
    getLinks(htmlTXT)

def getLinks(txt):
    global name
    global url_h
    content = txt
    content = content.replace('<!--', '').replace('-->', '')
    LISTtree = html.etree.HTML(content)
    links=LISTtree.xpath('//div[@class="text-list-html"]/div/ul/li/a/@href')
    for link in links:
        name=link[link.rfind('/')+1:link.rfind('.')]
        url=url_h+link
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
            }
            r = requests.get(url, headers=headers)
            r.encoding = 'utf8'
            # print(r.text)
            htmlTXT = r.text
            getPicLINK(htmlTXT)

        except Exception as e:
            with open("errorLog.txt", "a+", encoding="utf8") as f1:
                f1.write(getNowTime() + "-- " + traceback.format_exc() + '\n')
                print(getNowTime() + "-- " + traceback.format_exc() + '\n')
            print('error getLinks')
            continue



def getPicLINK(txt):
    # //div[@class="box list channel max-border list-text-my"]/ul/li[@class]/a/@href
    # with open(File_name,encoding="utf8") as f:
    # //main/div[@class="content"]/img[@class="videopic lazy"]/@src
    global url
    content=txt
    content=content.replace('<!--','').replace('-->','')
    LISTtree=html.etree.HTML(content)
    link_list1=LISTtree.xpath('//main/div[@class="content"]/img[@class="videopic lazy"]/@data-original')
    for link in link_list1:
        try:
         history_a(link)

        except Exception as e:
            with open("errorLog.txt", "a+", encoding="utf8") as f1:
                f1.write(getNowTime() + "-- " + traceback.format_exc() + '\n')
                print(getNowTime() + "-- " + traceback.format_exc() + '\n')
            print("piclink Error skipping")
            continue

def history_a(link):
    global picnamelist
    global name
    link=link
    pic_name = name + link[link.rfind('/') + 1:]
    path_out = os.getcwd()
    h_path=filenames(path_out)
    #p_path=pic_names(path_out,h_path)
    path_name = h_path+'/' +pic_name
    with open('history_aaa.txt','a+',encoding='utf8') as f:
        f.seek(0,0)
        picnamelist=f.readlines()
        if pic_name+'\n' not in picnamelist:
            f.writelines(pic_name+'\n')
            download_img(link,path_name)
            return
        else:
            print('picture%s Already exists, skipped!'%(pic_name))
            pass
picCount=0
def download_img(link,picName):
    global picCount
    global  picnamelist
    pic_name=picName
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36'
    }
    r = requests.get(link, headers=headers)
    print("Downloading:%s" % (pic_name[pic_name.rfind('/')+1:]))
    with open(pic_name,'wb',) as f:
        f.write(r.content)
        picCount=picCount+1
        print("The first%d Picture download completed! %s" % (picCount,getNowTime()))


def filenames(path_root):
    global picCount
    root_path=path_root
    os.chdir(root_path)
    root_wwj="NewPics"
    n=picCount //490
    root_wwj=root_wwj+str(n)
    if os.path.exists(root_path+"/"+root_wwj)==False:
        os.mkdir(root_wwj)
        os.chdir(root_path)
        return root_wwj
    else:
        return root_wwj

# def pic_names(path,h_path):
#     global picCount
#     os.chdir(h_path)
#     pic_name_path=picCount//499
#     if os.path.exists(path+"/"+h_path+'/'+pic_name_path)==False:
#         os.mkdir(pic_name_path)
#         os.chdir(path)
#         return pic_name_path





if __name__ == '__main__':
    pages=int(input('Climb several pages:'))
    getHTML(pages)
    print("complete")




Posted by itsinmyhead on Sat, 14 May 2022 10:37:58 +0300