If you have a hobby, you should occasionally see some pictures of your sister. Recently, small websites often collapse and disappear. It's better to find a way to localize and collect small photos!
First, prepare a collection of small websites, and then you can start!
The complete code is at the end of the text!!
tips:vx:hsrjdz
Reply to csdn or crawler automatically enter the group!!
For this update, add the history module, and you can pause the download at any time!! Add the os library to automatically generate the path. After each folder is full of 490 pictures, it will automatically create a new folder, which is convenient for you to import Baidu cloud with one click!! Automatic folder change storage,
history module
def history(link): global picnamelist global name link=link pic_name = name + link[link.rfind('/') + 1:] path_out = os.getcwd() h_path=filenames(path_out) #p_path=pic_names(path_out,h_path) path_name = h_path+'/' +pic_name with open('history.txt','a+',encoding='utf8') as f: f.seek(0,0) picnamelist=f.readlines() if pic_name+'\n' not in picnamelist: f.writelines(pic_name+'\n') download_img(link,path_name) return else: print('picture%s Already exists, skipped!'%(pic_name)) pass
Automatic new folder module
def filenames(path_root): global picCount root_path=path_root os.chdir(root_path) root_wwj="NewPics" n=picCount //490 root_wwj=root_wwj+str(n) if os.path.exists(root_path+"/"+root_wwj)==False: os.mkdir(root_wwj) os.chdir(root_path) return root_wwj else: return root_wwj
Add the abnormal silence module. After the website reports frequently, it will automatically wait for 10 seconds, and then launch a new link
All right, let's get to the point
First step
We first write a link to get the url of the website. Because the url is often composed of page or other elements, we separate it. There is a picture column under the homepage of the website I found. There is a title page in the picture column, and there are about 10 photos in a title
So the steps are:
Step 1: enter the title page of the picture column and the title page of the picture column
The rule is the first page: www.xxxx... / TP html
Page 2: www.xxxx... / tp-2 html
Page 3: www.xxxx... / tp-3 html
...
...
def getHTML(pages): for i in range(1,pages+1): url = 'https://' if i>1: url=url+'-'+str(i)+'.html' print('------Downloading page%d Picture of page!------'%(i)) htmlTex(url) else: url=url+'.html' print('------Downloading page%d Picture of page!------'%(i)) htmlTex(url)
Step two
We have got our link above. This link is the link of each page change. We need to get the link to enter the title from the page change link. There are various pictures in the title. We use xpath to get the title link path. Later
So let's disguise the header for the url. I choose the disguised mobile browsing here
headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' }
, and then requests Get (URL, headers = headers) gets the text content of the page
Pass it to the next function to grab the link into the title
def htmlTex(url): headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } r = requests.get(url, headers=headers) r.encoding = 'utf8' htmlTXT = r.text getLinks(htmlTXT)
Step 3
The getLinks function gets the parsing code and gets the specified title link according to the xpath path,
By requesting the title link again, you can get the text text of the page where the picture in the title is located and pass it to the getPicLINK function
def getLinks(txt): global name content = txt content = content.replace('<!--', '').replace('-->', '') LISTtree = html.etree.HTML(content) links=LISTtree.xpath('//div[@class="text-list-html"]/div/ul/li/a/@href') for link in links: name=link[link.rfind('/')+1:link.rfind('.')] url='https:'+link try: headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } r = requests.get(url, headers=headers) r.encoding = 'utf8' # print(r.text) htmlTXT = r.text getPicLINK(htmlTXT)
getPicLINK receives the picture address text, obtains the picture address with xpath, and then downloads it
def getPicLINK(txt): global url content=txt content=content.replace('<!--','').replace('-->','') LISTtree=html.etree.HTML(content) link_list1=LISTtree.xpath('//main/div[@class="content"]/img[@class="videopic lazy"]/@data-original') for link in link_list1: try: history(link) except Exception as e: with open("errorLog.txt", "a+", encoding="utf8") as f1: f1.write(getNowTime() + "-- " + traceback.format_exc() + '\n') print(getNowTime() + "-- " + traceback.format_exc() + '\n') print("piclink Error skipping") continue
The history function is used to judge whether the picture already exists. The principle is to store the picture name into TXT every time you download it. Before downloading, read the picture name in txt to judge whether the picture exists
def history(link): global picnamelist global name link=link pic_name = name + link[link.rfind('/') + 1:] path_name = 'pics2/' + pic_name with open('pics2/history.txt','a+',encoding='utf8') as f: f.seek(0,0) picnamelist=f.readlines() if pic_name+'\n' not in picnamelist: f.writelines(pic_name+'\n') download_img(link,path_name) return else: print('picture%s Already exists, skipped!'%(pic_name)) pass
Download function:
picCount=0 def download_img(link,picName): global picCount global picnamelist pic_name=picName headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } r = requests.get(link, headers=headers) print("Downloading:%s" % (pic_name[pic_name.rfind('/')+1:])) with open(pic_name,'wb',) as f: f.write(r.content) picCount=picCount+1 print("The first%d Picture download completed! %s" % (picCount,getNowTime()))
After reading the article, I think it's good. Just praise and pay attention
A lot of people are looking for code. Here's the code
# vx:13237066568 # QQ:672377334 # Wjdrx: reptile # Group password 1 (fill in when adding friends or reply after passing): crawler # Group password 2 (fill in when adding friends or reply after passing): csdn import requests from lxml import html import time import traceback import os i=0 name='' picnamelist=[] url_h= 'https://www.bpr5.com' key_v='/tupian/list-%E7%B2%BE%E5%93%81%E5%A5%97%E5%9B%BE' # https://www.bpr5.com/tupian/list-%E7%B2%BE%E5%93%81%E5%A5%97%E5%9B%BE.html def getNowTime(): year = time.localtime().tm_year month = time.localtime().tm_mon day = time.localtime().tm_mday hour = time.localtime().tm_hour minute = time.localtime().tm_min second = time.localtime().tm_sec return str(year) + "-" + str(month) + "-" + str(day) + " " + str(hour) + ":" + str(minute) + ":" + str(second) def getHTML(pages): global url_h global key_v for i in range(1,pages+1): url = url_h+key_v if i>1: url=url+'-'+str(i)+'.html' print('------Downloading page%d Picture of page!------'%(i)) htmlTex(url) else: url=url+'.html' print('------Downloading page%d Picture of page!------'%(i)) htmlTex(url) def htmlTex(url): headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } r = requests.get(url, headers=headers) r.encoding = 'utf8' htmlTXT = r.text getLinks(htmlTXT) def getLinks(txt): global name global url_h content = txt content = content.replace('<!--', '').replace('-->', '') LISTtree = html.etree.HTML(content) links=LISTtree.xpath('//div[@class="text-list-html"]/div/ul/li/a/@href') for link in links: name=link[link.rfind('/')+1:link.rfind('.')] url=url_h+link try: headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } r = requests.get(url, headers=headers) r.encoding = 'utf8' # print(r.text) htmlTXT = r.text getPicLINK(htmlTXT) except Exception as e: with open("errorLog.txt", "a+", encoding="utf8") as f1: f1.write(getNowTime() + "-- " + traceback.format_exc() + '\n') print(getNowTime() + "-- " + traceback.format_exc() + '\n') print('error getLinks') continue def getPicLINK(txt): # //div[@class="box list channel max-border list-text-my"]/ul/li[@class]/a/@href # with open(File_name,encoding="utf8") as f: # //main/div[@class="content"]/img[@class="videopic lazy"]/@src global url content=txt content=content.replace('<!--','').replace('-->','') LISTtree=html.etree.HTML(content) link_list1=LISTtree.xpath('//main/div[@class="content"]/img[@class="videopic lazy"]/@data-original') for link in link_list1: try: history_a(link) except Exception as e: with open("errorLog.txt", "a+", encoding="utf8") as f1: f1.write(getNowTime() + "-- " + traceback.format_exc() + '\n') print(getNowTime() + "-- " + traceback.format_exc() + '\n') print("piclink Error skipping") continue def history_a(link): global picnamelist global name link=link pic_name = name + link[link.rfind('/') + 1:] path_out = os.getcwd() h_path=filenames(path_out) #p_path=pic_names(path_out,h_path) path_name = h_path+'/' +pic_name with open('history_aaa.txt','a+',encoding='utf8') as f: f.seek(0,0) picnamelist=f.readlines() if pic_name+'\n' not in picnamelist: f.writelines(pic_name+'\n') download_img(link,path_name) return else: print('picture%s Already exists, skipped!'%(pic_name)) pass picCount=0 def download_img(link,picName): global picCount global picnamelist pic_name=picName headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36' } r = requests.get(link, headers=headers) print("Downloading:%s" % (pic_name[pic_name.rfind('/')+1:])) with open(pic_name,'wb',) as f: f.write(r.content) picCount=picCount+1 print("The first%d Picture download completed! %s" % (picCount,getNowTime())) def filenames(path_root): global picCount root_path=path_root os.chdir(root_path) root_wwj="NewPics" n=picCount //490 root_wwj=root_wwj+str(n) if os.path.exists(root_path+"/"+root_wwj)==False: os.mkdir(root_wwj) os.chdir(root_path) return root_wwj else: return root_wwj # def pic_names(path,h_path): # global picCount # os.chdir(h_path) # pic_name_path=picCount//499 # if os.path.exists(path+"/"+h_path+'/'+pic_name_path)==False: # os.mkdir(pic_name_path) # os.chdir(path) # return pic_name_path if __name__ == '__main__': pages=int(input('Climb several pages:')) getHTML(pages) print("complete")