Station P crawler, the analysis process crawls the original drawing in batches png

Station P crawler, the analysis process crawls the original drawing in batches png

Website link of P station

1. If you want to crawl the original image in batches, you must first be able to find the download url of the original image. You can't eat a fat man in one bite.

  • Select the picture review element to find the address of the picture
  • The img tag contains the address of the picture
  • But they are all the link addresses at the end of jpg. In my last blog, I crawled through the painter of P station. I concluded that the thumbnail at the end of JPG is the thumbnail on the web page, which is only a few hundred K. it will be blurred after zooming in. What I want to climb is the original high-definition picture, which is definitely not the JPG link
  • Through the experience of my last blog, I changed the jpg link address to png. As a result, I ran into a wall and couldn't access it
  • But I noticed the href tag address of the a tag above
  • When I put the mouse on the link below, the preview image will be displayed, but the preview image will not appear on the link address at the end of png
  • Then I copied and transferred the address of the picture, and the result was inaccessible, which was expected
  • After that, I click the picture to enter the large picture preview, and put the mouse on the link address of a label to have the preview picture
  • Copy the link address of png and enter it successfully
  • But this is only a superficial success. Why is the first copy unsuccessful? After clicking the picture, you can enter the png address
  • This is obviously because with cookies, I use the same chrome browser and will automatically save cookies
  • Then I created a new invisible window to copy the png link address for access
  • After verifying the above ideas, you need to add cookie s when crawling pictures to access them normally

2. After analyzing for a long time, you can start to climb the png original map

  • The first step is to crawl png original pictures one by one

  • I use pycharm software

  • Enter F12, and the image in the network contains the detailed information of png image

  • The request method on the right can be seen as a get method for access

    import requests
    
    url = 'https://i.pximg.net/img-original/img/2020/09/01/00/04/33/84073765_p0.png'
    
    respon = requests.get(url)
    
    print(respon)
    
  • Direct access must not be accessible. It has been proved above

  • After checking the network just now, there is no cookie information in the request header

  • Then I tried to put the user agent into the headers, but the access was fruitless. I had to put all the request headers into the headers. As a result, several request headers with colons in front reported errors

  • I don't know the reason why they need to delete a small header in the background, but I don't know the reason why it happened,

  • The final header is to remove the first four contents with colons

    headers = {'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
            				'accept-encoding': 'gzip, deflate, br',
           					 'accept-language': 'zh-CN,zh-HK;q=0.9,zh;q=0.8,en-US;q=0.7,en;q=0.6',
           					 'referer': 'https://www.pixiv.net/artworks/84073765',
         				   	'sec-fetch-dest': 'image',
            				'sec-fetch-mode': 'no-cors',
    			            'sec-fetch-site': 'cross-site',
           				 	'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
           }
    
  • As expected, the visit was successful. Then go to download and see the size of the picture

  • Download succeeded, picture 1.5M


  • -Here I uploaded the original picture, which looks very delicate

  • In fact, it has met the needs of most people here. You can download the png original image directly. You can go to the browser directly without using Python crawler and save the png address image locally

3. The next step is batch crawling, thinking and implementation process

  • I casually entered a person's home page with 48 pictures. Here I zoomed the page

  • Batch climb to 48 pictures

  • Enter the network and select xhr. xhr is the way the server interacts with the web page

  • See what the preview of each connection is and find out the rules

  • When I crawled the first picture, I did many different experiments and found the law

      https://i.pximg.net/img-original/img/2020/09/02/06/59/02/84100538_p0.png
      https://i.pximg.net/img-original/img/2020/02/06/07/58/50/79311204_p0.png
      https://i.pximg.net/img-original/img/2020/01/16/07/52/59/78925639_p0.png
      https://i.pximg.net/img-original/img/2020/01/14/00/02/54/78888979_p0.png
    
  • I found that the png link address of different original images is the same address in front, only after img /_ p0 is different. Here I guess it may be the time of image upload

  • I found the thumbnail link address of the preview image in xhr

  • There are 48 pieces of such information under works, each of which contains a url and various information. Copy the url to view it

  • Is a thumbnail at the end of jpg

      https://i.pximg.net/c/250x250_80_a2/img-master/img/2016/10/19/00/30/00/59535035_p0_square1200.jpg
    
  • But I found that the link address of the thumbnail just contains the data behind img / I need

  • But this is also my guess. I have to carry out an experiment. The experiment must be correct. I won't repeat it if I climb to the picture

  • The next step is to get the preview data and find the previous request url

  • The next step is to visit the url. From the experience of climbing a picture in front, it must add request headers

  • This request header contains cookie s. It is normal for station P to directly access the main station without logging in. Just add it directly to the headers. According to the experience just now, the request header with colon does not need to be verified and can be deleted directly

  • After successful access, print the returned data. The returned data is json data. Print it directly with the built-in pprint and format it

  • Next, start parsing json data layer by layer, take out the desired url, and then use regular expression to remove the part behind img / in the url and splice it with the main chain. Then put the spliced string links into the array, traverse and download. The traverse and download is the same as downloading a single image before

  • Another point is that when crawling a single image, the referer url in the headers changes. The eight digits behind each image are different and need to be replaced

  • The picture above shows the headers of a single image for the first time

  • The above figure shows the headers used in batch crawling. Set the referer blank, and then select the last eight digit string of the address after traversing the address, and splice the numbers to the referer main link

  • The following is the source code of the program

import requests
import json
import re
import time
import pprint
urls = []
str=[]
url = 'https://www.pixiv.net/ajax/user/14764274/profile/illusts?ids%5B%5D=84100538&ids%5B%5D=79311204&ids%5B%5D=78925639&ids%5B%5D=78888979&ids%5B%5D=78869339&ids%5B%5D=78847944&ids%5B%5D=78827574&ids%5B%5D=78813145&ids%5B%5D=78775945&ids%5B%5D=78757259&ids%5B%5D=78429877&ids%5B%5D=78171992&ids%5B%5D=78112792&ids%5B%5D=78042247&ids%5B%5D=77996669&ids%5B%5D=77477076&ids%5B%5D=77462788&ids%5B%5D=76868075&ids%5B%5D=76225608&ids%5B%5D=75609483&ids%5B%5D=74349649&ids%5B%5D=74115208&ids%5B%5D=72597113&ids%5B%5D=71801827&ids%5B%5D=71525803&ids%5B%5D=71492859&ids%5B%5D=70654730&ids%5B%5D=69957223&ids%5B%5D=69941488&ids%5B%5D=69205005&ids%5B%5D=69122865&ids%5B%5D=67580235&ids%5B%5D=64461877&ids%5B%5D=63596968&ids%5B%5D=63322918&ids%5B%5D=62800320&ids%5B%5D=62779411&ids%5B%5D=62622498&ids%5B%5D=61962458&ids%5B%5D=61718041&ids%5B%5D=61423992&ids%5B%5D=60960268&ids%5B%5D=60807191&ids%5B%5D=60354728&ids%5B%5D=60155743&ids%5B%5D=59889253&ids%5B%5D=59850937&ids%5B%5D=59535035&work_category=illustManga&is_first_page=1&lang=zh'
headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh-HK;q=0.9,zh;q=0.8,en-US;q=0.7,en;q=0.6',
            'cache-control': 'max-age=0',
            'sec-fetch-dest': 'document',
            'sec-fetch-mode': 'navigate',
            'sec-fetch-site': 'same-origin',
            'sec-fetch-user': '?1',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36',
            'cookie': '__utma=235335808.1027262340.1599112377.1599112377.1599112377.1; __utmc=235335808; __utmz=235335808.1599112377.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=58696328_nxjk3s7AKvYOULokEuGhZlYvcD09zmNa; device_token=b0a0e8e719881df600fe69ed34f005fb; __cfduid=d24606e62e53770a2b1a57cafec9fb8a31599112396; first_visit_datetime_pc=2020-09-03+14%3A53%3A16; c_type=22; privacy_policy_agreement=2; a_type=0; b_type=1; p_ab_id=0; p_ab_id_2=1; p_ab_d_id=43847122; __utmv=235335808.|3=plan=normal=1^5=gender=male=1^6=user_id=58696328=1^11=lang=zh=1; _fbp=fb.1.1599112398396.1467823854; yuid_b=NWKJGXE; _ga=GA1.2.1027262340.1599112377; _gid=GA1.2.1720450123.1599112401; ki_r=; limited_ads=%7B%22responsive%22%3A%22%22%7D; tags_sended=1; categorized_tags=kP7msdIeEU; ki_t=1599112404094%3B1599112404094%3B1599114359909%3B1%3B7; tag_view_ranking=RTJMXD26Ak~kP7msdIeEU~ugBhCty2is~OT4SuGenFI~3T65lX9DdZ~O7hVbWB9-R~y3Lt8R6sIh~Lt-oEicbBr~3mLXnunyNA~pNfuh5ybtG~eVxus64GZU~pzzjRSV6ZO~CrFcrMFJzz~LJo91uBPz4~d9UpgqVAEz~AYK00_a66q~ahXZD0-3Je~5H5jwYRKk2~lH5YZxnbfC~kGYw4gQ11Z~75zhzbk0bS~kBUaOn2Z6G~2acjSVohem~kqu7T68WD3; __utmt=1; __utmb=235335808.39.9.1599112393691; _gat_UA-1830249-138=1'}

string = 'https://www.pixiv.net/artworks/{}'
string2 = ''
headers2 = {'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh-HK;q=0.9,zh;q=0.8,en-US;q=0.7,en;q=0.6',
            'referer': '',
            'sec-fetch-dest': 'image',
            'sec-fetch-mode': 'no-cors',
            'sec-fetch-site': 'cross-site',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
           }

res = requests.get(url,headers = headers)
list = res.json()
l = list['body']['works']
for i in l:
    urls.append(l[i]['url'])
for a in urls:
    str.append(re.findall('img/(.*?)_p0',a))

url_img = 'https://i.pximg.net/img-original/img/{}_p0.png'
i = 0
for astr in str:
    string2 = string.format(astr[0][-8: -1])            	    # referer main link splicing string
    headers2['referer'] = string2                          				# Update the referer of headers to the spliced link address for access
    rrr = requests.get(url_img.format(astr[0]),headers = headers2)
    print(headers2)
    print(url_img.format(astr[0]))
    print(string2)
    with open('./new1/%d-image.png'%i,'wb') as f:
        f.write(rrr.content)
        i+=1
        print("%d-image   Download successful\n"%i)
  • Operation results
  • Successful batch crawling

Tags: Python Python crawler

Posted by le007 on Thu, 19 May 2022 17:47:59 +0300