The written crawler program is interrupted due to website or network reasons. What if it is executed here and the previous data is crawled back to the database?

Incremental crawler

The crawler can get the updated data of a website through the program. How to carry out incremental crawling:

  1. Judge whether the URL has been crawled before sending the request.
  2. After parsing the content, judge whether this part of the content has been crawled.
  3. When writing to the storage medium, judge whether the content is in the medium.

It is not difficult to find that the core of incremental crawling is weight removal. As for which step the weight removal operation works, it can only be said that each has its own advantages and disadvantages. In my opinion, I need to use both ideas (according to the actual situation).

  • The first idea is suitable for websites with new pages, such as new chapters of novels, the latest news every day and so on.
  • The second idea is suitable for websites with updated page content.
  • The third idea is equivalent to the last line of defense.

This can achieve the goal of weight removal to the greatest extent. De duplication method:

  1. Store the url generated in the crawling process in the set of redis. When crawling the data next time, first judge whether the url to be requested exists in the set where the url is stored. If it does not exist, make the request.
  2. Limit the unique identifier of the crawled web page content, and then store the unique identifier in the set of redis. When crawling to the web page data next time, you can first judge whether the unique identifier of the data exists in the set of redis before persistent storage, and then decide whether to carry out persistent storage

Case ---- > cat eye actor information crawling

website

https://maoyan.com/films/celebrity/29410

demand

(1) The crawling strategy of this case is to use multiple actors as the starting task, and the latter actors are mainly obtained from the relevant filmmakers.
(2) Relevant filmmakers may repeat and design to remove the duplication.
(3) Customize the scheduler to set the downloaded tasks.

code implementation

import hashlib
import threading
import requests
from lxml import etree
from queue import Queue
import redis
class MaoYanActor(threading.Thread):
    def __init__(self,q_actors,name):
        super().__init__()
        self.name=name
        self.q_actors=q_actors
        self.redis_=redis.Redis()
        self.s=requests.session()
        self.headers={
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.2.6.1000',
            'Host':'maoyan.com',
            'Connection':'keep-alive',
        }
        self.s.headers=self.headers

    def get_md5(self,value):
        return hashlib.md5(bytes(value, encoding='utf-8')).hexdigest()

    def get_xpath(self,url):
        rep=self.s.get(url).text
        return etree.HTML(rep)
    def url_seen(self,url):
        fp=self.get_md5(url)
        add=self.redis_.sadd('maoyan_actor:actor_urls',fp)
        return add==0
    def parse_actor(self,tree):
        actor_name=tree.xpath('//div[@class="shortInfo"]/p[1]/text()')[0]
    def run(self):
        while True:
            if q_actors.empty():
                break
            # Get url
            url = q_actors.get()
            print(f"{self.name}==========@{url.split('/')[-1]}")
            if not self.url_seen(url):
                # Request and get the corresponding content
                tree = self.get_xpath(url)
                # Parse the current actor information and save it
                self.parse_actor(tree)
                # Get the information of the responding filmmaker -- url
                relation_actors = tree.xpath('//div[@class="rel-item"]/a/@href')
                for actors in relation_actors:
                    self.q_actors.put('https://maoyan.com' + actors)

if __name__ == '__main__':
    # Process url list
    start_urls = [
        'https://maoyan.com/films/celebrity/29410',
        'https://maoyan.com/films/celebrity/789',
        'https://maoyan.com/films/celebrity/28383',
        'https://maoyan.com/films/celebrity/33280',
        'https://maoyan.com/films/celebrity/28449'
    ]
    q_actors = Queue()
    for url in start_urls:
        q_actors.put(url)
    name_list=['Thread 1','Thread 2','Thread 3','Thread 4']
    for name in name_list:
        print(name)
        m=MaoYanActor(q_actors,name)
        m.start()

Tags: crawler

Posted by lj11 on Tue, 17 May 2022 06:56:53 +0300