37. Integrate bloom filter into scratch redis

[Baidu cloud search, search various materials: http://www.lqkweb.com]
[Search the network disk and search various materials: http://www.swpan.cn]

Python distributed crawler builds search engine Scrapy - integrates bloom filter into scratch redis to judge whether the URL is repeated

Detailed explanation of Bloom filter

Basic concepts

If you want to judge whether an element is in a set, you generally think of saving all elements and then determining them by comparison. Linked list, tree and other data structures are all this idea However, with the increase of elements in the collection, we need more and more storage space and slower retrieval speed. However, there is also a data structure called Hash table in the world. It can map an element into a point in a Bit Array through a Hash function. In this way, we just need to see if this point is 1 to know whether there is it in the set. This is the basic idea of Bloom filter.

The problem with Hash is conflict. Assuming that the Hash function is good, if the length of our bit array is m points, then if we want to reduce the collision rate to, for example, 1%, this Hash table can only hold m/100 elements. Obviously, this is not called space efficient. The solution is also simple, that is, use multiple hashes. If one of them says that the element is not in the collection, it must not be. If they all say yes, although there is a certain possibility that they are lying, the probability of intuitively judging such things is relatively low.

advantage

Compared with other data structures, bloom filter has great advantages in space and time. Bloom filter storage space and insertion / query time are constants. In addition, Hash functions have no relationship with each other, which is convenient for parallel implementation by hardware. Bloom filter does not need to store the element itself, which has advantages in some occasions where confidentiality requirements are very strict.

Bloom filter can represent a complete set, and no other data structure can;

k and m are the same. The intersection, union and difference operation of two Bloom filters using the same set of Hash functions can be carried out by bit operation.

shortcoming

However, the disadvantages of Bloom filter are as obvious as its advantages. False Positive is one of them. As the number of stored elements increases, the miscalculation rate increases. However, if the number of elements is too small, it is sufficient to use a hash table.

In addition, in general, elements cannot be deleted from the bloom filter It is easy for us to think of changing the bit array into an integer array, and adding 1 to the corresponding counter for each element inserted, so that the counter can be subtracted when the element is deleted. However, it is not so easy to delete elements safely. First, we must ensure that the deleted elements are indeed in the bloom filter This cannot be guaranteed by this filter alone. In addition, counter winding can also cause problems.

Bloom filter (Bloom filter) implemented by python based on redis_ imooc

BloomFilter_imooc Download

Download address: https://github.com/liyaopinne...

Dependencies:

Bloom filter based on redis in python

Dependent on mmh3

Install dependent packages:

  pip install mmh3

1. Install bloomfilter_ Dependencies required by imooc

2. Bloomfilter to be downloaded_ After extracting the imooc package, PY_ bloomfilter. Copy the PY file to the scene project directory

py_ bloomfilter. Py (Bloom filter) source code

import mmh3
import redis
import math
import time

class PyBloomFilter():
    #Built in 100 random seeds
    SEEDS = [543, 460, 171, 876, 796, 607, 650, 81, 837, 545, 591, 946, 846, 521, 913, 636, 878, 735, 414, 372,
             344, 324, 223, 180, 327, 891, 798, 933, 493, 293, 836, 10, 6, 544, 924, 849, 438, 41, 862, 648, 338,
             465, 562, 693, 979, 52, 763, 103, 387, 374, 349, 94, 384, 680, 574, 480, 307, 580, 71, 535, 300, 53,
             481, 519, 644, 219, 686, 236, 424, 326, 244, 212, 909, 202, 951, 56, 812, 901, 926, 250, 507, 739, 371,
             63, 584, 154, 7, 284, 617, 332, 472, 140, 605, 262, 355, 526, 647, 923, 199, 518]

    #capacity is the pre estimated amount of weight to be removed
    #error_rate indicates the error rate
    #conn indicates the connection client of redis
    #Key indicates the name prefix of the key in redis
    def __init__(self, capacity=1000000000, error_rate=0.00000001, conn=None, key='BloomFilter'):
        self.m = math.ceil(capacity*math.log2(math.e)*math.log2(1/error_rate))      #Total number of bit s required
        self.k = math.ceil(math.log1p(2)*self.m/capacity)                           #Minimum number of hash es required
        self.mem = math.ceil(self.m/8/1024/1024)                                    #How many megabytes of memory do you need
        self.blocknum = math.ceil(self.mem/512)                                     #How many 512M memory blocks do you need? The first character of value must be ascii code. There are 256 memory blocks at most
        self.seeds = self.SEEDS[0:self.k]
        self.key = key
        self.N = 2**31-1
        self.redis = conn
        # print(self.mem)
        # print(self.k)

    def add(self, value):
        name = self.key + "_" + str(ord(value[0])%self.blocknum)
        hashs = self.get_hashs(value)
        for hash in hashs:
            self.redis.setbit(name, hash, 1)

    def is_exist(self, value):
        name = self.key + "_" + str(ord(value[0])%self.blocknum)
        hashs = self.get_hashs(value)
        exist = True
        for hash in hashs:
            exist = exist & self.redis.getbit(name, hash)
        return exist

    def get_hashs(self, value):
        hashs = list()
        for seed in self.seeds:
            hash = mmh3.hash(value, seed)
            if hash >= 0:
                hashs.append(hash)
            else:
                hashs.append(self.N - hash)
        return hashs

pool = redis.ConnectionPool(host='127.0.0.1', port=6379, db=0)
conn = redis.StrictRedis(connection_pool=pool)

#Method of use
# if __name__ == "__main__":
#     bf = PyBloomFilter(conn=conn)           #Using connection pool to connect to Redis
#     bf.add('www.jobbole.com')               #Add a domain name to the default channel of Redis
#     bf.add('www.luyin.org')                 #Add a domain name to the default channel of Redis
#     print(bf.is_exist('www.zhihu.com'))     #Print whether the domain name exists in the channel. If it exists, return 1 and if not, return 0
#     print(bf.is_exist('www.luyin.org'))     #Print whether the domain name exists in the channel. If it exists, return 1 and if not, return 0

Will PY_ bloomfilter. Py (Bloom filter) is integrated into dupefilter in scratch redis Py de duplication device, so that the URL it has captured is not added to the downloader, and the URL it has not captured is added to the downloader

Dupefilter in scratch redis Py de duplication modifier

import logging
import time

from scrapy.dupefilters import BaseDupeFilter
from scrapy.utils.request import request_fingerprint

from . import defaults
from .connection import get_redis_from_settings
from bloomfilter.py_bloomfilter import conn,PyBloomFilter   #Import bloom filter

logger = logging.getLogger(__name__)

# TODO: Rename class to RedisDupeFilter.
class RFPDupeFilter(BaseDupeFilter):
    """Redis-based request duplicates filter.

    This class can also be used with default Scrapy's scheduler.

    """

    logger = logger

    def __init__(self, server, key, debug=False):
        """Initialize the duplicates filter.

        Parameters
        ----------
        server : redis.StrictRedis
            The redis server instance.
        key : str
            Redis key Where to store fingerprints.
        debug : bool, optional
            Whether to log filtered requests.

        """
        self.server = server
        self.key = key
        self.debug = debug
        self.logdupes = True

        #Integrated bloom filter
        self.bf = PyBloomFilter(conn=conn, key=key)     #Using connection pool to connect to Redis

    @classmethod
    def from_settings(cls, settings):
        """Returns an instance from given settings.

        This uses by default the key ``dupefilter:<timestamp>``. When using the
        ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
        it needs to pass the spider name in the key.

        Parameters
        ----------
        settings : scrapy.settings.Settings

        Returns
        -------
        RFPDupeFilter
            A RFPDupeFilter instance.

        """
        server = get_redis_from_settings(settings)
        # XXX: This creates one-time key. needed to support to use this
        # class as standalone dupefilter with scrapy's default scheduler
        # if scrapy passes spider on open() method this wouldn't be needed
        # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
        key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)

    @classmethod
    def from_crawler(cls, crawler):
        """Returns instance from crawler.

        Parameters
        ----------
        crawler : scrapy.crawler.Crawler

        Returns
        -------
        RFPDupeFilter
            Instance of RFPDupeFilter.

        """
        return cls.from_settings(crawler.settings)

    def request_seen(self, request):
        """Returns True if request was already seen.

        Parameters
        ----------
        request : scrapy.http.Request

        Returns
        -------
        bool

        """
        fp = self.request_fingerprint(request)

        #Integrated bloom filter
        if self.bf.is_exist(fp):    #Judge if the domain name exists in Redis
            return True
        else:
            self.bf.add(fp)         #If it does not exist, add the domain name to Redis
            return False

        # This returns the number of values added, zero if already exists.
        # added = self.server.sadd(self.key, fp)
        # return added == 0

    def request_fingerprint(self, request):
        """Returns a fingerprint for a given request.

        Parameters
        ----------
        request : scrapy.http.Request

        Returns
        -------
        str

        """
        return request_fingerprint(request)

    def close(self, reason=''):
        """Delete data on close. Called by Scrapy's scheduler.

        Parameters
        ----------
        reason : str, optional

        """
        self.clear()

    def clear(self):
        """Clears fingerprints data."""
        self.server.delete(self.key)

    def log(self, request, spider):
        """Logs given request.

        Parameters
        ----------
        request : scrapy.http.Request
        spider : scrapy.spiders.Spider

        """
        if self.debug:
            msg = "Filtered duplicate request: %(request)s"
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
        elif self.logdupes:
            msg = ("Filtered duplicate request %(request)s"
                   " - no more duplicates will be shown"
                   " (see DUPEFILTER_DEBUG to show all duplicates)")
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
            self.logdupes = False

Crawler file

#!/usr/bin/env python
# -*- coding:utf8 -*-

from scrapy_redis.spiders import RedisCrawlSpider    #Import scene_ Rediscrawlespider class in redis
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

class jobboleSpider(RedisCrawlSpider):               #Custom crawler class, inheriting RedisSpider class
    name = 'jobbole'                                 #Set crawler name
    allowed_domains = ['www.luyin.org']              #Crawl domain name
    redis_key = 'jobbole:start_urls'                 #Set a name to redis to store the url

    rules = (
        #Configure crawl list page rules
        # Rule(LinkExtractor(allow=('ggwa/.*')), follow=True),

        #Configure rules for fetching content pages
        Rule(LinkExtractor(allow=('.*')), callback='parse_job', follow=True),
    )

    def parse_job(self, response):  #Callback function. Note: because the source code of the crawles template creates a parse callback function, remember that we cannot create a function with the name parse
        #Use the ItemLoader class to load the items container class to fill in data
        neir = response.css('title::text').extract()
        print(neir)

Start crawler

cd to the redis installation directory and execute the command: redis cli - H 127.0.0.1 - P 6379 # connect to the redis client

After connecting to the redis client, execute the command: lpush jobpole: start_ urls http://www.luyin.org Add a crawler start url to redis

Start crawling

redis status description:

Posted by richza on Sat, 07 May 2022 22:11:40 +0300