redis Basics - bloom filter

Bloom filter



In 1970, an elder named bloom studied the problem of judging whether elements in massive elements exist, that is, how much bitmap capacity and how many hash functions are needed. It published a paper and proposed this container as Bloom filter.


working principle


First, the essence of Bloom filter is a bit array and several hash functions.



There are three elements in the set. What should I do to save them into the bloom filter? The first is the element a. here we calculate it three times. b. So is the C element.

After the element has been saved, now I want to judge whether an element exists in this container. I need to use the same three functions for calculation.

For example, for element d, I calculate with the first function f1 and find that this position is 1, no problem. The second position is also 1, and the third position is also 1.

If the subscript position value obtained through three calculations is 1, in this case, can you be sure that the d element must be in this container? In fact, it can't. For example, in this figure, these three positions are set to 1 when saving a, b and c, so even if the d element has not been saved before, it will get three 1s and return true.

Therefore, this is a very important feature of Bloom filter. Because hash collision is inevitable, it will have a certain misjudgment rate. We call this False Positive Probability (FPP) when the element that does not exist in the bloom filter is misjudged to exist.

Let's look at another element, the e element. We need to determine whether it exists in the container. Similarly, we need to use these three functions to calculate. The first position is 1, the second position is 1, and the third position is 0.

Is the e element not necessarily in this container? You can be sure it doesn't exist. If the e element has been stored in the bloom filter at that time, these three positions must be 1, and there can be no 0.


Summary: features of Bloom filter:

From the perspective of container

  1. If the bloom filter determines that the element exists in the set, it does not necessarily exist
  2. If the bloom filter judges that it does not exist, there must be no electric package

From the perspective of elements

  1. If there is a certain element in bloom, it is judged that the filter must exist
  2. If the element does not actually exist, the bloom filter may judge that it exists

Implementation of Guava


Google's Guava provides a ready-made bloom filter.

<dependency>
	<groupId>com.google.guava</groupId>
	<artifactId>guava</artifactId>
	<version>21.0</version>
<dependency>

To create a bloom filter:


BloomFilter<String> bf -BloomFilter. create(Funnels.stringFunnel(Charsets. UTF_8),insertions);

The method of storing elements provided by the bloom filter is put().
The method provided by the bloom filter to determine whether an element exists is mightcontext().


if(bf.mightContain(data)){
	if(sets.contains(data)){
		//When judging the actual existence, hit
		right++;
		continue;
	}
	//When judgment exists but does not exist, it is wrong
	wrongt++;
}

The error rate of Bloom filter is set to 0.03 by default, which can also be specified when creating.

public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
	return create(funnel, expectedInsertions, 0.03D); 
}

The capacity of the bitmap is calculated based on the number of elements and the misjudgment rate.

long numBits = optimalNumOfBits(expectedInsertions, fpp);

According to the size of the bit group, we further calculate the number of hash functions.

int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);

Storing 1 million elements only takes up 0.87M of memory and generates 5 hash functions.


Application of Bloom filter in the project


Working position of Bloom filter:




To determine whether the value of the database exists, the first step is to load all the data in the database. Before going to Redis to query, first query in the bloom filter. If bf says no, there must be no database and there is no need to check.
If bf you say yes, you can go through the previous process.


/**
 * Bloom filter concurrency test between redis and database
 */

@RunWith(SpringJUnit4ClassRunner.class)
@SpringBootTest
@EnableAutoConfiguration
public class BloomTestsConcurrency {
    @Resource
    private RedisTemplate redisTemplate;

    @Autowired
    private UserService userService;

    private static final int THREAD_NUM = 1000; // The number of concurrent threads should not be set too large for Windows machines

    static BloomFilter<String> bf;

    static List<User> allUsers;

    @PostConstruct
    public void init() {
        // Get the data from the database and load it into the bloom filter
        long start = System.currentTimeMillis();
        allUsers = userService.getAllUser();
        if (allUsers == null || allUsers.size() == 0) {
            return;
        }
        // When creating bloom filter, the default misjudgment rate is 0.03, i.e. 3%
        bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), allUsers.size());
        // The lower the misjudgment rate, the longer the array length, and the more hash functions are required
        // bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), allUsers.size(), 0.0001);
        // Store data into bloom filter
        for (User user : allUsers) {
            bf.put(user.getAccount());
        }
        long end = System.currentTimeMillis();
        System.out.println("Query and load"+allUsers.size()+"The total time of data to bloom filter is:"+(end -start ) +"millisecond");
    }

    @Test
    public void cacheBreakDownTest() {
        long start = System.currentTimeMillis();
        allUsers = userService.getAllUser();
        CyclicBarrier cyclicBarrier = new CyclicBarrier(THREAD_NUM);
        ExecutorService executorService = Executors.newFixedThreadPool(THREAD_NUM);
        for (int i = 0; i < THREAD_NUM; i++){
            executorService.execute(new BloomTestsConcurrency().new MyThread(cyclicBarrier, redisTemplate, userService));
        }

        executorService.shutdown();
        //Determine whether all threads have finished running
        while (!executorService.isTerminated()) {

        }

        long end = System.currentTimeMillis();
        System.out.println("Concurrent number:"+THREAD_NUM + "´╝îTotal time spent creating new threads and filtering:"+(end -start ) +"MS, end of presentation");
    }

    public class MyThread implements Runnable {
        private CyclicBarrier cyclicBarrier;
        private RedisTemplate redisTemplate;
        private UserService userService;

        public MyThread(CyclicBarrier cyclicBarrier, RedisTemplate redisTemplate, UserService userService) {
            this.cyclicBarrier = cyclicBarrier;
            this.redisTemplate = redisTemplate;
            this.userService = userService;
        }

        @Override
        public void run() {
            //All sub threads wait. When all sub threads are created, they will execute the following code concurrently
            try {
                cyclicBarrier.await();
            } catch (InterruptedException e) {
                e.printStackTrace();
            } catch (BrokenBarrierException e) {
                e.printStackTrace();
            }

            // 1.1 (test: the bloom filter judges that it does not exist and intercepts -- if there is no bloom filter, it will cause cache penetration)
            // Randomly generate a string, which does not exist in redis
            String randomUser = UUID.randomUUID().toString();
            // 1.2 (test: the bloom filter judges the existence of Redis, and takes the value from Redis cache. If Redis is empty, query the database and write it to Redis)
            // Get an existing user from the List
            // String randomUser = allUsers.get(new Random().nextInt(allUsers.size())).getAccount();
            String key = "Key:" + randomUser;

            Date date1 = new Date();
            SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

            // If this user does not exist in the bloom filter, return directly and block the flow
            if (!bf.mightContain(randomUser)) {
                System.out.println(sdf.format(date1)+" The bloom filter does not exist, illegal request");
                return;
            }

            // Query the cache, and directly return the cached data if it exists in the cache
            ValueOperations<String, String> operation =
                    (ValueOperations<String, String>) redisTemplate.opsForValue();
            Object cacheUser = operation.get(key);
            if (cacheUser != null) {
                Date date2 = new Date();
                System.out.println(sdf.format(date2)+" Hit redis cache");
                return;
            }

            // TODO prevents concurrent repeated write cache and locks
            synchronized (randomUser) {
                // If the cache does not exist, query the database
                List<User> user = userService.getUserByAccount(randomUser);
                if (user == null || user.size() == 0) {
                    // HikariPool-1 - Connection is not available, request timed out after
                    System.out.println(" Redis Cache does not exist, query database does not exist, cache penetration occurs!!!");
                    return;
                }
                // Write the data queried by mysql database into redis
                Date date3 = new Date();
                System.out.println(sdf.format(date3)+" Query and write from database Reids");
                operation.set("Key:" + user.get(0).getAccount(), user.get(0).getAccount());
            }
        }
    }
}

Shortcomings and variants of Bloom filter


If the database is deleted, the data of Bloom filter should also be deleted. However, there is no deletion method in bloom filter. Why doesn't bloom filter provide a way to delete? Or, what happens if the elements of the bloom filter are deleted?




For example, if we delete a, the three positions should be changed to 0. But when judging whether element b exists or not, because a position becomes 0, element b also judges that it does not exist. Because of hash collision, elements can only be stored and cannot be deleted.

If we want to realize the function of deletion, what should we do? Similar to HashMap's chain address method, we can add a counter at each subscript position. For example, if this position is hit twice, the counter is 2. When deleting element a, first change the counter to 1. When the b element is deleted, the counter becomes 0, and then the bit corresponding to the subscript is set to 0.

In fact, in the decades since bloom filter was proposed, there have been many variants of Bloom filter. This bf that provides deletion function through counter is called Counting Bloom Filter.

/**
 * https://github.com/Baqend/Orestes-Bloomfilter
 */
public class CountingBloomFilterTest {
    public static void main(String[] args) {
        CountingBloomFilter<String> cbf = new FilterBuilder(1000,
                0.01).buildCountingBloomFilter();
        cbf.add("http://xiecongcong.com");
        cbf.add("http://alibaba.com");
        cbf.add("http://baidu.com");

        cbf.remove("http://baidu.com");

        System.out.println(cbf.contains("http://xiecongcong.com")); //true
        System.out.println(cbf.contains("http://alibaba.com")); //true
        System.out.println(cbf.contains("http://baidu.com")); //false
    }
}

Other application scenarios of Bloom filter


What is the problem solved by Bloom filter? How to quickly determine whether an element exists in a large number of elements. Therefore, in addition to solving the problem of cache penetration, we have many other uses.

For example, we don't need to repeat the crawling of data crawlers and crawled URLs. How can we judge whether a url has been crawled in billions of URLs?

There are also our email servers and accounts that send spam. We call them spamers. In so many email accounts, how can we judge whether an account is a spamer?

In some scenes, we can use bloom filter.

Tags: Redis

Posted by Joe689 on Thu, 05 May 2022 04:12:51 +0300