Redis's BloomFilter is a powerful tool to avoid cache penetration

When Redis is used, it is easy to encounter problems such as massive data requiring duplicate checking and how to avoid cache penetration. In fact, these problems can be realized through BloomFilter bloom filter

Bloom Filter concept

Bloom filter is a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element exists in a collection. The main advantage is that the spatial efficiency and query time are far more than general algorithms. The disadvantage is that it has a certain false recognition rate and is difficult to delete.

Bloom Filter principle

The basic idea of Bloom filter is: when an element is added to the set, map the element into K points in a bit group through K hash functions, and set them to 1 When searching, we only need to check whether these points are 1 to know whether there is it in the set; If there is any 0, the detected element must not exist; If they are all 1, there is a high probability.

Its difference from the single hash function bit map is that it uses K hash functions, and each string corresponds to K bits, which greatly reduces the probability of conflict.

shortcoming

Because Bloom Filter has high efficiency in time and space, it abandons the accuracy of judgment and the convenience of deletion

  • There is a misjudgment. The queried element does not exist in the container, but it happens to be 1 in the k positions after the hash, resulting in a misjudgment.

    • Solution: create a list to store elements that may be misjudged
  • Difficulty in deleting: an element placed in the container is mapped to 1 in k positions of the bit array. When deleting, it cannot be simply set to 0, which may affect the judgment of other elements

Bloom Filter implementation

Bloom Filter has many implementations and optimizations. Here we introduce a Bloom Filter provided by Guava.

When using Bloom filter, the two points that cannot be bypassed are the estimated data volume n and the expected misjudgment rate fpp

In the process of implementation, it is the selection of hash function and the size of bit array.

For a certain scenario, we need to estimate the amount of data to be stored as n and the expected misjudgment rate as fpp. Then we need to calculate the size m of the Bit array we need and the number k of hash functions, and select the hash function

Bit array size selection

According to the estimated data amount n and the misjudgment rate fpp, the size of the bit array is:

Hash function selection

According to the estimated data quantity n and the length m of the bit array, the number k of a hash function can be obtained:

realization

First, you need to introduce the guave package

<dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>23.0</version>
 </dependency>    

The following is the core code we tested, which is mainly divided into two steps:

  • Insert one million numbers into the filter, and then verify whether the one million numbers can pass the filter
  • Then look for another 10000 numbers to test the number of fish that have escaped the net
public class TestBloomFilter {

    private static int total = 1000000;
    private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total);
//    private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total, 0.001);

    public static void main(String[] args) {
        // Initialize 1000000 pieces of data into the filter
        for (int i = 0; i < total; i++) {
            bf.put(i);
        }

        // Match the values already in the filter. Is there anything that doesn't match
        for (int i = 0; i < total; i++) {
            if (!bf.mightContain(i)) {
                System.out.println("Some escaped");
            }
        }

        // How many of the 10000 values that do not match in the filter
        int count = 0;
        for (int i = total; i < total + 10000; i++) {
            if (bf.mightContain(i)) {
                count++;
            }
        }
        System.out.println("Number of accidental injuries:" + count);
    }

}

result:

Through the running results, we can know that when traversing the one million numbers in the filter, they are recognized, while among the 10000 numbers not in the filter, 320 are recognized as in the filter. The misjudgment rate is three percentage points.

Bloom filter create function

Through the above example, we use the Create method to create filters. Let's take a look at the source code of create:

 @VisibleForTesting
    static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp, BloomFilter.Strategy strategy) {
        Preconditions.checkNotNull(funnel);
        Preconditions.checkArgument(expectedInsertions >= 0L, "Expected insertions (%s) must be >= 0", expectedInsertions);
        Preconditions.checkArgument(fpp > 0.0D, "False positive probability (%s) must be > 0.0", fpp);
        Preconditions.checkArgument(fpp < 1.0D, "False positive probability (%s) must be < 1.0", fpp);
        Preconditions.checkNotNull(strategy);
        if (expectedInsertions == 0L) {
            expectedInsertions = 1L;
        }

        long numBits = optimalNumOfBits(expectedInsertions, fpp);
        int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);

        try {
            return new BloomFilter(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
        } catch (IllegalArgumentException var10) {
            throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", var10);
        }
    }

Although the four create methods are overloaded, they are the fourth one in the end. The meaning of each parameter is as follows:

  • funnel: data type (usually calling Funnels tool class)
  • Expected inserts: the number of values expected to be inserted
  • fpp: error rate (default 0.03)
  • strategy: application of hash algorithm Bloom Filter

Generally speaking, the larger the error rate, the smaller the space and time required, and the smaller the error rate, the larger the space and time required.

Tags: Java Redis

Posted by fahrvergnuugen on Tue, 10 May 2022 21:15:55 +0300