When Redis is used, it is easy to encounter problems such as massive data requiring duplicate checking and how to avoid cache penetration. In fact, these problems can be realized through BloomFilter bloom filter
Bloom Filter concept
Bloom filter is a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element exists in a collection. The main advantage is that the spatial efficiency and query time are far more than general algorithms. The disadvantage is that it has a certain false recognition rate and is difficult to delete.
Bloom Filter principle
The basic idea of Bloom filter is: when an element is added to the set, map the element into K points in a bit group through K hash functions, and set them to 1 When searching, we only need to check whether these points are 1 to know whether there is it in the set; If there is any 0, the detected element must not exist; If they are all 1, there is a high probability.
Its difference from the single hash function bit map is that it uses K hash functions, and each string corresponds to K bits, which greatly reduces the probability of conflict.
shortcoming
Because Bloom Filter has high efficiency in time and space, it abandons the accuracy of judgment and the convenience of deletion

There is a misjudgment. The queried element does not exist in the container, but it happens to be 1 in the k positions after the hash, resulting in a misjudgment.

 Solution: create a list to store elements that may be misjudged

Difficulty in deleting: an element placed in the container is mapped to 1 in k positions of the bit array. When deleting, it cannot be simply set to 0, which may affect the judgment of other elements
Bloom Filter implementation
Bloom Filter has many implementations and optimizations. Here we introduce a Bloom Filter provided by Guava.
When using Bloom filter, the two points that cannot be bypassed are the estimated data volume n and the expected misjudgment rate fpp
In the process of implementation, it is the selection of hash function and the size of bit array.
For a certain scenario, we need to estimate the amount of data to be stored as n and the expected misjudgment rate as fpp. Then we need to calculate the size m of the Bit array we need and the number k of hash functions, and select the hash function
Bit array size selection
According to the estimated data amount n and the misjudgment rate fpp, the size of the bit array is:
Hash function selection
According to the estimated data quantity n and the length m of the bit array, the number k of a hash function can be obtained:
realization
First, you need to introduce the guave package
<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>23.0</version> </dependency>
The following is the core code we tested, which is mainly divided into two steps:
 Insert one million numbers into the filter, and then verify whether the one million numbers can pass the filter
 Then look for another 10000 numbers to test the number of fish that have escaped the net
public class TestBloomFilter { private static int total = 1000000; private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total); // private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total, 0.001); public static void main(String[] args) { // Initialize 1000000 pieces of data into the filter for (int i = 0; i < total; i++) { bf.put(i); } // Match the values already in the filter. Is there anything that doesn't match for (int i = 0; i < total; i++) { if (!bf.mightContain(i)) { System.out.println("Some escaped"); } } // How many of the 10000 values that do not match in the filter int count = 0; for (int i = total; i < total + 10000; i++) { if (bf.mightContain(i)) { count++; } } System.out.println("Number of accidental injuries:" + count); } }
result:
Through the running results, we can know that when traversing the one million numbers in the filter, they are recognized, while among the 10000 numbers not in the filter, 320 are recognized as in the filter. The misjudgment rate is three percentage points.
Bloom filter create function
Through the above example, we use the Create method to create filters. Let's take a look at the source code of create:
@VisibleForTesting static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp, BloomFilter.Strategy strategy) { Preconditions.checkNotNull(funnel); Preconditions.checkArgument(expectedInsertions >= 0L, "Expected insertions (%s) must be >= 0", expectedInsertions); Preconditions.checkArgument(fpp > 0.0D, "False positive probability (%s) must be > 0.0", fpp); Preconditions.checkArgument(fpp < 1.0D, "False positive probability (%s) must be < 1.0", fpp); Preconditions.checkNotNull(strategy); if (expectedInsertions == 0L) { expectedInsertions = 1L; } long numBits = optimalNumOfBits(expectedInsertions, fpp); int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits); try { return new BloomFilter(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy); } catch (IllegalArgumentException var10) { throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", var10); } }
Although the four create methods are overloaded, they are the fourth one in the end. The meaning of each parameter is as follows:
 funnel: data type (usually calling Funnels tool class)
 Expected inserts: the number of values expected to be inserted
 fpp: error rate (default 0.03)
 strategy: application of hash algorithm Bloom Filter
Generally speaking, the larger the error rate, the smaller the space and time required, and the smaller the error rate, the larger the space and time required.