Bloom filter is an intelligent data structure that optimizes memory usage. Its primary purpose is determining whether specific data is present in a set. The algorithm may provide a “Firm No” or a “Probably Yes” response, prioritizing storage and query performance over perfect accuracy. This approach makes Bloom filter a valuable tool for managing large data sets while conserving resources.
1. How does the Bloom filter work
1.1 Hash Functions and Bit Array
Bloom filter is an array of bits (0 & 1) where “0” means no elements have been added and “1” means maybe we have added the element. Bloom filter starts with a fixed-size bit array where all the bits are set to zero. The array size depends on some factors discussed in the best practices below.
1.2 Adding Elements to the Bloom Filter
When an element is added to a bloom filter, it will go through a hash function computation. This hash function will return an index of the bit array; after generating the index of the bit array bloom filter will set this indice to “1”.
Let’s say we want to hash the word “abodyify.com”:
hash(“abodyify.com”) % 10 = 3
The hash value return index of “3”. Now bloom filter will set the index “3” to “1”. See the illustration below:
Note: ( % 10 ) The number 10 here means the array’s length. If you have an array of 64 bits, then it is going to be % 64. and so on.
1.3 Checking Element Existence in the Bloom Filter
To find out if we have the elements on the array, all we need to do to repeat the same hash function again.
The hash function return index of “9”. Now bloom filter will check if index “9” is equal to “1”. See the illustration below:
Now let’s search for “abodyify.com”:
hash(“abodyify.com”) % 10 = 3
The hash function return index of “3”. Now the bloom filter will check if index “3” equals “1”. See the illustration below:
Because for “abodyify.com” the bit is equal to “1”, the Bloom filter is telling us that “Maybe” we added “abodyify.com” to the set.
I used the word “Maybe” because the Bloom filter is a probabilistic data structure, and their positive response are not always accurate.
Note: Did you know that a negative response from a bloom filter means we can trust with 100% certainty that the element was never added? However, it’s important to note that a positive response doesn’t guarantee that the element exists, a bloom filter can’t provide a completely accurate positive response.
1.4 Why Bloom filter Positive Response is Not Accurate
The reason why we cannot trust the positive response is due to collision. In out example above we added the word “abodyify.com” to the set and it on index “3”.
Now we want to search for the word “programming”:
hash(“programming”) % 10 = 3
As we can see the hash return index “3”, and the bloom filter will check the bit array and it will find index “3” is set to “1”
Now bloom filter will see index “3” is set to “1” bit. It will tell us maybe “programming” is added, but in reality, we never added the word “programming” and it gets bit “1” because it collides with the word “abodyify.com”
As a result of this collision, the Bloom filter will return a “false positive”
2. Bloom Filter Advantages and Use Cases
2.1 Space Efficient Storage of Large Sets
The main benefit of the Bloom filter is memory efficiency. Other data structures, such as Hash Tables, require memory proportional to the number of stored elements. On the other hand, the Bloom filter requires less memory, even with a large volume of data.
The bloom filter uses a probabilistic method which makes it space-efficient. This is because it allows collisions and stores data in the same bit, resulting in less memory usage. However, this can decrease its accuracy. Therefore, there is a trade-off between using less memory and obtaining accurate results.
2.2 Fast Query Operations
The bloom filter is fast, making it suitable for quick element existence checks. Due to the bitwise operations involved in Bloom filter operations, such queries can be performed with constant time complexity, regardless of the size of the dataset.
2.3 Examples of Bloom Filter Applications
1. Akamai Web Page Cache: Akamai uses a bloom filter to cache web pages that have been requested twice. 75% of web pages are requested once, so it is not worth it to cache, Akamai uses a bloom filter to avoid caching these pages.
2. Internet Cache Protocol: The internet cache protocol (ICP) uses Bloom filters to reduce the amount of web traffic on the internet. A web request will first be checked against the local cache (using a Bloom filter) before being sent to the internet.
3. Medium’s Read Next feature: Medium uses a Bloom filter to track what stories a user has read. When recommending what to read next, Medium will exclude stories that the user has already read. This makes the Read Next feature more useful because it’s always suggesting new content.
4. NoSQL Databases: Some NoSQL databases like Cassandra and HBase use Bloom filters to decrease the disk lookups for non-existent rows or columns. Avoiding disk lookups for non-existent rows or columns saves I/O operations and increases speed.
5. Networking: Bloom filters can also be used in the networking field to manage a list of IP addresses. They can detect whether an IP address is on a blacklist to prevent spam emails.
6. Password Checker: Some websites may use Bloom filters to prevent users from setting commonly used passwords. They have a Bloom filter of the most common passwords, and when users try to set a password, it’s checked against the filter. If the password is in the filter, the users are required to set a different password.
3. Limitations and Trade-offs
3.1 False Positive Rate and Probability of False Positives
Bloom filters have a probability of false positives, where an element not in the filter may mistakenly appear as if it is present—the false positive rate increases as the filter becomes more space-efficient.
3.2 Handling Deletions and Dynamic Sets
Bloom filters are primarily designed for insertion and search queries and do not support direct element deletions. Removing an element from a Bloom filter can lead to the inadvertent removal of other elements due to potential hash collisions. Consequently, they are better suited for static or infrequently modified datasets.
4. Best Practices
4.1 Choosing the Optimal Number of Hash Functions and Bit Array Size
1. Trade-off between False Positive Rate and Memory Usage: The number of hash functions used in a Bloom filter affects the probability of false positives. Multiple hash functions generally reduce the false positive rate but increase computational overhead. Finding the right balance depends on the desired level of accuracy and available memory resources.
As we can see in the image above, adding more hashing will decrease the chances of false positives because we will store the value on three different bits in the filter, which will reduce the collision rate.
2. Calculating Optimal Bit Array Size: The size of the bit array impacts both the false positive rate and memory usage. A larger bit array decreases the false positive rate but requires more memory. Various formulas and rules of thumb can help calculate an optimal bit array size based on the expected number of elements and desired false positive rate.
4.2 Managing False Positive Rates and Desired Precision
1. Understanding the Probability of False Positives: Bloom filters have an inherent probability of false positives due to potential hash collisions. It’s crucial to comprehend and manage this probability, especially in applications where false positives are costly or undesirable.
2. Tuning Parameters for Desired Precision: The false positive rate can be controlled by adjusting parameters such as the size of the bit array, the number of hash functions, and the number of elements being stored. Experimentation and analysis can help fine-tune these parameters to achieve the desired level of precision for specific use cases.
4.3. Handling Deletions and Dynamic Sets
1. Limitations of Deletions: Bloom filters do not support direct deletions of elements once added. Removing an element would require modifying bits in the bit array, potentially impacting other elements. As a result, Bloom filters are more suitable for scenarios where the dataset is static or rarely modified.
2. Dynamic Bloom Filters: Various variations, such as counting Bloom filters or scalable Bloom filters, address the limitation of deletions and support dynamic datasets. These adaptations allow for element removal and resizing of the filter to accommodate changing data requirements.
5. Floom filter implementation with Nodejs and RedisConnect to Redis server
1. Connect to Redis server
We need to install Redis npm package
const Redis = require('redis') const client = Redis.createClient() client.connect()
2. Create a Bloom filter
Use the function reserve to create a new bloom filter.
The function requires three parameters:
- Filter name (in our example, we choose “mybloom”)
- Error rate (in our example, we choose 0.01)
- Array length (in our example, we choose 1000)
Error rate explanation:
- 1% error rate requires 7 hash functions and 10.08 bits per item.
- 0.1% error rate requires 10 hash functions and 14.4 bits per item.
- 0.01% error rate requires 14 hash functions and 20.16 bits per item.
Note: It will throw an error if the bloom filter is already created.
await client.bf.reserve('mybloom', 0.01, 1000)
3. Add Element
await client.bf.add('mybloom', 'abodyify.com')
4. Search Element
It will return true or false value
const a1 = await client.bf.exists('mybloom', 'abodyify.com')