Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and received the message "Username is already taken." You added your date of birth along with your username, but still no luck. You have now also added your university list number and received a "Username already taken". It’s really frustrating, isn’t it?
But have you ever wondered how quickly Geekbook checks for username availability by searching millions of registered users. There are many ways to make this work —
- Linear Search : bad idea!
- Binary search : keep all usernames in alphabetical order and compare the entered username with the average in the list. If it matches, then the username is used, otherwise it is determined whether the entered username will come before or after the middle and, if it comes after, ignore all usernames up to the middle (inclusive). Now search after the middle and repeat this process until you get a match or the end of the search without a match. This method is better and more promising, but still requires several steps.
But there must be something better!
Bloom Filter — it’s a data structure that can do the job.
To understand Bloom filters, you need to know what is hashing . The hash function takes input and outputs a unique, fixed-length identifier that is used to identify the input.
What is Bloom Filter?
Bloom Filter — it is a compact probabilistic data structure that is used to test whether an item is a member of a set. For example, a username availability check is given by a membership problem, where the set is a list of all registered usernames. The price we pay for efficiency is that it is probabilistic in nature, which mean s there may be some false positives. A false positive value may mean that the given username is already taken, but in fact it is not.
Interesting Bloom Filters properties
- Unlike a standard hash table, a fixed size Bloom filter can represent a collection of arbitrarily large numbers of elements.
- Adding an element will never fail. However, the false positive rate increases steadily as more items are added, until all bits in the filter are set to 1, at which point all queries are successful.
- Bloom filters never generate a false negative result, i.e. informing you that the username does not exist when it does exist.
- Removing items from a filter is not possible because if we remove one item by clearing bits on the indices generated k hash functions, this can remove several other elements. Example — if we remove the "geeks" (in the example below) by clearing bits at 1, 4, and 7, we can end up removing "nerd" as well, since bit at index 4 becomes 0 and the Bloom filter states that "nerd "Is not present.
Bloom filter operation
Empty Bloom filter — it is a bitmap of m bits, all set to zero, for example:
We need k number of hash functions to compute hashes for this input. When we want to add an element to the filter, the bits at k indices h1 (x), h2 (x), ... hk (x) are set, where the indices are calculated using hash functions.
Example. Suppose we want to introduce geeks into the filter, we use 3 hash functions and a bitmap of length 10, all of which are initially set to 0. First, we compute the hashes as follows:
h1 (“geeks” )% 10 = 1 h2 (“geeks”)% 10 = 4 h3 (“geeks”)% 10 = 7
Note: These results are random for explanation purposes only.
Now we will set the bits on indices 1, 4 and 7 to 1
We want to enter "nerd" again, we will also compute hashes
h1 (“nerd”)% 10 = 3 h2 (“nerd”)% 10 = 5 h3 (“nerd” )% 10 = 4
Set bits in indices 3, 5 and 4 to 1
Now if we want to check if there are geeks in the filter or not. We’ll do the same process, but this time in reverse. We calculate the appropriate hashes using h1, h2 and h3 and check if all of these indices are set to 1 in the bitmap. If all the bits are set, then we can say that the geeks are probably present . If any of the bits of these indices are 0, then there are definitely no geeks .
False positive in Bloom filters
The question is, why did we say “probably present” , why this uncertainty. Let’s take a look at this with an example. Suppose we want to check if a "cat" is present or not. We will compute hashes using h1, h2 and h3
h1 (“cat”)% 10 = 1 h2 (“cat”)% 10 = 3 h3 (“cat”)% 10 = 7
If we check the bitmap, the bits of these indices will be set to 1, but we know that "cat" was never added to the filter. Bit at indices 1 and 7 was set when we added geeks, and bit 3 was set, we added nerd.
Thus, since the bits in the computed indices have already been set by some other element, the Bloom filter mistakenly asserts that there is a "cat" and generates a false positive result. Depending on the application, this can be a huge disadvantage or relatively good.
We can control the likelihood of a false positive by controlling the size of the Bloom filter. More space mean s less false positives. If we want to reduce the likelihood of a false positive, we have to use more hash functions and a larger bit array. This would add delay in addition to clause and membership checking.
False positive probability: Let m be the size of the bitmap, k be the number of hash functions and n will be the number of expected items to be inserted into the filter, then the probability of a false positive p can be calculated as:
Bitmap size: if the expected number of n elements is known and the desired false positive probability the result is p, then the size of the bitmap m can be calculated as:
Optimal number of hash functions: The number of hash functions k must be a positive integer. If m — this is the size of the bitmap, and n — the number of elements to insert, then k can be calculated as:
Space efficiency
If we want to store a large list of elements in a set for the purposes of set membership, we can store it in HashMap , linked list . All of these methods require storing the element itself, which is not very memory efficient. For example, if we want to store geeks in a hashmap, we must store the actual geeks string as a key-value pair {some_key: "geeks"}.
Bloom Filters do not store an item at all. As we have seen, they use a bitmap that allows hash collisions. This would not be compact without hash collision.
Choosing a hash function
Hash function used in filters Bloom must be independent and evenly distributed. They should be as fast as possible. Fast simple non-cryptographic hashes that are reasonably independent include murmur , a number of FNV hash functions, and Jenkins .
Hash generation is the main operation in Bloom filters. Cryptographic hash functions provide stability and assurance, but are expensive to calculate. As the number of hash functions k increases, the Bloom filter becomes slow. Although non-cryptographic hash functions do not guarantee, but provide significant performance improvements.
Basic implementation of the Bloom Filter class in Python3. Save as bloomfilter.py
|
Let’s check out the Bloom filter. Save this file as bloom_test.py
|
Exit
Size of bit array: 124 False positive Probability: 0.05 Number of hash functions: 4 ’war’ is definitely not present! ’gloomy’ is definitely not present! ’humanity’ is definitely not present! ’abundant’ is probably present! ’bloom’ is probably present! ’coherent’ is probably present! ’cohesive’ is probably present! ’bluff’ is definitely not present! ’bolster’ is probably present! ’hate’ is definitely not present! ’racism’ is definitely not present! ’bonus’ is probably present! ’abounds’ is probably present! ’genial’ is probably present! ’pythonengineering’ is definitely not present! ’nuke’ is definitely not present! ’hurt’ is definitely not present! ’twitter’ is a false positive! ’ cheater’ is definitely not present! ’generosity’ is probably present! ’facebook’ is definitely not present! ’abundance’ is probably present!
Applying Bloom Filters
- Medium uses Bloom Filters to recommend posts to users by filtering posts that the user has viewed.
- Quora implemented a generic Bloom filter in the backend channel to fil