Bloom Filters — Python introduction and implementation

Michael Zippo 18.07.2021

Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and received the message "Username is already taken." You added your date of birth along with your username, but still no luck. You have now also added your university list number and received a "Username already taken". It’s really frustrating, isn’t it?
But have you ever wondered how quickly Geekbook checks for username availability by searching millions of registered users. There are many ways to make this work —

Linear Search : bad idea!
Binary search : keep all usernames in alphabetical order and compare the entered username with the average in the list. If it matches, then the username is used, otherwise it is determined whether the entered username will come before or after the middle and, if it comes after, ignore all usernames up to the middle (inclusive). Now search after the middle and repeat this process until you get a match or the end of the search without a match. This method is better and more promising, but still requires several steps.
But there must be something better!

Bloom Filter — it’s a data structure that can do the job.

To understand Bloom filters, you need to know what is hashing . The hash function takes input and outputs a unique, fixed-length identifier that is used to identify the input.

What is Bloom Filter?

Bloom Filter — it is a compact probabilistic data structure that is used to test whether an item is a member of a set. For example, a username availability check is given by a membership problem, where the set is a list of all registered usernames. The price we pay for efficiency is that it is probabilistic in nature, which mean s there may be some false positives. A false positive value may mean that the given username is already taken, but in fact it is not.

Interesting Bloom Filters properties

Unlike a standard hash table, a fixed size Bloom filter can represent a collection of arbitrarily large numbers of elements.
Adding an element will never fail. However, the false positive rate increases steadily as more items are added, until all bits in the filter are set to 1, at which point all queries are successful.
Bloom filters never generate a false negative result, i.e. informing you that the username does not exist when it does exist.
Removing items from a filter is not possible because if we remove one item by clearing bits on the indices generated k hash functions, this can remove several other elements. Example — if we remove the "geeks" (in the example below) by clearing bits at 1, 4, and 7, we can end up removing "nerd" as well, since bit at index 4 becomes 0 and the Bloom filter states that "nerd "Is not present.

Bloom filter operation

Empty Bloom filter — it is a bitmap of m bits, all set to zero, for example:

We need k number of hash functions to compute hashes for this input. When we want to add an element to the filter, the bits at k indices h1 (x), h2 (x), ... hk (x) are set, where the indices are calculated using hash functions.
Example. Suppose we want to introduce geeks into the filter, we use 3 hash functions and a bitmap of length 10, all of which are initially set to 0. First, we compute the hashes as follows:

 h1 (“geeks” )% 10 = 1 h2 (“geeks”)% 10 = 4 h3 (“geeks”)% 10 = 7

Note: These results are random for explanation purposes only.
Now we will set the bits on indices 1, 4 and 7 to 1

Shop

Best laptop for Excel

Best laptop for Solidworks

$399+

Best laptop for Roblox

$399+

Best laptop for development

$499+

Best laptop for Cricut Maker

$299+

Best laptop for hacking

$890

Best laptop for Machine Learning

$699+

Raspberry Pi robot kit

$150

Bloom Filters — Python introduction and implementation

Shop

News

Wiki