Suppose you are creating an account on Geekbook, you want to enter a cool username, you entered it and received the message “Username is already taken.” You added your date of birth along with your username, but still no luck. You have now also added your university list number and received a “Username already taken”. It`s really frustrating, isn`t it?
But have you ever wondered how quickly Geekbook checks for username availability by searching millions of registered users. There are many ways to make this work —
Bloom Filter — it`s a data structure that can do the job.
To understand Bloom filters, you need to know what is hashing . The hash function takes input and outputs a unique, fixedlength identifier that is used to identify the input.
What is Bloom Filter?
Bloom Filter — it is a compact probabilistic data structure that is used to test whether an item is a member of a set. For example, a username availability check is given by a membership problem, where the set is a list of all registered usernames. The price we pay for efficiency is that it is probabilistic in nature, which means there may be some false positives. A false positive value may mean that the given username is already taken, but in fact it is not.
Interesting Bloom Filters properties
Bloom filter operation
Empty Bloom filter — it is a bitmap of m bits, all set to zero, for example:
We need k number of hash functions to compute hashes for this input. When we want to add an element to the filter, the bits at k indices h1 (x), h2 (x), … hk (x) are set, where the indices are calculated using hash functions.
Example. Suppose we want to introduce geeks into the filter, we use 3 hash functions and a bitmap of length 10, all of which are initially set to 0. First, we compute the hashes as follows:
h1 (“geeks” )% 10 = 1 h2 (“geeks”)% 10 = 4 h3 (“geeks”)% 10 = 7
Note: These results are random for explanation purposes only.
Now we will set the bits on indices 1, 4 and 7 to 1
We want to enter “nerd” again, we will also compute hashes
h1 (“nerd”)% 10 = 3 h2 (“nerd”)% 10 = 5 h3 (“nerd” )% 10 = 4
Set bits in indices 3, 5 and 4 to 1
Now if we want to check if there are geeks in the filter or not. We`ll do the same process, but this time in reverse. We calculate the appropriate hashes using h1, h2 and h3 and check if all of these indices are set to 1 in the bitmap. If all the bits are set, then we can say that the geeks are probably present . If any of the bits of these indices are 0, then there are definitely no geeks .
False positive in Bloom filters
The question is, why did we say “probably present” , why this uncertainty. Let`s take a look at this with an example. Suppose we want to check if a “cat” is present or not. We will compute hashes using h1, h2 and h3
h1 (“cat”)% 10 = 1 h2 (“cat”)% 10 = 3 h3 (“cat”)% 10 = 7If we check the bitmap, the bits of these indices will be set to 1, but we know that "cat" was never added to the filter. Bit at indices 1 and 7 was set when we added geeks, and bit 3 was set, we added nerd.
Thus, since the bits in the computed indices have already been set by some other element, the Bloom filter mistakenly asserts that there is a "cat" and generates a false positive result. Depending on the application, this can be a huge disadvantage or relatively good.
We can control the likelihood of a false positive by controlling the size of the Bloom filter. More space means less false positives. If we want to reduce the likelihood of a false positive, we have to use more hash functions and a larger bit array. This would add delay in addition to clause and membership checking.
False positive probability: Let m be the size of the bitmap, k be the number of hash functions and n will be the number of expected items to be inserted into the filter, then the probability of a false positive p can be calculated as:
Bitmap size: if the expected number of n elements is known and the desired false positive probability the result is p, then the size of the bitmap m can be calculated as:
Optimal number of hash functions: The number of hash functions k must be a positive integer. If m — this is the size of the bitmap, and n — the number of elements to insert, then k can be calculated as:
Space efficiency
If we want to store a large list of elements in a set for the purposes of set membership, we can store it in HashMap , linked list . All of these methods require storing the element itself, which is not very memory efficient. For example, if we want to store geeks in a hashmap, we must store the actual geeks string as a keyvalue pair {some_key: "geeks"}.
Bloom Filters do not store an item at all. As we have seen, they use a bitmap that allows hash collisions. This would not be compact without hash collision.Choosing a hash function
Hash function used in filters Bloom must be independent and evenly distributed. They should be as fast as possible. Fast simple noncryptographic hashes that are reasonably independent include murmur , a number of FNV hash functions, and Jenkins .
Hash generation is the main operation in Bloom filters. Cryptographic hash functions provide stability and assurance, but are expensive to calculate. As the number of hash functions k increases, the Bloom filter becomes slow. Although noncryptographic hash functions do not guarantee, but provide significant performance improvements.Basic implementation of the Bloom Filter class in Python3. Save as bloomfilter.py
# Python 3 program for building Bloom Filter
# First install mmh3 and the third party bitarray module
# pip install mmh3
# pip install bitarray
import
math
import
mmh3
from
bitarray
import
bitarray
class
BloomFilter (
object
):
" ""
Bloom filter class using murmur3 hash function
"" "
def
__ init __ (
self
, items_count, fp_prob):
"" "
items_count: int
The number of items that supposed to be saved in Bloom filter
fp_prob: float
False n Decimal positive probability
"" "
# False probability in decimal
self
. fp_prob
=
fp_prob
# Bitmap size to use
self
. size
=
self
. get_size (items_count, fp_prob )
# number of hash functions to use
self
. hash_count
=
self
. get_hash_count (
self
.size, items_count)
# Bitmap of specified size
self
. bit_array
=
bitarray (
self
. size)
# initialize all bits as 0
self
. bit_array.setall (
0
)
def
add (
self
, item):
"" "
Add element to filter
"" "
digests
=
[]
for
i
in
range
(
self
. hash_count):
# create a digest for this element.
# I`m working as a seed for the mmh3.hash () function
# With a different seed, the generated digest is different
digest
=
mmh3.
hash
(item, i)
%
self
. size
digests.append (digest)
# set the bit_array to True
self
. bit_array [digest]
=
True
def
check (
self
, item):
" ""
Check element in the filter
"" "
for
i
in
range
(
self
. hash_count):
digest
=
mmh3.
hash
(item, i)
%
self
. size
if
self
. bit_array [digest]
=
=
False
:
# if the bit is False, it no
# in filter
# otherwise it is likely to exist
return
False
return
True
@ classmethod
def
get_size (
self
, n, p ):
"" "
Return the bitmap size (m) to use with
following formula
m =  ( n * lg (p)) / (lg (2) ^ 2)
n: int
the number of items expected to keep ь in the filter
p: float
False positive decimal probability
"" "
m
=

(n
*
math.log (p))
/
(math.log (
2
)
*
*
2
)
return
int
(m)
@ classmethod
def
get_hash_count (
self
, m, n):
"" "
Return the hash function (k) to use using
following formula
k = (m / n) * LG ( 2)
m: int
bitmap size
n: int
the number of items to be saved in the filter
"" "
k
=
(m
/
n)
*
math.log (
2
)
return
int
(k)
Let`s check out the Bloom filter. Save this file as bloom_test.py
from
bloomfilter
import
BloomFilter
from
random
import
shuffle
n
=
20
# no items to add
p
=
0.05
# false probability
bloomf
=
BloomFilter (n, p)
(
" Size of bit array: {} "
.
format
(bloomf.size))
(
"False positive Probability: {}"
.
format
(bloomf.fp_prob))
(
"Number of hash functions: {}"
.
format
(bloomf.hash_count))
# words to be added
word_present
=
[
`abound`
,
`abounds`
,
`abundance`
,
`abundant`
,
` accessable`
,
`bloom`
,
`blossom`
,
` bolster`
,
`bonny`
,
` bonus`
,
`bonuses`
,
` coherent`
,
`cohesive`
,
`colorful`
,
` comely`
,
`comfort `
,
` gems`
,
`generosity`
,
` generous`
,
`generously`
,
`genial`
]
# no word added
word_absent
=
[
`bluff`
,
`cheater`
,
` hate`
,
`war`
,
` humanity`
,
` racism`
,
`hurt`
,
`nuke`
,
` gloomy`
,
` facebook`
,
`pythonengineering`
,
` twitter`
]
for
item
in
word_present:
bloomf.add (item)
shuffle (word_present)
shuffle (word_absent)
test_words
=
word_present [:
10
]
+
word_absent
shuffle (test_words)
for
word
in
test_words:
if
bloomf.check (word):
if
word
in
word_absent:
(
"` {} `is a false positive!"
.
format
(word) )
else
:
(
" `{}` is probably present! "
.
format
(word))
else
:
(
"` {} `is definitely not present!"
.
format
(word))
Exit
Size of bit array: 124 False positive Probability: 0.05 Number of hash functions: 4 `war` is definitely not present! `gloomy` is definitely not present! `humanity` is definitely not present! `abundant` is probably present! `bloom` is probably present! `coherent` is probably present! `cohesive` is probably present! `bluff` is definitely not present! `bolster` is probably present! `hate` is definitely not present! `racism` is definitely not present! `bonus` is probably present! `abounds` is probably present! `genial` is probably present! `pythonengineering` is definitely not present! `nuke` is definitely not present! `hurt` is definitely not present! `twitter` is a false positive! ` cheater` is definitely not present! `generosity` is probably present! `facebook` is definitely not present! `abundance` is probably present!
Applying Bloom Filters