Introduction to Bloom Filters: Achieving Efficiency and Accuracy
Imagine you’re running a library and need a quick way to check if a book is available. You could use a hash table, where each book’s title or ISBN acts like a magic key for instant lookups. Sounds great, right? But what if your library is massive, and memory is tight?
Enter the Bloom filter — a clever, space-saving trick! Instead of storing the entire catalog, it uses a bit array and hash functions to tell you if a book might be there.

- A Bloom filter starts as an empty bit array with m bits, all set to 0. To use it, we need k distinct hash functions.
- Each hash function generates k hash values corresponding to specific positions in an m-bit array, where the values determine which bits should be set to 1 when adding an element.
- This process ensures that the element’s presence is encoded across multiple positions in the array.
Adding elements to Bloom Filter
Let’s work through the process
- We have a bloom filter with a bit array of size m = 10. We also have two hash functions H1(X) and H2(X).
- Now, lets say we want to add a book to the bloom filter. We hash the book using both the hash functions.
H1(book) % 10 = position1
H2(book) % 10 = position2 - These two positions are set to 1 in the bloom filter’s bit array.

A Bloom filter allows false positives but never false negatives, and here’s why!
False Positive
A Bloom filter might indicate that an element is present when it’s not. This happens because multiple elements can set the same bits to 1, causing a collision. This overlap can make the filter mistakenly say an element is present when those bits were set by other elements.

Let’s consider checking if “the secret” is in the bloom filter
We hash with H1 and H2:
H1(“the secret”) % 10 = 3
H2(“the secret”) % 10 = 5
Although “the secret” was never added to the bloom filter, it incorrectly appears to be present because the bits at position 3 and 5 were set to 1 when adding two other books. This happens because multiple elements can set the same bits in the array, causing collisions.
Never False Negative
When you add an element, all the bits chosen by the hash functions are set to 1. When checking for that element later, if even one of those bits is 0, the filter knows the element was never added.

Lets consider checking if “the alchemist” is in the bloom filter
We hash with H1 and H2:
H1(“the alchemist”) % 10 = 3
H2(“the alchemist”) % 10 = 6
The bit at position 3 is 1 (set during an earlier addition).
The bit at position 6 is 0 (indicating it was never set).
Since one of the bits is 0, the Bloom filter confirms that “the alchemist” is definitely not present in the set. This ensures no false negatives, and we can be certain that the book hasn’t been added.
When creating a Bloom filter, several factors must be considered to ensure it performs efficiently and meets your requirements. Here’s what you should keep in mind:
- The size of the Bloom filter and the number of hash functions depend on how many elements you plan to store.
- Set an acceptable false positive rate (e.g., 1% or 0.1%) based on your application. This will influence the filter size and the number of hash functions.
- The number of hash functions impacts the trade-off between false positives and performance. Using too few hash functions leads to high false positive rates, while too many waste computational resources.
- Hash functions should be uniformly distributed and independent. Complex hash functions reduce false positives but increase computational cost.
- For applications with uncertain data sizes, consider using scalable Bloom filters that adjust their size dynamically.
- Standard Bloom filters do not support deletion. If deletion is required, use counting Bloom filters, which store counters instead of bits.
What's an Optimal bloom filter size?
The size of a Bloom filter (m, the number of bits in the bit array) depends on three key factors:
- Expected Number of Elements (n): The number of items you expect to store in the Bloom filter.
- Desired False Positive Rate (p): The acceptable probability of false positives (e.g., p=0.01 for 1%).


The number of hash functions (k), which impacts the trade-off between false positives and performance can be calculated.

Larger m: Reduces false positive rate but increases memory usage.
Larger k: Reduces false positives to a point but increases computational cost.
Applications
- Bloom filters are used in databases like Google Bigtable, Apache HBase, Apache Cassandra and PostgreSQL to minimize disk lookups for non-existent rows or columns. By avoiding unnecessary disk access, they significantly enhance the performance of query operations.
- Apache Spark uses Bloom filters to speed up data retrieval by excluding irrelevant partitions, optimizing read and write operations in large datasets.
- Netflix likely uses Bloom filters to efficiently check if a user has watched a movie or show, preventing repeated recommendations.
- The Google Chrome web browser used to use a Bloom filter to identify malicious URLs.
To wrap it up, Bloom filters are like a shortcut for managing huge amounts of data! They let you quickly check if something’s already been seen — whether it’s making sure a webpage is cached, filtering out spam emails, or sorting Bitcoin transactions. They’re an awesome, space-saving tool that boosts speed and efficiency in big data systems.
Bloom filters have a lot of advanced concepts and can be complex to create, but this was a simple introduction to help you get started.
Here is a simple bloom filter implementation in python.
Thanks for reading! Happy coding!