# Building a high-speed Bloom Filter with Upstash Redis

> Build a Bloom filter on Upstash Redis in ~50 lines of TypeScript using BITFIELD and the Upstash Redis SDK. Sizing math, working code examples, and a measured 1.45% false-positive rate.

TL;DR This article walks through building a Bloom filter on Upstash Redis in about 50 lines of TypeScript. The filter uses `BITFIELD` for both writes and reads, talks to Upstash through the `@upstash/redis` client, and fits a million items at 1% accuracy in roughly 1.2 MB, which is 30 to 100 times less memory than the same data in a normal Redis set. Each add or check takes one network round trip.

## Key takeaways

- A Bloom filter is the right tool when you want to ask "have I seen this before?" cheaply and you can tolerate the occasional false "yes."
- One million items at 1% accuracy fits in about 1.2 MB. The same data in a normal Redis set of 40-byte emails takes about 80 MB.
- A 1% accuracy filter needs about 9.6 bits per item. A 0.01% one needs about 19.2 bits per item. Item length does not matter.
- Adding and checking each take one network round trip when you batch the bit operations into a single `BITFIELD` command.
- A Bloom filter will never tell you "no" when the answer is "yes." It cannot remove items, and it can occasionally answer "yes" when the real answer is "no."

## What is a Bloom filter?

A Bloom filter is a small data structure that answers one question very efficiently: have I seen this thing before? It is allowed to be wrong sometimes, but only in one direction. If it says "no, I haven't seen it," that answer is always correct. If it says "yes, I think I've seen it," it might be lying about 1% of the time (or whatever accuracy you pick).

Think of it as a long row of light switches, all off to start. Every time you add a new item (say, an email address), you run that email through a few hash functions, and each hash tells you a position in the row. You flip every one of those switches on. To check whether you've seen an email before, you compute the same positions and look at the switches. If any of them are still off, you've definitely never added that email. If they're all on, you've probably added it, though there is a small chance some unrelated emails happened to flip those exact switches.

That tiny chance of a false "yes" is the only price we pay, and we get to pick how small it is up front. Here is what the standard sizing math (from the [RedisBloom `BF.RESERVE` reference](https://redis.io/docs/latest/commands/bf.reserve/)) costs us:

| Accuracy target | Bits per item | Number of hash functions |
| --- | --- | --- |
| 1 in 10 wrong | 4.8 | 3 |
| 1 in 100 wrong | 9.6 | 7 |
| 1 in 1,000 wrong | 14.4 | 10 |
| 1 in 10,000 wrong | 19.2 | 13 |

Notice that the cost is per item, not per byte. A filter for one million 64-byte SHA-256 hashes uses the same memory as one for one million short integers. That is the whole point: a Bloom filter compresses any kind of input down to a constant number of bits.

## Which Redis commands do we need?

Upstash supports the full set of bit-level Redis commands: `SETBIT` and `GETBIT` for one bit at a time, `BITCOUNT` for counting set bits, and `BITFIELD` for reading or writing several bits in a single round trip. The bitfield is the most important for this build, because the whole point of a Bloom filter is to touch several bits per operation.

A quick note on what you would do if you were running [Redis Stack](https://redis.io/docs/latest/operate/oss_and_stack/) or another distribution with the RedisBloom module: there are dedicated commands like `BF.RESERVE`, `BF.ADD`, and `BF.EXISTS` that wrap all this for you. Upstash does not ship that module, so we use `BITFIELD` directly. We give up auto-resizing (which slows queries down anyway as the filter grows) and a built-in hash function (a few lines of code), and in return we get a filter that runs on any Redis-compatible store and behaves exactly the way we sized it.

## How do we size a Bloom filter?

We give the filter two numbers up front: how many items we expect to add, and how often we're willing to be wrong. From those two numbers we derive the size of the bit array and the number of hash functions. Two helper functions:

```tsx id="code-cow6xlhy"
export function optimalBits(n: number, p: number): number {
  return Math.ceil((-n * Math.log(p)) / Math.LN2 ** 2);
}

export function optimalHashes(m: number, n: number): number {
  return Math.max(1, Math.round((m / n) * Math.LN2));
}
```

Running this for 100,000 items would mean:

```tsx id="code-2aq6ldvr"
p=0.01    m=958506 bits   k=7   bits/item=9.585
p=0.001   m=1437759 bits  k=10  bits/item=14.378
p=0.0001  m=1917012 bits  k=13  bits/item=19.170
```

A filter for one million items at 1% accuracy is about 1.2 MB and lives in a single Redis string. Upstash limits any single string to 100 MB on Free and Pay-as-you-go ([source](https://upstash.com/docs/redis/troubleshooting/max_record_size_exceeded)), which gives us space for roughly 83 million items in one filter before we'd need to shard across keys.

## How do we add an item?

Adding an item means flipping seven (or however many) bits at once in a single Redis string. There are two ways to do this: send seven separate `SETBIT` commands in seven network round trips, or send one `BITFIELD` command with seven sub-operations (which is a lot more efficient).

There's also a shortcut we can use for the hash functions. Instead of running seven different hash algorithms over the input, we compute one 128-bit hash, split it into two halves, and combine them to generate the seven positions. This is the standard [double-hashing trick from Kirsch and Mitzenmacher](https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf) and gives almost the same accuracy as seven independent hashes at a fraction of the cost.

```tsx id="code-jfsnanl5"
import { Redis } from "@upstash/redis";

function fnv1a64(bytes: Uint8Array, offset: bigint): bigint {
  const PRIME = 1099511628211n;
  const MASK = 0xffffffffffffffffn;
  let h = offset;
  for (let i = 0; i < bytes.length; i++) {
    h ^= BigInt(bytes[i]);
    h = (h * PRIME) & MASK;
  }
  return h;
}

const enc = new TextEncoder();
const A = 0xcbf29ce484222325n;
const B = 0x84222325cbf29ce4n;

function bitPositions(item: string, m: number, k: number): number[] {
  const bytes = enc.encode(item);
  const h1 = fnv1a64(bytes, A);
  const h2 = fnv1a64(bytes, B);
  const mBig = BigInt(m);
  const out: number[] = [];
  for (let i = 0; i < k; i++) {
    out.push(Number((h1 + BigInt(i) * h2) % mBig));
  }
  return out;
}

export async function bloomAdd(
  redis: Redis,
  key: string,
  bits: number,
  hashes: number,
  item: string,
): Promise<boolean> {
  const positions = bitPositions(item, bits, hashes);
  let cmd = redis.bitfield(key);
  for (const pos of positions) cmd = cmd.set("u1", pos, 1);
  const old = await cmd.exec(); // returns the old value of each bit
  return old.some((b) => b === 0); // true if any bit flipped
}
```

There's a small trick built into the return value. When we ask Redis to set a bit, it tells us what that bit was *before* we flipped it. If even one of the seven bits was zero before the call, the item was almost certainly new (we just set a fresh bit). If every bit was already one, the item might be a duplicate, or it might just be unlucky.

One thing to watch out for with the `@upstash/redis` client: by default it batches commands together to save round trips, but that batching does not play nicely with the chainable `bitfield()` builder. The fix is to disable the auto-batcher when we create the client:

```tsx id="code-jblbznoj"
const redis = new Redis({
  url: process.env.UPSTASH_REDIS_REST_URL!,
  token: process.env.UPSTASH_REDIS_REST_TOKEN!,
  enableAutoPipelining: false,
});
```

## How do we check if an item is in the filter?

Checking if an item is present works the same way as adding, except we read the bits instead of writing them. We compute the same bit positions and use `BITFIELD` to read them all in one go. If every bit is 1, the item is probably in the filter. If even one bit is 0, the item is definitely not there.

```tsx id="code-j85ngn78"
export async function bloomExists(
  redis: Redis,
  key: string,
  bits: number,
  hashes: number,
  item: string,
): Promise<boolean> {
  const positions = bitPositions(item, bits, hashes);
  let cmd = redis.bitfield(key);
  for (const pos of positions) cmd = cmd.get("u1", pos);
  const result = await cmd.exec();
  return result.every((b) => b === 1);
}
```

To check that this actually behaves like a real Bloom filter, I ran it against a live Upstash Redis database: insert 5,000 distinct items into a filter sized for 5,000 at 1% accuracy, then check 2,000 different items that were never added. The filter answered "yes" 29 times, for a measured false-positive rate of 1.45%. That is slightly above the 1% target, which is expected when we derive the hashes from a single FNV-1a value instead of running seven independent hash functions. Swapping FNV for [xxhash3](https://github.com/Cyan4973/xxHash) brings the rate closer to the theoretical bound.

The filter also produced zero wrong "no" answers, which is the one guarantee Bloom filters always make. Once a bit is set during an add, it stays set for the lifetime of the filter, so any item we added really did set its bits to 1 and will keep returning "yes" when we ask.

## How do you count how many items are in the filter?

With our custom filter we don't store an exact count of items added, but we can estimate it from the fraction of bits that are set. Redis counts set bits with `BITCOUNT`, and a known formula (the Swamidass-Baldi estimator) converts that count back into an item-count estimate.

```tsx id="code-hjj6mrlv"
export async function bloomCardinality(
  redis: Redis,
  key: string,
  bits: number,
  hashes: number,
): Promise<number> {
  const setBits = await redis.bitcount(key, 0, -1);
  if (setBits === 0) return 0;
  if (setBits >= bits) return Infinity;
  return Math.round(-(bits / hashes) * Math.log(1 - setBits / bits));
}
```

The estimate is good while the filter is lightly to moderately filled. Past about 80% of bits set, the math becomes very sensitive to noise and the result gets unreliable. If we need an accurate count at high fill, we can run a [HyperLogLog](https://upstash.com/blog/redis-hyperloglog) alongside the filter. `PFADD` and `PFCOUNT` work on Upstash and give us about 0.81% accuracy in 12 KB no matter how many items we add.

## When should we use a Bloom filter vs a Redis Set or HyperLogLog?

Three probabilistic options, three different jobs:

| Structure | What it answers | Memory for 1M items | False positives | False negatives | Cardinality |
| --- | --- | --- | --- | --- | --- |
| Redis Set (`SADD`/`SISMEMBER`) | Exact membership + count | \~40-80 MB | None | None | Exact (`SCARD`) |
| Bloom filter (`BITFIELD`) | Approximate membership | \~1.2 MB at 1% | Tunable | Never | Approximate (BITCOUNT + estimator) |
| HyperLogLog (`PFADD`/`PFCOUNT`) | Approximate count only | \~12 KB | N/A | N/A | 0.81% std. error |

Three good places to reach for a Bloom filter, where the memory savings over a normal Redis set are 30 to 100 times:

- **Skipping pointless database queries.** Before hitting your database, ask the Bloom filter whether the key probably exists. A "no" lets you skip the query entirely. A "yes" means you still need to query, but you have already filtered out most of the misses. This is the original [Cassandra and HBase use case](https://cassandra.apache.org/doc/latest/cassandra/managing/operating/bloom_filters.html).
- **Deduplicating URLs in a web crawler.** A filter for 100 million URLs at 1% accuracy fits in about 120 MB. Storing those same URLs in a normal Redis set, at 50 bytes each, would cost around 5 GB.
- **Rejecting obviously duplicate usernames or coupon codes.** Run the cheap Bloom check first, then only query the real database for the items it flags as "maybe seen."

Reach for a normal Redis set instead if you need to list the members, count them exactly, or delete some of them. Reach for HyperLogLog if you only need to know how many distinct items you have seen, not whether a specific one is in the set.

## What are the limitations of a Bloom filter on Upstash Redis?

Four to keep in mind:

- **You cannot delete items.** Clearing a bit could break the lookup for some other item that also happened to set that same bit. The standard workarounds are to use a [counting Bloom filter](https://en.wikipedia.org/wiki/Counting_Bloom_filter) (which stores small counters instead of single bits) or to rebuild the filter from scratch on a schedule.
- **There is no perfectly atomic "add only if new."** The real `BF.ADD` was designed to give you that guarantee. The `BITFIELD`-based version runs as one atomic Redis command, but its "this is a new item" answer is probabilistic. If you need a hard guarantee, pair the filter with a regular `SET key value NX` for the small set of items the filter flags as possibly seen.
- **An empty filter still costs full memory.** The bit array is allocated up front, so a 10 million bit filter is a 1.25 MB string in Redis whether you have added zero items or a million. Do not size much higher than you actually need.
- **One Redis key per filter.** Upstash caps any single value at 100 MB, which means a single filter holds at most about 80 million items at 1% accuracy. Beyond that, shard: pick which filter an item belongs to by hashing it, and store N filters under N keys.

## What about expiring the filter?

The filter lives in a normal Redis string, so all the usual key commands work on it. `EXPIRE bloom:emails 86400` will auto-delete it after a day. `PERSIST` removes the expiration. `RENAME` moves it. This is the easiest way to do rolling deduplication: name the filter after the hour or day ("have I seen this event in the last hour?"), let `EXPIRE` clean up old buckets, and you avoid any cleanup job entirely.

## Should I just use a Redis with the RedisBloom module?

If you genuinely need RedisBloom's more advanced features (auto-resizing filters, restoring a filter from a dump file, or its specific atomicity guarantees), then sure, run a Redis that has the module loaded. But for the common case of "I want a fixed-size Bloom filter that I configure once when I deploy," the 50-line version we built here is great. It runs against Upstash Redis from any environment (`@upstash/redis` over HTTP for serverless, or ioredis/node-redis over TCP for traditional servers), and it hit 1.45% false positives against a 1% target in the test above.

## FAQ

**Can I use this from Cloudflare Workers or Vercel Edge?** Yes. For serverless runtimes use `@upstash/redis`, which is HTTP-based and runs in any V8 isolate. For long-running Node.js or Bun servers, the same code works with ioredis or node-redis over Upstash's TCP endpoint. The `BigInt` arithmetic in the FNV-1a hash is supported in every modern runtime.

**Does&#32;`BITFIELD`&#32;count as one command for billing?** Yes. A single `BITFIELD` call with k `SET u1` subcommands is one Upstash command, not k. That is the main reason to use `BITFIELD` over k individual `SETBIT` calls both for latency and for the per-command billing model.

**How do I serialize a filter for offline analysis?** `GET bloom:key` returns the raw byte string; `SET bloom:key <bytes>` restores it. Pair with `STRLEN` to confirm the byte count matches `Math.ceil(m / 8)`.

