Skip to content

expireover: Add bloom filter for fast history existence checks#339

Open
kev009 wants to merge 1 commit into
InterNetNews:mainfrom
kev009:expireover-token-cache
Open

expireover: Add bloom filter for fast history existence checks#339
kev009 wants to merge 1 commit into
InterNetNews:mainfrom
kev009:expireover-token-cache

Conversation

@kev009
Copy link
Copy Markdown
Contributor

@kev009 kev009 commented May 7, 2026

expireover checks every article in the overview database against the history file to detect orphaned entries. This requires a per-article HISlookup, which does random pread() calls into the DBZ index and history text file. On large spools (1B+ articles), this takes months.

Add a bloom filter that is built from a single sequential HISwalk of the history file at startup. The bloom filter acts as a positive-only cache in OVhisthasmsgid: bloom hits skip the slow HISlookup, bloom misses fall through to HISlookup for correctness. False positives are benign (an orphaned overview entry survives one extra cycle).

The bloom filter is controlled by the new inn.conf parameter expirebloomfp, which specifies the false positive rate as a reciprocal (default 10000 = 0.01%). Setting it to 0 disables the bloom filter. Memory usage is approximately 20 bits per article (48 MB for 20M articles, 2.4 GB for 1B articles).

Changes:

  • Add lib/bloom.c and include/inn/bloom.h (bloom filter implementation using enhanced double hashing, Kirsch & Mitzenmacher 2006)
  • Extend HISwalk callback signature to include the message-ID HASH (HISwalk has had zero callers since it was added in 2001)
  • Set hisv6_walk ignore=true so corrupt lines don't abort the walk
  • Add OVTOKENCACHE to OVctl for passing the bloom filter to OVhisthasmsgid
  • Add expirebloomfp to innconf
  • Add unit tests (lib/bloom-t.c) and integration tests (lib/bloom-hiswalk-t.c)

expireover checks every article in the overview database against the
history file to detect orphaned entries.  This requires a per-article
HISlookup, which does random pread() calls into the DBZ index and
history text file.  On large spools (1B+ articles), this takes months.

Add a bloom filter that is built from a single sequential HISwalk of
the history file at startup.  The bloom filter acts as a positive-only
cache in OVhisthasmsgid: bloom hits skip the slow HISlookup, bloom
misses fall through to HISlookup for correctness.  False positives
are benign (an orphaned overview entry survives one extra cycle).

The bloom filter is controlled by the new inn.conf parameter
expirebloomfp, which specifies the false positive rate as a reciprocal
(default 10000 = 0.01%).  Setting it to 0 disables the bloom filter.
Memory usage is approximately 20 bits per article (48 MB for 20M
articles, 2.4 GB for 1B articles).

Changes:
- Add lib/bloom.c and include/inn/bloom.h (bloom filter implementation
  using enhanced double hashing, Kirsch & Mitzenmacher 2006)
- Extend HISwalk callback signature to include the message-ID HASH
  (HISwalk has had zero callers since it was added in 2001)
- Set hisv6_walk ignore=true so corrupt lines don't abort the walk
- Add OVTOKENCACHE to OVctl for passing the bloom filter to
  OVhisthasmsgid
- Add expirebloomfp to innconf
- Add unit tests (lib/bloom-t.c) and integration tests
  (lib/bloom-hiswalk-t.c)
@kev009 kev009 force-pushed the expireover-token-cache branch from c290835 to d0c379e Compare May 7, 2026 05:26
@Julien-Elie
Copy link
Copy Markdown
Contributor

What a nice contribution. Thanks Kevin for your work to improve INN!
With the change to the history library, is it ok for you to not bump a major release as the function was not previously called?

@kev009
Copy link
Copy Markdown
Contributor Author

kev009 commented May 7, 2026

@Julien-Elie thanks. Yes I went back and checked all the repo history, this was sketched out in 2001 but I can find no evidence of it being used by any consumers so I think it is safe in principle, but we could also bump to be safe.

freebsd-git pushed a commit to freebsd/freebsd-ports that referenced this pull request May 7, 2026
@kev009
Copy link
Copy Markdown
Contributor Author

kev009 commented May 8, 2026

Some empirical results from my current tradspool, weeks down to minutes:

expireover start Fri May  8 04:15:22 UTC 2026: ( -z/var/log/news/expire.rm -Z/var/log/news/expire.lowmark)
    Article lines processed 80197480
    Articles dropped             627
    Overview index dropped       627
expireover end Fri May  8 04:33:37 UTC 2026

@Julien-Elie
Copy link
Copy Markdown
Contributor

Very impressive!

@Julien-Elie Julien-Elie self-assigned this May 20, 2026
@Julien-Elie Julien-Elie added this to the 2.7.4 milestone May 20, 2026
@Julien-Elie Julien-Elie added enhancement New feature or request C: overview Related to overview methods P: medium Medium priority labels May 20, 2026
@Julien-Elie
Copy link
Copy Markdown
Contributor

Your patch looks good to me, thanks for it!
All right for the proposition to enable it by default.

Just a question: is the Kirsch & Mitzenmacher 2006 algorithm fine to use for us? No copyright or license to mention?

I hope other people will test the patch and confirm the time benefits. The news spool of my testing news server uses mostly CNFS storage, and I unfortunately do not see any change in time with or without the bloom filter. expireover terminates in about 45 seconds :)
I just note that user CPU time and CPU usage are higher but system CPU time is lower with the bloom filter.

expirebloomfp set to 0:

% time expireover -N
Article lines processed  4332717
Articles dropped               0
Overview index dropped         0
expireover -N  3,50s user 2,92s system 14% cpu 45,223 total

expirebloomfp set to 10000:

expireover -N  6,92s user 1,44s system 18% cpu 45,497 total

expirebloomfp set to 100000:

expireover -N  9,30s user 1,43s system 22% cpu 47,960 total

Julien-Elie added a commit that referenced this pull request May 22, 2026
Julien-Elie added a commit that referenced this pull request May 22, 2026
Julien-Elie added a commit that referenced this pull request May 22, 2026
Julien-Elie added a commit that referenced this pull request May 22, 2026
Comment thread expire/expireover.c
spools (1B+ articles). The bloom filter is used as a positive-only
cache: hits skip the slow history lookup, misses fall through to
HISlookup for correctness (handles articles added after the walk). */
if (innconf->expirebloomfp > 0 && !always_stat) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Julien-Elie I wonder if we add another chicken bit here and only do it for tradspool/timehash etc?

Copy link
Copy Markdown
Contributor Author

@kev009 kev009 May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the times with the default fp look reasonable so maybe need to see more data points to see if i.e. CNFS should completely opt out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The times are similar so maybe it's not worth adding extra complexity at this point to probe whether articles are in a self-expiring storage method and the -N flag is not used.

FWIW, without the -N flag, the time of run of expireover is still similar in my news spool (about 46 seconds for 416 000 articles).

@Julien-Elie
Copy link
Copy Markdown
Contributor

Julien-Elie commented May 23, 2026

@kev009 There is an issue when INN is built with --enable-tagged-hash: the test suite fails.

% ./runtests -o lib/bloom-hiswalk
1..8
zsh: segmentation fault  ./runtests -o lib/bloom-hiswalk

Seems to occur at the inn_msync_page(where, tab->reclen, MS_ASYNC) call in lib/dbz.c according to gdb.

Hmm, maybe this is a pre-existing bug not linked to the new feature?

@Julien-Elie Julien-Elie reopened this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C: overview Related to overview methods enhancement New feature or request P: medium Medium priority

Development

Successfully merging this pull request may close these issues.

2 participants