Skip to content

Add BM25SRetriever: pure-Python BM25 with no Java/Pyserini dependency#116

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/add-bm25s-integration
Draft

Add BM25SRetriever: pure-Python BM25 with no Java/Pyserini dependency#116
Copilot wants to merge 2 commits intomainfrom
copilot/add-bm25s-integration

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 23, 2026

The existing BM25Retriever requires Pyserini (~7GB install) and a JVM. This adds BM25SRetriever backed by bm25s — a pure-Python implementation that runs significantly faster and weighs ~479MB total.

New: BM25SRetriever

  • rankify/retrievers/bm25s_retriever.py — new retriever; builds a bm25s index from a JSONL or TSV corpus on first use, persists it to disk, and loads it on subsequent runs. No Java, no JVM, no Lucene.
  • Corpus formats supported: JSONL ({"id", "title", "text"}) and TSV (id\ttext\ttitle, same layout as psgs_w100.tsv)
  • Optional PyStemmer support via stemmer_lang parameter
  • Self-contained _has_answers — no pyserini import required

Integration

  • retriever.py — adds "bm25s" to METHOD_MAP; all retriever imports wrapped in try/except so missing optional deps (pyserini, faiss, gensim, etc.) no longer block imports of unrelated retrievers
  • __init__.py — exports BM25SRetriever; same graceful import handling
  • bm25_retriever.py / diver_bm25_retriever.py — pyserini imports made lazy; raise a descriptive ImportError pointing to BM25SRetriever if pyserini is absent
  • pyproject.tomlbm25s>=0.2.0 added to [retriever] optional deps

Usage

from rankify.retrievers import Retriever

# First run: builds and persists the index
retriever = Retriever(
    method="bm25s",
    n_docs=10,
    corpus_path="/path/to/corpus.jsonl",  # or .tsv
    index_folder="/path/to/index_dir",
)

# Subsequent runs: loads pre-built index, no corpus_path needed
retriever = Retriever(method="bm25s", n_docs=10, index_folder="/path/to/index_dir")
results = retriever.retrieve(documents)

Copilot AI changed the title [WIP] Add BM25s as a replacement for Pyserini Add BM25SRetriever: pure-Python BM25 with no Java/Pyserini dependency Apr 24, 2026
Copilot AI requested a review from abdoelsayed2016 April 24, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can we use BM25s which is a pure python package and faster than Pyserini and removes Java Dependency

2 participants