🚀 LLM-Guided Embedding Refinement

Boost zero-shot classification and retrieval with test-time query optimization

📖 Overview

This repository contains code to improve zero-shot classification and retrieval using embedding models through test-time optimization of query embedding representations.

At test time, given a user query, the query embedding is optimized with gradient descent based on targeted feedback from a stronger model. The method uses scores from an LLM or reranker over a small sampled set of candidate documents, then updates the query representation so the embedding space better reflects the task-specific intent of the query.

🎯 Key Features

✨ Test-time optimization — No retraining required
🔄 Flexible architecture — Works with various text embedding models and LLM rerankers
📊 Proven results — Consistent gains across multiple benchmarks

📄 Paper

Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, Assaf Toledo

💡 How It Works

Embedding models are efficient and scalable, but in challenging zero-shot settings they may miss nuanced task constraints. This library explores a test-time refinement procedure that adapts the query representation using external guidance from a generative LLM, without retraining the embedding model.

This approach improves ranking quality across multiple search and classification benchmarks, with consistent gains on tasks such as:

📚 Literature search
🎯 Intent detection
🔑 Key-point matching
📋 Instruction-following retrieval

🔄 Workflow

Step-by-step process:

📝 Embed the original query and candidate documents
🔍 Retrieve top candidates by embedding similarity
🎯 Score a sampled subset using an LLM, cross-encoder reranker, or gold labels
⚡ Optimize the query embedding to better align with supervision
🔄 Re-score the corpus using the refined query embedding

🚀 Quick Start

📦 Installation

1️⃣ Create and activate a Python environment

python -m venv .venv
source .venv/bin/activate

2️⃣ Install dependencies

pip install -r requirements.txt

3️⃣ Configure inference for LiteLLM or OpenAI

The reranker and HyDE components use an OpenAI-compatible chat-completions API.

Option A — LiteLLM gateway or proxy

Set BASE_URL to your LiteLLM endpoint and API_KEY to the corresponding key:

export BASE_URL="http://localhost:4000"
export API_KEY="your-litellm-api-key"

Use LiteLLM explicitly by prefixing the model name with LiteLLM/, for example:

python run_experiments.py \
  --reranker_model "LiteLLM/mistralai/Mistral-Small-3.2-24B-Instruct-2506" \

Option B — OpenAI API

Set your OpenAI API key:

export OPENAI_API_KEY="your-openai-api-key"

Use OpenAI explicitly by prefixing the model name with OpenAI/, for example:

python run_experiments.py \
  --reranker_model "OpenAI/gpt-4.1-mini" \
  --hyde_model "OpenAI/gpt-4.1-mini"

Important notes

If no service prefix is provided, this repository defaults to LiteLLM for LLM inference. This will fail if you did not define a suitable endpoint in your environment.
--reranker_model controls the LLM used for relevance feedback during optimization.
--hyde_model controls the LLM used to generate hypothetical documents for HyDE (HYpothetical Document Embeddings, see here), this is optional and is not required for basic query optimization functionality.
LiteLLM/OpenAI setup is only required when using LLM-based reranking or HyDE. It is not needed for --optimize_with_gold or cross-encoder rerankers such as cross-encoder/ms-marco-MiniLM-L-6-v2.

▶️ Basic Usage

Run all default models on all datasets:

python run_experiments.py

Run a specific model on selected datasets:

python run_experiments.py \
  --embedding_models "Qwen/Qwen3-Embedding-0.6B" \
  --datasets "Clinc150" "NFCorpus"

Run experiments in parallel:

python run_experiments.py --parallel 3

📊 Supported Datasets

The repository currently supports the following datasets through dataset_loaders.py:

Dataset	Description	Reference
🎓 RealScholarQuery	Real-world academic search queries over arXiv CS papers	He et al., 2025
🔑 ArgKP-21	Key-point matching from 2021 KPA shared task	Friedman et al., 2021
📋 FollowIR	Information retrieval from TREC relevance narratives	Weller et al., 2025
💬 Clinc150	Intent classification with 150 intents across 10 domains	Larson et al., 2019
🏦 Banking77	Banking domain with 77 fine-grained intent categories	Casanueva et al., 2020
🏥 NFCorpus	Medical literature retrieval with lay queries	Boteva et al., 2016

🛠️ Custom Usage

🎮 Main Entry Points

Script	Purpose
`embedding_adaptation.py`	Core script for single experiment runs
`run_experiments.py`	Batch runner for multiple experiments

🔧 Command Examples

Preview commands without execution

python run_experiments.py --dry_run

Enable HyDE (Hypothetical Document Embeddings)

python run_experiments.py \
  --hyde_model "meta-llama/Llama-3.1-8B-Instruct"

Run single experiment with custom parameters

python embedding_adaptation.py \
  --embedding_model "Qwen/Qwen3-Embedding-0.6B" \
  --dataset "NFCorpus" \
  --reranker_model "mistralai/Mistral-Small-3.2-24B-Instruct-2506" \
  --lr 1e-4 \
  --num_steps 100 \
  --total_scores 20 \
  --scores_from_top 20

⚙️ Configuration Parameters

🎯 Single Experiment Parameters (`embedding_adaptation.py`)

These parameters configure individual experiment runs. They can also be passed to run_experiments.py and will be forwarded to each experiment.

Parameter	Description	Default
`--embedding_model`	Embedding model to use (single model)	`Qwen/Qwen3-Embedding-0.6B`
`--dataset`	Dataset to use for experiment	`RealScholarQuery`
`--reranker_model`	Reranker model for feedback	`mistralai/Mistral-Small-3.2-24B-Instruct-2506`
`--hyde_model`	LLM for hypothetical document generation (optional)	None
`--lr`	Learning rate for query embedding optimization	1e-4
`--num_steps`	Number of optimization steps	100
`--total_scores`	Total documents to sample for reranking signal	20
`--scores_from_top`	Documents sampled from top results (by embedding similarity)	20
`--optimize_with_gold`	Use gold labels instead of reranker scores	False
`--embedder_batch_size`	Batch size for embedding inference	10
`--reranker_batch_size`	Batch size for reranker inference	10
`--random_seed`	Random seed for reproducibility	42
`--save_tensors`	Save query trajectory tensors for analysis	False
`--experiment_name`	Custom experiment name (auto-generated if not provided)	None

🔄 Batch Runner Parameters (`run_experiments.py` only)

These parameters are specific to the batch runner and control how multiple experiments are executed.

Parameter	Description	Default
`--parallel`	Number of concurrent experiments to run	1
`--experiment_prefix`	Custom prefix for experiment names	None
`--embedding_models`	List of embedding models to test (space-separated)	See defaults in script
`--datasets`	List of datasets to test (space-separated)	See defaults in script
`--dry_run`	Preview commands without executing	False
`--continue_on_error`	Continue running experiments even if one fails	False

📁 Output Structure

Each experiment run creates a directory under output/<experiment_name>/ containing:

output/
└── <experiment_name>/
    ├── *_results.csv              # Per-topic evaluation metrics
    ├── *_raw_scores.parquet       # Raw document-level scores
    ├── tensors/                   # Query trajectory tensors (if enabled)
    └── config.json                # Experiment configuration metadata

⏱️ Runtime Estimates

The total runtime of the experiments consists of two main components:

Document embeddings — Embedding all documents in the corpus. In a deployment environment this would typically be done offline.
Per-query computations — Query embedding, refinement optimization, and LLM teacher feedback.

Per-query latency with a GPU is well under a second per query, as shown in the paper. This means just a few minutes to run all the queries in a dataset, assuming an efficient model endpoint for obtaining the LLM feedback scores.

Thus, much of the experiment runtime is devoted to the one-time cost of computing the corpus document embeddings. For convenience, it is possible to run this initial step separately using the script embed_all_documents.py.

Runtime estimates on a single A100-80GB GPU:

Using a small embedding model, like Qwen/Qwen3-Embedding-0.6B: Running the full experiment on all datasets, including embedding all corpus documents and optimizing all the queries, should take about an hour.
Using larger embedding models, like Qwen/Qwen3-Embedding-8B:
- Embedding corpus documents with 7B/8B models can range from a few minutes for small datasets (e.g., ArgKP-21) to 3-4 hours for larger datasets like RealScholarQuery or FollowIR.
- Note that for datasets with long documents (e.g., FollowIR), it may be necessary to use a small --embedder_batch_size to avoid running out of GPU memory.

📝 Notes

💾 Caching: Embeddings, reranker outputs, and generated texts are cached under /cache to avoid repeated computation.

🔬 Research Focus: This implementation is optimized for research and experimentation. For production deployment, consider replacing the file-system cache with a scalable vector-store solution.

📚 Citation

If you use this code in your research, please cite our paper:

@article{gera2026taskadaptive,
  title={Task-Adaptive Embedding Refinement via Test-time LLM Guidance},
  author={Gera, Ariel and Ashury-Tahan, Shir and Bloch, Gal and Eytan, Ohad and Toledo, Assaf},
  year={2026},
  journal={arXiv:2605.12487},
  url={https://arxiv.org/abs/2605.12487},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
analysis		analysis
data/clinc150		data/clinc150
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_loaders.py		dataset_loaders.py
embed_all_documents.py		embed_all_documents.py
embedders.py		embedders.py
embedding_adaptation.py		embedding_adaptation.py
file_system_cache.py		file_system_cache.py
generative_llms.py		generative_llms.py
instruction_template_experiments.yaml		instruction_template_experiments.yaml
instruction_templates.yaml		instruction_templates.yaml
requirements.txt		requirements.txt
rerankers.py		rerankers.py
run_experiments.py		run_experiments.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 LLM-Guided Embedding Refinement

📖 Overview

🎯 Key Features

📄 Paper

💡 How It Works

🔄 Workflow

🚀 Quick Start

📦 Installation

1️⃣ Create and activate a Python environment

2️⃣ Install dependencies

3️⃣ Configure inference for LiteLLM or OpenAI

▶️ Basic Usage

📊 Supported Datasets

🛠️ Custom Usage

🎮 Main Entry Points

🔧 Command Examples

⚙️ Configuration Parameters

🎯 Single Experiment Parameters (`embedding_adaptation.py`)

🔄 Batch Runner Parameters (`run_experiments.py` only)

📁 Output Structure

⏱️ Runtime Estimates

📝 Notes

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 LLM-Guided Embedding Refinement

📖 Overview

🎯 Key Features

📄 Paper

💡 How It Works

🔄 Workflow

🚀 Quick Start

📦 Installation

1️⃣ Create and activate a Python environment

2️⃣ Install dependencies

3️⃣ Configure inference for LiteLLM or OpenAI

▶️ Basic Usage

📊 Supported Datasets

🛠️ Custom Usage

🎮 Main Entry Points

🔧 Command Examples

⚙️ Configuration Parameters

🎯 Single Experiment Parameters (embedding_adaptation.py)

🔄 Batch Runner Parameters (run_experiments.py only)

📁 Output Structure

⏱️ Runtime Estimates

📝 Notes

📚 Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

🎯 Single Experiment Parameters (`embedding_adaptation.py`)

🔄 Batch Runner Parameters (`run_experiments.py` only)

Packages