This repository presents the open-source resources associated with the paper Diagnosing and Mitigating Context Rot in Long-horizon Search. We release the complete experimental infrastructure for both Web Search and Local Search settings, including agent scaffolds with seven context management strategies, diagnostic analysis tools, and evaluation benchmarks for studying context rot in long-horizon search agents.
- [2026/06] Our paper is available on arXiv.
Extensive context has become the norm as Large Language Models (LLMs) are increasingly deployed in long-horizon tasks. The concern that increasing context length degrades model capabilities, known as context rot, has become a central issue for these applications. In this paper, we focus on deep search scenarios, aiming to investigate the rot phenomenon and its mitigation strategies. By evaluating four flagship open-source models across three benchmarks, we reveal a prevalent but unnoticed rot phenomenon: extensive context causes models to directly give up or prematurely provide uncertain answers, and this issue is exacerbated as the context grows. Through pruning experiments, we demonstrate the relationship between the accumulated context and the rot phenomenon. Furthermore, we investigate mitigating this issue through context management and post-hoc rejection sampling. For context management, we systematically evaluate seven different methods across three categories, based on performance, cost, and impact on context rot, providing clear guidance for strategy selection and usage. For rejection sampling, we develop a rot-aware filtering strategy and demonstrate its effectiveness across three aggregation methods. Finally, we show that these two approaches can be combined for further performance improvements.
-
Create an environment
conda create -n websearch python=3.11 -y conda activate websearch
-
Install dependencies
pip install openai qwen-agent transformers requests tqdm tiktoken pandas dashscope soundfile jinja2
-
Configure model and tool services
export TOKENIZER_PATH="/path/to/local-tokenizer" export MODEL_NAME="model-name-served-by-agent-endpoint" export MODEL_API_KEY="your-model-api-key" export AGENT_URL="https://your-openai-compatible-agent-endpoint/v1" export SERPER_API_KEY="your-serper-key" export SUMMARY_API_KEY="your-summary-api-key" export SUMMARY_API_BASE="https://your-openai-compatible-summary-endpoint/v1" export SUMMARY_MODEL_NAME="/path/or/name/of/summary-model" export MAX_LLM_CALL_PER_RUN=100 export CONTEXT_LENGTH=$((198 * 1024))
SERPER_API_KEYis used by the search tool and can be obtained from Serper.SUMMARY_*is used by the visit/page-summary tool.MODEL_NAME,MODEL_API_KEY, andAGENT_URLconfigure the main agent model endpoint.TOKENIZER_PATHis only used locally for token counting and context-length control.To reduce search API cost, Web Search caches historical search/visit tool-call results. When the same query is requested again, the cached result is returned directly. The cache is stored under
ContextRot/websearch/cache/serper_cache.sqliteby default, and you can override it withSERPER_CACHE_DB_PATH. -
Prepare input data
websearch/src/main.pysupports built-in dataset names:xbench-deepsearchbrowsecomp
The built-in dataset files are stored under
ContextRot/data/:ContextRot/data/xbench-deepsearch.jsonContextRot/data/browsecomp.json
You can also pass a custom JSON or JSONL file with
--input. Each record should contain:{"question": "Question text", "answer": "Reference answer"}
Run from the Web Search source directory:
cd ContextRot/websearch/src
python main.py \
--dataset browsecomp \
--model "$MODEL_NAME" \
--tokenizer_path "$TOKENIZER_PATH" \
--agent react \
--max_workers 4By default, outputs are written to ContextRot/websearch/output/<dataset>/<model_name>/<agent>.json, for example ContextRot/websearch/output/browsecomp/Qwen3.5-397B-A17B-FP8/react.json. Use --output /path/to/output.jsonl to override this path.
Add --num_samples N to run only the first N samples. If this argument is omitted, the full dataset is evaluated.
Current Web Search agent choices:
reactdiscardsummary_semanticsummary_lengthsummary_turnkeep_k_latestkeep_k_latest_wo_anykeep_k_latest_wo_reasoningfold
summary_semantic uses summary_agent_semantic_based.py and requires an additional classifier LLM to decide when summarization should be triggered. Configure it with:
export CLASSIFIER_API_BASE="https://your-openai-compatible-classifier-endpoint/v1"
export CLASSIFIER_API_KEY="your-classifier-api-key"
export CLASSIFIER_MODEL_NAME="gpt-oss-120b"In our experiments, we use gpt-oss-120b as the classifier LLM.
Use the analysis utilities on generated trajectory files:
terminal_state.py is used to classify the terminal state of each trajectory.
struggle_behavior.py is used to analyze the struggle pattern of agent trajectories at the process level.
export OPENAI_MODEL="your-llm-for-analysis"
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
python ContextRot/websearch/analysis/terminal_state.py \
--input /path/to/output.jsonl \
--agent-type react
python ContextRot/websearch/analysis/struggle_behavior.py \
--input /path/to/output.jsonl \
--label-csv /path/to/terminal_state.csvterminal_state.py needs the tokenizer path configured in the Web Search setup section, and OPENAI_MODEL / OPENAI_API_KEY / OPENAI_BASE_URL for the voting-based labeler. Set --agent-type to the agent used for generation. struggle_behavior.py uses the same LLM environment variables plus a label CSV produced by terminal_state.py, and generates a CSV file. In our experiments, we use gpt-oss-120b as the analysis LLM.
-
Create an environment
conda create -n foldagent python=3.11 -y conda activate foldagent
-
Install dependencies
pip install \ accelerate codetiming datasets dill hydra-core liger-kernel "numpy<2.0.0" \ pandas peft "pyarrow>=19.0.0" pybind11 pylatexenc pre-commit "ray[default]" \ "tensordict>=0.8.0,<=0.10.0,!=0.9.0" torchdata transformers wandb \ "packaging>=20.0" uvicorn fastapi latex2sympy2_extended math_verify tensorboard openai
-
Start the local search server
Start this on the machine that hosts retrieval:
cd ContextRot/localsearch/src/envs python search_server.py \ --model Qwen/Qwen3-Embedding-8B \ --corpus Tevatron/browsecomp-plus-corpus \ --corpus-embedding-dataset miaolu3/browsecomp-plus \ --host 0.0.0.0 \ --port 8000 -
Configure evaluation workers
export LOCAL_SEARCH_URL="http://<search-server-ip>:8000" export OPENAI_API_KEY="your-api-key" export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1" export TOKENIZER_PATH="/path/to/tokenizer"
Run evaluation from localsearch/:
cd ContextRot/localsearch
python src/eval_bc.py \
--dataset browsecomp-plus \
--model_name your-model-name \
--workflow fold \
--num_workers 32 \
--context_length 131072 \
--max_turn 100 \
--max_session 100Current Local Search workflows:
reactdiscardsummary_semanticsummary_lengthsummary_turnkeep_k_latestkeep_k_latest_wo_anykeep_k_latest_wo_reasoningfold
Use the analysis utilities on generated trajectory files:
terminal_state.py classifies the terminal state of each trajectory.
struggle_score.py analyzes struggle patterns in agent trajectories at the process level.
export TOKENIZER_PATH="/path/to/tokenizer"
export OPENAI_MODEL="your-llm-for-analysis"
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
python ContextRot/localsearch/analysis/terminal_state.py \
--json /path/to/output.json \
--agent-type react
python ContextRot/localsearch/analysis/struggle_score.py \
--json /path/to/output.json \
--label-csv /path/to/terminal_state.csvWe sincerely thank the authors of FoldAgent and DeepResearch. The Local Search code in this repository is based on FoldAgent, and the Web Search code is based on DeepResearch.
Please cite the paper if this repository or the paper is helpful to your work.
@misc{xia2026diagnosingmitigatingcontextrot,
title={Diagnosing and Mitigating Context Rot in Long-horizon Search},
author={Shijie Xia and Yikun Wang and Zhen Huang and Pengfei Liu},
year={2026},
eprint={2606.29718},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2606.29718},
}