Skip to content

Fix client-disconnect session leaks in PyTorch MP engine#4655

Open
grimoire wants to merge 3 commits into
InternLM:mainfrom
grimoire:fix-session-end
Open

Fix client-disconnect session leaks in PyTorch MP engine#4655
grimoire wants to merge 3 commits into
InternLM:mainfrom
grimoire:fix-session-end

Conversation

@grimoire

@grimoire grimoire commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fix session leaks when API clients disconnect or Ctrl+C during PyTorch MP serving.

This patch makes MP stream startup cancellation-safe and makes serve-side terminal session cleanup idempotent, so dropped requests no longer leave backend sessions or API session mappings alive forever.

Bug to fix

client Ctrl+C
  -> HTTP streaming task cancelled
  -> safe_run starts cleanup
  -> async_end waits for MP stream init event
  -> stream init event may never be set if cancellation happened before stream id reply
  -> END_SESSION is not sent
  -> backend session remains alive

Changes

  • Add MP stream startup barrier via SessionState.init_done.
  • Ensure async_end() waits until backend ADD_MESSAGE has been enqueued before sending END_SESSION.
  • Make ZMQ and Ray MP streaming startup cancellation-safe with shielded startup tasks.
  • Add abandoned stream cleanup for ZMQ/Ray stream tasks.
  • Make serve/API session cleanup idempotent on normal completion, cancellation, prompt errors, and client disconnect.
  • Add ZMQ backend-death handling so pending RPC futures do not hang forever.
  • Keep expected client disconnect logs at INFO; real exception paths still log tracebacks.

Requirement

Copilot AI review requested due to automatic review settings June 6, 2026 16:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets leaked backend sessions and stale session mappings when API clients disconnect (or cancel) during PyTorch MP serving, by making MP streaming startup cancellation-safe and making serve-side cleanup idempotent.

Changes:

  • Add a stream-startup barrier (SessionState.init_done) and make MP streaming startup robust to cancellation (ZMQ/Ray), including abandoned stream cleanup and backend-death wakeups.
  • Make session cleanup idempotent across engine/generator/API wrapper paths, and tighten AsyncEngine.generate() session removal behavior under cancellations/errors.
  • Fix TP-local Q/KV head metadata usage for FlashAttention/FlashMLA and correct KV metadata handling for last-chunk spec-decode input rewriting; add regression tests.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_lmdeploy/serve/test_session_cleanup.py New regression tests for idempotent session-map cleanup across normal exit/cancel paths.
tests/pytorch/spec_decode/test_spec_agent.py Update regression test to ensure long-context KV metadata stays aligned after input rewriting.
tests/pytorch/paging/test_state_manager.py New tests for reserved state-cache row and num_state_caches=None behavior.
tests/pytorch/engine/test_zmq_rpc.py Add tests for stream-start barrier, cancel safety, backend-death wakeups, and idempotent stream output handling.
tests/pytorch/engine/test_ray_mp_engine.py Add tests for Ray stream startup cancellation and idempotent result retrieval after drop.
tests/pytorch/config/test_model_config.py New tests for TP-local head count computation and dist_config preservation.
lmdeploy/serve/openai/api_server.py Add streaming/non-streaming request wrapper that ensures generator/session cleanup on disconnect/cancel.
lmdeploy/serve/managers/session_manager.py Add request-exit-driven session removal and make SessionManager.remove() idempotent/stale-safe.
lmdeploy/serve/core/async_engine.py Make safe_run cancellation-safe; ensure session removal happens consistently on prompt/cancel/error paths.
lmdeploy/pytorch/spec_decode/spec_agent.py Keep aggregate KV metadata unchanged for last-chunk input rewriting.
lmdeploy/pytorch/paging/state_manager.py Ensure reserved state row is excluded from allocatable IDs; handle num_state_caches=None.
lmdeploy/pytorch/models/utils/cudagraph.py Use TP-local head counts for FlashAttention/FlashMLA metadata.
lmdeploy/pytorch/envs.py Add LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE env flag.
lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py Add backend-death handling, cancellation-safe stream startup, abandoned stream drop, and streaming startup barrier plumbing.
lmdeploy/pytorch/engine/mp_engine/zmq_engine.py Wire backend liveness callbacks/sentinel; make port-wait robust to early backend exit.
lmdeploy/pytorch/engine/mp_engine/ray_engine.py Add cancellation-safe Ray stream startup and abandoned-stream drop support.
lmdeploy/pytorch/engine/mp_engine/base.py Replace is_exists with init_done barrier; make async_end() wait for startup completion.
lmdeploy/pytorch/engine/mp_engine/base_worker.py Add EngineOutputGather.discard() to drop abandoned stream buffers.
lmdeploy/pytorch/config.py Add dist_config to ModelConfig; implement get_num_qkv_head_by_tp() and preserve dist_config in from_hf_config().
lmdeploy/pytorch/backends/cuda/op_backend.py Use TP-local head counts when building FlashAttention/FlashMLA metadata.
lmdeploy/pytorch/backends/cuda/graph_runner.py Add “fake capture” path to bypass actual CUDA graph capture for debugging/padding behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/pytorch/engine/mp_engine/ray_engine.py
Comment thread lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py
Comment thread lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py
Comment thread lmdeploy/pytorch/engine/mp_engine/ray_engine.py
Comment thread lmdeploy/serve/openai/api_server.py Outdated
@grimoire grimoire force-pushed the fix-session-end branch 2 times, most recently from 9de66db to 38c9bc7 Compare June 8, 2026 08:45
@lvhan028 lvhan028 added the Bug:P0 label Jun 8, 2026
Comment on lines +557 to +559
except (asyncio.CancelledError, GeneratorExit):
remove_session_once()
raise

@lvhan028 lvhan028 Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be merged to except Exception branch for the sake of metrics_processor

@grimoire grimoire Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I kept the CancelledError/GeneratorExit branch separate from the generic Exception branch so cancellation still propagates correctly instead of being converted into an error GenOut.

But the metrics concern is valid: generate() has already incremented num_total_reqs, so prompt-processing cancellation should also update failed metrics. I added metrics_processor.increase_failed_requests('cancel') before cleanup/re-raise, and added a unit test to verify num_total_reqs, num_cancelled_reqs, and num_uncompleted_reqs stay balanced after prompt cancellation.

Powered by codex

Comment thread lmdeploy/serve/openai/api_server.py
Comment thread lmdeploy/serve/core/async_engine.py
grimoire added 2 commits June 8, 2026 20:59
# Conflicts:
#	lmdeploy/serve/anthropic/endpoints/messages.py
#	tests/test_lmdeploy/serve/anthropic/test_endpoints.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants