Skip to content

Improve engine health monitoring and wakeup scheduling#4645

Open
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:fix-health
Open

Improve engine health monitoring and wakeup scheduling#4645
lvhan028 wants to merge 1 commit into
InternLM:mainfrom
lvhan028:fix-health

Conversation

@lvhan028

@lvhan028 lvhan028 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Motivation

  • Health checks were too aggressive for PyTorch MP/Ray deployments: short probe timeouts and overlapping polls caused false unhealthy results while the engine was busy or recovering from sleep/wakeup.
  • Synchronous POST /wakeup also blocked the FastAPI event loop during long warmup, starving background health probes.
  • Scheduler tick advanced only after a full forward completed, so long prefills looked stalled to health logic.

Modification

lmdeploy/serve/core/health.py

  • Add env overrides: LMDEPLOY_HEALTH_POLL_INTERVAL, LMDEPLOY_HEALTH_PROBE_TIMEOUT, LMDEPLOY_HEALTH_UNHEALTHY_AFTER.
  • Change defaults to 12s poll interval, 10s probe timeout, 90s unhealthy/stale threshold.
  • Warn if poll_interval <= probe_timeout.
  • On pending probe result: log and skip snapshot update (keep last successful state).
  • Expire stale probes only when last snapshot was healthy (not sleeping).

lmdeploy/serve/core/async_engine.py

  • Return status='pending' instead of unhealthy when a previous health probe is still running.
  • Make wakeup() async and run engine.wakeup() via asyncio.to_thread() to avoid blocking the API event loop.

lmdeploy/serve/openai/api_server.py

  • await async wakeup() in the /wakeup handler.

lmdeploy/pytorch/engine/inputs_maker.py
lmdeploy/pytorch/engine/engine_loop.py
lmdeploy/pytorch/paging/scheduler.py

  • Call scheduler.tick() after each forward_async() (one tick per forward dispatch).

Copilot AI review requested due to automatic review settings June 4, 2026 08:21
@lvhan028 lvhan028 requested a review from RunningLeon June 4, 2026 08:21
@lvhan028 lvhan028 added the Bug:P0 label Jun 4, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tunes engine health monitoring and scheduling progress reporting to reduce false “unhealthy” states during long/overlapping probes and long prefills, and makes the /wakeup endpoint non-blocking for the FastAPI event loop.

Changes:

  • Add environment-variable overrides and updated defaults for health monitor polling/timeout/staleness behavior, including “pending” probe handling.
  • Make AsyncEngine.wakeup() asynchronous by offloading the blocking backend wakeup to a worker thread; update the OpenAI-compatible /wakeup route to await it.
  • Advance scheduler_tick once per forward_async() dispatch (moved to inputs dispatch path) so health logic sees progress even during long prefills.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lmdeploy/serve/openai/api_server.py Await the now-async engine wakeup() in the /wakeup handler.
lmdeploy/serve/core/health.py Add env overrides and adjust probe/poll/staleness logic, including skipping snapshot updates on pending probes.
lmdeploy/serve/core/async_engine.py Return pending when probes overlap; make wakeup() async via asyncio.to_thread().
lmdeploy/pytorch/paging/scheduler.py Clarify tick() semantics as one step per forward dispatch.
lmdeploy/pytorch/engine/inputs_maker.py Call scheduler.tick() after each forward_async() dispatch.
lmdeploy/pytorch/engine/engine_loop.py Remove the old scheduler.tick() location to avoid double-counting / wrong timing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +33
def _env_override_float(env_var: str, value: float) -> float:
"""Return ``value`` unless ``env_var`` is set, then parse and return it."""
env_value = os.getenv(env_var)
if env_value is None:
return value
try:
return float(env_value)
except ValueError:
return value
Comment thread lmdeploy/serve/core/async_engine.py
await self.engine.sleep(level)

def wakeup(self, tags: list[str] | None = None):
async def wakeup(self, tags: list[str] | None = None):
logger.warning(f'some tag in {tags} not in sleeping tags {self.sleeping_tags}')
return
self.engine.wakeup(tags)
await asyncio.to_thread(self.engine.wakeup, tags)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quote:

await asyncio.to_thread(self.engine.wakeup, tags) runs PyTorch Engine.wakeup() off the event-loop thread. That path calls EngineLoop.resume_from_sleep() (lmdeploy/pytorch/engine/engine.py:553), which sets loop-owned asyncio.Events (lmdeploy/pytorch/engine/engine_loop.py:181-188). This is not thread-safe and can leave the engine loop stuck after /wakeup, or fail under asyncio debug. I’d split the blocking backend wakeup/warmup from the event-loop resume, or resume via loop.call_soon_threadsafe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants