fix(agents): RPC calls no longer hang during connection churn#1746
Merged
Conversation
usePartySocket replaces the underlying socket object whenever connection
options change (async query refresh, enabled toggle, path change). The
RPC layer didn't account for that:
- a call issued against a stale `agent` reference (e.g. captured by a
mount-time effect) was buffered inside the permanently-closed old
socket and its promise never settled
- a call transmitted just before replacement lost its response — RPC
responses are connection-bound — with no rejection either
- the old socket's close event was unobservable after the swap, so
"connection closed" cleanup never ran for it
useAgent:
- call(), agent.stub, and agent.setState now route through a ref to the
live socket, so stale references keep working after a replacement.
- RPC requests are only handed to a socket once it's OPEN. Until then
they're queued by the hook and flushed on the next open — including on
a replacement socket. Queued requests were never transmitted, so
flushing can't double-execute anything server-side.
- Pending calls are tagged with the socket they were transmitted on.
When a socket closes or is replaced, only calls sent on *that* socket
are rejected ("Connection closed"); queued calls survive, and calls in
flight on a newer socket are no longer spuriously rejected by a stale
close event from an old one.
- Destination guard: queued calls only follow the connection to the same
agent instance. If the hook is re-pointed at a different address
(agent, name, basePath, or path props change) before a queued call was
transmitted, the call is rejected instead of executing against an
instance it wasn't composed for.
- Non-streaming calls get a default 30s timeout as a backstop so lost
responses reject instead of hanging. Configurable via the new
`defaultCallTimeout` option (0 disables); an explicit per-call
`timeout` always wins (`timeout: 0` opts out). Streaming calls are
exempt.
AgentClient:
- Pending calls track whether their request was actually transmitted.
On a transient disconnect, transmitted calls are rejected (their
response can never arrive) while buffered calls stay pending —
PartySocket re-sends them on reconnect. A permanent close rejects
everything.
- Same `defaultCallTimeout` backstop as the hook.
Both now log a console.warn when an RPC response arrives with no
matching pending call (e.g. after a timeout) instead of silently
discarding it.
New rpc-robustness react test suite covers mount-time calls, stale
references after replacement, queued-call flush, in-flight rejection on
replacement, the destination guard, default-timeout behavior (including
timeout: 0 and streaming exemption), and the dropped-response warning;
client.test.ts gains the AgentClient equivalents.
Fixes #1738
Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 0238e3e The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1738.
usePartySocketreplaces the underlying socket object whenever connection options change (async query refresh,enabledtoggle, path change). The RPC layer didn't account for that, producing three ways for acall()promise to never settle:agentreference (e.g. captured by a mount-time effect — the exact pattern in the issue) was buffered inside the permanently-closed old socket and never transmitteduseAgentcall(),agent.stub, andagent.setStateroute through a ref to the live socket, so stale references keep working after a replacement.Connection closed; queued calls survive, and calls in flight on a newer socket are no longer spuriously rejected by a stale close event from an old one.agent,name,basePath, or path props change) before a queued call was transmitted, the call rejects instead of executing against an instance it wasn't composed for. Credential-only changes (query refresh) still flush normally.defaultCallTimeoutoption (0disables); an explicit per-calltimeoutalways wins (timeout: 0opts out). Streaming calls are exempt — long-lived streams are legitimate.AgentClientdefaultCallTimeoutbackstop.Both now
console.warnwhen an RPC response arrives with no matching pending call (e.g. after a timeout) instead of silently discarding it.Upstream
Companion partysocket PR (cloudflare/partykit#403) fixes the underlying buffered-message loss and unobservable close at the library level — synchronous close events,
send()returning transmit status,drainQueuedMessages(), and destination-guarded buffer transfer in the hooks. This PR is self-contained and doesn't depend on it: the hook-level queue here never relies on PartySocket's internal buffer. The agents react suite was also smoke-tested against the new partysocket build (96/96 passing), so the future dependency bump is safe.Test plan
rpc-robustness.test.tsxsuite (9 tests): mount-time call resolves; staleagentreference works after socket replacement; queued calls flush onto a replacement socket; in-flight calls on a replaced socket reject promptly; destination guard rejects queued calls when the agent address changes; default timeout fires;timeout: 0opts out; streaming calls exempt from the backstop; dropped-response warningAgentClienttests inclient.test.ts(6 tests): transmitted vs. buffered rejection on transient disconnect, buffered calls surviving reconnect, default timeout,timeout: 0, streaming exemption, dropped-response warningpnpm run checkandnx affected -t test(9 projects) greenChangeset included (
agentsminor).Made with Cursor