bridge: add guest-side reconnect loop for live migration by shreyanshjain7174 · Pull Request #2698 · microsoft/hcsshim

shreyanshjain7174 · 2026-04-21T17:27:06Z

Problem

During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover — ListenAndServe returns with an I/O error, and the GCS has no way to re-establish communication with the new host. This leaves the guest unable to process any further container lifecycle operations after migration.

What this does

Adds a reconnect loop around the bridge lifecycle in cmd/gcs/main.go. When the bridge connection drops (detected by ListenAndServe returning), the GCS re-dials the host on the vsock command port and creates a fresh Bridge + Mux. The Host state (containers, processes, cgroups) persists across reconnections since it lives outside the bridge.

A Publisher is added to internal/guest/bridge/publisher.go to solve the goroutine lifetime mismatch: container wait goroutines are spawned during CreateContainer and outlive the bridge that created them. When a container exits, its wait goroutine calls Publisher.Publish() which routes the notification through whichever bridge is currently active. If no bridge is connected (during the reconnect gap), the notification is dropped — the host-side shim recovers by re-querying container state after reconnecting.

The defer ordering in ListenAndServe is fixed so quitChan closes before responseChan becomes invalid, preventing a panic when PublishNotification races with bridge teardown. responseChan is buffered to absorb in-flight responses during shutdown.

Design

The approach follows the existing runWithRestartMonitor pattern already used in cmd/gcs/main.go for chronyd — a loop with exponential backoff that retries forever.

Key design decisions:

Bridge is ephemeral: a fresh Bridge + Mux is created per iteration. Channels and handler closures are scoped to each ListenAndServe call, so there is no stale state to reset.
Host persists: hcsv2.Host holds containers in mutex-guarded maps. It is created once and reused across all bridge iterations. Container state survives the bridge drop.
Publisher drops, doesn't queue: if a container exits during the reconnect gap (~seconds), the notification is dropped. This is safe because the host-side shim calls WaitForProcess which blocks on the container's actual exit status — the notification is a convenience, not the source of truth.
Graceful shutdown preserved: if the host sends a shutdown request, ShutdownRequested() returns true and the loop breaks instead of reconnecting.

Reference: Kevin Parsons' live migration POC for the reconnect concept. This implementation simplifies the POC down to the minimum — just the main loop + Publisher (~90 lines of new code).

Changes

File	Change
`cmd/gcs/main.go`	Reconnect `for{}` loop with exponential backoff (1s–30s, retry forever)
`internal/guest/bridge/bridge.go`	`Publisher` field on `Bridge`, `ShutdownRequested()` method, fixed defer ordering, buffered `responseChan`, priority select guard in `PublishNotification`
`internal/guest/bridge/bridge_v2.go`	Container wait goroutine always uses `Publisher.Publish()`
`internal/guest/bridge/publisher.go`	New: mutex-guarded bridge reference swap (40 lines)
`internal/guest/bridge/publisher_test.go`	Tests for Publisher nil-bridge drop and bridge-set-publish

Testing

Tested on a two-node Hyper-V live migration setup using the TwoNodeInfra test module:

Invoke-FullLmTestCycle -Verbose — deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Container lcow-test successfully migrated with pod sandbox intact.
Post-migration crictl exec — created a fresh LCOW pod on Node_1 with our custom GCS (deployed via rootfs.vhd), started a container that writes a file, exec'd cat /tmp/dummy.txt to verify bridge communication works end-to-end.
Ran go build, go vet, and gofmt clean on all modified packages.

During live migration the vsock connection between the host and the GCS breaks when the VM moves to the destination node. The GCS bridge drops and cannot recover, leaving the guest unable to communicate with the new host. This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge after a connection loss. On each iteration a fresh Bridge and Mux are created while the Host state (containers, processes) persists across reconnections. A Publisher abstraction is added to bridge/publisher.go so that container wait goroutines spawned during CreateContainer can route exit notifications through the current bridge. When the bridge is down between reconnect iterations, notifications are dropped with a warning — the host-side shim re-queries container state after reconnecting. The defer ordering in ListenAndServe is fixed so that quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from panicking on a dead bridge. Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration setup (Node_1 -> Node_2). Migration completes at 100% and container exec works on the destination node after migration. Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>

shreyanshjain7174 mentioned this pull request Apr 21, 2026

Adds guest-side GCS changes for V2 shim support #2669

Open

shreyanshjain7174 marked this pull request as ready for review April 21, 2026 17:28

shreyanshjain7174 requested a review from a team as a code owner April 21, 2026 17:28

shreyanshjain7174 requested review from jterry75 and rawahars April 21, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: add guest-side reconnect loop for live migration#2698

bridge: add guest-side reconnect loop for live migration#2698
shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
shreyanshjain7174:bridge-reconnect-v2

shreyanshjain7174 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyanshjain7174 commented Apr 21, 2026

Problem

What this does

Design

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant