Skip to content

Firecracker snapshots: implementation + benchmark harness (impl of #376)#381

Merged
tinder-maxwellelliott merged 6 commits into
masterfrom
claude/confident-jones-ec04f9
Jun 22, 2026
Merged

Firecracker snapshots: implementation + benchmark harness (impl of #376)#381
tinder-maxwellelliott merged 6 commits into
masterfrom
claude/confident-jones-ec04f9

Conversation

@tinder-maxwellelliott

Copy link
Copy Markdown
Collaborator

Implements the Firecracker snapshot design from the RFC in #376instant starts of bazel-diff on large monorepos by restoring a microVM whose Bazel server already has the build graph loaded and external repos fetched, so the PR-time path only re-analyses changed packages.

What's here

CLI hooks (//cli, Kotlin — RFC §4, Phase 1)

  • bazel-diff fingerprint — snapshot cache key over the inputs that affect the build graph (bazel version, MODULE.bazel.lock, .bazelrc, bazel-diff version, query-affecting flag set). FingerprintInteractor is pure + unit-tested.
  • bazel-diff warmup — record-side entrypoint: generate-hashes for the base revision + writes base_hashes.json/fingerprint.json + clean-exit "safe to snapshot" contract. Extends GenerateHashesCommand, so baked base hashes are byte-identical to a cold run.

Go orchestrator (tools/firecracker/, bazel-diff-snap)

  • Dependency-free (stdlib only): Firecracker REST API over a unix socket → static CI binary, no module downloads.
  • record / consume; consume is fail-safe — fingerprint match + nearest-ancestor resolution, exit 2 → cold fallback. A stale snapshot is never silently trusted (RFC §5.2).
  • Two drivers: local (no VM, runs anywhere — used by unit tests + portable cold/warm proxy) and firecracker (real microVM via the API + virtio-net/TAP; Linux + /dev/kvm).
  • Store layout, resolver, fingerprint match, and the API client are unit-tested.

Benchmark + validation harness (tools/firecracker/bench/)

  • gen_project.py — synthetic large-Bazel-project generator (layered genrule DAG, bounded depth, no external toolchains, two git revisions).
  • bench.py — cold-vs-warm analysis-time benchmark; asserts warm output is byte-identical to cold.
  • Dockerfile + scripts to run it on Linux at scale.
  • build_guest_image.sh / setup_tap.sh — build the guest rootfs + kernel and the host TAP for the firecracker driver.
  • .github/workflows/firecracker-e2e.yml — the RFC §5.3 correctness canary (snapshot-consumed vs cold equality) on a Linux+KVM runner.

Validation

  • Analysis-time win (cold vs warm), Linux, ~149.5k targets: cold consume 11.7s vs warm 4.9s58% faster, warm output byte-identical to cold. On a real monorepo (bzlmod resolution + minutes of cold start) the absolute win is far larger.
  • §5.3 correctness canary, real KVM (EC2 c6g.metal): full record → snapshot → restore → consume passed (--- PASS ... 84.8s); firecracker-consumed impacted set == cold/local set (522 == 522 targets).

Notes for reviewers

  • The firecracker driver requires Linux + /dev/kvm (bare metal or a host with nested virt). The local driver + the cold/warm benchmark run anywhere (incl. macOS).
  • The record step's in-guest warmup needs guest network egress to fetch BCR deps; the EC2 run used host NAT + a guest DNS resolver. firecracker-e2e.yml will need the same (NAT/DNS) — or a pre-seeded bazel repo cache baked into the image — to pass fully unattended. Happy to fold that in as a follow-up.
  • Deliberate deviation from the RFC: the orchestrator speaks the Firecracker REST API directly over the unix socket (stdlib net/http) instead of firecracker-go-sdk, to keep a zero-dependency static binary.

🤖 Generated with Claude Code

tinder-maxwellelliott and others added 6 commits June 18, 2026 11:32
Implements + validates the Firecracker snapshot design (PR #376):

CLI hooks (//cli, RFC Phase 1):
- `fingerprint` subcommand + FingerprintInteractor (pure, unit-tested):
  snapshot cache key over bazel version, MODULE.bazel.lock, .bazelrc,
  bazel-diff version, and the query-affecting flag set.
- `warmup` subcommand: record-side entrypoint = generate-hashes for the base
  revision + writes base_hashes.json/fingerprint.json + clean-exit contract.
  Extends GenerateHashesCommand so base hashes are byte-identical to a cold run.

Go orchestrator (tools/firecracker/, stdlib-only static binary):
- `bazel-diff-snap record|consume`; Firecracker REST API over a unix socket.
- local driver (no VM, runs anywhere) + firecracker driver (Linux+KVM).
- consume is fail-safe: fingerprint match + nearest-ancestor resolution,
  exit 2 -> cold fallback. Pure logic + API client unit-tested.

Benchmark harness (tools/firecracker/bench/):
- gen_project.py: synthetic large-Bazel-project generator (layered genrule DAG,
  bounded depth, no external toolchains, two git revisions).
- bench.py: cold-vs-warm analysis-time benchmark; asserts warm == cold.
- Dockerfile + scripts to run on Linux at scale.

Validated on Linux at ~149.5k targets: cold consume 11.7s vs warm 4.9s (58%
faster), warm output byte-identical to cold; orchestrator local record/consume
produced a bounded impacted-target cone.

Real microVM record/consume requires Linux + /dev/kvm (the self-hosted CI host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the firecracker driver's networking gap and add an end-to-end
validation path so the snapshot flow can be proven on real KVM.

Driver:
- fcapi: add networkInterface payload + addNetworkInterface
  (PUT /network-interfaces/{id}).
- fcDriver: attach a TAP-backed virtio-net NIC before InstanceStart, bake a
  static `ip=` directive into the kernel cmdline (matches the guest image's
  MAC->IP fcnet convention), and check the host TAP exists before a restore.
- main: --tap-device/--guest-ip/--host-ip/--netmask/--guest-mac flags.

Bug fixes in the previously-unrun driver:
- bootArgs now passes `root=/dev/vda rw` (Firecracker does not synthesize it),
  so a disk-backed guest actually boots.
- record now checks out the base SHA in the guest before warmup, mirroring
  localDriver.record and consume, so baked base hashes are for the base rev.

Image + host setup:
- bench/build_guest_image.sh: build kernel + rootfs.base.ext4 with JDK, bazel,
  git, bazel-diff, the workspace, and a standalone (non-socket-activated) sshd
  that survives snapshot restore.
- bench/setup_tap.sh: privileged host TAP setup (driver stays privilege-free).

Validation:
- fc_integration_test.go (build tag `fcintegration`): drives fcDriver
  record+consume against a real microVM, env-configured.
- .github/workflows/firecracker-e2e.yml: workflow_dispatch job that builds the
  image, boots a real microVM on an x86_64 + /dev/kvm runner, and asserts the
  snapshot-consumed impacted set is byte-identical to the cold/local set
  (RFC §5.3).
- README: networking flags, build/setup docs, and validation notes — incl. the
  known aarch64 16KB-page-host post-restore userspace freeze.

Unit-tested (fcapi network call, netConfig boot args, ensureTapExists);
go test + vet clean for both default and fcintegration tag sets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Found while running the §5.3 canary on real Linux (in a nested-virt VM): the
guest image build and the firecracker driver had several real bugs that would
also bite on a Linux+KVM CI host. Fixes:

build_guest_image.sh (the firecracker-ci minimized base exposed these):
- run apt in the chroot with the sandbox off + create /tmp, apt spool/log dirs
  (else "Couldn't create temporary file" / repos "not signed")
- mount /dev/pts in the chroot (JDK postinst calls posix_openpt)
- create /usr/share/man/manN (JDK update-alternatives man symlinks)
- chown the baked /work to root (git "detected dubious ownership" -> exit 128)
- actually switch sshd from socket-activation to a standalone always-on
  ssh.service (the README claimed this but the script never did it; socket sshd
  doesn't reliably serve connections after a snapshot restore)

driver_firecracker.go:
- add waitForGuest() and poll guest ssh after instanceStart (record) and after
  snapshot resume (consume) before issuing commands — the driver previously
  raced the guest's boot / resume and the first ssh would fail.

fc_integration_test.go:
- ssh ConnectTimeout + BatchMode so the readiness poll fails fast and never
  hangs on a prompt.

With these, the canary boots -> NIC/TAP -> standalone sshd -> guest exec ->
git checkout -> java all work. (Local nested-virt runs are then gated only by
~70x JVM slowdown under Apple's L2 nested virtualization, which does not exist
on a bare-metal Linux+KVM host; bazel's server can't start within its timeout
in the nested guest. Run the canary on real hardware via firecracker-e2e.yml.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The HTTP client used a 60s timeout for all Firecracker API calls, but
/snapshot/create and /snapshot/load dump/load the guest's full memory to/from
disk and take well over a minute for a multi-GB VM. On a real KVM host the
canary's record step hit "PUT /snapshot/create: context deadline exceeded"
mid-write. Raise the timeout to 15m (other calls are instant over the socket).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Go orchestrator (tools/firecracker): 30% -> 66% statement coverage.
- Extract the in-guest command builders (warmupCommand/consumeScript) and make
  waitForGuest's poll interval injectable, so the record/consume logic is unit-
  testable without a microVM.
- Add tests: main.go runRecord/runConsume end-to-end via the local driver with a
  fake bazel-diff (+ makeDriver branches, multiFlag, arg validation, cold
  fallback); store newEntry/writeMetadata/path accessors/mustAbs; readBazelLabel;
  fingerprint error paths; fcClient resume; driver helpers (copyFile, baseRootfs,
  netConfig, waitForGuest, sshGuest args, boot error, teardown).
- Fix TestEnsureTapExists to skip the lo check off-Linux (filepath.Glob returns a
  nil error on no match, so it was wrongly running + failing on macOS).
  Remaining 0% is the genuinely VM/ssh-bound record/consume/boot path, covered by
  the fcintegration canary.

Kotlin CLI hooks (//cli, protects the >=90% bazel coverage gate):
- Extract WarmupCommand.writeFingerprint() out of call() so the fingerprint
  emission is testable without the bazel-backed generate-hashes run.
- Add FingerprintGathererTest, FingerprintCommandTest, WarmupCommandTest.
- New-file coverage: FingerprintInteractor 100%, FingerprintCommand 91%,
  FingerprintGatherer 92%, WarmupCommand 76% (remainder is super.call()).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tinder-maxwellelliott tinder-maxwellelliott merged commit dfa0a8a into master Jun 22, 2026
15 checks passed
@tinder-maxwellelliott tinder-maxwellelliott deleted the claude/confident-jones-ec04f9 branch June 22, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants