Firecracker snapshots: implementation + benchmark harness (impl of #376)#381
Merged
Merged
Conversation
Implements + validates the Firecracker snapshot design (PR #376): CLI hooks (//cli, RFC Phase 1): - `fingerprint` subcommand + FingerprintInteractor (pure, unit-tested): snapshot cache key over bazel version, MODULE.bazel.lock, .bazelrc, bazel-diff version, and the query-affecting flag set. - `warmup` subcommand: record-side entrypoint = generate-hashes for the base revision + writes base_hashes.json/fingerprint.json + clean-exit contract. Extends GenerateHashesCommand so base hashes are byte-identical to a cold run. Go orchestrator (tools/firecracker/, stdlib-only static binary): - `bazel-diff-snap record|consume`; Firecracker REST API over a unix socket. - local driver (no VM, runs anywhere) + firecracker driver (Linux+KVM). - consume is fail-safe: fingerprint match + nearest-ancestor resolution, exit 2 -> cold fallback. Pure logic + API client unit-tested. Benchmark harness (tools/firecracker/bench/): - gen_project.py: synthetic large-Bazel-project generator (layered genrule DAG, bounded depth, no external toolchains, two git revisions). - bench.py: cold-vs-warm analysis-time benchmark; asserts warm == cold. - Dockerfile + scripts to run on Linux at scale. Validated on Linux at ~149.5k targets: cold consume 11.7s vs warm 4.9s (58% faster), warm output byte-identical to cold; orchestrator local record/consume produced a bounded impacted-target cone. Real microVM record/consume requires Linux + /dev/kvm (the self-hosted CI host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the firecracker driver's networking gap and add an end-to-end
validation path so the snapshot flow can be proven on real KVM.
Driver:
- fcapi: add networkInterface payload + addNetworkInterface
(PUT /network-interfaces/{id}).
- fcDriver: attach a TAP-backed virtio-net NIC before InstanceStart, bake a
static `ip=` directive into the kernel cmdline (matches the guest image's
MAC->IP fcnet convention), and check the host TAP exists before a restore.
- main: --tap-device/--guest-ip/--host-ip/--netmask/--guest-mac flags.
Bug fixes in the previously-unrun driver:
- bootArgs now passes `root=/dev/vda rw` (Firecracker does not synthesize it),
so a disk-backed guest actually boots.
- record now checks out the base SHA in the guest before warmup, mirroring
localDriver.record and consume, so baked base hashes are for the base rev.
Image + host setup:
- bench/build_guest_image.sh: build kernel + rootfs.base.ext4 with JDK, bazel,
git, bazel-diff, the workspace, and a standalone (non-socket-activated) sshd
that survives snapshot restore.
- bench/setup_tap.sh: privileged host TAP setup (driver stays privilege-free).
Validation:
- fc_integration_test.go (build tag `fcintegration`): drives fcDriver
record+consume against a real microVM, env-configured.
- .github/workflows/firecracker-e2e.yml: workflow_dispatch job that builds the
image, boots a real microVM on an x86_64 + /dev/kvm runner, and asserts the
snapshot-consumed impacted set is byte-identical to the cold/local set
(RFC §5.3).
- README: networking flags, build/setup docs, and validation notes — incl. the
known aarch64 16KB-page-host post-restore userspace freeze.
Unit-tested (fcapi network call, netConfig boot args, ensureTapExists);
go test + vet clean for both default and fcintegration tag sets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Found while running the §5.3 canary on real Linux (in a nested-virt VM): the guest image build and the firecracker driver had several real bugs that would also bite on a Linux+KVM CI host. Fixes: build_guest_image.sh (the firecracker-ci minimized base exposed these): - run apt in the chroot with the sandbox off + create /tmp, apt spool/log dirs (else "Couldn't create temporary file" / repos "not signed") - mount /dev/pts in the chroot (JDK postinst calls posix_openpt) - create /usr/share/man/manN (JDK update-alternatives man symlinks) - chown the baked /work to root (git "detected dubious ownership" -> exit 128) - actually switch sshd from socket-activation to a standalone always-on ssh.service (the README claimed this but the script never did it; socket sshd doesn't reliably serve connections after a snapshot restore) driver_firecracker.go: - add waitForGuest() and poll guest ssh after instanceStart (record) and after snapshot resume (consume) before issuing commands — the driver previously raced the guest's boot / resume and the first ssh would fail. fc_integration_test.go: - ssh ConnectTimeout + BatchMode so the readiness poll fails fast and never hangs on a prompt. With these, the canary boots -> NIC/TAP -> standalone sshd -> guest exec -> git checkout -> java all work. (Local nested-virt runs are then gated only by ~70x JVM slowdown under Apple's L2 nested virtualization, which does not exist on a bare-metal Linux+KVM host; bazel's server can't start within its timeout in the nested guest. Run the canary on real hardware via firecracker-e2e.yml.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The HTTP client used a 60s timeout for all Firecracker API calls, but /snapshot/create and /snapshot/load dump/load the guest's full memory to/from disk and take well over a minute for a multi-GB VM. On a real KVM host the canary's record step hit "PUT /snapshot/create: context deadline exceeded" mid-write. Raise the timeout to 15m (other calls are instant over the socket). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Go orchestrator (tools/firecracker): 30% -> 66% statement coverage. - Extract the in-guest command builders (warmupCommand/consumeScript) and make waitForGuest's poll interval injectable, so the record/consume logic is unit- testable without a microVM. - Add tests: main.go runRecord/runConsume end-to-end via the local driver with a fake bazel-diff (+ makeDriver branches, multiFlag, arg validation, cold fallback); store newEntry/writeMetadata/path accessors/mustAbs; readBazelLabel; fingerprint error paths; fcClient resume; driver helpers (copyFile, baseRootfs, netConfig, waitForGuest, sshGuest args, boot error, teardown). - Fix TestEnsureTapExists to skip the lo check off-Linux (filepath.Glob returns a nil error on no match, so it was wrongly running + failing on macOS). Remaining 0% is the genuinely VM/ssh-bound record/consume/boot path, covered by the fcintegration canary. Kotlin CLI hooks (//cli, protects the >=90% bazel coverage gate): - Extract WarmupCommand.writeFingerprint() out of call() so the fingerprint emission is testable without the bazel-backed generate-hashes run. - Add FingerprintGathererTest, FingerprintCommandTest, WarmupCommandTest. - New-file coverage: FingerprintInteractor 100%, FingerprintCommand 91%, FingerprintGatherer 92%, WarmupCommand 76% (remainder is super.call()). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the Firecracker snapshot design from the RFC in #376 — instant starts of bazel-diff on large monorepos by restoring a microVM whose Bazel server already has the build graph loaded and external repos fetched, so the PR-time path only re-analyses changed packages.
What's here
CLI hooks (
//cli, Kotlin — RFC §4, Phase 1)bazel-diff fingerprint— snapshot cache key over the inputs that affect the build graph (bazel version,MODULE.bazel.lock,.bazelrc, bazel-diff version, query-affecting flag set).FingerprintInteractoris pure + unit-tested.bazel-diff warmup— record-side entrypoint:generate-hashesfor the base revision + writesbase_hashes.json/fingerprint.json+ clean-exit "safe to snapshot" contract. ExtendsGenerateHashesCommand, so baked base hashes are byte-identical to a cold run.Go orchestrator (
tools/firecracker/,bazel-diff-snap)record/consume;consumeis fail-safe — fingerprint match + nearest-ancestor resolution, exit2→ cold fallback. A stale snapshot is never silently trusted (RFC §5.2).local(no VM, runs anywhere — used by unit tests + portable cold/warm proxy) andfirecracker(real microVM via the API + virtio-net/TAP; Linux +/dev/kvm).Benchmark + validation harness (
tools/firecracker/bench/)gen_project.py— synthetic large-Bazel-project generator (layered genrule DAG, bounded depth, no external toolchains, two git revisions).bench.py— cold-vs-warm analysis-time benchmark; asserts warm output is byte-identical to cold.Dockerfile+ scripts to run it on Linux at scale.build_guest_image.sh/setup_tap.sh— build the guest rootfs + kernel and the host TAP for the firecracker driver..github/workflows/firecracker-e2e.yml— the RFC §5.3 correctness canary (snapshot-consumed vs cold equality) on a Linux+KVM runner.Validation
record → snapshot → restore → consumepassed (--- PASS ... 84.8s); firecracker-consumed impacted set == cold/local set (522 == 522 targets).Notes for reviewers
firecrackerdriver requires Linux +/dev/kvm(bare metal or a host with nested virt). Thelocaldriver + the cold/warm benchmark run anywhere (incl. macOS).recordstep's in-guestwarmupneeds guest network egress to fetch BCR deps; the EC2 run used host NAT + a guest DNS resolver.firecracker-e2e.ymlwill need the same (NAT/DNS) — or a pre-seeded bazel repo cache baked into the image — to pass fully unattended. Happy to fold that in as a follow-up.net/http) instead offirecracker-go-sdk, to keep a zero-dependency static binary.🤖 Generated with Claude Code