feat(telemetry): docker isolation flag, CLI source + docker failure codes (schema v5)#702
Conversation
Deploying mcpproxy-docs with
|
| Latest commit: |
60dfe13
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://49845b15.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://feat-docker-telemetry-mcp-27.mcpproxy-docs.pages.dev |
971cc72 to
5aa843d
Compare
🤖 Codex review — REQUEST_CHANGES (1×P1, 1×P2)[P1] Update all
|
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 27685601658 --repo smart-mcp-proxy/mcpproxy-go
|
|
Thanks @codex — addressed. Head is now P1 (build break — P2 (stat-probe cache upgrade drops source) — real bug, fixed in Sequencing (#701 conflict): #701 (MCP-2751 login-shell env hydration) is still open, not merged, and I don't own it — so I'm not blocking #702 on it (the issue's only stated dependency was #697, already in main). The overlap is a textual conflict in Re-requesting review on |
🤖 Codex re-review (round 3) — REQUEST_CHANGES (1×P2)[P2] Recognize zsh's docker-not-found wording —
|
|
Thanks @codex — good catch, fixed in P2 r3 (zsh wording misclassified): You're right — zsh's reversed shape Fix: added Note: the no-hint zsh case still resolves to Re-requesting review on |
🤖 Codex re-review (round 4) — REQUEST_CHANGES (1×P2) + rebase needed[P2] Re-probe bundled Docker before returning
|
…ure codes (schema v5) Make the #696 (docker CLI not on PATH) and image/interpreter-mismatch failure modes visible in the telemetry dashboard instead of guessing. Feature flags (schema bumped 4 -> 5, additive / forward-compatible): - docker_isolation_enabled (bool) from cfg.DockerIsolation in BuildFeatureFlagSnapshot — distinguishes "isolation on, 0 matching servers" from "isolation off". - docker_cli_source (enum path|bundled|login_shell|absent) — the direct #696 fleet signal. New shellwrap.ResolveDockerSource reports which resolution branch found docker (shares the docker-path cache); Runtime.GetDockerCLISource mirrors IsDockerAvailable. Coarse enum only — never the path string. Diagnostics (flow into error_code_counts_24h, MCPX-prefix gated/PII-safe): - MCPX_DOCKER_CLI_NOT_FOUND — isolation requested but docker binary unresolved. - MCPX_DOCKER_EXEC_NOT_FOUND — in-container interpreter missing (e.g. uvx absent in python:3.11); previously misclassified as MCPX_STDIO_SPAWN_ENOENT. - MCPX_DOCKER_OCI_RUNTIME — OCI runtime / exec-format failures. ClassifierHints gains DockerIsolated; the supervisor threads it from a side-effect-free mirror of IsolationManager.ShouldIsolate so containerized ENOENT routes to DOCKER codes. Catalog entries added for all three. Privacy: all new fields are coarse enums/bools; docker_cli_source canary test asserts only the 4 enum values ever reach the wire and ScanForPII stays green. Docs/contracts updated (docs/features/telemetry.md + v2/v3 contract notes). DASHBOARD HAND-OFF (telemetry.mcpproxy.app, separate service): add panels for feature_flags.docker_isolation_enabled, the 4-way docker_cli_source breakdown, and the 3 new MCPX_DOCKER_* rows in the error_code_counts_24h panel. Related #696 Co-Authored-By: Paperclip <noreply@paperclip.ing>
The RuntimeStats interface gained GetDockerCLISource() (MCP-2745); the httpapi telemetry payload test's fake stub must implement it or the package fails to typecheck under the full-repo CI lint. Related #696 Co-Authored-By: Paperclip <noreply@paperclip.ing>
ResolveDockerPath's MCP-2744 stat-probe override (upgrading a live cached negative to a permanent success when the well-known-path probe finds docker) updated dockerPath/dockerPathErr but not dockerPathSource. A stale 'absent' from the prior failed resolution then leaked into docker_cli_source telemetry (schema v5) even though docker WAS resolved via the bundled path. Set dockerPathSource = DockerSourceBundled in that branch. Regression test asserts a cached negative upgraded via the stat probe reports 'bundled', not 'absent' (fails without the one-line fix). Related #696 Co-Authored-By: Paperclip <noreply@paperclip.ing>
…-found (Codex P2 r3) zsh's login-shell wording is reversed — 'zsh:1: command not found: docker' (name AFTER the colon), not bash/sh's 'docker: command not found'. The CLI-not- found case only matched the bash shape, so the common macOS #696 scenario fell through to the generic 'command not found' branch and recorded MCPX_DOCKER_EXEC_NOT_FOUND, conflating host-CLI-missing (#696) with in-container exec ENOENT. Match 'command not found: docker' in both the isolation-spawn helper and the unhinted classifyDocker fallback. Added a classifier table case for the zsh wording (verified red→green: routed to EXEC_NOT_FOUND without the fix, CLI_NOT_FOUND with it). Related #696 Co-Authored-By: Paperclip <noreply@paperclip.ing>
…e path (Codex P2 r4) ResolveDockerSource honored a live cached negative by returning 'absent' WITHOUT running the well-known-path stat probe that ResolveDockerPath uses during the negative-TTL window — so docker_cli_source reported 'absent' while ResolveDockerPath returned the bundled path (schema-v5 telemetry mismatch). Root-cause the recurring divergence (this is the 2nd source-tracking drift between the two functions) by extracting a single shared resolveDockerPathLocked helper that is the only writer of the cache + dockerPathSource. ResolveDockerPath and ResolveDockerSource now drive the identical cache path, including the MCP-2744 override, so they cannot disagree on the same cache state. Regression test asserts ResolveDockerSource reports 'bundled' (not 'absent') when a well-known docker appears during the negative-TTL window — verified red->green against the shared-resolver refactor. Related #696 Co-Authored-By: Paperclip <noreply@paperclip.ing>
0d05a0d to
89cf261
Compare
|
Thanks @codex — both done. Rebased onto latest main (incl. #701/#700), head REBASE: rebased P2 r4 (ResolveDockerSource skipped the stat-probe override): fixed in Re-requesting review on |
🤖 Codex re-review (round 5) — REQUEST_CHANGES (1×P2)[P2] Gate
|
…Codex P2 r5)
The unhinted classifyDocker matched a BARE 'exec format error' as
MCPX_DOCKER_OCI_RUNTIME, but a non-docker wrong-architecture host stdio binary
emits the same string (ENOEXEC) — it would be mislabeled a Docker/OCI failure.
- classifyDocker OCI case now requires real OCI/runc context ('oci runtime' /
'runc'); bare 'exec format error' no longer matches there.
- classifyStdio routes a bare 'exec format error' (and typed syscall.ENOEXEC)
to a new STDIO code MCPX_STDIO_SPAWN_EXEC_FORMAT, guarded against docker OCI
context so a real containerized exec-format still falls through to OCI.
- The docker-isolated path keeps classifying bare 'exec format error' as OCI
via the hinted classifyDockerIsolatedSpawn.
- New catalog entry for MCPX_STDIO_SPAWN_EXEC_FORMAT.
Classifier table cases: bare exec-format WITH hint -> OCI; bare exec-format
WITHOUT hint -> STDIO (verified red MCPX_DOCKER_OCI_RUNTIME -> green
MCPX_STDIO_SPAWN_EXEC_FORMAT); 'oci runtime' context without hint -> OCI.
Related #696
Co-Authored-By: Paperclip <noreply@paperclip.ing>
|
Thanks @codex — fixed in P2 r5 (over-broad
Rather than route ENOEXEC to a semantically-wrong Tests ( Note: the prior run's |
🤖 Codex review — ACCEPT ✅
No P0–P2 after the final fix (the Reviewed via |
Summary
Makes the docker problems from #696 (CLI not on PATH) and per-server image/interpreter mismatch (e.g.
python:3.11lacksuvx) visible in the telemetry dashboard instead of guessing. Today telemetry can't distinguish "isolation on, 0 matching servers" from "isolation off", can't see the #696 failure mode at all, and misclassifies in-container exec ENOENT as a host STDIO error.Implements MCP-2745. Base
origin/mainalready contains #697/#698/#699 (the docker-spawn fixes), so no blocker remains.Changes
A. Feature-flag fields (schema bumped 4 → 5, additive / forward-compatible —
additionalProperties:true):feature_flags.docker_isolation_enabled(bool) — fromcfg.DockerIsolationinBuildFeatureFlagSnapshot(pure, config-only).feature_flags.docker_cli_source(enumpath|bundled|login_shell|absent) — the direct Docker Desktop not detected on macOS: servers fail withcommand not found: dockerdespitedocker_available: true#696 fleet signal. Newshellwrap.ResolveDockerSource()reports which resolution branch found docker (shares the docker-path cache);Runtime.GetDockerCLISource()mirrorsIsDockerAvailable(). Coarse enum only — never the path string.B. Diagnostic codes (DOCKER domain; flow into
error_code_counts_24hvia the existingRecordErrorCodeMCPX-prefix-gated/PII-safe pipeline):MCPX_DOCKER_CLI_NOT_FOUND— isolation requested butdockerbinary unresolved (Docker Desktop not detected on macOS: servers fail withcommand not found: dockerdespitedocker_available: true#696).MCPX_DOCKER_EXEC_NOT_FOUND— in-container interpreter missing (image lacksuvx/node/…); previously misclassified asMCPX_STDIO_SPAWN_ENOENT.MCPX_DOCKER_OCI_RUNTIME— OCI runtime /exec format error.ClassifierHintsgainsDockerIsolated; the supervisor threads it from a side-effect-free mirror ofIsolationManager.ShouldIsolate, so containerized ENOENT routes to DOCKER codes. Catalog entries added for all three (severity + CTA).Privacy
All new fields are coarse enums / booleans. New canary
TestPayloadV5_DockerCLISourceIsEnumOnlyasserts only the 4 enum values ever reach the wire (no path/URL substrings) andScanForPIIstays green for every value.Tests (TDD, all
-racegreen)shellwrap:ResolveDockerSource→ path/bundled/login_shell/absent.telemetry:DockerIsolationEnabledfrom cfg (false when nil);DockerCLISourceforwarded from runtime stats; v5 PII canary; schema-version tripwires bumped to 5.diagnostics: classifier table — CLI-not-found →MCPX_DOCKER_CLI_NOT_FOUND; containerized ENOENT (hint) →MCPX_DOCKER_EXEC_NOT_FOUND; non-containerized ENOENT still →MCPX_STDIO_SPAWN_ENOENT; OCI →MCPX_DOCKER_OCI_RUNTIME. Catalog min-coverage covers the 3 new codes.golangci-lint v2(CI config) clean on touched packages.telemetry.mcpproxy.app, separate service)The ingestion/dashboard is not in this repo; the emitted payload is forward-compatible. Action item for whoever owns ingestion: add panels/queries for
feature_flags.docker_isolation_enabled, the 4-wayfeature_flags.docker_cli_sourcebreakdown, and the newMCPX_DOCKER_CLI_NOT_FOUND/MCPX_DOCKER_EXEC_NOT_FOUND/MCPX_DOCKER_OCI_RUNTIMErows in theerror_code_counts_24hpanel. Otherwise the emitted signal is stranded.Docs
docs/features/telemetry.md(schema → v5, new field rows, new diagnostic codes) + parity notes onspecs/042-telemetry-tier2/contracts/heartbeat-v2.schema.jsonandspecs/044-retention-telemetry-v3/contracts/heartbeat-v3.json.Related #696