Skip to content

feat(telemetry): docker isolation flag, CLI source + docker failure codes (schema v5)#702

Merged
Dumbris merged 6 commits into
mainfrom
feat/docker-telemetry-mcp-2745
Jun 17, 2026
Merged

feat(telemetry): docker isolation flag, CLI source + docker failure codes (schema v5)#702
Dumbris merged 6 commits into
mainfrom
feat/docker-telemetry-mcp-2745

Conversation

@Dumbris

@Dumbris Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

Makes the docker problems from #696 (CLI not on PATH) and per-server image/interpreter mismatch (e.g. python:3.11 lacks uvx) visible in the telemetry dashboard instead of guessing. Today telemetry can't distinguish "isolation on, 0 matching servers" from "isolation off", can't see the #696 failure mode at all, and misclassifies in-container exec ENOENT as a host STDIO error.

Implements MCP-2745. Base origin/main already contains #697/#698/#699 (the docker-spawn fixes), so no blocker remains.

Changes

A. Feature-flag fields (schema bumped 4 → 5, additive / forward-compatible — additionalProperties:true):

B. Diagnostic codes (DOCKER domain; flow into error_code_counts_24h via the existing RecordErrorCode MCPX-prefix-gated/PII-safe pipeline):

  • MCPX_DOCKER_CLI_NOT_FOUND — isolation requested but docker binary unresolved (Docker Desktop not detected on macOS: servers fail with command not found: docker despite docker_available: true #696).
  • MCPX_DOCKER_EXEC_NOT_FOUND — in-container interpreter missing (image lacks uvx/node/…); previously misclassified as MCPX_STDIO_SPAWN_ENOENT.
  • MCPX_DOCKER_OCI_RUNTIME — OCI runtime / exec format error.
  • ClassifierHints gains DockerIsolated; the supervisor threads it from a side-effect-free mirror of IsolationManager.ShouldIsolate, so containerized ENOENT routes to DOCKER codes. Catalog entries added for all three (severity + CTA).

Privacy

All new fields are coarse enums / booleans. New canary TestPayloadV5_DockerCLISourceIsEnumOnly asserts only the 4 enum values ever reach the wire (no path/URL substrings) and ScanForPII stays green for every value.

Tests (TDD, all -race green)

  • shellwrap: ResolveDockerSource → path/bundled/login_shell/absent.
  • telemetry: DockerIsolationEnabled from cfg (false when nil); DockerCLISource forwarded from runtime stats; v5 PII canary; schema-version tripwires bumped to 5.
  • diagnostics: classifier table — CLI-not-found → MCPX_DOCKER_CLI_NOT_FOUND; containerized ENOENT (hint) → MCPX_DOCKER_EXEC_NOT_FOUND; non-containerized ENOENT still → MCPX_STDIO_SPAWN_ENOENT; OCI → MCPX_DOCKER_OCI_RUNTIME. Catalog min-coverage covers the 3 new codes.
  • golangci-lint v2 (CI config) clean on touched packages.

⚠️ Dashboard hand-off (EXTERNAL — telemetry.mcpproxy.app, separate service)

The ingestion/dashboard is not in this repo; the emitted payload is forward-compatible. Action item for whoever owns ingestion: add panels/queries for feature_flags.docker_isolation_enabled, the 4-way feature_flags.docker_cli_source breakdown, and the new MCPX_DOCKER_CLI_NOT_FOUND / MCPX_DOCKER_EXEC_NOT_FOUND / MCPX_DOCKER_OCI_RUNTIME rows in the error_code_counts_24h panel. Otherwise the emitted signal is stranded.

Docs

docs/features/telemetry.md (schema → v5, new field rows, new diagnostic codes) + parity notes on specs/042-telemetry-tier2/contracts/heartbeat-v2.schema.json and specs/044-retention-telemetry-v3/contracts/heartbeat-v3.json.

Related #696

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 60dfe13
Status: ✅  Deploy successful!
Preview URL: https://49845b15.mcpproxy-docs.pages.dev
Branch Preview URL: https://feat-docker-telemetry-mcp-27.mcpproxy-docs.pages.dev

View logs

@Dumbris Dumbris force-pushed the feat/docker-telemetry-mcp-2745 branch from 971cc72 to 5aa843d Compare June 17, 2026 08:56
@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 Codex review — REQUEST_CHANGES (1×P1, 1×P2)

[P1] Update all RuntimeStats fakes — internal/telemetry/telemetry.go:181

Adding GetDockerCLISource() to the RuntimeStats interface leaves fakeRuntimeStats{} in internal/httpapi/telemetry_payload_test.go incomplete → go test ./internal/httpapi fails to compile. Update that fake (and any other implementors) to satisfy the new method.

[P2] Preserve source when upgrading a negative cache — internal/shellwrap/shellwrap.go:278-280

When ResolveDockerPath upgrades a live negative cache via the well-known-path probe it sets dockerPath/dockerPathErr but never dockerPathSource; the cached-success branch then returns absent even though Docker resolved from a bundled/well-known path. With the #699 cache fix now on main, schema-v5 telemetry will report the wrong docker_cli_source until the cache resets. Set dockerPathSource in that upgrade path.

Also: this PR will need a rebase onto main after #701 (login-shell hydration) lands — both edit internal/shellwrap/shellwrap.go and will conflict.

Reviewed via codex exec review --base origin/main.

@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.79646% with 33 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/runtime/supervisor/supervisor.go 0.00% 19 Missing ⚠️
internal/diagnostics/classifier.go 72.00% 6 Missing and 1 partial ⚠️
internal/shellwrap/shellwrap.go 80.76% 3 Missing and 2 partials ⚠️
internal/runtime/runtime.go 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: feat/docker-telemetry-mcp-2745

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (28 MB)
  • archive-windows-arm64 (25 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 27685601658 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @codex — addressed. Head is now 804e09f9702eb53da44fcdc96287c904ae33ea0d.

P1 (build break — fakeRuntimeStats incomplete): already fixed before this review landed. Commit 971cc724 (now 5aa843d1) added GetDockerCLISource() to internal/httpapi/telemetry_payload_test.go. I audited every RuntimeStats implementor (the 5 that define GetDockerIsolatedServerCount) — all now implement the new method, and full-repo golangci-lint v2 + go test ./internal/httpapi are green. The CI Lint + Unit Tests + E2E checks passed on 5aa843d1.

P2 (stat-probe cache upgrade drops source) — real bug, fixed in 804e09f9. ResolveDockerPath's MCP-2744 override upgraded dockerPath/dockerPathErr but not dockerPathSource, so a stale absent leaked into docker_cli_source after a negative→success upgrade. Now sets dockerPathSource = DockerSourceBundled in that branch (the probe IS the bundled path). Added TestResolveDockerSource_StatProbeUpgradeReportsBundled — verified it fails (absent) without the one-liner and passes (bundled) with it.

Sequencing (#701 conflict): #701 (MCP-2751 login-shell env hydration) is still open, not merged, and I don't own it — so I'm not blocking #702 on it (the issue's only stated dependency was #697, already in main). The overlap is a textual conflict in shellwrap.go, not a logical one. I'll rebase #702 onto main and re-test the moment either PR lands; whoever merges second rebases. Flagging here so the merge order is explicit.

Re-requesting review on 804e09f9702eb53da44fcdc96287c904ae33ea0d.

@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 Codex re-review (round 3) — REQUEST_CHANGES (1×P2)

[P2] Recognize zsh's docker-not-found wording — internal/diagnostics/classifier.go:196-200

On macOS the login-shell fallback runs under zsh, which reports a missing docker binary as zsh:1: command not found: docker. That string doesn't match the Docker-CLI patterns and falls through to the generic command not found branch, recording MCPX_DOCKER_EXEC_NOT_FOUND instead of MCPX_DOCKER_CLI_NOT_FOUND — misclassifying the exact #696 scenario this telemetry is meant to surface. Match the command not found: docker / zsh:...: command not found: docker shape and route it to MCPX_DOCKER_CLI_NOT_FOUND (add a classifier test case for the zsh wording).

Reviewed via codex exec review --base origin/main on 804e09f9.

@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @codex — good catch, fixed in 0d05a0d8261d3f0e7414100eb8fa56dbd0313610.

P2 r3 (zsh wording misclassified): You're right — zsh's reversed shape zsh:1: command not found: docker (name after the colon) didn't match the CLI-not-found case, so it fell through to the generic command not found branch and recorded MCPX_DOCKER_EXEC_NOT_FOUND — conflating #696 (host CLI missing) with in-container exec ENOENT.

Fix: added command not found: docker to the CLI-not-found case in both classifyDockerIsolatedSpawn (hinted isolation path) and the unhinted classifyDocker fallback, so it now beats the generic EXEC case. Added classifier table case cli_not_found_zsh_reversed — verified it routes to EXEC_NOT_FOUND without the fix and CLI_NOT_FOUND with it.

Note: the no-hint zsh case still resolves to MCPX_STDIO_SPAWN_ENOENT (classifyStdio's generic command not found catches it before classifyDocker for stdio transport) — that's the pre-existing host-stdio behavior and matches the design; the real #696 path always carries the DockerIsolated hint from the supervisor, which now classifies correctly.

Re-requesting review on 0d05a0d8261d3f0e7414100eb8fa56dbd0313610. Diagnostics tests + full-repo lint green.

@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 Codex re-review (round 4) — REQUEST_CHANGES (1×P2) + rebase needed

[P2] Re-probe bundled Docker before returning absentinternal/shellwrap/shellwrap.go:287-288

ResolveDockerSource returns absent without the probeWellKnownDocker override that ResolveDockerPath uses. If a prior lookup cached a negative within the negative-TTL and Docker appears at a well-known path, telemetry's docker_cli_source reports the wrong value until another path lookup runs. Apply the same well-known-path override in ResolveDockerSource before returning absent.

Rebase required: #701 (login-shell hydration) just merged and refactored internal/shellwrap/shellwrap.go. Rebase this PR onto main and re-test before re-review.

Reviewed via codex exec review --base origin/main on 0d05a0d8.

Dumbris and others added 5 commits June 17, 2026 14:24
…ure codes (schema v5)

Make the #696 (docker CLI not on PATH) and image/interpreter-mismatch failure
modes visible in the telemetry dashboard instead of guessing.

Feature flags (schema bumped 4 -> 5, additive / forward-compatible):
- docker_isolation_enabled (bool) from cfg.DockerIsolation in
  BuildFeatureFlagSnapshot — distinguishes "isolation on, 0 matching servers"
  from "isolation off".
- docker_cli_source (enum path|bundled|login_shell|absent) — the direct #696
  fleet signal. New shellwrap.ResolveDockerSource reports which resolution
  branch found docker (shares the docker-path cache); Runtime.GetDockerCLISource
  mirrors IsDockerAvailable. Coarse enum only — never the path string.

Diagnostics (flow into error_code_counts_24h, MCPX-prefix gated/PII-safe):
- MCPX_DOCKER_CLI_NOT_FOUND — isolation requested but docker binary unresolved.
- MCPX_DOCKER_EXEC_NOT_FOUND — in-container interpreter missing (e.g. uvx
  absent in python:3.11); previously misclassified as MCPX_STDIO_SPAWN_ENOENT.
- MCPX_DOCKER_OCI_RUNTIME — OCI runtime / exec-format failures.
ClassifierHints gains DockerIsolated; the supervisor threads it from a
side-effect-free mirror of IsolationManager.ShouldIsolate so containerized
ENOENT routes to DOCKER codes. Catalog entries added for all three.

Privacy: all new fields are coarse enums/bools; docker_cli_source canary test
asserts only the 4 enum values ever reach the wire and ScanForPII stays green.

Docs/contracts updated (docs/features/telemetry.md + v2/v3 contract notes).

DASHBOARD HAND-OFF (telemetry.mcpproxy.app, separate service): add panels for
feature_flags.docker_isolation_enabled, the 4-way docker_cli_source breakdown,
and the 3 new MCPX_DOCKER_* rows in the error_code_counts_24h panel.

Related #696

Co-Authored-By: Paperclip <noreply@paperclip.ing>
The RuntimeStats interface gained GetDockerCLISource() (MCP-2745); the httpapi
telemetry payload test's fake stub must implement it or the package fails to
typecheck under the full-repo CI lint.

Related #696

Co-Authored-By: Paperclip <noreply@paperclip.ing>
ResolveDockerPath's MCP-2744 stat-probe override (upgrading a live cached
negative to a permanent success when the well-known-path probe finds docker)
updated dockerPath/dockerPathErr but not dockerPathSource. A stale 'absent'
from the prior failed resolution then leaked into docker_cli_source telemetry
(schema v5) even though docker WAS resolved via the bundled path. Set
dockerPathSource = DockerSourceBundled in that branch.

Regression test asserts a cached negative upgraded via the stat probe reports
'bundled', not 'absent' (fails without the one-line fix).

Related #696

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…-found (Codex P2 r3)

zsh's login-shell wording is reversed — 'zsh:1: command not found: docker'
(name AFTER the colon), not bash/sh's 'docker: command not found'. The CLI-not-
found case only matched the bash shape, so the common macOS #696 scenario fell
through to the generic 'command not found' branch and recorded
MCPX_DOCKER_EXEC_NOT_FOUND, conflating host-CLI-missing (#696) with in-container
exec ENOENT. Match 'command not found: docker' in both the isolation-spawn
helper and the unhinted classifyDocker fallback.

Added a classifier table case for the zsh wording (verified red→green: routed to
EXEC_NOT_FOUND without the fix, CLI_NOT_FOUND with it).

Related #696

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…e path (Codex P2 r4)

ResolveDockerSource honored a live cached negative by returning 'absent'
WITHOUT running the well-known-path stat probe that ResolveDockerPath uses
during the negative-TTL window — so docker_cli_source reported 'absent' while
ResolveDockerPath returned the bundled path (schema-v5 telemetry mismatch).

Root-cause the recurring divergence (this is the 2nd source-tracking drift
between the two functions) by extracting a single shared resolveDockerPathLocked
helper that is the only writer of the cache + dockerPathSource. ResolveDockerPath
and ResolveDockerSource now drive the identical cache path, including the
MCP-2744 override, so they cannot disagree on the same cache state.

Regression test asserts ResolveDockerSource reports 'bundled' (not 'absent')
when a well-known docker appears during the negative-TTL window — verified
red->green against the shared-resolver refactor.

Related #696

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@Dumbris Dumbris force-pushed the feat/docker-telemetry-mcp-2745 branch from 0d05a0d to 89cf261 Compare June 17, 2026 11:27
@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @codex — both done. Rebased onto latest main (incl. #701/#700), head 89cf2619fd467a59310b45f37266222a398c0c71.

REBASE: rebased feat/docker-telemetry-mcp-2745 onto origin/main after #701 (login-shell hydration) landed. The shellwrap.go overlap resolved cleanly — #701's captureLoginShellEnv/hydrate code and my ResolveDockerSource/dockerPathSource additions live in different regions. Verified the hydrate code is present and go build ./... + full-repo lint + all package tests are green on the rebased tree.

P2 r4 (ResolveDockerSource skipped the stat-probe override): fixed in 89cf2619fd467a59310b45f37266222a398c0c71 — and root-caused the recurring drift. ResolveDockerSource returned absent off a live cached negative without running probeWellKnownDocker, while ResolveDockerPath applied the MCP-2744 override → mismatched docker_cli_source. Rather than mirror the override a 2nd time, I extracted a single shared resolveDockerPathLocked helper — the only writer of the cache + dockerPathSource. Both ResolveDockerPath and ResolveDockerSource now drive the identical cache path, so they can't diverge on the same state again. New test TestResolveDockerSource_NegativeTTLProbeOverride (verified red absent → green bundled).

Re-requesting review on 89cf2619fd467a59310b45f37266222a398c0c71. CI re-running.

@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 Codex re-review (round 5) — REQUEST_CHANGES (1×P2)

[P2] Gate exec format error matching to Docker context — internal/diagnostics/classifier.go:134

classifyDocker runs for every otherwise-unclassified error, and the new exec format errorMCPX_DOCKER_OCI_RUNTIME alternative has no Docker context/hint. A non-Docker stdio server whose host binary is built for the wrong architecture also surfaces exec format error, so it'd now be mislabeled MCPX_DOCKER_OCI_RUNTIME in UI diagnostics + telemetry. Require Docker/OCI context (e.g. runc/oci runtime in the message) or the docker-isolation hint — the same hint already gating MCPX_DOCKER_EXEC_NOT_FOUND — before returning a Docker code for a bare exec format error.

Reviewed via codex exec review --base origin/main on 89cf2619.

…Codex P2 r5)

The unhinted classifyDocker matched a BARE 'exec format error' as
MCPX_DOCKER_OCI_RUNTIME, but a non-docker wrong-architecture host stdio binary
emits the same string (ENOEXEC) — it would be mislabeled a Docker/OCI failure.

- classifyDocker OCI case now requires real OCI/runc context ('oci runtime' /
  'runc'); bare 'exec format error' no longer matches there.
- classifyStdio routes a bare 'exec format error' (and typed syscall.ENOEXEC)
  to a new STDIO code MCPX_STDIO_SPAWN_EXEC_FORMAT, guarded against docker OCI
  context so a real containerized exec-format still falls through to OCI.
- The docker-isolated path keeps classifying bare 'exec format error' as OCI
  via the hinted classifyDockerIsolatedSpawn.
- New catalog entry for MCPX_STDIO_SPAWN_EXEC_FORMAT.

Classifier table cases: bare exec-format WITH hint -> OCI; bare exec-format
WITHOUT hint -> STDIO (verified red MCPX_DOCKER_OCI_RUNTIME -> green
MCPX_STDIO_SPAWN_EXEC_FORMAT); 'oci runtime' context without hint -> OCI.

Related #696

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Thanks @codex — fixed in 60dfe13e895d86da163624827d069609a0b00de1.

P2 r5 (over-broad exec format error → OCI): you're right, a non-docker wrong-arch host stdio binary emits the same ENOEXEC string and was being mislabeled MCPX_DOCKER_OCI_RUNTIME. Fixes:

  • classifyDocker (unhinted) OCI case now requires real OCI/runc context (oci runtime / runc); bare exec format error no longer matches there.
  • classifyStdio routes a bare exec format error (and typed syscall.ENOEXEC) to a new STDIO code MCPX_STDIO_SPAWN_EXEC_FORMAT, guarded with !oci runtime && !runc so a real containerized exec-format still falls through to classifyDocker → OCI.
  • The docker-isolated path keeps bare exec format error → OCI via the hinted classifyDockerIsolatedSpawn (gated on the same DockerIsolated hint that gates MCPX_DOCKER_EXEC_NOT_FOUND).
  • Added a catalog entry for the new code.

Rather than route ENOEXEC to a semantically-wrong SPAWN_ENOENT, I gave it a precise dedicated STDIO code so the user message is accurate ("wrong CPU architecture / not an executable").

Tests (TestClassify_Docker_IsolationSpawn table): bare exec-format WITH hint → OCI; bare exec-format WITHOUT hint → MCPX_STDIO_SPAWN_EXEC_FORMAT (verified red MCPX_DOCKER_OCI_RUNTIME → green); oci runtime context without hint → OCI.

Note: the prior run's Validate PR title red was a transient infra flake (the job returned an HTML error page, not a title violation) — this push re-triggers it. Re-requesting review on 60dfe13e895d86da163624827d069609a0b00de1.

@Dumbris

Dumbris commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

🤖 Codex review — ACCEPT ✅

I found no discrete correctness issues in the changed code. The Docker telemetry/source tracking and diagnostic classification changes appear consistent with the existing runtime and tests.

No P0–P2 after the final fix (the exec format error → OCI match is now gated on Docker-isolation context, so a wrong-arch non-docker host binary stays STDIO-classified). Ships schema-v5 docker telemetry: docker_isolation_enabled, docker_cli_source (path/bundled/login_shell/absent), and MCPX_DOCKER_CLI_NOT_FOUND / MCPX_DOCKER_EXEC_NOT_FOUND / MCPX_DOCKER_OCI_RUNTIME diagnostic codes.

Reviewed via codex exec review --base origin/main on 60dfe13e.

@mcpproxy-gatekeeper mcpproxy-gatekeeper Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved via Claude Code review (Codex quota exhausted).

@Dumbris Dumbris merged commit e9993dc into main Jun 17, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants