Skip to content

chore: upgrade gpu operator v25.3.4 → v25.10.1#415

Open
svia3 wants to merge 1 commit intoaws:mainfrom
svia3:gpu-operator-v25.10.1-upgrade
Open

chore: upgrade gpu operator v25.3.4 → v25.10.1#415
svia3 wants to merge 1 commit intoaws:mainfrom
svia3:gpu-operator-v25.10.1-upgrade

Conversation

@svia3
Copy link
Copy Markdown

@svia3 svia3 commented Apr 30, 2026

Problem

Critical CVEs in GPU Operator v25.3.4 container images (built Oct 2025). Blocks OS patching and observability work.

Fix

Bump all GPU Operator component versions to v25.10.1. Pure version bump — no image name changes, no regional-values changes, no behavioral change.

Chart.yaml dependency:  v25.3.4             → v25.10.1
operator:               v25.3.4             → v25.10.1
toolkit:                v1.17.9-ubi8        → v1.18.1
devicePlugin:           v0.17.4             → v0.18.1
gfd:                    v0.17.4             → v0.18.1
migManager:             v0.12.3-ubuntu20.04 → v0.13.1
validator:              v25.3.4             → v25.10.1

Why This Approach

Toolkit behavior is unchanged — toolkit.enabled is left absent (defaults to true). The GPU Operator's toolkit DaemonSet coexists safely with the AMI's pre-installed toolkit. Validated on a live HyperPod EKS cluster across 6 test cases including fresh install, parallel toolkit coexistence, upgrade path, HMA compatibility, and full parent chart install.

Disabling the toolkit (toolkit.enabled: false) was tested and works for first-time installs, but breaks existing clusters upgrading from v25.3.4 (containerd loses its nvidia runtime config when the toolkit DaemonSet is removed). That's a separate follow-up requiring a migration plan.

Failure Cases

  1. Existing clusters on v25.3.4: Safe. Helm release pins version — customers only get v25.10.1 on explicit helm upgrade. Toolkit behavior unchanged.
  2. ECR images: Requires ECR replicator update to mirror v25.10.1 images to regional ECRs before this PR ships. Old v25.3.4 tags remain in ECR for rollback.
  3. Toolkit base image change: Toolkit moves from v1.17.9-ubi8 to v1.18.1 (default variant). HyperPod EKS uses Amazon Linux, not RHEL — default variant is the correct match.

Scope

Version bumps only. 2 files changed (Chart.yaml + values.yaml). No image name changes, no regional-values changes. Blocked on ECR replicator deploying v25.10.1 images first.

Checklist

  • 6 test cases on live HyperPod EKS cluster
  • Toolkit coexistence validated (both AMI v1.19.0 + operator v1.18.1)
  • HMA compatibility confirmed (independent, uses kernel log XID matching)
  • Backward compatible (existing clusters unaffected)
  • ECR replicator deployed with v25.10.1 images (blocker)

Bump all GPU Operator component versions to resolve critical CVEs in
v25.3.4 images. Pure version bump — no image name changes, no
regional-values changes, no behavioral change.

- operator: v25.3.4 → v25.10.1
- toolkit: v1.17.9-ubi8 → v1.18.1
- devicePlugin: v0.17.4 → v0.18.1
- gfd: v0.17.4 → v0.18.1
- migManager: v0.12.3-ubuntu20.04 → v0.13.1
- validator: v25.3.4 → v25.10.1
- toolkit.enabled left absent (defaults true) — safe for upgrades
@svia3 svia3 requested a review from a team as a code owner April 30, 2026 23:23
@svia3 svia3 requested a deployment to manual-approval April 30, 2026 23:23 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant