Run OAP init job in the main phase to fix helm --wait deadlock#190
Merged
Conversation
The OAP init job was a `post-install,post-upgrade,post-rollback` hook. Under `helm upgrade --install --wait`, Helm waits for all release resources to become Ready before running post-* hooks, but the OAP Deployment runs in `-Dmode=no-init` and never becomes Ready until the init job creates the storage schema. The hook therefore never runs and the install deadlocks until it times out (hits new users on a fresh install/storage). Hooks cannot fix this with embedded storage subcharts: a pre-* hook init job cannot reach main-phase storage, and a post-* hook deadlocks under `--wait`. So the init job now runs as a normal main-phase resource alongside storage and the OAP Deployment, which blocks in no-init mode until the schema appears. To avoid `spec.template is immutable` failures on upgrade (a Job's pod template cannot be patched), the Job name carries an 8-char hash of the chart values, so a changed spec yields a new Job and Helm prunes the previous one. A new optional `oapInit.ttlSecondsAfterFinished` can auto-clean finished Jobs (off by default; left off for GitOps tools that would otherwise recreate the Job). The OAP Deployment startupProbe default failureThreshold is raised 9 -> 30 (90s -> 300s) so the pod waits for the init job during a cold start instead of being restarted. Docs (values.yaml, chart README, root README) updated accordingly.
kezhenxu94
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The OAP init job is a
post-install,post-upgrade,post-rollbackHelm hook. Under the very commonhelm upgrade --install --wait, this deadlocks:Readybefore it runspost-*hooks.Deploymentruns in-Dmode=no-initand, when the storage schema is missing, blocks in OAP'sModelInstallerloop ("... is running in 'no-init' mode, waiting ... retry 3s later") — its12800port never opens, so the readiness probe never passes.post-*hook that Helm won't run until the Deployment isReady.Result: the Deployment never becomes Ready → the hook never runs → the schema is never created →
helmtimes out. This bites new users on a fresh install / fresh storage.Hooks fundamentally cannot solve this when storage is an embedded subchart: a
pre-*hook init job cannot reach main-phase storage (the storage service does not exist yet), and apost-*hook deadlocks under--wait. Schema init must run in the same phase as storage.Fix
--waitresolves instead of deadlocking. Correct ordering comes from OAP's runtime behavior, not Helm phase ordering, so the single-run (one pod)Jobsemantics are unchanged....-oap-init-<8-char hash of .Values>). A Job'sspec.templateis immutable, so a stable name would makehelm upgradefail withfield is immutablewhenever the pod template changes. Hashing yields a fresh Job on any relevant change and Helm prunes the previous one — which also addresses the long-standing "Job already exists, cannot rerun" pain.oapInit.ttlSecondsAfterFinished(new, optional, default empty) to auto-clean finished Jobs via the K8s TTL-after-finished controller. Left off by default so GitOps tools (Argo CD/Flux) don't recreate the Job after deletion.failureThresholdraised9 → 30(90s → 300s) so the pod waits for the init job during a cold start instead of being restarted.Notes
--wait, Helm does not wait for the (now normal) Job — it waits on the Deployment, which self-converges once the Job creates the schema. Add--wait-for-jobsto have Helm surface init-job failures directly. Documented in the README.helm upgrade(Helm recreates it).Validation
helm lintclean;helm templaterenders for elasticsearch (embedded ECK), external ES, postgresql, and banyandb.Job(no hook annotations) with a hash name that is deterministic across runs and changes when values change.ttlSecondsAfterFinishedomitted by default, renders when set.--wait, so the change is transparent to e2e.Docs kept in sync
chart/skywalking/values.yaml,chart/skywalking/README.md, and rootREADME.md.🤖 Generated with Claude Code