Skip to content

Run OAP init job in the main phase to fix helm --wait deadlock#190

Merged
wu-sheng merged 1 commit into
masterfrom
fix-oap-init-job-wait-deadlock
Jun 10, 2026
Merged

Run OAP init job in the main phase to fix helm --wait deadlock#190
wu-sheng merged 1 commit into
masterfrom
fix-oap-init-job-wait-deadlock

Conversation

@wu-sheng

Copy link
Copy Markdown
Member

Problem

The OAP init job is a post-install,post-upgrade,post-rollback Helm hook. Under the very common helm upgrade --install --wait, this deadlocks:

  • Helm waits for all release resources to become Ready before it runs post-* hooks.
  • The OAP Deployment runs in -Dmode=no-init and, when the storage schema is missing, blocks in OAP's ModelInstaller loop ("... is running in 'no-init' mode, waiting ... retry 3s later") — its 12800 port never opens, so the readiness probe never passes.
  • The schema is created by the init job… which is a post-* hook that Helm won't run until the Deployment is Ready.

Result: the Deployment never becomes Ready → the hook never runs → the schema is never created → helm times out. This bites new users on a fresh install / fresh storage.

Hooks fundamentally cannot solve this when storage is an embedded subchart: a pre-* hook init job cannot reach main-phase storage (the storage service does not exist yet), and a post-* hook deadlocks under --wait. Schema init must run in the same phase as storage.

Fix

  • Remove the hook annotations so the init job runs as a normal main-phase resource, alongside storage and the OAP Deployment. OAP no-init self-blocks until the job creates the schema, so --wait resolves instead of deadlocking. Correct ordering comes from OAP's runtime behavior, not Helm phase ordering, so the single-run (one pod) Job semantics are unchanged.
  • Hash-suffixed Job name (...-oap-init-<8-char hash of .Values>). A Job's spec.template is immutable, so a stable name would make helm upgrade fail with field is immutable whenever the pod template changes. Hashing yields a fresh Job on any relevant change and Helm prunes the previous one — which also addresses the long-standing "Job already exists, cannot rerun" pain.
  • oapInit.ttlSecondsAfterFinished (new, optional, default empty) to auto-clean finished Jobs via the K8s TTL-after-finished controller. Left off by default so GitOps tools (Argo CD/Flux) don't recreate the Job after deletion.
  • OAP startupProbe default failureThreshold raised 9 → 30 (90s → 300s) so the pod waits for the init job during a cold start instead of being restarted.

Notes

  • Under bare --wait, Helm does not wait for the (now normal) Job — it waits on the Deployment, which self-converges once the Job creates the schema. Add --wait-for-jobs to have Helm surface init-job failures directly. Documented in the README.
  • The README "Rerun OAP init job" section is simplified: upgrades that change a value re-run init automatically; to force a rerun, delete the Job and helm upgrade (Helm recreates it).

Validation

  • helm lint clean; helm template renders for elasticsearch (embedded ECK), external ES, postgresql, and banyandb.
  • Init Job renders as a plain Job (no hook annotations) with a hash name that is deterministic across runs and changes when values change.
  • ttlSecondsAfterFinished omitted by default, renders when set.
  • No e2e test or CI workflow depends on the hook / --wait, so the change is transparent to e2e.

Docs kept in sync

chart/skywalking/values.yaml, chart/skywalking/README.md, and root README.md.

🤖 Generated with Claude Code

The OAP init job was a `post-install,post-upgrade,post-rollback` hook. Under
`helm upgrade --install --wait`, Helm waits for all release resources to become
Ready before running post-* hooks, but the OAP Deployment runs in `-Dmode=no-init`
and never becomes Ready until the init job creates the storage schema. The hook
therefore never runs and the install deadlocks until it times out (hits new
users on a fresh install/storage).

Hooks cannot fix this with embedded storage subcharts: a pre-* hook init job
cannot reach main-phase storage, and a post-* hook deadlocks under `--wait`.
So the init job now runs as a normal main-phase resource alongside storage and
the OAP Deployment, which blocks in no-init mode until the schema appears.

To avoid `spec.template is immutable` failures on upgrade (a Job's pod template
cannot be patched), the Job name carries an 8-char hash of the chart values, so
a changed spec yields a new Job and Helm prunes the previous one. A new optional
`oapInit.ttlSecondsAfterFinished` can auto-clean finished Jobs (off by default;
left off for GitOps tools that would otherwise recreate the Job).

The OAP Deployment startupProbe default failureThreshold is raised 9 -> 30
(90s -> 300s) so the pod waits for the init job during a cold start instead of
being restarted.

Docs (values.yaml, chart README, root README) updated accordingly.
@wu-sheng wu-sheng requested review from hanahmily and kezhenxu94 June 10, 2026 01:51
@wu-sheng wu-sheng added this to the 4.10.0 milestone Jun 10, 2026
@wu-sheng wu-sheng added the bug Something isn't working label Jun 10, 2026
@wu-sheng wu-sheng merged commit 3aee62a into master Jun 10, 2026
9 checks passed
@wu-sheng wu-sheng deleted the fix-oap-init-job-wait-deadlock branch June 10, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants