RHINENG-27056: move db-migration to a separate job#2236
Conversation
Reviewer's GuideMoves database migrations from the manager pod’s init container into a dedicated ClowdApp Job with retry logic, and converts the manager init container into a lightweight DB-schema readiness check so that new pods block until migrations complete successfully. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- Consider making
MIGRATION_MAX_RETRIESconfigurable via a ClowdApp parameter (similar toMIGRATION_TIMEOUT) instead of hardcoding'3'in the Job spec so the retry behavior can be tuned per environment without code changes. - It may be worth double-checking that
MIGRATION_TIMEOUTis aligned withMIGRATION_MAX_RETRIESand the 5s sleep inentrypoint.shso the Job’sactiveDeadlineSecondscannot expire mid-retry loop in typical failure scenarios.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider making `MIGRATION_MAX_RETRIES` configurable via a ClowdApp parameter (similar to `MIGRATION_TIMEOUT`) instead of hardcoding `'3'` in the Job spec so the retry behavior can be tuned per environment without code changes.
- It may be worth double-checking that `MIGRATION_TIMEOUT` is aligned with `MIGRATION_MAX_RETRIES` and the 5s sleep in `entrypoint.sh` so the Job’s `activeDeadlineSeconds` cannot expire mid-retry loop in typical failure scenarios.
## Individual Comments
### Comment 1
<location path="deploy/clowdapp.yaml" line_range="313" />
<code_context>
+ command:
+ - ./database_admin/entrypoint.sh
+ env:
+ - {name: MIGRATION_MAX_RETRIES, value: '3'}
+ - {name: LOG_LEVEL, value: '${LOG_LEVEL_DATABASE_ADMIN}'}
+ - {name: DB_DEBUG, value: '${DB_DEBUG_DATABASE_ADMIN}'}
</code_context>
<issue_to_address>
**suggestion:** Consider making MIGRATION_MAX_RETRIES configurable via a template parameter instead of hardcoding '3'.
Align this with how `MIGRATION_TIMEOUT` and other resource values are configured, so the retry policy can be tuned per environment without editing the manifest directly.
Suggested implementation:
```
podSpec:
image: ${IMAGE}:${IMAGE_TAG}
command:
- ./database_admin/entrypoint.sh
env:
- name: MIGRATION_MAX_RETRIES
value: ${{MIGRATION_MAX_RETRIES}}
```
To fully implement the suggestion, you will also need to:
1. Declare a `MIGRATION_MAX_RETRIES` template parameter in this ClowdApp (or the surrounding template/Helm chart), following the same pattern used for `MIGRATION_TIMEOUT` (e.g., add it to the `parameters`/`envTemplate` section with a default value of `3`).
2. Ensure any environment-specific overlays (e.g., dev/stage/prod) can override `MIGRATION_MAX_RETRIES` so it can be tuned per environment without editing this manifest directly.
</issue_to_address>
### Comment 2
<location path="database_admin/entrypoint.sh" line_range="3" />
<code_context>
#!/bin/bash
-set -e -o pipefail # stop on error
+set -o pipefail
MIGRATION_FILES=file://./database_admin/migrations
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping `set -e` may mask failures in future commands before the migrate loop.
The retry loop correctly handles `migrate` failures, but without `set -e` any other command before or between attempts can fail silently and the script will keep running. To preserve fail-fast behavior while still allowing retries, consider restoring `set -e` and isolating the retry logic around `migrate` (for example, by wrapping `migrate` in a function whose exit code you control).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| command: | ||
| - ./database_admin/entrypoint.sh | ||
| env: | ||
| - {name: MIGRATION_MAX_RETRIES, value: '3'} |
There was a problem hiding this comment.
suggestion: Consider making MIGRATION_MAX_RETRIES configurable via a template parameter instead of hardcoding '3'.
Align this with how MIGRATION_TIMEOUT and other resource values are configured, so the retry policy can be tuned per environment without editing the manifest directly.
Suggested implementation:
podSpec:
image: ${IMAGE}:${IMAGE_TAG}
command:
- ./database_admin/entrypoint.sh
env:
- name: MIGRATION_MAX_RETRIES
value: ${{MIGRATION_MAX_RETRIES}}
To fully implement the suggestion, you will also need to:
- Declare a
MIGRATION_MAX_RETRIEStemplate parameter in this ClowdApp (or the surrounding template/Helm chart), following the same pattern used forMIGRATION_TIMEOUT(e.g., add it to theparameters/envTemplatesection with a default value of3). - Ensure any environment-specific overlays (e.g., dev/stage/prod) can override
MIGRATION_MAX_RETRIESso it can be tuned per environment without editing this manifest directly.
| @@ -1,8 +1,20 @@ | |||
| #!/bin/bash | |||
|
|
|||
| set -e -o pipefail # stop on error | |||
There was a problem hiding this comment.
issue (bug_risk): Dropping set -e may mask failures in future commands before the migrate loop.
The retry loop correctly handles migrate failures, but without set -e any other command before or between attempts can fail silently and the script will keep running. To preserve fail-fast behavior while still allowing retries, consider restoring set -e and isolating the retry logic around migrate (for example, by wrapping migrate in a function whose exit code you control).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2236 +/- ##
==========================================
- Coverage 59.06% 59.04% -0.03%
==========================================
Files 138 138
Lines 8848 8848
==========================================
- Hits 5226 5224 -2
- Misses 3076 3078 +2
Partials 546 546
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Context
Manager had db-migration as an init container. With multiple replicas and rolling updates, several new pods could run migration at once and compete for pg_advisory_lock(123). ClowdApp cannot set maxSurge on manager (public web service enabled); a dedicated Job avoids per-replica migration. Also, it is harder to troubleshoot when you have 2 pods trying to do migration and you don't know which one actually does it.
This PR:
db-migrationinit withcheck-for-db(same pattern as listener/evaluator)db-migrationJobentrypoint.shto retry migrate on failureSummary of the new flow:
New deploy → migration Job and new pods start in parallel → new pods block in init until DB schema is upgraded → success = rollout completes, failure = new pods fail init and old pods keep serving.
Summary by Sourcery
Move database migration from the manager init container to a dedicated ClowdApp Job and make migrations more resilient to failures.
New Features:
Enhancements:
Deployment: