Skip to content

RHINENG-27056: move db-migration to a separate job#2236

Open
TenSt wants to merge 1 commit into
RedHatInsights:masterfrom
TenSt:stepan/RHINENG-27056-move-db-migration-to-a-job
Open

RHINENG-27056: move db-migration to a separate job#2236
TenSt wants to merge 1 commit into
RedHatInsights:masterfrom
TenSt:stepan/RHINENG-27056-move-db-migration-to-a-job

Conversation

@TenSt

@TenSt TenSt commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Context

Manager had db-migration as an init container. With multiple replicas and rolling updates, several new pods could run migration at once and compete for pg_advisory_lock(123). ClowdApp cannot set maxSurge on manager (public web service enabled); a dedicated Job avoids per-replica migration. Also, it is harder to troubleshoot when you have 2 pods trying to do migration and you don't know which one actually does it.

This PR:

  • Moves DB migration from manager init container to a single ClowdApp Job to avoid concurrent migrators during rollout
  • Replaces manager db-migration init with check-for-db (same pattern as listener/evaluator)
  • Adds db-migration Job
  • Updates entrypoint.sh to retry migrate on failure

Summary of the new flow:

New deploy → migration Job and new pods start in parallel → new pods block in init until DB schema is upgraded → success = rollout completes, failure = new pods fail init and old pods keep serving.

Summary by Sourcery

Move database migration from the manager init container to a dedicated ClowdApp Job and make migrations more resilient to failures.

New Features:

  • Introduce a dedicated db-migration Job to run schema migrations independently of manager pods.

Enhancements:

  • Replace the manager db-migration init container with a lightweight check-for-db init container that only waits for the upgraded schema.
  • Add configurable migration timeout and retry logic for the migration entrypoint to improve robustness during rollouts.

Deployment:

  • Configure a MIGRATION_TIMEOUT parameter to control the db-migration Job execution time.

@TenSt TenSt requested a review from a team as a code owner June 18, 2026 13:31
@sourcery-ai

sourcery-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Reviewer's Guide

Moves database migrations from the manager pod’s init container into a dedicated ClowdApp Job with retry logic, and converts the manager init container into a lightweight DB-schema readiness check so that new pods block until migrations complete successfully.

File-Level Changes

Change Details Files
Replace manager init-container migration with a DB readiness check script so manager pods only start once the schema is upgraded.
  • Rename the manager init container from db-migration to check-for-db and point its command to ./database_admin/check-upgraded.sh instead of entrypoint.sh
  • Remove database admin–specific environment variables from the manager init container, leaving only POD_CONFIG
deploy/clowdapp.yaml
Introduce a dedicated one-shot db-migration Job to run schema migrations centrally during deployments.
  • Add a ClowdApp Job named db-migration with completions=1 and parallelism=1 using the same database_admin image and entrypoint as before
  • Configure the migration Job with MIGRATION_MAX_RETRIES, database-related environment variables, and resource requests/limits appropriate for database admin work
  • Add a MIGRATION_TIMEOUT ClowdApp parameter and wire it to the Job’s activeDeadlineSeconds
deploy/clowdapp.yaml
Add retry behavior to the migration entrypoint to make migrations more robust and controllable from the Job.
  • Stop using set -e so the script can implement custom retry logic while still failing the Job on repeated errors
  • Introduce MIGRATION_MAX_RETRIES (default 1) and loop over migration attempts with a short sleep between failures
  • Exit successfully on the first successful migration run and exit with failure after the last failed attempt
database_admin/entrypoint.sh

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Consider making MIGRATION_MAX_RETRIES configurable via a ClowdApp parameter (similar to MIGRATION_TIMEOUT) instead of hardcoding '3' in the Job spec so the retry behavior can be tuned per environment without code changes.
  • It may be worth double-checking that MIGRATION_TIMEOUT is aligned with MIGRATION_MAX_RETRIES and the 5s sleep in entrypoint.sh so the Job’s activeDeadlineSeconds cannot expire mid-retry loop in typical failure scenarios.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider making `MIGRATION_MAX_RETRIES` configurable via a ClowdApp parameter (similar to `MIGRATION_TIMEOUT`) instead of hardcoding `'3'` in the Job spec so the retry behavior can be tuned per environment without code changes.
- It may be worth double-checking that `MIGRATION_TIMEOUT` is aligned with `MIGRATION_MAX_RETRIES` and the 5s sleep in `entrypoint.sh` so the Job’s `activeDeadlineSeconds` cannot expire mid-retry loop in typical failure scenarios.

## Individual Comments

### Comment 1
<location path="deploy/clowdapp.yaml" line_range="313" />
<code_context>
+        command:
+          - ./database_admin/entrypoint.sh
+        env:
+        - {name: MIGRATION_MAX_RETRIES, value: '3'}
+        - {name: LOG_LEVEL, value: '${LOG_LEVEL_DATABASE_ADMIN}'}
+        - {name: DB_DEBUG, value: '${DB_DEBUG_DATABASE_ADMIN}'}
</code_context>
<issue_to_address>
**suggestion:** Consider making MIGRATION_MAX_RETRIES configurable via a template parameter instead of hardcoding '3'.

Align this with how `MIGRATION_TIMEOUT` and other resource values are configured, so the retry policy can be tuned per environment without editing the manifest directly.

Suggested implementation:

```
      podSpec:
        image: ${IMAGE}:${IMAGE_TAG}
        command:
          - ./database_admin/entrypoint.sh
        env:
        - name: MIGRATION_MAX_RETRIES
          value: ${{MIGRATION_MAX_RETRIES}}

```

To fully implement the suggestion, you will also need to:
1. Declare a `MIGRATION_MAX_RETRIES` template parameter in this ClowdApp (or the surrounding template/Helm chart), following the same pattern used for `MIGRATION_TIMEOUT` (e.g., add it to the `parameters`/`envTemplate` section with a default value of `3`).
2. Ensure any environment-specific overlays (e.g., dev/stage/prod) can override `MIGRATION_MAX_RETRIES` so it can be tuned per environment without editing this manifest directly.
</issue_to_address>

### Comment 2
<location path="database_admin/entrypoint.sh" line_range="3" />
<code_context>
 #!/bin/bash

-set -e -o pipefail # stop on error
+set -o pipefail

 MIGRATION_FILES=file://./database_admin/migrations
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping `set -e` may mask failures in future commands before the migrate loop.

The retry loop correctly handles `migrate` failures, but without `set -e` any other command before or between attempts can fail silently and the script will keep running. To preserve fail-fast behavior while still allowing retries, consider restoring `set -e` and isolating the retry logic around `migrate` (for example, by wrapping `migrate` in a function whose exit code you control).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread deploy/clowdapp.yaml
command:
- ./database_admin/entrypoint.sh
env:
- {name: MIGRATION_MAX_RETRIES, value: '3'}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider making MIGRATION_MAX_RETRIES configurable via a template parameter instead of hardcoding '3'.

Align this with how MIGRATION_TIMEOUT and other resource values are configured, so the retry policy can be tuned per environment without editing the manifest directly.

Suggested implementation:

      podSpec:
        image: ${IMAGE}:${IMAGE_TAG}
        command:
          - ./database_admin/entrypoint.sh
        env:
        - name: MIGRATION_MAX_RETRIES
          value: ${{MIGRATION_MAX_RETRIES}}

To fully implement the suggestion, you will also need to:

  1. Declare a MIGRATION_MAX_RETRIES template parameter in this ClowdApp (or the surrounding template/Helm chart), following the same pattern used for MIGRATION_TIMEOUT (e.g., add it to the parameters/envTemplate section with a default value of 3).
  2. Ensure any environment-specific overlays (e.g., dev/stage/prod) can override MIGRATION_MAX_RETRIES so it can be tuned per environment without editing this manifest directly.

@@ -1,8 +1,20 @@
#!/bin/bash

set -e -o pipefail # stop on error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Dropping set -e may mask failures in future commands before the migrate loop.

The retry loop correctly handles migrate failures, but without set -e any other command before or between attempts can fail silently and the script will keep running. To preserve fail-fast behavior while still allowing retries, consider restoring set -e and isolating the retry logic around migrate (for example, by wrapping migrate in a function whose exit code you control).

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.04%. Comparing base (704b877) to head (9c55f7c).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2236      +/-   ##
==========================================
- Coverage   59.06%   59.04%   -0.03%     
==========================================
  Files         138      138              
  Lines        8848     8848              
==========================================
- Hits         5226     5224       -2     
- Misses       3076     3078       +2     
  Partials      546      546              
Flag Coverage Δ
unittests 59.04% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants