ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729]#437
Open
rm3l wants to merge 11 commits into
Conversation
When a chart test fails in CI, the single-job approach makes it hard to tell which chart broke — you have to scroll through lengthy logs to find the culprit. A failure in one chart also blocks visibility into the results of the others, and re-running re-tests everything. By giving each chart its own matrix job, failures are immediately attributable from the job name in the GitHub Actions UI, unrelated charts keep running, and only the broken chart needs to be re-run. Assisted-by: Claude
Assisted-by: Claude
Assisted-by: Claude
…hart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability
Helm operations (test, uninstall) on the main branch consistently hit the 500s timeout ceiling while completing in seconds on release branches. Helm produces zero output during these waits, making it impossible to determine what it's blocking on. Adding --debug to helm-extra-args will surface Helm-level details (hooks, resource waits, etc.) to help diagnose the root cause. Assisted-by: Claude
Helm waits silently during install/test/uninstall operations, producing no output for up to 500s. This makes it impossible to see what the cluster is doing during those stalls. Add a background loop that prints pod status and recent events every 30s while ct install runs, giving visibility into what Kubernetes resources are stuck or pending. Assisted-by: Claude
Background processes die when their parent step exits, so the monitoring loop from a separate step only ran once. Move it into the ct install step so it stays alive for the duration of the test run. Assisted-by: Claude
The status job only checked for "failure", letting "cancelled" and other non-success states pass as green. Check for success/skipped instead, so any unexpected result correctly fails the job. Assisted-by: Claude
Helm v4.2.1 causes helm install to hang indefinitely, ignoring the --timeout flag entirely. No chart pods are ever created and helm produces no output until the runner kills it after 2+ hours. This reverts the Helm version change from commit 26c43ea. Assisted-by: Claude
Make the background cluster monitoring conditional on the TEST_MONITORING_HEARTBEAT_ENABLED repo variable (default: false) to avoid noisy logs in normal runs. Add a 2-hour timeout to test jobs to prevent runaway runs like the Helm v4 hang that ran until the runner killed it. Assisted-by: Claude
Changes to the test-charts action or test workflow were not triggering any chart tests because discover-charts only looked at charts/ changes. This is how the Helm v4 regression went undetected. Now detect changes to .github/actions/test-charts/, .github/workflows/test.yaml, ct.yaml, and ct-install.yaml and test all charts when they are modified. Assisted-by: Claude
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description of the change
Currently, both the PR (
test.yaml) and nightly (nightly.yaml) workflows runct installagainst all applicable charts in a single job. When a failure occurs:Example: https://github.com/redhat-developer/rhdh-chart/actions/runs/27310055619/job/80677740243
This PR converts both workflows into dynamic per-chart matrix jobs so that failures are immediately attributable to a specific chart.
This will help with the new standalone chart that will be added soon.
It also reverts the Helm v4.2.1 bump (from #430) back to v3.21.1 — Helm v4 causes
helm installto hang indefinitely, ignoring--timeout. The Renovate PR went undetected because CI file changes did not trigger chart tests; this is now fixed. We'll need to troubleshoot these issues separately.Which issue(s) does this PR fix or relate to
How to test changes / Special notes to the reviewer
Example workflows:
Checklist