Skip to content

ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729]#437

Open
rm3l wants to merge 11 commits into
redhat-developer:mainfrom
rm3l:RHIDP-14729--rhdh-chart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability
Open

ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729]#437
rm3l wants to merge 11 commits into
redhat-developer:mainfrom
rm3l:RHIDP-14729--rhdh-chart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability

Conversation

@rm3l

@rm3l rm3l commented Jun 16, 2026

Copy link
Copy Markdown
Member

Description of the change

Currently, both the PR (test.yaml) and nightly (nightly.yaml) workflows run ct install against all applicable charts in a single job. When a failure occurs:

  • It is difficult to determine which chart caused the failure without scrolling through lengthy logs.
  • A failure in one chart blocks visibility into the results of other charts.
  • Re-running the workflow re-tests all charts, not just the one that failed.

Example: https://github.com/redhat-developer/rhdh-chart/actions/runs/27310055619/job/80677740243

This PR converts both workflows into dynamic per-chart matrix jobs so that failures are immediately attributable to a specific chart.

This will help with the new standalone chart that will be added soon.

It also reverts the Helm v4.2.1 bump (from #430) back to v3.21.1 — Helm v4 causes helm install to hang indefinitely, ignoring --timeout. The Renovate PR went undetected because CI file changes did not trigger chart tests; this is now fixed. We'll need to troubleshoot these issues separately.

Which issue(s) does this PR fix or relate to

How to test changes / Special notes to the reviewer

Example workflows:

Checklist

  • N/A — this PR only changes CI workflows and config, no chart code was modified.

rm3l added 11 commits June 11, 2026 18:16
When a chart test fails in CI, the single-job approach makes it hard
to tell which chart broke — you have to scroll through lengthy logs
to find the culprit. A failure in one chart also blocks visibility
into the results of the others, and re-running re-tests everything.

By giving each chart its own matrix job, failures are immediately
attributable from the job name in the GitHub Actions UI, unrelated
charts keep running, and only the broken chart needs to be re-run.

Assisted-by: Claude
…hart-split-chart-testing-ci-into-per-chart-matrix-jobs-for-better-failure-debuggability
Helm operations (test, uninstall) on the main branch consistently hit
the 500s timeout ceiling while completing in seconds on release branches.
Helm produces zero output during these waits, making it impossible to
determine what it's blocking on.

Adding --debug to helm-extra-args will surface Helm-level details
(hooks, resource waits, etc.) to help diagnose the root cause.

Assisted-by: Claude
Helm waits silently during install/test/uninstall operations, producing
no output for up to 500s. This makes it impossible to see what the
cluster is doing during those stalls.

Add a background loop that prints pod status and recent events every
30s while ct install runs, giving visibility into what Kubernetes
resources are stuck or pending.

Assisted-by: Claude
Background processes die when their parent step exits, so the
monitoring loop from a separate step only ran once. Move it into the
ct install step so it stays alive for the duration of the test run.

Assisted-by: Claude
The status job only checked for "failure", letting "cancelled" and
other non-success states pass as green. Check for success/skipped
instead, so any unexpected result correctly fails the job.

Assisted-by: Claude
Helm v4.2.1 causes helm install to hang indefinitely, ignoring the
--timeout flag entirely. No chart pods are ever created and helm
produces no output until the runner kills it after 2+ hours.

This reverts the Helm version change from commit 26c43ea.

Assisted-by: Claude
Make the background cluster monitoring conditional on the
TEST_MONITORING_HEARTBEAT_ENABLED repo variable (default: false)
to avoid noisy logs in normal runs.

Add a 2-hour timeout to test jobs to prevent runaway runs like
the Helm v4 hang that ran until the runner killed it.

Assisted-by: Claude
Changes to the test-charts action or test workflow were not triggering
any chart tests because discover-charts only looked at charts/ changes.
This is how the Helm v4 regression went undetected.

Now detect changes to .github/actions/test-charts/,
.github/workflows/test.yaml, ct.yaml, and ct-install.yaml and test
all charts when they are modified.

Assisted-by: Claude
@rm3l rm3l changed the title RHIDP-14729: split chart-testing CI into per-chart matrix jobs ci: split chart-testing CI into per-chart matrix jobs [RHIDP-14729] Jun 16, 2026
@sonarqubecloud

Copy link
Copy Markdown

@rm3l rm3l marked this pull request as ready for review June 16, 2026 15:49
@rm3l rm3l requested a review from a team as a code owner June 16, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant