Skip to content

[WIP] Replace failed-tests matviews with direct partitioned query#3647

Draft
mstaeble wants to merge 1 commit into
openshift:mainfrom
mstaeble:remove-failed-tests-matview
Draft

[WIP] Replace failed-tests matviews with direct partitioned query#3647
mstaeble wants to merge 1 commit into
openshift:mainfrom
mstaeble:remove-failed-tests-matview

Conversation

@mstaeble

@mstaeble mstaeble commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • The prow_job_failed_tests_by_day and prow_job_failed_tests_by_hour materialized views scanned all ~3,500 partitions of prow_job_run_tests with no time or release filter, taking ~3 minutes per refresh.
  • A direct query with partition pruning keys (prow_job_run_release + prow_job_run_timestamp BETWEEN start AND end) eliminates the matview refresh entirely and queries the partitioned table directly.
  • Removes the matview definitions and replaces the matview read in PrintJobAnalysisJSONFromDB with a direct query using the release and time range already available from the caller.
  • Adds error checking on the query result.
  • Tested against staging database via the local UI — job analysis page renders test failure counts correctly.

Benchmarks (staging, release 4.22 with ~1,000 job IDs, enable_partitionwise_join = on)

Approach Cold Warm
Matview refresh (no filters) ~3 minutes ~3 minutes
Matview read 53ms 46ms
Direct query (release + 14-day range) 257ms 249ms

The direct query is slower than reading the pre-aggregated matview (249ms vs 46ms), but eliminates the 3-minute background refresh cycle. End-to-end page load times observed: ~2.5s cold after server restart, <1.5s warm.

Test plan

  • Full project compiles with go build ./...
  • go vet clean
  • Manually tested job analysis page against staging database
  • CI passes

🤖 Generated with Claude Code

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Removes prow_job_failed_tests_by_day_matview and prow_job_failed_tests_by_hour_matview from PostgresMatViews registry and their SQL definition. PrintJobAnalysisJSONFromDB is updated to query prow_job_run_tests directly with date_trunc-based period grouping. Benchmark cases for the removed matviews are also removed.

Changes

Failed-Tests Matview Removal and Live Query Migration

Layer / File(s) Summary
Remove matview definitions from database registry
pkg/db/views.go
Removes prow_job_failed_tests_by_day_matview and prow_job_failed_tests_by_hour_matview entries from PostgresMatViews and deletes the prowJobFailedTestsMatView SQL constant.
Replace matview query with live join and validation
pkg/api/job_analysis.go
PrintJobAnalysisJSONFromDB now queries prow_job_run_tests joined to tests directly, computing period via date_trunc on prow_job_run_timestamp, filtering by failure status and release, constraining by timestamp window, and adding explicit Scan error handling.
Remove matview benchmark cases
pkg/flags/postgres_benchmarking_test.go
Removes MatviewFailedTestsByDay and MatviewFailedTestsByHour benchmark entries from getMatviewBenchmarkCases as those matviews no longer exist.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

ready-for-human-review

Suggested reviewers

  • deepsm007
  • xueqzhan
🚥 Pre-merge checks | ✅ 18 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Go Error Handling ⚠️ Warning Line 69 in pkg/api/job_analysis.go contains sumResults.Scan(&sums) which ignores the returned error without checking or handling it, violating the custom check requirement to never ignore returne... Check the error from sumResults.Scan(&sums) and return it if non-nil, similar to the pattern used in lines 112-122 for the test failures query: if err := sumResults.Scan(&sums).Error; err != nil { return result, err }
Test Coverage For New Features ⚠️ Warning PR modifies PrintJobAnalysisJSONFromDB (an existing function handling DB queries) to replace matview reads with direct partitioned queries, including error handling addition. No unit tests were add... Add unit test for PrintJobAnalysisJSONFromDB covering both the successful query path and error scenarios, or alternatively document why existing integration tests are sufficient.
✅ Passed checks (18 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Sql Injection Prevention ✅ Passed All SQL queries in the modified code use GORM's parameterized query API with proper placeholders (?). No direct string concatenation or unsafe SQL construction found. All user inputs (release, star...
Excessive Css In React Should Use Styles ✅ Passed PR contains only Go backend code changes (database queries, materialized views, benchmarks). No React components or inline CSS are modified, making the check inapplicable.
Single Responsibility And Clear Naming ✅ Passed PR improves SRP by removing unnecessary materialized views, consolidating test failure logic, and adding explicit error checking. Naming remains consistent with codebase conventions; no new naming...
Feature Documentation ✅ Passed Changes are database infrastructure optimizations (materialized view removal) in job_analysis.go and views.go. Feature documentation (docs/features/job-analysis-symptoms.md) covers artifact pattern...
Stable And Deterministic Test Names ✅ Passed No Ginkgo tests are used in this codebase or this PR. All benchmark test names are static and deterministic with no dynamic values, timestamps, or generated identifiers.
Test Structure And Quality ✅ Passed The custom check for Ginkgo test code is not applicable to this PR. The repository uses standard Go testing (testing package), not Ginkgo. No Ginkgo tests exist in the codebase or in the PR changes.
Microshift Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. The changes are database query optimizations and benchmark test removals, not e2e test additions. The custom check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not add any Ginkgo e2e tests. It only modifies database queries and schema (job_analysis.go, views.go) and Go standard benchmark tests (postgres_benchmarking_test.go). SNO compatibility che...
Topology-Aware Scheduling Compatibility ✅ Passed PR contains only database/Go library changes (query optimization, materialized view removal, benchmark updates) with no Kubernetes manifests, operator code, controllers, or scheduling constraints t...
Ote Binary Stdout Contract ✅ Passed All modified files maintain OTE Binary Stdout Contract: library packages contain no process-level stdout writes, and test file writes are confined to test function bodies (allowed per spec).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. It modifies API logic, removes DB view definitions, and removes benchmark cases from standard Go tests. The custom check only applies when new Ginkgo...
No-Weak-Crypto ✅ Passed PR modifies database optimization only (replaces materialized views with direct queries). No crypto imports, weak algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementation...
Container-Privileges ✅ Passed PR modifies only Go source files (pkg/api/job_analysis.go, pkg/db/views.go, pkg/flags/postgres_benchmarking_test.go) for database query optimization. No container or Kubernetes manifest files were...
No-Sensitive-Data-In-Logs ✅ Passed No logging statements exposing sensitive data (passwords, tokens, API keys, PII, session IDs, or hostnames) were added. Error handling returns errors without logging them.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: replacing failed-tests materialized views with direct partitioned queries, which is the core objective of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mstaeble
Once this PR has been reviewed and has the lgtm label, please assign stbenjam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mstaeble mstaeble force-pushed the remove-failed-tests-matview branch 2 times, most recently from d553fc3 to 20e1d35 Compare June 17, 2026 21:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/api/job_analysis.go (1)

112-114: 💤 Low value

Consider moving period validation earlier to ensure both queries use the same value.

This validation only affects the test-failure query (lines 116-127). The earlier job-runs aggregation (lines 51-67) passes period directly to GORM. If an invalid period reaches this function, the two queries could behave inconsistently. Moving the validation before line 51 would ensure uniform behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/job_analysis.go` around lines 112 - 114, The period validation logic
that checks if period is not equal to PeriodDay and PeriodHour (and defaults to
PeriodDay if invalid) is currently positioned after the job-runs aggregation
query. Move this validation block to the beginning of the function, before the
aggregation query that uses the period parameter, so that both the early
aggregation query and the later test-failure query operate with the same
validated period value and ensure consistent behavior throughout the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/job_analysis.go`:
- Line 124: The GROUP BY clause at line 124 includes prow_job_id, which causes
the query to return separate counts for each job. However, when aggregating
results later (line 132), the code uses assignment rather than addition, causing
per-job counts to be overwritten and discarded. To count total test failures
across all selected jobs per period, remove pjrt.prow_job_id from the Group()
method call in the query construction so the results are grouped only by test
name and date truncation without the job ID dimension.

---

Nitpick comments:
In `@pkg/api/job_analysis.go`:
- Around line 112-114: The period validation logic that checks if period is not
equal to PeriodDay and PeriodHour (and defaults to PeriodDay if invalid) is
currently positioned after the job-runs aggregation query. Move this validation
block to the beginning of the function, before the aggregation query that uses
the period parameter, so that both the early aggregation query and the later
test-failure query operate with the same validated period value and ensure
consistent behavior throughout the function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: cf4f597c-12b0-43d8-bba6-3422519a9f0c

📥 Commits

Reviewing files that changed from the base of the PR and between 0ecf781 and d553fc3.

📒 Files selected for processing (3)
  • pkg/api/job_analysis.go
  • pkg/db/views.go
  • pkg/flags/postgres_benchmarking_test.go
💤 Files with no reviewable changes (2)
  • pkg/db/views.go
  • pkg/flags/postgres_benchmarking_test.go

Comment thread pkg/api/job_analysis.go Outdated
@mstaeble mstaeble force-pushed the remove-failed-tests-matview branch from 20e1d35 to 7f1bea2 Compare June 17, 2026 21:56
The prow_job_failed_tests_by_day and prow_job_failed_tests_by_hour
materialized views scanned all ~3,500 partitions of prow_job_run_tests
with no time or release filter, taking ~3 minutes per refresh.

A direct query with partition pruning keys (prow_job_run_release and
prow_job_run_timestamp BETWEEN start AND end) runs in ~13ms — faster
than reading the matview itself (46ms) — making the matviews
unnecessary overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mstaeble mstaeble force-pushed the remove-failed-tests-matview branch from 7f1bea2 to 935b968 Compare June 18, 2026 13:48
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 18, 2026
@mstaeble mstaeble changed the title Replace failed-tests matviews with direct partitioned query [WIP] Replace failed-tests matviews with direct partitioned query Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant