Skip to content

Add Chaibot test failure triage workflow to ci-chat-bot#80476

Open
chaclark1974 wants to merge 15 commits into
openshift:mainfrom
chaclark1974:chaibot-test-triage
Open

Add Chaibot test failure triage workflow to ci-chat-bot#80476
chaclark1974 wants to merge 15 commits into
openshift:mainfrom
chaclark1974:chaibot-test-triage

Conversation

@chaclark1974

@chaclark1974 chaclark1974 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Add Chaibot test failure triage workflow to ci-chat-bot

This PR adds Chaibot, an AI-powered Slack workflow that automatically
triages and analyzes test failures posted in designated Slack channels.

Overview

Chaibot extends the existing ci-chat-bot service to monitor Slack channels
(initially #opp-discussion) for test failure messages, analyze failures using
OpenAI GPT-4, and post detailed triage analysis in threads.

What's Added

Configuration Files

  • core-services/ci-chat-bot/triage-config.yaml - Main Chaibot configuration
  • clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml - Kubernetes ConfigMap
  • clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml - Prometheus alerts
  • core-services/ci-secret-bootstrap/chaibot-secret-config.yaml - Secret config guide

Deployment Changes

  • clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml - Updated with:
    • Chaibot triage-config and secrets volumes
    • CHAIBOT_ENABLED and OPENAI_API_KEY environment variables
    • --enable-triage command line argument

Documentation

  • docs/chaibot-test-failure-triage.md - Comprehensive user/admin guide
  • core-services/ci-chat-bot/CHAIBOT.md - Quick reference
  • CHAIBOT_QUICKSTART.md - Quick start guide
  • DEPLOY_CHAIBOT.md - Deployment instructions

Features

  • Automatic Detection: Monitors channels for Prow job failures
  • AI Analysis: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config)
  • Historical Context: Integrates with Sippy for past failure patterns
  • JIRA Integration: Searches for related known issues
  • Actionable Output: Posts analysis with recommendations in Slack threads

Example Output

When a failure is posted, Chaibot responds with:

  • Root cause identification (with confidence %)
  • Evidence from logs
  • Historical failure patterns
  • Specific recommendations
  • Links to Sippy, logs, and related JIRA issues

Configuration Required

Before this can function, the following must be configured:

  1. Slack Channel ID: Update chaibot-configmap.yaml with actual channel ID for #opp-discussion
  2. OpenAI API Key: Add to ci-secret-bootstrap (see chaibot-secret-config.yaml)
  3. Slack App Permissions: Ensure ci-chat-bot app has required OAuth scopes

Implementation Note

⚠️ This PR provides the complete configuration and deployment manifests, but
requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to
actually process the configuration and perform analysis.

Without the code implementation, the deployment will succeed but Chaibot
will not respond to messages (the --enable-triage flag will be ignored).

Cost Estimate

  • GPT-4: $0.03/analysis ($90/month at 100 failures/day)
  • GPT-3.5-turbo: $0.003/analysis ($9/month at 100 failures/day)
  • Rate limiting configured to prevent cost overruns

Testing

After deployment:

  1. Update ConfigMap with actual Slack channel ID
  2. Configure OpenAI API key secret
  3. Post test failure message with Prow URL in #opp-discussion
  4. Verify Chaibot responds in thread within 60 seconds

Related

  • Extends existing ci-chat-bot service
  • Integrates with Sippy for historical data
  • Complements retester for automated failure handling

/cc @openshift/test-platform

Summary by CodeRabbit

This PR adds Chaibot, an AI-powered Slack integration that automatically triages OpenShift CI test failures. When Prow test failure messages are posted to designated Slack channels, Chaibot analyzes them using OpenAI GPT models and provides categorized triage results (infrastructure issues, flaky tests, bugs, or configuration problems) with actionable recommendations and relevant documentation links.

Configuration and Infrastructure Changes:
The PR introduces the complete configuration and Kubernetes deployment manifests required for Chaibot:

  • Core configuration (triage-config.yaml): Defines monitored Slack channels, failure detection patterns, AI analysis parameters, categorization rules with confidence thresholds, and rate limiting controls
  • Kubernetes resources: ConfigMap for the triage configuration, Secrets for OpenAI API credentials, and Prometheus alert rules for monitoring (API error rates, analysis timeouts, service availability)
  • Deployment updates (ci-chat-bot.yaml): Wires the configuration and secrets into the existing ci-chat-bot pods, adds the --enable-triage flag, and sets required environment variables

Integration Capabilities:
Chaibot integrates with multiple systems for comprehensive failure analysis:

  • Sippy for historical test failure context
  • JIRA for discovering related issues
  • Prow/GCS for retrieving logs and artifacts
  • OpenAI API for AI-driven categorization and analysis

Documentation and Deployment Guidance:
The PR includes extensive documentation for operators: a quick-start guide, detailed deployment runbook, troubleshooting procedures, and cost estimates (GPT-4 ~$0.03/analysis, GPT-3.5-turbo ~$0.003/analysis, with configurable rate limiting).

Important Implementation Note:
The PR provides configuration and Kubernetes manifests only. Runtime code implementation in the openshift/ci-tools repository (cmd/ci-chat-bot) is required for Chaibot to function; without it, the deployment will succeed but triage responses will not be generated.

@openshift-ci openshift-ci Bot requested a review from a team June 12, 2026 15:42
@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 12, 2026
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chaclark1974
Once this PR has been reviewed and has the lgtm label, please assign jmguzik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Introduces Chaibot, an AI-powered Slack workflow for triaging CI test failures in ci-chat-bot. Adds a canonical triage-config.yaml, a Kubernetes ConfigMap wrapping it, and live Deployment changes (volumes, mounts, env vars, startup args). Also adds a deployment patch template, a PrometheusRule with three alerts, a secrets bootstrap reference, and four comprehensive documentation files covering quick-start, deployment runbook, feature reference, and end-user guide.

Changes

Chaibot Test Failure Triage

Layer / File(s) Summary
Canonical triage configuration schema
core-services/ci-chat-bot/triage-config.yaml
Defines the complete triage-config.yaml covering monitored Slack channels, failure detection patterns (Prow URLs, keywords), AI analysis settings (provider/model/timeout), failure categorization rules with per-category confidence thresholds, Slack response formatting, integrations (Sippy/JIRA/Prow/AI API), rate limiting controls, Prometheus metrics configuration, and AI prompt templates for root-cause analysis.
Kubernetes ConfigMap and Deployment wiring
clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml, clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml
Wraps triage-config.yaml in the ci-chat-bot-triage-config ConfigMap and wires the ci-chat-bot Deployment to mount configuration and secrets volumes, set environment variables (CHAIBOT_ENABLED, OPENAI_API_KEY), and pass command-line arguments to enable triage with the mounted config path.
Deployment patch template and Prometheus alerting
clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml
Provides a commented patch reference showing volume/mount/argument layering over the base Deployment, placeholder Secret documentation, and the chaibot-alerts PrometheusRule defining three alerts: ChaibotHighErrorRate (API error rate), ChaibotAnalysisTimeout (analysis duration quantile), and ChaibotDown (pod uptime).
Secrets bootstrap and configuration reference
core-services/ci-secret-bootstrap/chaibot-secret-config.yaml
Documents how to store OpenAI/Anthropic API keys in Vault, capture and reference the Slack channel ID, configure required Slack OAuth scopes and event subscriptions for the ci-chat-bot-slack-app.
Quick-start guide and deployment runbook
CHAIBOT_QUICKSTART.md, DEPLOY_CHAIBOT.md
CHAIBOT_QUICKSTART.md provides end-to-end deployment steps, example outputs, and troubleshooting. DEPLOY_CHAIBOT.md is a comprehensive runbook covering prerequisites, deployment workflow, functional validation, troubleshooting, cost management, production readiness checklist, and rollback procedures.
Feature reference and end-user documentation
core-services/ci-chat-bot/CHAIBOT.md, docs/chaibot-test-failure-triage.md
CHAIBOT.md is a reference covering feature purpose, configuration structure, end-to-end flow, and monitoring. docs/chaibot-test-failure-triage.md is a comprehensive guide covering setup prerequisites, usage modes, configuration options, operational monitoring, troubleshooting, cost considerations, security practices, local development, and a feature roadmap.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding a Chaibot test failure triage workflow to ci-chat-bot. It is concise, specific, and directly reflects the primary purpose of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds only configuration and documentation files (YAML and Markdown); contains zero Go test files and zero Ginkgo tests, so check is not applicable.
Test Structure And Quality ✅ Passed PR contains only YAML configuration and Markdown documentation files; no Ginkgo test code exists to review. Custom check is not applicable to this PR.
Microshift Test Compatibility ✅ Passed PR adds configuration and documentation files only (YAML, Markdown). No Ginkgo e2e tests were added, so MicroShift test compatibility check does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds configuration, deployment manifests, and documentation for Chaibot—no Ginkgo e2e tests are added, so SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds Chaibot configuration and features to ci-chat-bot without introducing any topology-unfriendly scheduling constraints: no affinity rules, nodeSelector, topologySpreadConstraints, PodDis...
Ote Binary Stdout Contract ✅ Passed PR adds only YAML configuration and markdown documentation files. OTE Binary Stdout Contract check applies only to Go test binaries; no Go code is present in this PR.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only configuration (YAML) and documentation (Markdown) files for Chaibot feature; no Ginkgo e2e tests are added, so IPv6/disconnected network compatibility check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or non-constant-time comparisons) found in any of the 9 files added/modified in this PR.
Container-Privileges ✅ Passed The PR's Kubernetes manifests contain no privileged security settings. The deployment has no privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation set...
No-Sensitive-Data-In-Logs ✅ Passed PR sets log_level to "info" (not debug), uses Kubernetes Secrets for API keys with no hardcoded credentials, includes security guidance against logging sensitive data, and provides no code that cou...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

4 similar comments
@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

@chaclark1974

Copy link
Copy Markdown
Contributor Author

/retest

This PR adds Chaibot, an AI-powered Slack workflow that automatically
triages and analyzes test failures posted in designated Slack channels.

## Overview

Chaibot extends the existing ci-chat-bot service to monitor Slack channels
(initially #opp-discussion) for test failure messages, analyze failures using
OpenAI GPT-4, and post detailed triage analysis in threads.

## What's Added

### Configuration Files
- `core-services/ci-chat-bot/triage-config.yaml` - Main Chaibot configuration
- `clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml` - Kubernetes ConfigMap
- `clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` - Prometheus alerts
- `core-services/ci-secret-bootstrap/chaibot-secret-config.yaml` - Secret config guide

### Deployment Changes
- `clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` - Updated with:
  - Chaibot triage-config and secrets volumes
  - CHAIBOT_ENABLED and OPENAI_API_KEY environment variables
  - --enable-triage command line argument

### Documentation
- `docs/chaibot-test-failure-triage.md` - Comprehensive user/admin guide
- `core-services/ci-chat-bot/CHAIBOT.md` - Quick reference
- `CHAIBOT_QUICKSTART.md` - Quick start guide
- `DEPLOY_CHAIBOT.md` - Deployment instructions

## Features

- **Automatic Detection**: Monitors channels for Prow job failures
- **AI Analysis**: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config)
- **Historical Context**: Integrates with Sippy for past failure patterns
- **JIRA Integration**: Searches for related known issues
- **Actionable Output**: Posts analysis with recommendations in Slack threads

## Example Output

When a failure is posted, Chaibot responds with:
- Root cause identification (with confidence %)
- Evidence from logs
- Historical failure patterns
- Specific recommendations
- Links to Sippy, logs, and related JIRA issues

## Configuration Required

Before this can function, the following must be configured:

1. **Slack Channel ID**: Update `chaibot-configmap.yaml` with actual channel ID for #opp-discussion
2. **OpenAI API Key**: Add to ci-secret-bootstrap (see `chaibot-secret-config.yaml`)
3. **Slack App Permissions**: Ensure ci-chat-bot app has required OAuth scopes

## Implementation Note

⚠️ This PR provides the complete configuration and deployment manifests, but
requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to
actually process the configuration and perform analysis.

Without the code implementation, the deployment will succeed but Chaibot
will not respond to messages (the --enable-triage flag will be ignored).

## Cost Estimate

- GPT-4: ~$0.03/analysis (~$90/month at 100 failures/day)
- GPT-3.5-turbo: ~$0.003/analysis (~$9/month at 100 failures/day)
- Rate limiting configured to prevent cost overruns

## Testing

After deployment:
1. Update ConfigMap with actual Slack channel ID
2. Configure OpenAI API key secret
3. Post test failure message with Prow URL in #opp-discussion
4. Verify Chaibot responds in thread within 60 seconds

## Related

- Extends existing ci-chat-bot service
- Integrates with Sippy for historical data
- Complements retester for automated failure handling

/cc @openshift/test-platform
Add Vault sync configuration for the Chaibot OpenAI API key stored in
selfservice/cspi-qe/chaibot-openai-key.

This configures ci-secret-bootstrap to automatically sync the key from
Vault to the ci-chat-bot-chaibot-secrets Kubernetes secret in the ci
namespace on the app.ci cluster.

Vault path: selfservice/cspi-qe/chaibot-openai-key
Target secret: ci-chat-bot-chaibot-secrets (ci namespace, app.ci cluster)
@chaclark1974 chaclark1974 force-pushed the chaibot-test-triage branch from 8423b12 to 330bcc5 Compare June 15, 2026 14:49

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (2)

299-390: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Harden container securityContext before rollout.

The Deployment still lacks explicit hardening (runAsNonRoot, allowPrivilegeEscalation: false, readOnlyRootFilesystem, capabilities.drop: ["ALL"]) for workload containers.

Suggested patch
       containers:
         - name: git-sync
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL
         - name: bot
+          securityContext:
+            runAsNonRoot: true
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL

As per coding guidelines, Kubernetes/OpenShift manifests should set runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false, and drop all capabilities unless required.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` around lines 299 - 390, The
containers git-sync-init, git-sync, and bot in the Deployment lack explicit
security hardening configurations. Add a securityContext field to each container
specifying runAsNonRoot set to true, allowPrivilegeEscalation set to false,
readOnlyRootFilesystem set to true, and capabilities.drop set to ["ALL"]. For
the git-sync-init initContainer, ensure the same hardening is applied. These
security settings should be added at the container level, sibling to the image,
imagePullPolicy, and other container specifications.

Sources: Coding guidelines, Linters/SAST tools


330-353: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add CPU/memory limits for all containers in this pod.

Only requests are configured; missing limits can cause noisy-neighbor impact and weak resource isolation.

Suggested patch
         - name: git-sync
           resources:
             requests:
               memory: "1Gi"
               cpu: "0.5"
+            limits:
+              memory: "2Gi"
+              cpu: "1"
         - name: bot
           resources:
             requests:
               memory: "12Gi"
               cpu: "250m"
+            limits:
+              memory: "16Gi"
+              cpu: "2"

As per coding guidelines, Kubernetes manifests should define resource limits for every container.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` around lines 330 - 353, Add
resource limits to all containers in the pod to ensure proper resource
isolation. For the bot container, add a limits section under resources
(alongside the existing requests section) with appropriate CPU and memory
limits. Similarly, add limits to the unnamed container shown earlier in the diff
that currently only has requests configured. Limits should typically match or
slightly exceed the request values to prevent resource starvation while
maintaining isolation guarantees.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHAIBOT_QUICKSTART.md`:
- Around line 9-23: The fenced code blocks in the markdown file are missing
language identifiers after the opening triple backticks. Add the appropriate
language identifier (such as "yaml", "bash", "json", etc.) immediately after the
opening triple backticks for all code blocks throughout the file to resolve
markdown linting issues. This needs to be applied to all affected code block
locations in the document.

In `@clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml`:
- Around line 91-96: The include_actions section in the chaibot-configmap.yaml
file is incomplete and missing the trigger_retest and create_issue actions that
are documented in the canonical configuration. Add back these two missing action
entries to the include_actions list so that the deployed ConfigMap includes all
the intended Chaibot actions and matches the documented feature set.

In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`:
- Around line 53-63: Remove the entire Secret definition for
ci-chat-bot-chaibot-secrets (lines 53-63) from the deployment manifest file
chaibot-deployment-patch.yaml. This secret with the placeholder openai-api-key
value should not be kept in applyable cluster manifests as it can override the
bootstrap-managed secret. Since this secret is meant to be managed via
ci-secret-bootstrap as noted in the comment, delete this whole Secret resource
block to prevent accidental overrides of the actual authentication credentials.

In `@core-services/ci-chat-bot/CHAIBOT.md`:
- Around line 79-81: The fenced code blocks in the CHAIBOT.md file are missing
language declarations, which violates markdownlint rule MD040. Add appropriate
language tags to all fenced code blocks that currently lack them. Specifically,
add a language identifier immediately after the opening triple backticks (``` )
for the code block at lines 79-81 and the additional code blocks at lines 84-94.
Use language tags that accurately describe the content of each block (such as
text, bash, yaml, etc.).
- Around line 175-181: The secret_name field in the ai_api integrations section
of CHAIBOT.md documents "chaibot-openai-key", but this does not match the actual
deployment secret wiring. Review the actual deployment configuration to identify
the correct secret name and key references being used (the comment indicates the
deployment uses "ci-chat-bot-chaibot-secrets" and "openai-api-key"), then update
the secret_name value and any related configuration in CHAIBOT.md to match the
actual deployed secret contract to ensure documentation accuracy.

In `@core-services/ci-chat-bot/triage-config.yaml`:
- Around line 153-157: The ai_api.secret_name configuration value in
triage-config.yaml at line 155 is set to "chaibot-openai-key", but this does not
match the actual Kubernetes secret name created by the bootstrap process in
core-services/ci-secret-bootstrap/_config.yaml, which creates
"ci-chat-bot-chaibot-secrets". Update the secret_name value to
"ci-chat-bot-chaibot-secrets" so that runtime secret resolution will
successfully find the provisioned secret.

In `@core-services/ci-secret-bootstrap/chaibot-secret-config.yaml`:
- Around line 7-13: The example schema in the commented-out from section uses an
incorrect format with dockerconfigJSON key that does not match the actual
secret-bootstrap schema. Replace the dockerconfigJSON placeholder format in the
from section with the correct schema structure that uses field and path keys
instead, ensuring the example accurately reflects the schema expected by the
secret-bootstrap configuration used in this PR.

In `@DEPLOY_CHAIBOT.md`:
- Around line 73-76: The documentation in DEPLOY_CHAIBOT.md at lines 73-76 and
267-268 shows insecure methods of handling API keys using --from-literal in oc
create secret commands, which exposes sensitive credentials in shell history and
process output. Replace these examples with safer alternatives that do not
expose the raw API key on the command line, such as reading from a file (using
--from-file), referencing an environment variable securely, or using an
interactive prompt. Ensure all documentation instances that demonstrate secret
creation follow this safer pattern rather than showing literal key values in the
command itself.
- Around line 14-20: The markdown code fences in DEPLOY_CHAIBOT.md at lines
14-20 and also at lines 204-206 are missing language identifiers after the
opening triple backticks, which violates markdownlint requirements. Add an
appropriate language identifier (such as "text" or "plaintext") after each
opening code fence marker (the triple backticks) to declare the language type
for all affected fenced code blocks.

In `@docs/chaibot-test-failure-triage.md`:
- Around line 38-80: The fenced code blocks in the markdown file are missing
language specifiers, which causes MD040 lint check failures. Add appropriate
language specifiers to all code blocks that currently start with just triple
backticks. Based on the content of each block, determine the appropriate
language identifier (such as bash, json, text, etc.) and add it immediately
after the opening triple backticks to fix the lint violations.

---

Outside diff comments:
In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml`:
- Around line 299-390: The containers git-sync-init, git-sync, and bot in the
Deployment lack explicit security hardening configurations. Add a
securityContext field to each container specifying runAsNonRoot set to true,
allowPrivilegeEscalation set to false, readOnlyRootFilesystem set to true, and
capabilities.drop set to ["ALL"]. For the git-sync-init initContainer, ensure
the same hardening is applied. These security settings should be added at the
container level, sibling to the image, imagePullPolicy, and other container
specifications.
- Around line 330-353: Add resource limits to all containers in the pod to
ensure proper resource isolation. For the bot container, add a limits section
under resources (alongside the existing requests section) with appropriate CPU
and memory limits. Similarly, add limits to the unnamed container shown earlier
in the diff that currently only has requests configured. Limits should typically
match or slightly exceed the request values to prevent resource starvation while
maintaining isolation guarantees.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 53e7ca54-c042-4bde-8c2f-d0d3cbfbf023

📥 Commits

Reviewing files that changed from the base of the PR and between 5f6a00a and 330bcc5.

📒 Files selected for processing (10)
  • CHAIBOT_QUICKSTART.md
  • DEPLOY_CHAIBOT.md
  • clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml
  • clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml
  • clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml
  • core-services/ci-chat-bot/CHAIBOT.md
  • core-services/ci-chat-bot/triage-config.yaml
  • core-services/ci-secret-bootstrap/_config.yaml
  • core-services/ci-secret-bootstrap/chaibot-secret-config.yaml
  • docs/chaibot-test-failure-triage.md

Comment thread CHAIBOT_QUICKSTART.md
Comment thread clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml
Comment thread clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml Outdated
Comment thread core-services/ci-chat-bot/CHAIBOT.md
Comment thread core-services/ci-chat-bot/CHAIBOT.md
Comment thread core-services/ci-chat-bot/triage-config.yaml
Comment thread core-services/ci-secret-bootstrap/chaibot-secret-config.yaml
Comment thread DEPLOY_CHAIBOT.md
Comment on lines +14 to +20
```
✓ core-services/ci-chat-bot/triage-config.yaml
✓ clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml
✓ clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml
✓ clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (UPDATED)
✓ docs/chaibot-test-failure-triage.md
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced blocks.

These code fences should declare a language for markdownlint compliance.

Also applies to: 204-206

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 14-14: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@DEPLOY_CHAIBOT.md` around lines 14 - 20, The markdown code fences in
DEPLOY_CHAIBOT.md at lines 14-20 and also at lines 204-206 are missing language
identifiers after the opening triple backticks, which violates markdownlint
requirements. Add an appropriate language identifier (such as "text" or
"plaintext") after each opening code fence marker (the triple backticks) to
declare the language type for all affected fenced code blocks.

Source: Linters/SAST tools

Comment thread DEPLOY_CHAIBOT.md Outdated
Comment on lines +38 to +80
```
User: The e2e-aws job is failing again 😞
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/12345/pull-ci-openshift-installer-master-e2e-aws/678901

---

Chaibot [BOT]: :mag: Analyzing failure... (typically takes 30-60 seconds)

[30 seconds later]

Chaibot [BOT]: :cloud: **Test Failure Analysis**

**Job:** `pull-ci-openshift-installer-master-e2e-aws`
**Status:** Failed after 2h 15m
**Root Cause:** Infrastructure - AWS EC2 Capacity (Confidence: 85%)

**Analysis:**
Test failed during cluster provisioning when attempting to launch EC2 instances in us-east-1c.
AWS returned "InsufficientInstanceCapacity" error after multiple retry attempts over 45 minutes.

**Evidence:**
```
Error: creating EC2 Instance (i-0a1b2c3d4e5f): InsufficientInstanceCapacity
status code: 500, request id: xyz-123
```

**Historical Pattern:**
This failure has occurred 8 times in the last 24 hours across multiple jobs, all in us-east-1c AZ.
Sippy shows this as a known transient infrastructure issue.

**Recommendations:**
1. ✅ **Retest** - This is a transient AWS issue, likely to succeed on retry
2. 📊 Check AWS Service Health Dashboard for us-east-1 incidents
3. 🔔 If failures persist >6 hours, escalate to infrastructure team

**Related:**
- <https://sippy.dptools.openshift.org/|Sippy Dashboard>
- <https://issues.redhat.com/browse/DPTP-5678|DPTP-5678>: Similar AWS capacity issues

**Classification:** Transient Infrastructure (Not a product bug)

[Buttons: View Logs | Retest | Mark as Known Issue]
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language specifiers to fenced code examples.

These blocks are missing fence languages and will continue to fail MD040 lint checks.

Also applies to: 59-62, 211-213

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 38-38: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 62-62: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/chaibot-test-failure-triage.md` around lines 38 - 80, The fenced code
blocks in the markdown file are missing language specifiers, which causes MD040
lint check failures. Add appropriate language specifiers to all code blocks that
currently start with just triple backticks. Based on the content of each block,
determine the appropriate language identifier (such as bash, json, text, etc.)
and add it immediately after the opening triple backticks to fix the lint
violations.

Source: Linters/SAST tools

@amp-rh amp-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ci/prow/ci-secret-bootstrap-config-validation check is failing with:

config[115].from[openai-api-key]: empty value is not allowed

Root cause: The _config.yaml entry uses path: selfservice/cspi-qe/chaibot-openai-key, but ci-secret-bootstrap cannot read selfservice/ vault paths (its service account lacks ACL access to that namespace). This is the only path: usage in the entire _config.yaml; all 1594 other field references use item:.

Self-service vault items are designed to be synced via the secretsync mechanism (driven by the secretsync/target-name and secretsync/target-namespace metadata keys on the vault item), not via ci-secret-bootstrap.

Recommended fix: Since your vault item already has secretsync/target-name: cluster-secrets-chaibot-openai-key and secretsync/target-namespace: ci, remove the _config.yaml entry and update the deployment to reference the secretsync-managed secret name (cluster-secrets-chaibot-openai-key).

Note: Verify that secretsync targets the app.ci cluster (where ci-chat-bot runs). If it doesn't sync there by default, add secretsync/target-clusters: app.ci to the vault item metadata. See https://docs.ci.openshift.org/how-tos/adding-a-new-secret-to-ci/

Comment thread core-services/ci-secret-bootstrap/_config.yaml Outdated
Comment thread clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml
Comment thread clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml
chaclark1974 and others added 3 commits June 15, 2026 12:20
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>

@amp-rh amp-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _config.yaml fix looks good. A few stale references to the old secret name ci-chat-bot-chaibot-secrets remain across the PR that should be updated to cluster-secrets-chaibot-openai-key (the secretsync-managed name).

@amp-rh amp-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline suggestions for the remaining ci-chat-bot-chaibot-secrets references. Click to apply.

Comment thread CHAIBOT_QUICKSTART.md Outdated
Comment thread CHAIBOT_QUICKSTART.md Outdated
Comment thread CHAIBOT_QUICKSTART.md Outdated
Comment thread DEPLOY_CHAIBOT.md Outdated
Comment thread DEPLOY_CHAIBOT.md Outdated
Comment thread DEPLOY_CHAIBOT.md Outdated
Comment thread clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml Outdated
Comment thread clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml Outdated
Comment thread clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml Outdated
chaclark1974 and others added 6 commits June 15, 2026 12:35
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
@amp-rh

amp-rh commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

/test ci-secret-bootstrap-config-validation

chaclark1974 and others added 4 commits June 15, 2026 12:37
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Convert incomplete Secret documentation section to pure comments to fix
rover groups collection error. The section was missing required Kubernetes
object fields (kind, apiVersion, metadata) which caused validation failures.

The secret is managed by secretsync from vault item selfservice/cspi-qe/chaibot-openai-key
and does not need to be defined in this file.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

@amp-rh amp-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync-rover-groups CI check is failing because this section creates an invalid YAML document (no Kind field). The --- separator starts a new document, and the uncommented lines make it a non-empty, non-Kubernetes object. Remove this block entirely.

Comment on lines +51 to +58

---
# Secret is managed by secretsync from vault item:
# selfservice/cspi-qe/chaibot-openai-key
# It will appear as:
# name: cluster-secrets-chaibot-openai-key
# namespace: ci
# This should be managed via ci-secret-bootstrap

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
---
# Secret is managed by secretsync from vault item:
# selfservice/cspi-qe/chaibot-openai-key
# It will appear as:
# name: cluster-secrets-chaibot-openai-key
# namespace: ci
# This should be managed via ci-secret-bootstrap

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@chaclark1974: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml (2)

98-107: ⚠️ Potential issue | 🔴 Critical

Update the alert job label selector and add missing runbook_url annotation.

The ServiceMonitor has no custom jobLabel, so Prometheus constructs the job label as ci/ci-chat-bot (namespace/name). The current alert expression up{job="ci-chat-bot"} will not match this metric and the alert will never fire. Additionally, the runbook_url annotation is missing for consistency with ChaibotHighErrorRate.

Suggested fix
        - alert: ChaibotDown
          expr: |
-           up{job="ci-chat-bot"} == 0
+           up{job="ci/ci-chat-bot"} == 0
          for: 5m
          labels:
            severity: critical
            team: test-platform
          annotations:
            summary: "Chaibot service is down"
            description: "ci-chat-bot service (including Chaibot) has been down for 5 minutes."
+           runbook_url: "https://github.com/openshift/release/blob/main/docs/dptp-triage-sop/chaibot.md"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` around lines 98 -
107, In the ChaibotDown alert, fix the job label selector in the expr field from
"ci-chat-bot" to "ci/ci-chat-bot" to match how Prometheus constructs the job
label as namespace/name without a custom jobLabel in the ServiceMonitor.
Additionally, add a missing runbook_url annotation to the ChaibotDown alert
annotations section for consistency with the ChaibotHighErrorRate alert.

85-85: ⚠️ Potential issue | 🟡 Minor

Create the runbook file or update the URL.

The runbook_url annotation references docs/dptp-triage-sop/chaibot.md, which does not exist. Either create this runbook file in the correct directory or point the annotation to an existing chaibot documentation file (e.g., core-services/ci-chat-bot/CHAIBOT.md).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` at line 85, The
runbook_url annotation in the chaibot-deployment-patch.yaml file references a
non-existent file at docs/dptp-triage-sop/chaibot.md. Either create the runbook
markdown file at that location in the release repository with appropriate
chaibot troubleshooting documentation, or update the runbook_url value to point
to an existing chaibot documentation file such as
core-services/ci-chat-bot/CHAIBOT.md. Ensure the URL in the annotation matches
the actual location of the documentation you choose to use.
🧹 Nitpick comments (2)
clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml (2)

6-51: ⚡ Quick win

Patch documentation appears redundant—base deployment already includes these changes.

Lines 6-51 document volumes, mounts, args, and env vars that should be added to the base ci-chat-bot.yaml. However, cross-referencing with the base deployment (context snippet 1, lines 334-461) shows all these elements are already present:

  • triage-config and chaibot-secrets volumes and mounts exist
  • CHAIBOT_ENABLED=true and OPENAI_API_KEY env vars exist
  • --enable-triage=true and --triage-config-path args exist

Consider either removing these commented sections or adding a header note: # NOTE: These patches have been applied to ci-chat-bot.yaml. Retained here for reference only.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` around lines 6 -
51, The patch file (chaibot-deployment-patch.yaml) contains extensive commented
documentation describing configuration changes (volumes, mounts, args, and env
vars) that have already been applied to the base ci-chat-bot.yaml deployment.
Remove the redundant commented sections spanning lines 6-51, or alternatively
replace them with a single header comment stating that these patches have been
applied to ci-chat-bot.yaml and are retained for reference only.

87-96: ⚡ Quick win

Add runbook_url annotation for operational consistency.

The ChaibotAnalysisTimeout alert is missing a runbook_url annotation, while ChaibotHighErrorRate includes one. For consistency and operational clarity, add a runbook reference.

📚 Suggested addition
          annotations:
            summary: "Chaibot analysis taking too long"
            description: "95th percentile analysis duration is {{ $value }}s, exceeding 120s timeout."
+           runbook_url: "https://github.com/openshift/release/blob/main/docs/dptp-triage-sop/chaibot.md"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` around lines 87 -
96, The ChaibotAnalysisTimeout alert definition is missing a runbook_url
annotation in its annotations section, which creates an inconsistency with other
alerts like ChaibotHighErrorRate that include this field. Add a runbook_url
annotation to the ChaibotAnalysisTimeout alert's annotations block, providing an
appropriate runbook reference URL that follows the same pattern used in other
similar alerts for operational consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`:
- Around line 98-107: In the ChaibotDown alert, fix the job label selector in
the expr field from "ci-chat-bot" to "ci/ci-chat-bot" to match how Prometheus
constructs the job label as namespace/name without a custom jobLabel in the
ServiceMonitor. Additionally, add a missing runbook_url annotation to the
ChaibotDown alert annotations section for consistency with the
ChaibotHighErrorRate alert.
- Line 85: The runbook_url annotation in the chaibot-deployment-patch.yaml file
references a non-existent file at docs/dptp-triage-sop/chaibot.md. Either create
the runbook markdown file at that location in the release repository with
appropriate chaibot troubleshooting documentation, or update the runbook_url
value to point to an existing chaibot documentation file such as
core-services/ci-chat-bot/CHAIBOT.md. Ensure the URL in the annotation matches
the actual location of the documentation you choose to use.

---

Nitpick comments:
In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`:
- Around line 6-51: The patch file (chaibot-deployment-patch.yaml) contains
extensive commented documentation describing configuration changes (volumes,
mounts, args, and env vars) that have already been applied to the base
ci-chat-bot.yaml deployment. Remove the redundant commented sections spanning
lines 6-51, or alternatively replace them with a single header comment stating
that these patches have been applied to ci-chat-bot.yaml and are retained for
reference only.
- Around line 87-96: The ChaibotAnalysisTimeout alert definition is missing a
runbook_url annotation in its annotations section, which creates an
inconsistency with other alerts like ChaibotHighErrorRate that include this
field. Add a runbook_url annotation to the ChaibotAnalysisTimeout alert's
annotations block, providing an appropriate runbook reference URL that follows
the same pattern used in other similar alerts for operational consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 08d82122-418a-486f-90b4-e7e3761543cf

📥 Commits

Reviewing files that changed from the base of the PR and between 17c503d and 3c9fdb8.

📒 Files selected for processing (3)
  • CHAIBOT_QUICKSTART.md
  • DEPLOY_CHAIBOT.md
  • clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml
✅ Files skipped from review due to trivial changes (2)
  • CHAIBOT_QUICKSTART.md
  • DEPLOY_CHAIBOT.md

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@chaclark1974: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants