Add Chaibot test failure triage workflow to ci-chat-bot#80476
Add Chaibot test failure triage workflow to ci-chat-bot#80476chaclark1974 wants to merge 15 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: chaclark1974 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughIntroduces Chaibot, an AI-powered Slack workflow for triaging CI test failures in ChangesChaibot Test Failure Triage
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest |
4 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
This PR adds Chaibot, an AI-powered Slack workflow that automatically triages and analyzes test failures posted in designated Slack channels. ## Overview Chaibot extends the existing ci-chat-bot service to monitor Slack channels (initially #opp-discussion) for test failure messages, analyze failures using OpenAI GPT-4, and post detailed triage analysis in threads. ## What's Added ### Configuration Files - `core-services/ci-chat-bot/triage-config.yaml` - Main Chaibot configuration - `clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml` - Kubernetes ConfigMap - `clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` - Prometheus alerts - `core-services/ci-secret-bootstrap/chaibot-secret-config.yaml` - Secret config guide ### Deployment Changes - `clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` - Updated with: - Chaibot triage-config and secrets volumes - CHAIBOT_ENABLED and OPENAI_API_KEY environment variables - --enable-triage command line argument ### Documentation - `docs/chaibot-test-failure-triage.md` - Comprehensive user/admin guide - `core-services/ci-chat-bot/CHAIBOT.md` - Quick reference - `CHAIBOT_QUICKSTART.md` - Quick start guide - `DEPLOY_CHAIBOT.md` - Deployment instructions ## Features - **Automatic Detection**: Monitors channels for Prow job failures - **AI Analysis**: Uses OpenAI to categorize failures (infrastructure, flaky, bug, config) - **Historical Context**: Integrates with Sippy for past failure patterns - **JIRA Integration**: Searches for related known issues - **Actionable Output**: Posts analysis with recommendations in Slack threads ## Example Output When a failure is posted, Chaibot responds with: - Root cause identification (with confidence %) - Evidence from logs - Historical failure patterns - Specific recommendations - Links to Sippy, logs, and related JIRA issues ## Configuration Required Before this can function, the following must be configured: 1. **Slack Channel ID**: Update `chaibot-configmap.yaml` with actual channel ID for #opp-discussion 2. **OpenAI API Key**: Add to ci-secret-bootstrap (see `chaibot-secret-config.yaml`) 3. **Slack App Permissions**: Ensure ci-chat-bot app has required OAuth scopes ## Implementation Note⚠️ This PR provides the complete configuration and deployment manifests, but requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to actually process the configuration and perform analysis. Without the code implementation, the deployment will succeed but Chaibot will not respond to messages (the --enable-triage flag will be ignored). ## Cost Estimate - GPT-4: ~$0.03/analysis (~$90/month at 100 failures/day) - GPT-3.5-turbo: ~$0.003/analysis (~$9/month at 100 failures/day) - Rate limiting configured to prevent cost overruns ## Testing After deployment: 1. Update ConfigMap with actual Slack channel ID 2. Configure OpenAI API key secret 3. Post test failure message with Prow URL in #opp-discussion 4. Verify Chaibot responds in thread within 60 seconds ## Related - Extends existing ci-chat-bot service - Integrates with Sippy for historical data - Complements retester for automated failure handling /cc @openshift/test-platform
Add Vault sync configuration for the Chaibot OpenAI API key stored in selfservice/cspi-qe/chaibot-openai-key. This configures ci-secret-bootstrap to automatically sync the key from Vault to the ci-chat-bot-chaibot-secrets Kubernetes secret in the ci namespace on the app.ci cluster. Vault path: selfservice/cspi-qe/chaibot-openai-key Target secret: ci-chat-bot-chaibot-secrets (ci namespace, app.ci cluster)
8423b12 to
330bcc5
Compare
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (2)
299-390:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftHarden container
securityContextbefore rollout.The Deployment still lacks explicit hardening (
runAsNonRoot,allowPrivilegeEscalation: false,readOnlyRootFilesystem,capabilities.drop: ["ALL"]) for workload containers.Suggested patch
containers: - name: git-sync + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL - name: bot + securityContext: + runAsNonRoot: true + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALLAs per coding guidelines, Kubernetes/OpenShift manifests should set
runAsNonRoot,readOnlyRootFilesystem,allowPrivilegeEscalation: false, and drop all capabilities unless required.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` around lines 299 - 390, The containers git-sync-init, git-sync, and bot in the Deployment lack explicit security hardening configurations. Add a securityContext field to each container specifying runAsNonRoot set to true, allowPrivilegeEscalation set to false, readOnlyRootFilesystem set to true, and capabilities.drop set to ["ALL"]. For the git-sync-init initContainer, ensure the same hardening is applied. These security settings should be added at the container level, sibling to the image, imagePullPolicy, and other container specifications.Sources: Coding guidelines, Linters/SAST tools
330-353:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftAdd CPU/memory limits for all containers in this pod.
Only
requestsare configured; missinglimitscan cause noisy-neighbor impact and weak resource isolation.Suggested patch
- name: git-sync resources: requests: memory: "1Gi" cpu: "0.5" + limits: + memory: "2Gi" + cpu: "1" - name: bot resources: requests: memory: "12Gi" cpu: "250m" + limits: + memory: "16Gi" + cpu: "2"As per coding guidelines, Kubernetes manifests should define resource limits for every container.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` around lines 330 - 353, Add resource limits to all containers in the pod to ensure proper resource isolation. For the bot container, add a limits section under resources (alongside the existing requests section) with appropriate CPU and memory limits. Similarly, add limits to the unnamed container shown earlier in the diff that currently only has requests configured. Limits should typically match or slightly exceed the request values to prevent resource starvation while maintaining isolation guarantees.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHAIBOT_QUICKSTART.md`:
- Around line 9-23: The fenced code blocks in the markdown file are missing
language identifiers after the opening triple backticks. Add the appropriate
language identifier (such as "yaml", "bash", "json", etc.) immediately after the
opening triple backticks for all code blocks throughout the file to resolve
markdown linting issues. This needs to be applied to all affected code block
locations in the document.
In `@clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml`:
- Around line 91-96: The include_actions section in the chaibot-configmap.yaml
file is incomplete and missing the trigger_retest and create_issue actions that
are documented in the canonical configuration. Add back these two missing action
entries to the include_actions list so that the deployed ConfigMap includes all
the intended Chaibot actions and matches the documented feature set.
In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`:
- Around line 53-63: Remove the entire Secret definition for
ci-chat-bot-chaibot-secrets (lines 53-63) from the deployment manifest file
chaibot-deployment-patch.yaml. This secret with the placeholder openai-api-key
value should not be kept in applyable cluster manifests as it can override the
bootstrap-managed secret. Since this secret is meant to be managed via
ci-secret-bootstrap as noted in the comment, delete this whole Secret resource
block to prevent accidental overrides of the actual authentication credentials.
In `@core-services/ci-chat-bot/CHAIBOT.md`:
- Around line 79-81: The fenced code blocks in the CHAIBOT.md file are missing
language declarations, which violates markdownlint rule MD040. Add appropriate
language tags to all fenced code blocks that currently lack them. Specifically,
add a language identifier immediately after the opening triple backticks (``` )
for the code block at lines 79-81 and the additional code blocks at lines 84-94.
Use language tags that accurately describe the content of each block (such as
text, bash, yaml, etc.).
- Around line 175-181: The secret_name field in the ai_api integrations section
of CHAIBOT.md documents "chaibot-openai-key", but this does not match the actual
deployment secret wiring. Review the actual deployment configuration to identify
the correct secret name and key references being used (the comment indicates the
deployment uses "ci-chat-bot-chaibot-secrets" and "openai-api-key"), then update
the secret_name value and any related configuration in CHAIBOT.md to match the
actual deployed secret contract to ensure documentation accuracy.
In `@core-services/ci-chat-bot/triage-config.yaml`:
- Around line 153-157: The ai_api.secret_name configuration value in
triage-config.yaml at line 155 is set to "chaibot-openai-key", but this does not
match the actual Kubernetes secret name created by the bootstrap process in
core-services/ci-secret-bootstrap/_config.yaml, which creates
"ci-chat-bot-chaibot-secrets". Update the secret_name value to
"ci-chat-bot-chaibot-secrets" so that runtime secret resolution will
successfully find the provisioned secret.
In `@core-services/ci-secret-bootstrap/chaibot-secret-config.yaml`:
- Around line 7-13: The example schema in the commented-out from section uses an
incorrect format with dockerconfigJSON key that does not match the actual
secret-bootstrap schema. Replace the dockerconfigJSON placeholder format in the
from section with the correct schema structure that uses field and path keys
instead, ensuring the example accurately reflects the schema expected by the
secret-bootstrap configuration used in this PR.
In `@DEPLOY_CHAIBOT.md`:
- Around line 73-76: The documentation in DEPLOY_CHAIBOT.md at lines 73-76 and
267-268 shows insecure methods of handling API keys using --from-literal in oc
create secret commands, which exposes sensitive credentials in shell history and
process output. Replace these examples with safer alternatives that do not
expose the raw API key on the command line, such as reading from a file (using
--from-file), referencing an environment variable securely, or using an
interactive prompt. Ensure all documentation instances that demonstrate secret
creation follow this safer pattern rather than showing literal key values in the
command itself.
- Around line 14-20: The markdown code fences in DEPLOY_CHAIBOT.md at lines
14-20 and also at lines 204-206 are missing language identifiers after the
opening triple backticks, which violates markdownlint requirements. Add an
appropriate language identifier (such as "text" or "plaintext") after each
opening code fence marker (the triple backticks) to declare the language type
for all affected fenced code blocks.
In `@docs/chaibot-test-failure-triage.md`:
- Around line 38-80: The fenced code blocks in the markdown file are missing
language specifiers, which causes MD040 lint check failures. Add appropriate
language specifiers to all code blocks that currently start with just triple
backticks. Based on the content of each block, determine the appropriate
language identifier (such as bash, json, text, etc.) and add it immediately
after the opening triple backticks to fix the lint violations.
---
Outside diff comments:
In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml`:
- Around line 299-390: The containers git-sync-init, git-sync, and bot in the
Deployment lack explicit security hardening configurations. Add a
securityContext field to each container specifying runAsNonRoot set to true,
allowPrivilegeEscalation set to false, readOnlyRootFilesystem set to true, and
capabilities.drop set to ["ALL"]. For the git-sync-init initContainer, ensure
the same hardening is applied. These security settings should be added at the
container level, sibling to the image, imagePullPolicy, and other container
specifications.
- Around line 330-353: Add resource limits to all containers in the pod to
ensure proper resource isolation. For the bot container, add a limits section
under resources (alongside the existing requests section) with appropriate CPU
and memory limits. Similarly, add limits to the unnamed container shown earlier
in the diff that currently only has requests configured. Limits should typically
match or slightly exceed the request values to prevent resource starvation while
maintaining isolation guarantees.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 53e7ca54-c042-4bde-8c2f-d0d3cbfbf023
📒 Files selected for processing (10)
CHAIBOT_QUICKSTART.mdDEPLOY_CHAIBOT.mdclusters/app.ci/ci-chat-bot/chaibot-configmap.yamlclusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yamlclusters/app.ci/ci-chat-bot/ci-chat-bot.yamlcore-services/ci-chat-bot/CHAIBOT.mdcore-services/ci-chat-bot/triage-config.yamlcore-services/ci-secret-bootstrap/_config.yamlcore-services/ci-secret-bootstrap/chaibot-secret-config.yamldocs/chaibot-test-failure-triage.md
| ``` | ||
| ✓ core-services/ci-chat-bot/triage-config.yaml | ||
| ✓ clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml | ||
| ✓ clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml | ||
| ✓ clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (UPDATED) | ||
| ✓ docs/chaibot-test-failure-triage.md | ||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced blocks.
These code fences should declare a language for markdownlint compliance.
Also applies to: 204-206
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 14-14: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@DEPLOY_CHAIBOT.md` around lines 14 - 20, The markdown code fences in
DEPLOY_CHAIBOT.md at lines 14-20 and also at lines 204-206 are missing language
identifiers after the opening triple backticks, which violates markdownlint
requirements. Add an appropriate language identifier (such as "text" or
"plaintext") after each opening code fence marker (the triple backticks) to
declare the language type for all affected fenced code blocks.
Source: Linters/SAST tools
| ``` | ||
| User: The e2e-aws job is failing again 😞 | ||
| https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/12345/pull-ci-openshift-installer-master-e2e-aws/678901 | ||
|
|
||
| --- | ||
|
|
||
| Chaibot [BOT]: :mag: Analyzing failure... (typically takes 30-60 seconds) | ||
|
|
||
| [30 seconds later] | ||
|
|
||
| Chaibot [BOT]: :cloud: **Test Failure Analysis** | ||
|
|
||
| **Job:** `pull-ci-openshift-installer-master-e2e-aws` | ||
| **Status:** Failed after 2h 15m | ||
| **Root Cause:** Infrastructure - AWS EC2 Capacity (Confidence: 85%) | ||
|
|
||
| **Analysis:** | ||
| Test failed during cluster provisioning when attempting to launch EC2 instances in us-east-1c. | ||
| AWS returned "InsufficientInstanceCapacity" error after multiple retry attempts over 45 minutes. | ||
|
|
||
| **Evidence:** | ||
| ``` | ||
| Error: creating EC2 Instance (i-0a1b2c3d4e5f): InsufficientInstanceCapacity | ||
| status code: 500, request id: xyz-123 | ||
| ``` | ||
|
|
||
| **Historical Pattern:** | ||
| This failure has occurred 8 times in the last 24 hours across multiple jobs, all in us-east-1c AZ. | ||
| Sippy shows this as a known transient infrastructure issue. | ||
|
|
||
| **Recommendations:** | ||
| 1. ✅ **Retest** - This is a transient AWS issue, likely to succeed on retry | ||
| 2. 📊 Check AWS Service Health Dashboard for us-east-1 incidents | ||
| 3. 🔔 If failures persist >6 hours, escalate to infrastructure team | ||
|
|
||
| **Related:** | ||
| - <https://sippy.dptools.openshift.org/|Sippy Dashboard> | ||
| - <https://issues.redhat.com/browse/DPTP-5678|DPTP-5678>: Similar AWS capacity issues | ||
|
|
||
| **Classification:** Transient Infrastructure (Not a product bug) | ||
|
|
||
| [Buttons: View Logs | Retest | Mark as Known Issue] | ||
| ``` |
There was a problem hiding this comment.
Add language specifiers to fenced code examples.
These blocks are missing fence languages and will continue to fail MD040 lint checks.
Also applies to: 59-62, 211-213
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 38-38: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 62-62: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/chaibot-test-failure-triage.md` around lines 38 - 80, The fenced code
blocks in the markdown file are missing language specifiers, which causes MD040
lint check failures. Add appropriate language specifiers to all code blocks that
currently start with just triple backticks. Based on the content of each block,
determine the appropriate language identifier (such as bash, json, text, etc.)
and add it immediately after the opening triple backticks to fix the lint
violations.
Source: Linters/SAST tools
amp-rh
left a comment
There was a problem hiding this comment.
The ci/prow/ci-secret-bootstrap-config-validation check is failing with:
config[115].from[openai-api-key]: empty value is not allowed
Root cause: The _config.yaml entry uses path: selfservice/cspi-qe/chaibot-openai-key, but ci-secret-bootstrap cannot read selfservice/ vault paths (its service account lacks ACL access to that namespace). This is the only path: usage in the entire _config.yaml; all 1594 other field references use item:.
Self-service vault items are designed to be synced via the secretsync mechanism (driven by the secretsync/target-name and secretsync/target-namespace metadata keys on the vault item), not via ci-secret-bootstrap.
Recommended fix: Since your vault item already has secretsync/target-name: cluster-secrets-chaibot-openai-key and secretsync/target-namespace: ci, remove the _config.yaml entry and update the deployment to reference the secretsync-managed secret name (cluster-secrets-chaibot-openai-key).
Note: Verify that secretsync targets the app.ci cluster (where ci-chat-bot runs). If it doesn't sync there by default, add secretsync/target-clusters: app.ci to the vault item metadata. See https://docs.ci.openshift.org/how-tos/adding-a-new-secret-to-ci/
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
amp-rh
left a comment
There was a problem hiding this comment.
The _config.yaml fix looks good. A few stale references to the old secret name ci-chat-bot-chaibot-secrets remain across the PR that should be updated to cluster-secrets-chaibot-openai-key (the secretsync-managed name).
amp-rh
left a comment
There was a problem hiding this comment.
Inline suggestions for the remaining ci-chat-bot-chaibot-secrets references. Click to apply.
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
|
/test ci-secret-bootstrap-config-validation |
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Co-authored-by: Anthony Pruitt <mpruitt@redhat.com>
Convert incomplete Secret documentation section to pure comments to fix rover groups collection error. The section was missing required Kubernetes object fields (kind, apiVersion, metadata) which caused validation failures. The secret is managed by secretsync from vault item selfservice/cspi-qe/chaibot-openai-key and does not need to be defined in this file. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
amp-rh
left a comment
There was a problem hiding this comment.
The sync-rover-groups CI check is failing because this section creates an invalid YAML document (no Kind field). The --- separator starts a new document, and the uncommented lines make it a non-empty, non-Kubernetes object. Remove this block entirely.
|
|
||
| --- | ||
| # Secret is managed by secretsync from vault item: | ||
| # selfservice/cspi-qe/chaibot-openai-key | ||
| # It will appear as: | ||
| # name: cluster-secrets-chaibot-openai-key | ||
| # namespace: ci | ||
| # This should be managed via ci-secret-bootstrap |
There was a problem hiding this comment.
| --- | |
| # Secret is managed by secretsync from vault item: | |
| # selfservice/cspi-qe/chaibot-openai-key | |
| # It will appear as: | |
| # name: cluster-secrets-chaibot-openai-key | |
| # namespace: ci | |
| # This should be managed via ci-secret-bootstrap |
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml (2)
98-107:⚠️ Potential issue | 🔴 CriticalUpdate the alert
joblabel selector and add missingrunbook_urlannotation.The ServiceMonitor has no custom
jobLabel, so Prometheus constructs the job label asci/ci-chat-bot(namespace/name). The current alert expressionup{job="ci-chat-bot"}will not match this metric and the alert will never fire. Additionally, therunbook_urlannotation is missing for consistency withChaibotHighErrorRate.Suggested fix
- alert: ChaibotDown expr: | - up{job="ci-chat-bot"} == 0 + up{job="ci/ci-chat-bot"} == 0 for: 5m labels: severity: critical team: test-platform annotations: summary: "Chaibot service is down" description: "ci-chat-bot service (including Chaibot) has been down for 5 minutes." + runbook_url: "https://github.com/openshift/release/blob/main/docs/dptp-triage-sop/chaibot.md"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` around lines 98 - 107, In the ChaibotDown alert, fix the job label selector in the expr field from "ci-chat-bot" to "ci/ci-chat-bot" to match how Prometheus constructs the job label as namespace/name without a custom jobLabel in the ServiceMonitor. Additionally, add a missing runbook_url annotation to the ChaibotDown alert annotations section for consistency with the ChaibotHighErrorRate alert.
85-85:⚠️ Potential issue | 🟡 MinorCreate the runbook file or update the URL.
The
runbook_urlannotation referencesdocs/dptp-triage-sop/chaibot.md, which does not exist. Either create this runbook file in the correct directory or point the annotation to an existing chaibot documentation file (e.g.,core-services/ci-chat-bot/CHAIBOT.md).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` at line 85, The runbook_url annotation in the chaibot-deployment-patch.yaml file references a non-existent file at docs/dptp-triage-sop/chaibot.md. Either create the runbook markdown file at that location in the release repository with appropriate chaibot troubleshooting documentation, or update the runbook_url value to point to an existing chaibot documentation file such as core-services/ci-chat-bot/CHAIBOT.md. Ensure the URL in the annotation matches the actual location of the documentation you choose to use.
🧹 Nitpick comments (2)
clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml (2)
6-51: ⚡ Quick winPatch documentation appears redundant—base deployment already includes these changes.
Lines 6-51 document volumes, mounts, args, and env vars that should be added to the base
ci-chat-bot.yaml. However, cross-referencing with the base deployment (context snippet 1, lines 334-461) shows all these elements are already present:
triage-configandchaibot-secretsvolumes and mounts existCHAIBOT_ENABLED=trueandOPENAI_API_KEYenv vars exist--enable-triage=trueand--triage-config-pathargs existConsider either removing these commented sections or adding a header note:
# NOTE: These patches have been applied to ci-chat-bot.yaml. Retained here for reference only.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` around lines 6 - 51, The patch file (chaibot-deployment-patch.yaml) contains extensive commented documentation describing configuration changes (volumes, mounts, args, and env vars) that have already been applied to the base ci-chat-bot.yaml deployment. Remove the redundant commented sections spanning lines 6-51, or alternatively replace them with a single header comment stating that these patches have been applied to ci-chat-bot.yaml and are retained for reference only.
87-96: ⚡ Quick winAdd
runbook_urlannotation for operational consistency.The
ChaibotAnalysisTimeoutalert is missing arunbook_urlannotation, whileChaibotHighErrorRateincludes one. For consistency and operational clarity, add a runbook reference.📚 Suggested addition
annotations: summary: "Chaibot analysis taking too long" description: "95th percentile analysis duration is {{ $value }}s, exceeding 120s timeout." + runbook_url: "https://github.com/openshift/release/blob/main/docs/dptp-triage-sop/chaibot.md"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` around lines 87 - 96, The ChaibotAnalysisTimeout alert definition is missing a runbook_url annotation in its annotations section, which creates an inconsistency with other alerts like ChaibotHighErrorRate that include this field. Add a runbook_url annotation to the ChaibotAnalysisTimeout alert's annotations block, providing an appropriate runbook reference URL that follows the same pattern used in other similar alerts for operational consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`:
- Around line 98-107: In the ChaibotDown alert, fix the job label selector in
the expr field from "ci-chat-bot" to "ci/ci-chat-bot" to match how Prometheus
constructs the job label as namespace/name without a custom jobLabel in the
ServiceMonitor. Additionally, add a missing runbook_url annotation to the
ChaibotDown alert annotations section for consistency with the
ChaibotHighErrorRate alert.
- Line 85: The runbook_url annotation in the chaibot-deployment-patch.yaml file
references a non-existent file at docs/dptp-triage-sop/chaibot.md. Either create
the runbook markdown file at that location in the release repository with
appropriate chaibot troubleshooting documentation, or update the runbook_url
value to point to an existing chaibot documentation file such as
core-services/ci-chat-bot/CHAIBOT.md. Ensure the URL in the annotation matches
the actual location of the documentation you choose to use.
---
Nitpick comments:
In `@clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml`:
- Around line 6-51: The patch file (chaibot-deployment-patch.yaml) contains
extensive commented documentation describing configuration changes (volumes,
mounts, args, and env vars) that have already been applied to the base
ci-chat-bot.yaml deployment. Remove the redundant commented sections spanning
lines 6-51, or alternatively replace them with a single header comment stating
that these patches have been applied to ci-chat-bot.yaml and are retained for
reference only.
- Around line 87-96: The ChaibotAnalysisTimeout alert definition is missing a
runbook_url annotation in its annotations section, which creates an
inconsistency with other alerts like ChaibotHighErrorRate that include this
field. Add a runbook_url annotation to the ChaibotAnalysisTimeout alert's
annotations block, providing an appropriate runbook reference URL that follows
the same pattern used in other similar alerts for operational consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 08d82122-418a-486f-90b4-e7e3761543cf
📒 Files selected for processing (3)
CHAIBOT_QUICKSTART.mdDEPLOY_CHAIBOT.mdclusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml
✅ Files skipped from review due to trivial changes (2)
- CHAIBOT_QUICKSTART.md
- DEPLOY_CHAIBOT.md
|
@chaclark1974: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add Chaibot test failure triage workflow to ci-chat-bot
This PR adds Chaibot, an AI-powered Slack workflow that automatically
triages and analyzes test failures posted in designated Slack channels.
Overview
Chaibot extends the existing ci-chat-bot service to monitor Slack channels
(initially #opp-discussion) for test failure messages, analyze failures using
OpenAI GPT-4, and post detailed triage analysis in threads.
What's Added
Configuration Files
core-services/ci-chat-bot/triage-config.yaml- Main Chaibot configurationclusters/app.ci/ci-chat-bot/chaibot-configmap.yaml- Kubernetes ConfigMapclusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml- Prometheus alertscore-services/ci-secret-bootstrap/chaibot-secret-config.yaml- Secret config guideDeployment Changes
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml- Updated with:Documentation
docs/chaibot-test-failure-triage.md- Comprehensive user/admin guidecore-services/ci-chat-bot/CHAIBOT.md- Quick referenceCHAIBOT_QUICKSTART.md- Quick start guideDEPLOY_CHAIBOT.md- Deployment instructionsFeatures
Example Output
When a failure is posted, Chaibot responds with:
Configuration Required
Before this can function, the following must be configured:
chaibot-configmap.yamlwith actual channel ID for #opp-discussionchaibot-secret-config.yaml)Implementation Note
requires code implementation in openshift/ci-tools (cmd/ci-chat-bot) to
actually process the configuration and perform analysis.
Without the code implementation, the deployment will succeed but Chaibot
will not respond to messages (the --enable-triage flag will be ignored).
Cost Estimate
$0.03/analysis ($90/month at 100 failures/day)$0.003/analysis ($9/month at 100 failures/day)Testing
After deployment:
Related
/cc @openshift/test-platform
Summary by CodeRabbit
This PR adds Chaibot, an AI-powered Slack integration that automatically triages OpenShift CI test failures. When Prow test failure messages are posted to designated Slack channels, Chaibot analyzes them using OpenAI GPT models and provides categorized triage results (infrastructure issues, flaky tests, bugs, or configuration problems) with actionable recommendations and relevant documentation links.
Configuration and Infrastructure Changes:
The PR introduces the complete configuration and Kubernetes deployment manifests required for Chaibot:
triage-config.yaml): Defines monitored Slack channels, failure detection patterns, AI analysis parameters, categorization rules with confidence thresholds, and rate limiting controlsci-chat-bot.yaml): Wires the configuration and secrets into the existing ci-chat-bot pods, adds the--enable-triageflag, and sets required environment variablesIntegration Capabilities:
Chaibot integrates with multiple systems for comprehensive failure analysis:
Documentation and Deployment Guidance:
The PR includes extensive documentation for operators: a quick-start guide, detailed deployment runbook, troubleshooting procedures, and cost estimates (GPT-4 ~$0.03/analysis, GPT-3.5-turbo ~$0.003/analysis, with configurable rate limiting).
Important Implementation Note:
The PR provides configuration and Kubernetes manifests only. Runtime code implementation in the
openshift/ci-toolsrepository (cmd/ci-chat-bot) is required for Chaibot to function; without it, the deployment will succeed but triage responses will not be generated.