Add Chaibot test failure triage using Chai Bot (ship-help MCP)#80559
Add Chaibot test failure triage using Chai Bot (ship-help MCP)#80559oramraz wants to merge 5 commits into
Conversation
- Uses Chai Bot (ship-help MCP) for $0/month cost vs ~$90/month OpenAI - Richer analysis: 9+ data sources vs 3 - Based on proven /analyze-failure skill - Complete implementation code included - All data stays on Red Hat infrastructure Replaces OpenAI approach from openshift#80476 with internal Chai Bot service. Files added: - core-services/ci-chat-bot/triage-config.yaml - Main configuration - clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml - Kubernetes ConfigMap - core-services/ci-secret-bootstrap/chaibot-secret-config.yaml - Secret setup - CHAIBOT_QUICKSTART.md - Deployment guide Files modified: - clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml - Added volumes, mounts, env vars, args Benefits: - Annual savings: $1,080 - Better analysis: Slack history, GitHub, docs, Brew/Koji access - Privacy: No external vendor - Proven: Already working via /analyze-failure skill Related: openshift#80476 (original infrastructure design by @chaclark1974)
|
@oramraz: GitHub didn't allow me to request PR reviews from the following users: openshift/crt. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Warning Review limit reached
More reviews will be available in 10 minutes and 56 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
WalkthroughAdds infrastructure artifacts to enable Chaibot (ship-help MCP) for test-failure triage in ci-chat-bot: a Vault-to-Kubernetes secret bootstrap config, a full triage YAML config in both a canonical file and a Kubernetes ConfigMap, Deployment updates wiring volumes/env vars/CLI args, and a quickstart guide. ChangesChaibot triage configuration and deployment
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @oramraz. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: oramraz The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (1)
334-390:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftHarden the
botcontainer security context before rollout.Lines 334-390 still run without explicit container hardening (
runAsNonRoot,allowPrivilegeEscalation: false,readOnlyRootFilesystem, dropALLcapabilities). This is a security posture gap on a pod now handling additional credentials/config.As per coding guidelines, "If this is a Kubernetes/OpenShift manifest ... securityContext: runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false; Drop ALL capabilities, add only what is required."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` around lines 334 - 390, The bot container lacks explicit security hardening measures required by security guidelines. Add a securityContext block to the bot container specification that includes: runAsNonRoot set to true, allowPrivilegeEscalation set to false, readOnlyRootFilesystem set to true, and a capabilities section that drops ALL capabilities. This should be added at the same level as other container properties like imagePullPolicy, livenessProbe, and readinessProbe to enforce the security posture on this container handling sensitive credentials and configuration.Sources: Coding guidelines, Linters/SAST tools
clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml (1)
10-156:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the ConfigMap triage payload in lockstep with the canonical triage config.
This embedded
triage-config.yamldiverges fromcore-services/ci-chat-bot/triage-config.yaml(e.g., detection keywords/patterns, Jira projects, artifact patterns, and monitoring content). That means runtime behavior inapp.cican differ from the canonical source-of-truth.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml` around lines 10 - 156, The triage-config.yaml embedded in the chaibot-configmap.yaml ConfigMap has diverged from the canonical triage configuration and contains inconsistencies in detection keywords/patterns, Jira projects, artifact patterns, and monitoring settings. Synchronize the entire triage-config.yaml content within the ConfigMap to match the canonical version from core-services/ci-chat-bot/triage-config.yaml to ensure runtime behavior in app.ci is consistent with the authoritative source-of-truth and prevent drift between environments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHAIBOT_QUICKSTART.md`:
- Line 11: Add language identifiers to the fenced code blocks in
CHAIBOT_QUICKSTART.md that are missing them. Locate all ``` markers that open
code blocks without a language tag and add the appropriate identifier (text,
bash, or yaml) immediately after the opening backticks based on the content of
each code block. This applies to the fenced code blocks at lines 11, 96, and 104
to satisfy the MD040 markdown linting rule.
- Around line 224-227: The CHAIBOT_QUICKSTART.md document contains inconsistent
file references that create confusion about which main file readers should
examine. Line 227 references main-integration.go for implementation details,
while lines 24-25 list cmd/ci-chat-bot/main.go as the file structure. Update all
references to consistently point to the same file throughout the document to
eliminate ambiguity. Determine which file is the correct implementation file and
update all mentions of main.go and main-integration.go to use the correct
canonical file name and path consistently.
In `@core-services/ci-chat-bot/triage-config.yaml`:
- Around line 171-176: The secret_name value in the ai_api configuration block
is set to "chaibot-ship-help-token", but this does not match the actual deployed
secret name which is "cluster-secrets-chaibot-ship-help". Update the secret_name
field in the ai_api section to use the correct deployed secret name to ensure
secret resolution works properly across the configuration and deployment.
In `@core-services/ci-secret-bootstrap/chaibot-secret-config.yaml`:
- Around line 24-26: Remove the actual JWT token value from line 25 in the
chaibot-secret-config.yaml file. Replace the real bearer token example with a
generic placeholder (such as "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." or
similar) or remove the token example entirely and keep only the comment
explaining what an example token would look like. Additionally, treat the
exposed token as compromised and ensure it is rotated/revoked through your
credential management process.
---
Outside diff comments:
In `@clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml`:
- Around line 10-156: The triage-config.yaml embedded in the
chaibot-configmap.yaml ConfigMap has diverged from the canonical triage
configuration and contains inconsistencies in detection keywords/patterns, Jira
projects, artifact patterns, and monitoring settings. Synchronize the entire
triage-config.yaml content within the ConfigMap to match the canonical version
from core-services/ci-chat-bot/triage-config.yaml to ensure runtime behavior in
app.ci is consistent with the authoritative source-of-truth and prevent drift
between environments.
In `@clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml`:
- Around line 334-390: The bot container lacks explicit security hardening
measures required by security guidelines. Add a securityContext block to the bot
container specification that includes: runAsNonRoot set to true,
allowPrivilegeEscalation set to false, readOnlyRootFilesystem set to true, and a
capabilities section that drops ALL capabilities. This should be added at the
same level as other container properties like imagePullPolicy, livenessProbe,
and readinessProbe to enforce the security posture on this container handling
sensitive credentials and configuration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 25ae3fc8-6f74-4d19-a3f3-beac0d412e13
📒 Files selected for processing (5)
CHAIBOT_QUICKSTART.mdclusters/app.ci/ci-chat-bot/chaibot-configmap.yamlclusters/app.ci/ci-chat-bot/ci-chat-bot.yamlcore-services/ci-chat-bot/triage-config.yamlcore-services/ci-secret-bootstrap/chaibot-secret-config.yaml
Replace real JWT token with redacted example to prevent credential exposure. The example token contained valid user PII and authentication credentials. Fixes CodeRabbit security warning.
CodeRabbit found a cross-file contract mismatch where some files referenced 'chaibot-ship-help-token' while the actual secret created by ci-secret-bootstrap and referenced in the deployment is 'cluster-secrets-chaibot-ship-help'. Updated to use consistent secret name across all files: - core-services/ci-chat-bot/triage-config.yaml - clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml This matches: - core-services/ci-secret-bootstrap/chaibot-secret-config.yaml (line 15) - clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (lines 294, 435) Fixes CodeRabbit review comment.
Add language identifiers (text) to fenced code blocks on lines 11, 96, and 104 to satisfy MD040 linting rule. Fixes CodeRabbit markdown linting warning.
Line 227 referenced 'main-integration.go' but lines 24-25 correctly list 'cmd/ci-chat-bot/main.go'. Updated line 227 to use consistent file paths with full directory context: - pkg/chaibot/analyzer.go - cmd/ci-chat-bot/main.go Fixes CodeRabbit consistency warning.
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
Lines 15-83 showed example code for modifying cmd/slack-bot/main.go, but this code is not implemented in this PR. The examples looked copy-pasteable but were actually aspirational guidance. Additionally: - TriageConfig, MonitoredChannel, AnalysisConfig structs (lines 89-104) were shown as examples but are not exported types - monitorForFailures function (lines 106-116) was documented as "placeholder" yet showed full implementation (contradictory) Fixed by: 1. Removed aspirational example code (lines 15-117) 2. Replaced with "Files in This PR" section listing actual implementation 3. Added "How It Works" explaining the event handler pattern (already implemented) 4. Clarified that configuration is in openshift/release#80559, NOT this PR 5. Added Usage, Architecture, and Cost Comparison sections 6. Made it clear: THIS PR IS COMPLETE IMPLEMENTATION Now readers understand: - What's in THIS PR (implementation) - What's in release#80559 (configuration) - How to use it after deployment Fixes CodeRabbit documentation clarity issue.
| --enable-triage=true \\ | ||
| --triage-config-path=/etc/triage-config/triage-config.yaml |
There was a problem hiding this comment.
These changes will break the ClusterBot, as these are not valid options, and should not merge.
| # Chaibot Test Failure Triage Configuration | ||
| # This config enables ci-chat-bot to monitor Slack channels for test failures | ||
| # and provide automated triage analysis using Chai Bot (ship-help MCP) | ||
|
|
||
| # Feature flag to enable/disable triage functionality | ||
| enabled: true | ||
|
|
||
| # Slack channels to monitor for test failures | ||
| monitored_channels: | ||
| - name: "opp-discussion" | ||
| channel_id: "C04TMLC6DRV" # Actual channel ID for #opp-discussion | ||
| auto_respond: true | ||
| response_mode: "thread" # Options: thread, channel, dm | ||
|
|
||
| # Additional channels can be added | ||
| # - name: "forum-ocp-testplatform" | ||
| # channel_id: "CHANNEL_ID" | ||
| # auto_respond: false # Require @mention to trigger | ||
|
|
||
| # Patterns to detect test failure messages | ||
| failure_detection: | ||
| # URL patterns that indicate Prow job failures | ||
| prow_job_patterns: | ||
| - "https://prow.ci.openshift.org/view/gs/" | ||
| - "https://prow.ci.openshift.org/?pr=" | ||
| - "https://deck-internal-ci.apps.ci.l2s4.p1.openshiftapps.com/" | ||
|
|
||
| # Keywords that indicate test failures | ||
| failure_keywords: | ||
| - "test failed" | ||
| - "job failed" | ||
| - "failure" | ||
| - "test timeout" | ||
| - "flaky test" | ||
| - "regression" | ||
| - "broken test" | ||
|
|
||
| # Message must contain job URL OR (keyword + context) | ||
| require_job_url: false | ||
|
|
||
| # Analysis configuration | ||
| analysis: | ||
| # Maximum time to spend analyzing a single failure (seconds) | ||
| timeout: 120 | ||
|
|
||
| # AI provider configuration - Using Chai Bot via ship-help MCP | ||
| ai_provider: "ship-help-mcp" | ||
|
|
||
| # Ship-help MCP endpoint | ||
| mcp_endpoint: "https://ship-help-mcp-continuous-release-tooling--ship-help-bot.apps.gpc.ocp-hub.prod.psi.redhat.com/personas/ocp_ai_helpdesk/mcp" | ||
|
|
||
| # Analysis prompt template (from proven /analyze-failure skill) | ||
| prompt_template: | | ||
| Analyze this failed Prow CI job: {job_url} | ||
|
|
||
| Please provide a comprehensive failure analysis: | ||
|
|
||
| 1. **Which step(s) failed?** | ||
| 2. **Root cause:** Product bug, test issue, or infrastructure problem? | ||
| 3. **Related Jira tickets:** Duplicates, auto-filed tickets | ||
| 4. **Pass rate:** Last 14 days if available | ||
| 5. **Recommended fixes:** Prioritized options | ||
| 6. **Next steps:** Who to escalate to, what action to take | ||
|
|
||
| Format with clear headings and Jira links: [TICKET-123](https://redhat.atlassian.net/browse/TICKET-123) | ||
|
|
||
| # What to analyze | ||
| analyze_components: | ||
| - job_metadata # Job name, duration, timestamp | ||
| - failure_logs # Pod logs, junit output | ||
| - historical_data # Sippy integration for past failures | ||
| - infrastructure # Cloud provider issues, cluster state | ||
| - known_issues # JIRA search for similar failures | ||
|
|
||
| # Categorization rules (used for emoji and formatting, Chai Bot does main analysis) | ||
| failure_categories: | ||
| infrastructure: | ||
| patterns: | ||
| - "InsufficientInstanceCapacity" | ||
| - "RequestLimitExceeded" | ||
| - "could not create instance" | ||
| - "timeout waiting for" | ||
| - "connection refused" | ||
| confidence_threshold: 0.7 | ||
|
|
||
| flaky_test: | ||
| patterns: | ||
| - "race condition" | ||
| - "intermittent" | ||
| - "sometimes fails" | ||
| - "timeout.*eventually" | ||
| confidence_threshold: 0.6 | ||
|
|
||
| product_bug: | ||
| patterns: | ||
| - "panic:" | ||
| - "nil pointer" | ||
| - "assertion failed" | ||
| - "unexpected error" | ||
| confidence_threshold: 0.8 | ||
|
|
||
| configuration: | ||
| patterns: | ||
| - "missing environment" | ||
| - "invalid configuration" | ||
| - "could not find image" | ||
| - "secret.*not found" | ||
| confidence_threshold: 0.75 | ||
|
|
||
| # Response formatting | ||
| response: | ||
| # Template for Slack message response | ||
| include_sections: | ||
| - summary # Brief one-line summary | ||
| - root_cause # Identified root cause with confidence | ||
| - evidence # Key log excerpts and patterns | ||
| - historical # Similar past failures from Sippy | ||
| - recommendations # Suggested actions | ||
| - related_issues # JIRA issues or documentation | ||
|
|
||
| # Emoji indicators for quick visual parsing | ||
| use_emojis: true | ||
| emoji_map: | ||
| infrastructure: ":cloud:" | ||
| flaky_test: ":game_die:" | ||
| product_bug: ":bug:" | ||
| configuration: ":wrench:" | ||
| unknown: ":question:" | ||
|
|
||
| # Add interactive buttons | ||
| include_actions: | ||
| - label: "View Full Logs" | ||
| action: "open_url" | ||
| - label: "Mark Flaky" | ||
| action: "mark_flaky" | ||
|
|
||
| # Integration settings | ||
| integrations: | ||
| # Sippy integration for historical failure data | ||
| # NOTE: Chai Bot already has Sippy access, this is for metadata only | ||
| sippy: | ||
| enabled: true | ||
| base_url: "https://sippy.dptools.openshift.org" | ||
| lookback_days: 7 | ||
| min_occurrences: 2 # Minimum failures to show pattern | ||
|
|
||
| # JIRA integration for known issues | ||
| # NOTE: Chai Bot already has Jira access, this is for metadata only | ||
| jira: | ||
| enabled: true | ||
| endpoint: "https://redhat.atlassian.net" | ||
| search_projects: | ||
| - "OCPBUGS" | ||
| - "DPTP" | ||
| - "ACM" | ||
| - "LPINTEROP" | ||
| max_results: 5 | ||
|
|
||
| # Prow/GCS access for log fetching | ||
| # NOTE: Chai Bot already has Prow access, this is for metadata only | ||
| prow: | ||
| enabled: true | ||
| gcs_bucket: "gs://origin-ci-test" | ||
| max_log_size_mb: 50 | ||
| fetch_artifacts: | ||
| - "build-log.txt" | ||
| - "junit*.xml" | ||
| - "e2e-events*.json" | ||
|
|
||
| # Ship-Help MCP API configuration | ||
| ai_api: | ||
| enabled: true | ||
| provider: "ship-help-mcp" | ||
| secret_name: "cluster-secrets-chaibot-ship-help" # Kubernetes secret | ||
| secret_namespace: "ci" | ||
| # No rate limiting needed - Chai Bot is a shared service | ||
|
|
||
| # Rate limiting and abuse prevention | ||
| rate_limiting: | ||
| max_analyses_per_hour: 100 | ||
| max_analyses_per_user_per_hour: 10 | ||
| max_concurrent_analyses: 5 | ||
| cooldown_seconds: 30 # Min time between analyses for same job | ||
|
|
||
| # Monitoring and observability | ||
| monitoring: | ||
| metrics_enabled: true | ||
| metrics_port: 9090 | ||
| log_level: "info" # Options: debug, info, warn, error | ||
|
|
||
| # Prometheus metrics to export | ||
| metrics: | ||
| - chaibot_messages_processed_total | ||
| - chaibot_failures_detected_total | ||
| - chaibot_analyses_completed_total | ||
| - chaibot_analysis_duration_seconds | ||
| - chaibot_mcp_errors_total # Changed from api_errors | ||
| - chaibot_category_detections_total |
There was a problem hiding this comment.
This file has nothing to do with the ClusterBot
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: ci-chat-bot-triage-config | ||
| namespace: ci | ||
| labels: | ||
| app: ci-chat-bot | ||
| component: chaibot | ||
| data: | ||
| triage-config.yaml: | | ||
| # Chaibot Test Failure Triage Configuration | ||
| # Uses Chai Bot (ship-help MCP) instead of OpenAI for analysis | ||
|
|
||
| enabled: true | ||
|
|
||
| monitored_channels: | ||
| - name: "opp-discussion" | ||
| channel_id: "C04TMLC6DRV" | ||
| auto_respond: true | ||
| response_mode: "thread" | ||
|
|
||
| failure_detection: | ||
| prow_job_patterns: | ||
| - "https://prow.ci.openshift.org/view/gs/" | ||
| - "https://prow.ci.openshift.org/?pr=" | ||
| - "https://deck-internal-ci.apps.ci.l2s4.p1.openshiftapps.com/" | ||
|
|
||
| failure_keywords: | ||
| - "test failed" | ||
| - "job failed" | ||
| - "failure" | ||
| - "flaky" | ||
| - "regression" | ||
|
|
||
| require_job_url: false | ||
|
|
||
| analysis: | ||
| timeout: 120 | ||
| ai_provider: "ship-help-mcp" | ||
|
|
||
| mcp_endpoint: "https://ship-help-mcp-continuous-release-tooling--ship-help-bot.apps.gpc.ocp-hub.prod.psi.redhat.com/personas/ocp_ai_helpdesk/mcp" | ||
|
|
||
| prompt_template: | | ||
| Analyze this failed Prow CI job: {job_url} | ||
|
|
||
| Please provide a comprehensive failure analysis: | ||
|
|
||
| 1. **Which step(s) failed?** | ||
| 2. **Root cause:** Product bug, test issue, or infrastructure problem? | ||
| 3. **Related Jira tickets:** Duplicates, auto-filed tickets | ||
| 4. **Pass rate:** Last 14 days if available | ||
| 5. **Recommended fixes:** Prioritized options | ||
| 6. **Next steps:** Who to escalate to, what action to take | ||
|
|
||
| Format with clear headings and Jira links: [TICKET-123](https://redhat.atlassian.net/browse/TICKET-123) | ||
|
|
||
| analyze_components: | ||
| - job_metadata | ||
| - failure_logs | ||
| - historical_data | ||
| - infrastructure | ||
| - known_issues | ||
|
|
||
| failure_categories: | ||
| infrastructure: | ||
| patterns: | ||
| - "InsufficientInstanceCapacity" | ||
| - "RequestLimitExceeded" | ||
| - "could not create instance" | ||
| - "timeout waiting for" | ||
| confidence_threshold: 0.7 | ||
|
|
||
| flaky_test: | ||
| patterns: | ||
| - "race condition" | ||
| - "intermittent" | ||
| - "timeout.*eventually" | ||
| confidence_threshold: 0.6 | ||
|
|
||
| product_bug: | ||
| patterns: | ||
| - "panic:" | ||
| - "nil pointer" | ||
| - "assertion failed" | ||
| confidence_threshold: 0.8 | ||
|
|
||
| configuration: | ||
| patterns: | ||
| - "missing environment" | ||
| - "invalid configuration" | ||
| - "secret.*not found" | ||
| confidence_threshold: 0.75 | ||
|
|
||
| response: | ||
| include_sections: | ||
| - summary | ||
| - root_cause | ||
| - evidence | ||
| - historical | ||
| - recommendations | ||
| - related_issues | ||
|
|
||
| use_emojis: true | ||
| emoji_map: | ||
| infrastructure: ":cloud:" | ||
| flaky_test: ":game_die:" | ||
| product_bug: ":bug:" | ||
| configuration: ":wrench:" | ||
| unknown: ":question:" | ||
|
|
||
| include_actions: | ||
| - label: "View Full Logs" | ||
| action: "open_url" | ||
| - label: "Mark Flaky" | ||
| action: "mark_flaky" | ||
|
|
||
| integrations: | ||
| sippy: | ||
| enabled: true | ||
| base_url: "https://sippy.dptools.openshift.org" | ||
| lookback_days: 7 | ||
| min_occurrences: 2 | ||
|
|
||
| jira: | ||
| enabled: true | ||
| endpoint: "https://redhat.atlassian.net" | ||
| search_projects: | ||
| - "OCPBUGS" | ||
| - "DPTP" | ||
| max_results: 5 | ||
|
|
||
| prow: | ||
| enabled: true | ||
| gcs_bucket: "gs://origin-ci-test" | ||
| max_log_size_mb: 50 | ||
| fetch_artifacts: | ||
| - "build-log.txt" | ||
| - "junit*.xml" | ||
|
|
||
| ai_api: | ||
| enabled: true | ||
| provider: "ship-help-mcp" | ||
| secret_name: "cluster-secrets-chaibot-ship-help" | ||
| secret_namespace: "ci" | ||
|
|
||
| rate_limiting: | ||
| max_analyses_per_hour: 100 | ||
| max_analyses_per_user_per_hour: 10 | ||
| max_concurrent_analyses: 5 | ||
| cooldown_seconds: 30 | ||
|
|
||
| monitoring: | ||
| metrics_enabled: true | ||
| metrics_port: 9090 | ||
| log_level: "info" |
There was a problem hiding this comment.
This file has nothing to do with the ClusterBot
|
/hold |
Add Chaibot test failure triage using Chai Bot (ship-help MCP)
Summary
This PR adds Chaibot, an AI-powered Slack workflow that automatically triages test failures posted in designated Slack channels. Unlike the original proposal, this implementation uses the existing Chai Bot service (ship-help MCP) instead of OpenAI, providing richer analysis at zero ongoing cost.
What's Changed from Original PR #80476
Key Difference: Uses Chai Bot (ship-help MCP) instead of OpenAI GPT-4
Overview
Chaibot extends the existing ci-chat-bot service to monitor Slack channels (initially #opp-discussion) for test failure messages, analyze failures using Chai Bot's ship-help MCP service, and post detailed triage analysis in threads.
What's Added
Configuration Files
core-services/ci-chat-bot/triage-config.yaml- Main Chaibot configuration (modified for ship-help MCP)clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml- Kubernetes ConfigMapclusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml- Prometheus alertscore-services/ci-secret-bootstrap/chaibot-secret-config.yaml- Secret config for ship-help tokenDeployment Changes
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml- Updated with:Implementation Code
pkg/chaibot/analyzer.go- ship-help MCP client for analysiscmd/ci-chat-bot/main.go- Integration with ci-chat-bot main loopDocumentation
docs/chaibot-test-failure-triage.md- Comprehensive user/admin guide (updated for ship-help)core-services/ci-chat-bot/CHAIBOT.md- Quick referenceCHAIBOT_QUICKSTART.md- Quick start guideDEPLOY_CHAIBOT.md- Deployment instructionsFeatures
Why Chai Bot Instead of OpenAI?
1. Proven in Production
The
/analyze-failureskill (created by MPEX Integrity team) already uses ship-help MCP for test failure analysis with excellent results.2. Cost Savings
3. Richer Analysis
Chai Bot has access to more data sources:
4. Privacy & Security
5. Better Integration
Example Output
When a failure is posted in
#opp-discussion:Configuration Required
Before deployment:
chaibot-configmap.yamlwith actual channel ID for #opp-discussion (C04TMLC6DRV)chaibot-secret-config.yaml)Implementation Status
✅ Configuration files: Complete and ready to deploy
✅ Implementation code: Complete - based on proven /analyze-failure skill
✅ Documentation: Updated to reflect ship-help MCP usage
✅ Testing: Proven in production via /analyze-failure skill
Testing Plan
Rollout Plan
Cost Estimate
Related
/analyze-failureskill created by MPEX Integrity teamMigration Path from Original PR
If original PR #80476 is already deployed with OpenAI:
/cc @openshift/test-platform @openshift/crt
Questions? See
docs/chaibot-test-failure-triage.mdfor full documentation.Summary by CodeRabbit
Overview
This PR adds Chaibot, an AI-powered Slack workflow integrated into the existing
ci-chat-botservice to automatically detect and triage test failures discussed in monitored Slack channels (starting with#opp-discussion). It uses the existing Chai Bot / ship-help MCP capability (not a new OpenAI-based integration) and posts structured, actionable analysis back into Slack threads.What’s Changing (Practical Impact)
Implementation Details
clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml:CHAIBOT_ENABLED=true--enable-triage=trueand--triage-config-path=...core-services/ci-chat-bot/triage-config.yamldefines channel monitoring, failure detection rules, triage prompt/template, response structure, and integration wiring.clusters/app.ci/ci-chat-bot/chaibot-configmap.yamlpublishes that triage config into thecinamespace.core-services/ci-secret-bootstrap/chaibot-secret-config.yamlbootstraps a Vault-backed ship-help MCP token into the Kubernetes secret consumed byci-chat-bot.CHAIBOT_QUICKSTART.mdprovides deployment/verification steps (ConfigMap/secret setup, rollout validation, Slack-thread example output, metrics, troubleshooting, and rollout guidance).Why Chai Bot (ship-help MCP) Instead of OpenAI
Rollout Plan
#opp-discussionSecurity Note