Skip to content

Add Bugsnag error grouping with stable normalized keys#234

Open
morgan-wowk wants to merge 1 commit intobugsnag/orchestrator-integrationfrom
bugsnag/error-grouping
Open

Add Bugsnag error grouping with stable normalized keys#234
morgan-wowk wants to merge 1 commit intobugsnag/orchestrator-integrationfrom
bugsnag/error-grouping

Conversation

@morgan-wowk
Copy link
Copy Markdown
Collaborator

@morgan-wowk morgan-wowk commented May 9, 2026

Add Bugsnag error grouping with stable normalized keys

Introduces configurable error grouping so structurally identical exceptions collapse into a single group rather than creating a new entry per unique pod name, UUID, or memory address.

How it works

A new TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY env var controls the metadata key name written on each Bugsnag event. When unset the feature is a complete no-op. When set (by a deployment), every notified exception gets a custom[<key>] tab containing a normalized string derived from the exception type and message.

System errors reported through record_system_error_exception are additionally prefixed with SYSTEM_ERROR: so they can be filtered or grouped separately from non-system errors.

Error taxonomy

The following exception types, from one consumer use case, are normalized to stable grouping keys:

Group Normalized key
k8s pod not found kubernetes ApiException (404): NotFound: pods "{pod}" not found
k8s container terminated kubernetes ApiException (400): BadRequest: container "main" in pod {pod} is terminated
k8s pod initializing kubernetes ApiException (400): BadRequest: container "main" in pod {pod} is waiting to start: PodInitializing
k8s container not available kubernetes ApiException (400): BadRequest: container "main" in pod {pod} is not available
k8s webhook timeout kubernetes ApiException (500): InternalError: failed calling webhook "<>": context deadline exceeded
UnicodeDecodeError UnicodeDecodeError: 'utf-8' codec can't decode byte at position {n}
MaxRetryError MaxRetryError: k8s connection pool max retries exceeded (ReadTimeoutError)
OrchestratorError OrchestratorError: Unexpected running container status: {object}
Fallback ExceptionType: {message with addresses/UUIDs/IDs stripped}

Many exception types (e.g. AttributeError, sqlalchemy.exc.OperationalError) already produce stable messages and pass through the fallback unchanged.

Changes

  • error_normalization.py (new) — one public function normalize_error_message(*, exception) dispatching to type-specific handlers before falling back to a generic stripper that removes hex addresses, UUIDs, and long alphanumeric IDs
  • bugsnag_instrumentation.py — reads TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY; _before_notify attaches the normalized key when configured; supports an optional grouping_prefix passed through notify(**metadata)
  • orchestrator_sql.pyrecord_system_error_exception passes grouping_prefix="SYSTEM_ERROR" so system errors are visually distinct
  • test_error_normalization.py (new) — 15 unit tests covering all error groups and the fallback path

OSS note

The grouping key name is not hardcoded — it is supplied entirely via TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY at deploy time, so no internal platform names appear in OSS code.

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@morgan-wowk morgan-wowk marked this pull request as ready for review May 9, 2026 01:00
@morgan-wowk morgan-wowk requested a review from Ark-kun as a code owner May 9, 2026 01:00
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 0405d8b to 754d49c Compare May 9, 2026 02:06
Introduces error_normalization.py which strips instance-specific values
(pod names, IDs, memory addresses, byte offsets) from exceptions so
structurally identical errors collapse to one group in Bugsnag.

TANGLE_BUGSNAG_CUSTOM_GROUPING_KEY controls the metadata key name — no-op
when unset, allowing Shopify deployments to set it without touching OSS code.
System errors reported via record_system_error_exception are prefixed with
"SYSTEM_ERROR: " for easy filtering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@morgan-wowk morgan-wowk force-pushed the bugsnag/error-grouping branch from 754d49c to 148b0ab Compare May 9, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant