Skip to content

[Cosmos] feat: Hedging Detection API (accessors on response/error types)#46901

Draft
NaluTripician wants to merge 4 commits into
Azure:mainfrom
NaluTripician:feature/cosmos-hedging-detection-api
Draft

[Cosmos] feat: Hedging Detection API (accessors on response/error types)#46901
NaluTripician wants to merge 4 commits into
Azure:mainfrom
NaluTripician:feature/cosmos-hedging-detection-api

Conversation

@NaluTripician
Copy link
Copy Markdown
Contributor

Cosmos: Hedging Detection API (Python, Option B)

Closes #46899

Summary

Adds the cross-SDK Hedging Detection API to azure-cosmos (Python),
shipping Option B from the public spec: three accessor methods —
is_hedging_started(), get_requested_regions(), get_responded_regions()
— added directly to the four wrapper types and the three exception types
that customers receive from data-plane operations, backed by a shared
private _HedgingDetectionState.

response = container.read_item(item="abc", partition_key="pk1",
                               availability_strategy={"threshold_ms": 100})
if response.is_hedging_started():
    print("dispatched to:", [r.region_name for r in response.get_requested_regions()])
    print("responded from:", response.get_responded_regions())

Public surface (new)

Symbol Module
RequestedRegion (frozen dataclass: region_name, reason) azure.cosmos
RequestedRegionReason (enum, with _missing_ → UNKNOWN) azure.cosmos
is_hedging_started() -> bool added to 7 types
get_requested_regions() -> tuple[RequestedRegion, ...] added to 7 types
get_responded_regions() -> tuple[str, ...] added to 7 types

7 carrier types: CosmosDict, CosmosList, CosmosItemPaged,
CosmosAsyncItemPaged, CosmosHttpResponseError, CosmosBatchOperationError,
CosmosClientTimeoutError.

Design highlights

  • Closure-argument state propagation (SE-002 / AC8). The hedging handler
    deep-copies request_params at the worker boundary (line 96 of
    _availability_strategy_handler.py). Storing state on request_params
    would silently swallow child appends. State is therefore threaded through
    the handler as an explicit hedging_state= closure argument, into the
    retry utility via kwargs["_hedging_state"], and back to the wrapper
    construction site by stashing on the response-headers dict under a
    private sentinel key that each wrapper pops in its __init__. Internal
    flag is also popped in _Request before reaching the HTTP pipeline.
  • HEDGING recorded "dispatched, not necessarily wire-issued" (AC10). The
    hedging-handler append happens AFTER the threshold sleep AND AFTER the
    cancellation check, so a hedge arm that loses the race before its delay
    elapses produces NO phantom entry.
  • Forward compat (SE-016).
    RequestedRegionReason._missing_(cls, raw) returns cls.UNKNOWN for any
    unrecognized payload — calling RequestedRegionReason("future_v42")
    never raises.
  • Thread / coroutine safety (SE-017). _HedgingDetectionState uses a
    threading.Lock to guard the compound (append + flag set) update;
    snapshots are returned as immutable tuples.
  • TRANSPORT_RETRY / CIRCUIT_BREAKER_PROBE are RESERVED in v1 (SE-005).
    Enum values are exported (forward compat) but a repo-wide grep test
    asserts they are never emitted by production code.

Acceptance-criteria matrix

AC What Test
AC1 Single-region success → is_hedging_started()==False test_record_initial_does_not_set_hedging_flag
AC2 Primary wins under threshold → no phantom HEDGING entry test_cancelled_before_delay_records_nothing (sync + async)
AC3 Hedge arm wins → is_hedging_started()==True, HEDGING entry present test_record_hedging_sets_flag (sync + async)
AC4 Same-region retry → OPERATION_RETRY, not HEDGING test_record_operation_retry_does_not_set_flag
AC5 Error path exposes accessors test_cosmos_http_response_error_forwards
AC6 RequestedRegion frozen / hashable / picklable test_diagnostics_types.py (9 tests)
AC7 All 7 types carry the 3 accessors wrapper-accessors suite (sync + async)
AC8 Closure-arg pattern survives copy.deepcopy(request_params) test_handler_dispatches_state_through_closure_not_params (sync + async)
AC9 Sync↔async parity guaranteed at test time TestSyncAsyncParity.{test_parity_class_coverage,test_parity_method_coverage}
AC10 Cancelled-before-delay hedge arm records nothing test_cancelled_before_delay_records_nothing (sync + async)
AC11 Duplicate responses preserved test_responded_regions_preserves_duplicates
AC12 TRANSPORT_RETRY / CIRCUIT_BREAKER_PROBE never emitted in v1 test_reserved_reasons_absent_from_production_code (repo-wide grep)
AC13 Live multi-region smoke tests/livecanary/test_hedging_detection_live[_async].py (opt-in via env vars)

Test execution

==== TOTAL: 53 passed, 0 failed ====
test_diagnostics_types.py        9 passed
test_hedging_detection.py        21 passed
test_hedging_detection_async.py  23 passed

Live canary suite is opt-in (skipped unless COSMOS_MULTI_REGION_ENDPOINT
and COSMOS_MULTI_REGION_KEY env vars are set).

Files changed

Production (12)

  • azure/cosmos/_diagnostics_types.py (NEW)
  • azure/cosmos/_diagnostics.py (NEW)
  • azure/cosmos/__init__.py
  • azure/cosmos/_cosmos_responses.py
  • azure/cosmos/exceptions.py
  • azure/cosmos/_availability_strategy_handler.py
  • azure/cosmos/aio/_asynchronous_availability_strategy_handler.py
  • azure/cosmos/_synchronized_request.py
  • azure/cosmos/aio/_asynchronous_request.py
  • azure/cosmos/_retry_utility.py
  • azure/cosmos/aio/_retry_utility_async.py
  • CHANGELOG.md

Tests / samples (7)

  • tests/test_diagnostics_types.py (NEW)
  • tests/test_hedging_detection.py (NEW)
  • tests/test_hedging_detection_async.py (NEW)
  • tests/livecanary/__init__.py (NEW)
  • tests/livecanary/test_hedging_detection_live.py (NEW)
  • tests/livecanary/test_hedging_detection_live_async.py (NEW)
  • samples/hedging_detection.py + samples/hedging_detection_async.py (NEW)

Why no .pyi?

azure-cosmos doesn't ship a __init__.pyi; only py.typed is present.
Inline annotations on the new public types provide equivalent type-checker
coverage. Spec S2 was a SHOULD, not a MUST.

Notes for reviewers

  • This is the Python half of a cross-SDK API (see also pending .NET/Java
    PRs).
  • All new public symbols are exported through azure/cosmos/__init__.py.
  • The implementation is deliberately decoupled from logging — the public
    spec §7 forbids log-string scraping as a customer contract.

NaluTripician and others added 2 commits May 14, 2026 13:42
…atch wiring

Implements Phase 2 Priorities 1–3 of the cross-SDK hedging detection API
work item (azure-sdk-for-python#46899). Adds three accessor methods —
`is_hedging_started()`, `get_requested_regions()`, `get_responded_regions()` —
to the five wrapper / exception types (`CosmosDict`, `CosmosList`,
`CosmosItemPaged`, `CosmosAsyncItemPaged`, `CosmosHttpResponseError`,
`CosmosBatchOperationError`, `CosmosClientTimeoutError`), backed by a
shared private `_HedgingDetectionState`.

New public types:
* `azure.cosmos.RequestedRegion` (frozen slots dataclass)
* `azure.cosmos.RequestedRegionReason` (non-exhaustive Enum with
  `_missing_` → `UNKNOWN` for forward compatibility — SE-016)

Dispatch-site instrumentation:
* INITIAL recorded inside the hedging handler arm body for index 0
  (sync + async); INITIAL recorded at SynchronizedRequest entry for the
  non-hedged path.
* HEDGING recorded inside the hedge-arm body AFTER the threshold delay
  and AFTER the cancellation check — pre-delay cancellation produces no
  phantom entry (AC10 / spec §12 ''no phantom entries'').
* OPERATION_RETRY / REGION_FAILOVER recorded in
  `_retry_utility(_async).Execute` when the retry policy returns True;
  attached to exceptions before re-raise so error-path consumers see the
  full timeline.

Closure-argument pattern (SE-002): the state flows through
`execute_with_hedging` as an explicit `hedging_state` parameter (NOT on
`request_params` — the deepcopy at line 96 of the handler would silently
swallow child appends). Verified by AC8 regression test (to follow).

Sync↔async parity (SE-004 / M11): every sync code site has a paired async
modification.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds:
* CHANGELOG entry under 4.16.0b3 (Features Added).
* samples/hedging_detection.py + _async.py - customer-facing usage of the
  three new accessors on a CosmosDict response and a CosmosHttpResponseError.
* tests/test_diagnostics_types.py - 9 unit tests covering AC6
  (RequestedRegion frozen/equality/hash/pickle + RequestedRegionReason
  _missing_ -> UNKNOWN forward-compat per SE-016).
* tests/test_hedging_detection.py - 21 unit tests covering AC1, AC3, AC4,
  AC5, AC8 (closure-arg deepcopy regression), AC10 (no phantom entries on
  cancelled hedge arm), AC11 (duplicate responses allowed), AC12 (reserved
  reasons never emitted in production code - repo-wide grep), plus headers
  invariant + threading.Lock concurrent-append stress test.
* tests/test_hedging_detection_async.py - 23 async-twin tests including the
  sync<->async parity checker (AC9) that fails CI if any sync test gains a
  method without an _async counterpart.
* tests/livecanary/ - opt-in (env-var gated) multi-region smoke tests
  covering AC13a (hedged read records HEDGING entry) and AC13b
  (all-regions-fail surfaces dispatch metadata on CosmosHttpResponseError).

All 53 sync+async unit tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician NaluTripician changed the title [Cosmos] feat: Hedging Detection API (Option B - accessors on response/error types) [Cosmos] feat: Hedging Detection API (accessors on response/error types) May 14, 2026
NaluTripician and others added 2 commits May 14, 2026 15:49
The parity-coverage test used cryptic short names `scls` and `acls`
for `sync_cls` and `async_cls`. `scls` is flagged as a spelling
error by cspell on the build-analyze CI job. Rename to the spelled-out
form which is both more readable and cspell-clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address the 20 pylint findings flagged by the Build_Analyze CI job on the
hedging-detection PR:

* Remove no-op `try: ... except Exception: raise` blocks in
  `CrossRegionHedgingHandler.execute_single_request_with_delay` (sync +
  async). The bare reraise added nothing and triggered W0706
  try-except-raise. The post-call `_record_response` logic still only
  runs on the success path, since an in-flight exception now propagates
  directly through the function.
* Remove the unused `RequestedRegionReason` import from
  `aio/_retry_utility_async.py`. The async retry utility delegates
  retry-diagnostics recording to `_record_retry_diagnostics_and_attach`
  in the sync module, which owns the import.
* Add full Sphinx-style docstrings (`:param`/`:type`/`:returns`/
  `:rtype`) to every new hedging-detection internal flagged for missing
  doc elements:
  `CosmosItemPaged._copy_headers_stripped` (+ its async twin so they
  stay symmetric), `_HedgingDetectionState._record_request`,
  `_HedgingDetectionState._record_response`,
  `_HedgingDetectionAccessorsMixin.is_hedging_started`,
  `_HedgingDetectionAccessorsMixin.get_requested_regions`,
  `_HedgingDetectionAccessorsMixin.get_responded_regions`,
  `_attach_state_to_headers`, `_pop_state_from_headers`,
  `_record_retry_diagnostics_and_attach`, and `_resolve_region_name`.

No public API surface change; only docstrings and the deletion of two
inert try/except blocks. All 44 hedging-detection unit tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[cosmos] Hedging Detection API — public accessors on response wrappers and exception types

1 participant