[Cosmos] feat: Hedging Detection API (accessors on response/error types)#46901
Draft
NaluTripician wants to merge 4 commits into
Draft
[Cosmos] feat: Hedging Detection API (accessors on response/error types)#46901NaluTripician wants to merge 4 commits into
NaluTripician wants to merge 4 commits into
Conversation
…atch wiring Implements Phase 2 Priorities 1–3 of the cross-SDK hedging detection API work item (azure-sdk-for-python#46899). Adds three accessor methods — `is_hedging_started()`, `get_requested_regions()`, `get_responded_regions()` — to the five wrapper / exception types (`CosmosDict`, `CosmosList`, `CosmosItemPaged`, `CosmosAsyncItemPaged`, `CosmosHttpResponseError`, `CosmosBatchOperationError`, `CosmosClientTimeoutError`), backed by a shared private `_HedgingDetectionState`. New public types: * `azure.cosmos.RequestedRegion` (frozen slots dataclass) * `azure.cosmos.RequestedRegionReason` (non-exhaustive Enum with `_missing_` → `UNKNOWN` for forward compatibility — SE-016) Dispatch-site instrumentation: * INITIAL recorded inside the hedging handler arm body for index 0 (sync + async); INITIAL recorded at SynchronizedRequest entry for the non-hedged path. * HEDGING recorded inside the hedge-arm body AFTER the threshold delay and AFTER the cancellation check — pre-delay cancellation produces no phantom entry (AC10 / spec §12 ''no phantom entries''). * OPERATION_RETRY / REGION_FAILOVER recorded in `_retry_utility(_async).Execute` when the retry policy returns True; attached to exceptions before re-raise so error-path consumers see the full timeline. Closure-argument pattern (SE-002): the state flows through `execute_with_hedging` as an explicit `hedging_state` parameter (NOT on `request_params` — the deepcopy at line 96 of the handler would silently swallow child appends). Verified by AC8 regression test (to follow). Sync↔async parity (SE-004 / M11): every sync code site has a paired async modification. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds: * CHANGELOG entry under 4.16.0b3 (Features Added). * samples/hedging_detection.py + _async.py - customer-facing usage of the three new accessors on a CosmosDict response and a CosmosHttpResponseError. * tests/test_diagnostics_types.py - 9 unit tests covering AC6 (RequestedRegion frozen/equality/hash/pickle + RequestedRegionReason _missing_ -> UNKNOWN forward-compat per SE-016). * tests/test_hedging_detection.py - 21 unit tests covering AC1, AC3, AC4, AC5, AC8 (closure-arg deepcopy regression), AC10 (no phantom entries on cancelled hedge arm), AC11 (duplicate responses allowed), AC12 (reserved reasons never emitted in production code - repo-wide grep), plus headers invariant + threading.Lock concurrent-append stress test. * tests/test_hedging_detection_async.py - 23 async-twin tests including the sync<->async parity checker (AC9) that fails CI if any sync test gains a method without an _async counterpart. * tests/livecanary/ - opt-in (env-var gated) multi-region smoke tests covering AC13a (hedged read records HEDGING entry) and AC13b (all-regions-fail surfaces dispatch metadata on CosmosHttpResponseError). All 53 sync+async unit tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The parity-coverage test used cryptic short names `scls` and `acls` for `sync_cls` and `async_cls`. `scls` is flagged as a spelling error by cspell on the build-analyze CI job. Rename to the spelled-out form which is both more readable and cspell-clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address the 20 pylint findings flagged by the Build_Analyze CI job on the hedging-detection PR: * Remove no-op `try: ... except Exception: raise` blocks in `CrossRegionHedgingHandler.execute_single_request_with_delay` (sync + async). The bare reraise added nothing and triggered W0706 try-except-raise. The post-call `_record_response` logic still only runs on the success path, since an in-flight exception now propagates directly through the function. * Remove the unused `RequestedRegionReason` import from `aio/_retry_utility_async.py`. The async retry utility delegates retry-diagnostics recording to `_record_retry_diagnostics_and_attach` in the sync module, which owns the import. * Add full Sphinx-style docstrings (`:param`/`:type`/`:returns`/ `:rtype`) to every new hedging-detection internal flagged for missing doc elements: `CosmosItemPaged._copy_headers_stripped` (+ its async twin so they stay symmetric), `_HedgingDetectionState._record_request`, `_HedgingDetectionState._record_response`, `_HedgingDetectionAccessorsMixin.is_hedging_started`, `_HedgingDetectionAccessorsMixin.get_requested_regions`, `_HedgingDetectionAccessorsMixin.get_responded_regions`, `_attach_state_to_headers`, `_pop_state_from_headers`, `_record_retry_diagnostics_and_attach`, and `_resolve_region_name`. No public API surface change; only docstrings and the deletion of two inert try/except blocks. All 44 hedging-detection unit tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cosmos: Hedging Detection API (Python, Option B)
Closes #46899
Summary
Adds the cross-SDK Hedging Detection API to
azure-cosmos(Python),shipping Option B from the public spec: three accessor methods —
is_hedging_started(),get_requested_regions(),get_responded_regions()— added directly to the four wrapper types and the three exception types
that customers receive from data-plane operations, backed by a shared
private
_HedgingDetectionState.Public surface (new)
RequestedRegion(frozen dataclass:region_name,reason)azure.cosmosRequestedRegionReason(enum, with_missing_ → UNKNOWN)azure.cosmosis_hedging_started() -> boolget_requested_regions() -> tuple[RequestedRegion, ...]get_responded_regions() -> tuple[str, ...]7 carrier types:
CosmosDict,CosmosList,CosmosItemPaged,CosmosAsyncItemPaged,CosmosHttpResponseError,CosmosBatchOperationError,CosmosClientTimeoutError.Design highlights
deep-copies
request_paramsat the worker boundary (line 96 of_availability_strategy_handler.py). Storing state onrequest_paramswould silently swallow child appends. State is therefore threaded through
the handler as an explicit
hedging_state=closure argument, into theretry utility via
kwargs["_hedging_state"], and back to the wrapperconstruction site by stashing on the response-headers dict under a
private sentinel key that each wrapper pops in its
__init__. Internalflag is also popped in
_Requestbefore reaching the HTTP pipeline.hedging-handler append happens AFTER the threshold sleep AND AFTER the
cancellation check, so a hedge arm that loses the race before its delay
elapses produces NO phantom entry.
RequestedRegionReason._missing_(cls, raw)returnscls.UNKNOWNfor anyunrecognized payload — calling
RequestedRegionReason("future_v42")never raises.
_HedgingDetectionStateuses athreading.Lockto guard the compound (append + flag set) update;snapshots are returned as immutable tuples.
Enum values are exported (forward compat) but a repo-wide grep test
asserts they are never emitted by production code.
Acceptance-criteria matrix
is_hedging_started()==Falsetest_record_initial_does_not_set_hedging_flagtest_cancelled_before_delay_records_nothing(sync + async)is_hedging_started()==True, HEDGING entry presenttest_record_hedging_sets_flag(sync + async)test_record_operation_retry_does_not_set_flagtest_cosmos_http_response_error_forwardsRequestedRegionfrozen / hashable / picklabletest_diagnostics_types.py(9 tests)copy.deepcopy(request_params)test_handler_dispatches_state_through_closure_not_params(sync + async)TestSyncAsyncParity.{test_parity_class_coverage,test_parity_method_coverage}test_cancelled_before_delay_records_nothing(sync + async)test_responded_regions_preserves_duplicatesTRANSPORT_RETRY/CIRCUIT_BREAKER_PROBEnever emitted in v1test_reserved_reasons_absent_from_production_code(repo-wide grep)tests/livecanary/test_hedging_detection_live[_async].py(opt-in via env vars)Test execution
Live canary suite is opt-in (skipped unless
COSMOS_MULTI_REGION_ENDPOINTand
COSMOS_MULTI_REGION_KEYenv vars are set).Files changed
Production (12)
azure/cosmos/_diagnostics_types.py(NEW)azure/cosmos/_diagnostics.py(NEW)azure/cosmos/__init__.pyazure/cosmos/_cosmos_responses.pyazure/cosmos/exceptions.pyazure/cosmos/_availability_strategy_handler.pyazure/cosmos/aio/_asynchronous_availability_strategy_handler.pyazure/cosmos/_synchronized_request.pyazure/cosmos/aio/_asynchronous_request.pyazure/cosmos/_retry_utility.pyazure/cosmos/aio/_retry_utility_async.pyCHANGELOG.mdTests / samples (7)
tests/test_diagnostics_types.py(NEW)tests/test_hedging_detection.py(NEW)tests/test_hedging_detection_async.py(NEW)tests/livecanary/__init__.py(NEW)tests/livecanary/test_hedging_detection_live.py(NEW)tests/livecanary/test_hedging_detection_live_async.py(NEW)samples/hedging_detection.py+samples/hedging_detection_async.py(NEW)Why no
.pyi?azure-cosmosdoesn't ship a__init__.pyi; onlypy.typedis present.Inline annotations on the new public types provide equivalent type-checker
coverage. Spec S2 was a SHOULD, not a MUST.
Notes for reviewers
PRs).
azure/cosmos/__init__.py.spec §7 forbids log-string scraping as a customer contract.