Skip to content

fix: classify WAL restore errors with precise gRPC statuses#927

Open
leonardoce wants to merge 4 commits into
cloudnative-pg:mainfrom
leonardoce:transient-status
Open

fix: classify WAL restore errors with precise gRPC statuses#927
leonardoce wants to merge 4 commits into
cloudnative-pg:mainfrom
leonardoce:transient-status

Conversation

@leonardoce
Copy link
Copy Markdown
Contributor

Map each restorer sentinel to the gRPC status that best reflects whether the operator should retry:

  • ErrWALNotFound -> NotFound (terminal)
  • ErrInvalidWalName -> InvalidArgument (terminal)
  • ErrConnectivity -> Unavailable (retry)
  • ErrGeneric -> Unavailable (retry)
  • anything else -> Internal (terminal)

Previously every non-NotFound failure was returned verbatim, leaving the operator unable to tell transient blips apart from terminal conditions like a malformed WAL name.

@dosubot dosubot Bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels May 27, 2026
@leonardoce leonardoce requested a review from a team as a code owner May 27, 2026 09:31
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels May 27, 2026
@leonardoce
Copy link
Copy Markdown
Contributor Author

This PR needs cloudnative-pg/barman-cloud#253

leonardoce and others added 3 commits May 28, 2026 15:54
Map each restorer sentinel to the gRPC status that best reflects whether
the operator should retry:

  - ErrWALNotFound      -> NotFound        (terminal)
  - ErrInvalidWalName   -> InvalidArgument (terminal)
  - ErrConnectivity     -> Unavailable     (retry)
  - ErrGeneric          -> Unavailable     (retry)
  - anything else       -> Internal        (terminal)

Previously every non-NotFound failure was returned verbatim, leaving the
operator unable to tell transient blips apart from terminal conditions
like a malformed WAL name.

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
The per-case and helper comments asserted what the operator will do
with each gRPC code ("the operator stops retrying", "the operator
will retry the request", etc.). On the current cloudnative-pg main,
the operator only differentiates ErrWALNotFound; the precise-code
distinctions become meaningful only after the operator-side retry
work lands.

Rewrite the comments to describe what the code emits and why (per
gRPC convention), so they stay accurate regardless of the operator
version on the other end. Also note that barman returns ErrGeneric
(exit 4) for some retryable conditions too — not only ErrConnectivity
(exit 2) — which justifies why both map to codes.Unavailable.

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Extract the barman-restorer error -> gRPC code switch from
restoreFromBarmanObjectStore into a pure helper,
classifyWALRestoreError, so it can be exercised in isolation
without the surrounding k8s client / configuration scaffolding.

Add ginkgo specs that check:

  - each barman sentinel maps to the expected gRPC status code
    (ErrConnectivity and ErrGeneric both -> Unavailable, since
    barman uses exit 4 for some retryable conditions too),
  - an unclassified error falls through to codes.Internal,
  - classification still works through multiple fmt.Errorf wraps,
  - the switch matches by errors.Is identity rather than message
    substring (so a NotFound whose message happens to mention
    "connectivity" still maps to NotFound).

internal/cnpgi/common had no tests before; this introduces the
suite scaffolding alongside the new specs.

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@armru armru force-pushed the transient-status branch from 40f7ce2 to 7c63be1 Compare May 28, 2026 14:43
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 28, 2026
…s addition

The replace directive in go.mod pulls in a barman-cloud version that
exposes the new restoreAdditionalCommandArgs field on
BarmanObjectStoreConfiguration. The plugin's CRD embeds that struct,
so controller-gen + kustomize must be rerun to surface the field in
the published CRD and bundled manifest. CI's uncommitted-changes
guard caught the gap.

Regenerated with controller-gen v0.21.0 (matching the daggerverse
module version) and kustomize v5.5.0; only the new field is added,
no annotation drift.

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@gbartolini
Copy link
Copy Markdown
Contributor

I have created this ticket in Barman Cloud: EnterpriseDB/barman#1193

@gbartolini
Copy link
Copy Markdown
Contributor

I have created this ticket in Barman Cloud: EnterpriseDB/barman#1193

I have proposed this PR.

@gbartolini
Copy link
Copy Markdown
Contributor

This PR needs cloudnative-pg/barman-cloud#253

Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants