Skip to content

docs: add OpenSpec change for downscale improvements#253

Open
dobrerazvan wants to merge 4 commits into
masterfrom
docs/downscale-improvements-spec
Open

docs: add OpenSpec change for downscale improvements#253
dobrerazvan wants to merge 4 commits into
masterfrom
docs/downscale-improvements-spec

Conversation

@dobrerazvan

@dobrerazvan dobrerazvan commented May 13, 2026

Copy link
Copy Markdown

Adds the design artifacts for the kafka cluster downscale improvement initiative, covering two related fixes:

  1. Batched broker removal - submit all brokers removed in a single manifest apply as one CC remove_broker operation instead of N separate operations, eliminating redundant partition movements.

  2. Draining broker listener retention - keep brokers removed from spec in envoy/istio/contour config until CruiseControl completes draining (GracefulDownscaleSucceeded), preserving client connectivity.

Artifacts: proposal, design, capability specs, and implementation tasks.

Description

Please provide a meaningful description of what this change will do, or is for. Bonus points for including links to
related issues, other PRs, or technical references.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this
change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@dobrerazvan

Copy link
Copy Markdown
Author

dobrerazvan and others added 3 commits June 22, 2026 14:19
Adds the design artifacts for the kafka cluster downscale improvement
initiative, covering two related fixes:

1. Batched broker removal - submit all brokers removed in a single
   manifest apply as one CC remove_broker operation instead of N
   separate operations, eliminating redundant partition movements.

2. Draining broker listener retention - keep brokers removed from
   spec in envoy/istio/contour config until CruiseControl completes
   draining (GracefulDownscaleSucceeded), preserving client connectivity.

Artifacts: proposal, design, capability specs, and implementation tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Control completes

Brokers removed from the KafkaCluster spec were immediately excluded
from envoy, istio, and contour config, even while CruiseControl was
still draining them. Clients lost connectivity to brokers that still
held partition leaders.

Root cause: ShouldIncludeBroker() returned false when brokerConfig==nil
(broker not in spec). Add a fallback path: when brokerConfig is nil,
check the broker's CruiseControlState in status. If the state is an
active downscale (IsDownscale && !IsSucceeded) and the broker was
previously bound to the requested ingressConfig, keep it in the
external listener resources.

Brokers stuck in CompletedWithError or Paused are also retained,
allowing manual investigation while keeping client connectivity.
ShouldIncludeBroker is the single gatekeeper for all external listener
reconcilers (envoy, istio, contour), so no other files need changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When removing multiple brokers from a KafkaCluster spec, the operator
previously created one CruiseControlOperation per broker. CC can only
run one operation at a time, causing data to be shuffled onto brokers
that will themselves be decommissioned, far more partition movements
than necessary.

Collect all broker IDs in GracefulDownscaleRequired state and submit
them as a single remove_broker CC operation, mirroring the existing
addBrokers pattern. Because the KafkaCluster reconciler sets all
removed brokers to Required via a single atomic status patch, all
brokers from a single manifest apply are guaranteed to land in the
same batch.

Includes unit test (createCCOperation multi-broker params), integration
test (exactly one CCOperation created for two simultaneous removals),
e2e test scaffold, and 5-broker sample manifest.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant