feat: support graceful scale-down for AlluxioRuntime using AdvancedStatefulSet by jakharmonika364 · Pull Request #5805 · fluid-cloudnative/fluid

jakharmonika364 · 2026-04-23T19:38:45Z

Ⅰ. Describe what this PR does

This PR implements a graceful decommissioning workflow for Alluxio workers during scale-in. It adds a new AdvancedStatefulSet feature gate that allows Fluid to leverage OpenKruise capabilities for finer pod lifecycle management. When enabled, workers are decommissioned and their cached data is migrated before the pods are terminated, ensuring cluster stability and data availability during scaling operations.

Ⅱ. Does this pull request fix one issue?

fixes #4193

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

pkg/ddc/alluxio/operations/decommission_test.go: Unit tests for Alluxio decommissioning commands and active worker count parsing.
Integration verified via SyncReplicas logic to ensure scale-down waits for successful decommissioning.

Ⅳ. Describe how to verify it

Enable AdvancedStatefulSet=true in the feature gates.
Deploy an AlluxioRuntime and scale down the replicas.
Verify that the targeted worker pods are decommissioned from the Alluxio master before the StatefulSet actually deletes the pods.

Ⅴ. Special notes for reviews

The feature is currently in Alpha and disabled by default. It provides the necessary infrastructure to support selective pod deletion in later phases.

fluid-e2e-bot · 2026-04-23T19:38:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign trafalgarzzz for approval by writing /assign @trafalgarzzz in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fluid-e2e-bot · 2026-04-23T19:39:04Z

Hi @jakharmonika364. Thanks for your PR.

I'm waiting for a fluid-cloudnative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gemini-code-assist

Code Review

This pull request introduces graceful worker scale-down for Alluxio by decommissioning workers before they are terminated by the StatefulSet controller, ensuring cached blocks can be migrated. This functionality is gated by a new AdvancedStatefulSet feature. The changes include new decommissioning operations in AlluxioFileUtils, integration into the SyncReplicas reconciliation loop, and comprehensive unit tests. Review feedback suggests improving context propagation by replacing context.TODO(), optimizing efficiency by passing existing runtime objects to avoid redundant API lookups, and refining error handling during the draining phase to prevent log noise.

codecov · 2026-04-23T19:45:02Z

Codecov Report

❌ Patch coverage is 81.73077% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.91%. Comparing base (82e490e) to head (215e960).
⚠️ Report is 108 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/ddc/alluxio/replicas.go	75.36%	13 Missing and 4 partials ⚠️
pkg/features/features.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5805      +/-   ##
==========================================
+ Coverage   58.17%   64.91%   +6.73%     
==========================================
  Files         478      482       +4     
  Lines       32485    33616    +1131     
==========================================
+ Hits        18899    21821    +2922     
+ Misses      12042    10066    -1976     
- Partials     1544     1729     +185

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sonarqubecloud · 2026-04-26T20:54:59Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cheyang

Review: Graceful Scale-Down for AlluxioRuntime via AdvancedStatefulSet

Overall this is a well-structured feature. The architecture (feature gate + decommission operations + reconciler integration) follows Kubernetes and Fluid patterns correctly. The decommission-before-terminate approach is the right way to prevent data loss during scale-in. A few items remain before this is merge-ready.

Architecture & Design

The layering is sound: pkg/features for the gate, pkg/ddc/alluxio/operations for the CLI wrapper, and the reconciler hook in SyncReplicas. Keeping the gate at Alpha/disabled-by-default is appropriate for a first iteration.

One design concern: the PR title references "AdvancedStatefulSet" (OpenKruise) for selective pod deletion, but the current implementation only does highest-ordinal-first decommission — it does not yet interact with OpenKruise APIs. The feature gate name suggests broader scope than what is delivered. Consider documenting (or renaming) this clearly so users understand the gate currently enables graceful scale-down without the selective-pod-deletion piece.

Correctness & Safety

context.TODO() in drainScalingDownWorkers (pkg/ddc/alluxio/replicas.go:142)
The reconciler already carries a cruntime.ReconcileRequestContext. The context.TODO() in e.Client.Get(...) should be replaced with a proper context propagated from SyncReplicas. If the reconciler is cancelled (e.g., leader election loss), this call will hang until the Kubernetes client timeout expires.
Redundant getRuntime() call in getWorkerRPCPort() (pkg/ddc/alluxio/replicas.go:174)
SyncReplicas already fetches the runtime object. Passing it as a parameter (or caching it on the engine) would avoid an extra API round-trip on every reconcile during scale-in.
Requeue semantics for "not yet drained" (pkg/ddc/alluxio/replicas.go:107–109)
Returning fmt.Errorf(...) from inside retry.RetryOnConflict is technically correct (non-conflict errors surface immediately), but the reconciler will log this at error-level via LoggingErrorExceptConflict. Since "workers not yet drained" is a normal transient state, consider returning a dedicated sentinel (e.g., wrapping with a condition type) so the outer handler can log at Info level and requeue with a reasonable interval instead of exponential backoff.
alluxio fsadmin decommission --addresses flag verification
Please confirm this CLI form is supported by the Alluxio version used in Fluid's CI (the Alluxio docs historically have varied between decommissionWorker and decommission; this could silently fail against older images).

Test Coverage (Primary Gap)

decommission_test.go thoroughly covers the operations layer — good.
Missing: unit tests for drainScalingDownWorkers. This is the most critical function — it orchestrates decommission, polls active workers, and gates the scale-down. It needs tests covering at least:
- Pod already deleted (NotFound path)
- Pod has no IP yet
- Decommission call fails
- Active count still above threshold (requeue path)
- Happy path (drained successfully)
Missing: unit test for getWorkerRPCPort covering the custom-port and default-port branches.

Without these, the core reconciler logic has no direct regression coverage.

Minor Items

File	Note
`pkg/ddc/alluxio/operations/decommission.go`	Copyright header says 2024; the current year is 2026.
`test/gha-e2e/curvine/read_job.yaml`	The retry-loop improvement is unrelated to the feature; consider splitting into a separate PR for cleaner history, or at minimum noting it in the PR description.
`pkg/features/features.go:44`	The `init()` that registers the gate will run on import. Make sure the `cmd/` entrypoint imports `pkg/features` so the gate is actually wired up for the controller binary. If it's only transitively imported via `pkg/ddc/alluxio/replicas.go`, that's fine, but worth verifying.

Verdict

Needs work. The design is correct and safe, but unit tests for drainScalingDownWorkers are required before merge, and the context.TODO() issue should be resolved. Once those are addressed this should be ready.

cheyang · 2026-06-14T08:37:12Z

The graceful scale-down approach using Alluxio's decommission mechanism is architecturally sound. However, two concerns remain from prior review:

Decommission command syntax: Please verify alluxio fsadmin decommission --addresses is supported in the Alluxio version used by this project. The CLI may vary across versions.
Worker count parsing: Parsing fsadmin report capacity text output is fragile. Consider using the REST API for reliability.

The parseActiveWorkerCount tests and the overall structure are well done. Once the command format is confirmed, this is close to ready.

Verdict: needs-work — command format needs verification

…atefulSet (fluid-cloudnative#4193) Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

sonarqubecloud · 2026-06-17T15:10:27Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
4.5% Duplication on New Code

See analysis details on SonarQube Cloud

jakharmonika364 · 2026-06-17T15:14:33Z

Pushed fixes for all of this:

Verified the CLI command - it's actually fsadmin decommissionWorker, not decommission. Older code would've silently no-op'd against real Alluxio masters. Fixed, and noted it needs Alluxio 2.9+.
Passed ctx through instead of context.TODO(), and pass runtime into getWorkerRPCPort instead of refetching it.
Added a sentinel for the "not drained yet" case so it logs at Info instead of Error - it's normal during scale-in, not a failure.
Added unit tests for drainScalingDownWorkers (no-IP, NotFound, decommission failure, still-draining, happy path) and getWorkerRPCPort.
Fixed copyright year, dropped the unrelated curvine retry-loop change (wasn't part of this PR).

@cheyang could you please review it.

fluid-e2e-bot Bot added the needs-ok-to-test label Apr 23, 2026

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread pkg/ddc/alluxio/replicas.go Outdated

Comment thread pkg/ddc/alluxio/replicas.go Outdated

Comment thread pkg/ddc/alluxio/replicas.go Outdated

jakharmonika364 force-pushed the feat-support-advanced-statefulset-4193 branch 2 times, most recently from b77d0d4 to c62bca3 Compare April 26, 2026 20:54

cheyang requested changes Jun 12, 2026

View reviewed changes

feat: support graceful scale-down for AlluxioRuntime using AdvancedSt…

215e960

…atefulSet (fluid-cloudnative#4193) Signed-off-by: Monika Jakhar <jakharmonika364@gmail.com>

jakharmonika364 force-pushed the feat-support-advanced-statefulset-4193 branch from c62bca3 to 215e960 Compare June 17, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support graceful scale-down for AlluxioRuntime using AdvancedStatefulSet#5805

feat: support graceful scale-down for AlluxioRuntime using AdvancedStatefulSet#5805
jakharmonika364 wants to merge 1 commit into
fluid-cloudnative:masterfrom
jakharmonika364:feat-support-advanced-statefulset-4193

jakharmonika364 commented Apr 23, 2026

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 26, 2026

Uh oh!

cheyang left a comment

Uh oh!

cheyang commented Jun 14, 2026

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Uh oh!

jakharmonika364 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jakharmonika364 commented Apr 23, 2026

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

fluid-e2e-bot Bot commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Apr 26, 2026

Quality Gate passed

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Review: Graceful Scale-Down for AlluxioRuntime via AdvancedStatefulSet

Architecture & Design

Correctness & Safety

Test Coverage (Primary Gap)

Minor Items

Verdict

Uh oh!

cheyang commented Jun 14, 2026

Uh oh!

sonarqubecloud Bot commented Jun 17, 2026

Quality Gate passed

Uh oh!

jakharmonika364 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 23, 2026 •

edited

Loading