HDDS-15076. Fix Incorrect pending deletion size for EC blocks in DN#10105
HDDS-15076. Fix Incorrect pending deletion size for EC blocks in DN#10105priyeshkaratha wants to merge 3 commits intoapache:masterfrom
Conversation
sreejasahithi
left a comment
There was a problem hiding this comment.
Thanks @priyeshkaratha for the patch
|
@priyeshkaratha , it should be fixed in datanode. The current changes in SCM will make the SCM report incorrect. |
faec2d0 to
88f769a
Compare
I have updated PR to do complete logic inside DN. To calculated most aprox value, DN required number of nodes required which has to be passed from SCM side. Also added integration test to validate the same. |
befe336 to
26a2f81
Compare
26a2f81 to
35648e7
Compare
35648e7 to
7addeff
Compare
7addeff to
2b25dac
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes inflated “pending deletion size” accounting for erasure-coded (EC) blocks by introducing a per-replica/per-shard size signal into the delete-blocks flow so DataNodes attribute only their shard’s share rather than the full logical block size.
Changes:
- Extend deleted-block metadata/protos to carry per-replica size (and aggregate it into SCM delete transactions).
- Update OM to populate per-replica size for blocks pending deletion via
QuotaUtil.getSizePerReplica(...). - Add/adjust unit + integration tests for quota sizing and Recon storage distribution / pending deletion endpoints (RATIS + EC).
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/service/TestKeyDeletingService.java | Update DeletedBlock constructor usage to include per-replica size. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmMetadataManagerImpl.java | Populate DeletedBlock with per-replica size when building delete block groups. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java | Populate DeletedBlock with per-replica size for pending key deletions. |
| hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestDeletedBlocksTxnShell.java | Update test data generation for new DeletedBlock signature. |
| hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestScmDataDistributionFinalization.java | Update test deleted-block generation for per-replica size. |
| hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestStorageContainerManager.java | Update SCM integration tests for new DeletedBlock signature. |
| hadoop-ozone/integration-test-recon/src/test/java/org/apache/hadoop/ozone/recon/TestStorageDistributionEndpointRatis.java | New Recon integration test for RATIS replication behavior and DN failure reporting. |
| hadoop-ozone/integration-test-recon/src/test/java/org/apache/hadoop/ozone/recon/TestStorageDistributionEndpointEC.java | New Recon integration test for EC replication behavior. |
| hadoop-ozone/integration-test-recon/src/test/java/org/apache/hadoop/ozone/recon/AbstractTestStorageDistributionEndpoint.java | Refactor shared Recon integration test infrastructure into an abstract base class. |
| hadoop-ozone/common/src/test/java/org/apache/hadoop/ozone/om/helpers/TestQuotaUtil.java | Add unit tests for new per-replica sizing helper across RATIS + EC. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/QuotaUtil.java | Add getSizePerReplica(...) helper for per-DN physical sizing. |
| hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/block/TestDeletedBlockLog.java | Update SCM deleted-block-log tests for per-replica totals in transactions. |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java | Populate totalSizePerReplica in SCM delete transactions when available. |
| hadoop-hdds/interface-server/src/main/proto/ScmServerProtocol.proto | Extend KeyBlocks proto with sizePerReplica. |
| hadoop-hdds/interface-server/src/main/proto/ScmServerDatanodeHeartbeatProtocol.proto | Extend DeletedBlocksTransaction proto with totalSizePerReplica. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/ozone/common/DeletedBlock.java | Extend DeletedBlock model to carry per-replica size. |
| hadoop-hdds/framework/src/main/java/org/apache/hadoop/ozone/common/BlockGroup.java | Serialize/deserialize per-replica size in KeyBlocks proto. |
| hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java | Use per-replica transaction totals when updating DN pending deletion bytes. |
Comments suppressed due to low confidence (1)
hadoop-ozone/integration-test-recon/src/test/java/org/apache/hadoop/ozone/recon/AbstractTestStorageDistributionEndpoint.java:200
- The inline comment on
setDataSize(100)is now incorrect (it still says the replicated size is10 * 3 = 30). Please update or remove it so the comment matches the new 100-byte test payload and the updated assertions (e.g., 100 * 3 = 300 for RATIS THREE).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (VersionedDatanodeFeatures.isFinalized( | ||
| HDDSLayoutFeature.STORAGE_SPACE_DISTRIBUTION) && delTX.hasTotalBlockSize()) { | ||
| long pendingBytes = containerData.getBlockPendingDeletionBytes(); | ||
| pendingBytes += delTX.getTotalBlockSize(); | ||
| pendingBytes += delTX.getTotalSizePerReplica(); | ||
| metadataTable |
| containerData.incrPendingDeletionBlocks(newDeletionBlocks, | ||
| delTX.hasTotalBlockSize() ? delTX.getTotalBlockSize() : 0); | ||
| delTX.hasTotalBlockSize() ? delTX.getTotalSizePerReplica() : 0); | ||
| containerData.updateDeleteTransactionId(delTX.getTxID()); |
| * @return bytes stored on a single DN for this block | ||
| */ | ||
| public static long getSizePerReplica(long dataSize, ReplicationConfig repConfig) { | ||
| int requiredNodes = repConfig.getRequiredNodes(); |
There was a problem hiding this comment.
Let's check REPLICATION_TYPE first. For Ratis, there is no need to go through the calculation.
What changes were proposed in this pull request?
Problem
For EC blocks, each DataNode holds only one shard but each DN was crediting itself with the full logical block size as its pending-deletion contribution. This inflated cluster-wide pending deletion metrics by a factor of roughly (k+m)/m (e.g. 3× for EC(3,2) or EC(6,3)).
Why this approximation is acceptable
Dividing the total replicated size evenly across all nodes is not byte-perfect, the last partial stripe means some data shards are smaller than others.
However the cluster-wide sum across all DNs still equals totalReplicatedSize, so aggregate metrics are correct.
The per-DN error is at most ±1 ecChunkSize per block and this is negligible at container scale.
What is the link to the Apache JIRA
HDDS-15076
How was this patch tested?
Tested using existing testcases and additional unit testcase.