fix: miscalculation of num_steps when using num_epoch and lmdb#5488
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adjust LMDB batch counting to reflect auto-probability expansion and align distributed/training step calculations, with added regression tests.
Changes:
- Update
total_batch/indexsemantics to track sampler-expanded batch counts. - Use DataLoader length for LR step calculations when training on
LmdbDataset. - Add tests covering
total_batchalignment and distributed sampler length with auto-prob expansion.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| source/tests/pt/test_lmdb_dataloader.py | Adds tests validating expanded batch counts and distributed sampler __len__() behavior. |
| deepmd/pt/utils/lmdb_dataset.py | Changes index/total_batch to be derived from the batch sampler length (including expansion). |
| deepmd/pt/train/training.py | Uses len(training_dataloader) to compute batch counts for LR scheduling with LMDB datasets. |
| deepmd/dpmodel/utils/lmdb_data.py | Updates total_batch calculation and adjusts distributed sampler __len__() to include expansion via SameNlocBatchSampler. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Too many files changed? Review this PR in Change Stack to see how the pieces fit before you dive in. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughBatch-count computation moved to sampler/dataloader-derived totals: ChangesBatch Count Estimation Alignment
Sequence Diagram(s)sequenceDiagram
participant LmdbDataReader
participant SameNlocBatchSampler
participant DistributedSameNlocBatchSampler
participant Trainer
LmdbDataReader->>SameNlocBatchSampler: instantiate (shuffle=False, block_targets) to measure expanded batches
SameNlocBatchSampler-->>LmdbDataReader: return expanded batch count
LmdbDataReader->>DistributedSameNlocBatchSampler: provide total_batch or allow sampler-based estimation
DistributedSameNlocBatchSampler->>DistributedSameNlocBatchSampler: cache _total_batches and compute ceil(_total_batches/world_size)
DistributedSameNlocBatchSampler-->>Trainer: len() returns per-rank batch count
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 1311-1320: The current __len__() returns ceil(total /
self._world_size) for every rank which mismatches the strided partitioning in
__iter__(); change __len__() to compute a rank-aware batch count by getting
total = len(SameNlocBatchSampler(self._reader, shuffle=False,
block_targets=self._block_targets)), then compute base = total //
self._world_size and remainder = total % self._world_size and return base + (1
if self._rank < remainder else 0) so that __len__() matches the actual number of
batches produced by the __iter__() strided partitioning (alternatively, adjust
_partition_batches() to pad/repeat batches so every rank emits
ceil(total/world_size) and keep current __len__()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8ecc555b-6099-4600-a3c5-6d331cce1d8d
📒 Files selected for processing (4)
deepmd/dpmodel/utils/lmdb_data.pydeepmd/pt/train/training.pydeepmd/pt/utils/lmdb_dataset.pysource/tests/pt/test_lmdb_dataloader.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5488 +/- ##
==========================================
+ Coverage 81.36% 81.40% +0.03%
==========================================
Files 868 868
Lines 96567 96781 +214
Branches 4233 4240 +7
==========================================
+ Hits 78570 78781 +211
- Misses 16697 16698 +1
- Partials 1300 1302 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
njzjz-bot
left a comment
There was a problem hiding this comment.
LGTM. The latest revision fixes the distributed sampler length mismatch by making __len__() rank-aware while still caching the global sampler length, and the added tests cover both the rank-specific length/iterator consistency and cached-total behavior. The LmdbDataset.total_batch/index behavior for auto-prob expanded datasets is also now explicit in tests.
CI is still finishing some Python shards, but the already completed checks I saw are green, including CodeQL/pre-commit/builds.
— OpenClaw 2026.5.28 (model: custom-chat-jinzhezeng-group/gpt-5.5)
Summary by CodeRabbit
Bug Fixes
Tests