fix(pt_expt): report average step time in the normal training log#5500
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR extends training message logging with step time reporting. ChangesStep time reporting in training messages
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The pt_expt trainer emitted a separate, always-on `step=N step_time=...s` debug line (leftover from deepmodeling#5397) in addition to the normal progress message, where step_time was a single instantaneous sample of the last step. Replace it by folding an average-per-step time into the standard progress message: `format_training_message` gains an optional `step_time` argument (rendered as `step time = X.XXXX s`, default-off so pt/pd/jax are unaffected), and the trainer now reports interval_wall_time / steps_since_last_log. The per-step time.time() bracketing (t_start/t_end) is dropped from the hot loop. Tests covering the with/without step_time and eta branches are added to source/tests/pt_expt/test_training.py.
3e14909 to
189aff8
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5500 +/- ##
==========================================
- Coverage 81.42% 81.42% -0.01%
==========================================
Files 871 871
Lines 96951 96951
Branches 4240 4245 +5
==========================================
- Hits 78941 78939 -2
- Misses 16707 16710 +3
+ Partials 1303 1302 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Use an explicit per-step rate unit (avg = 0.1841 s/step) instead of the redundant 'step time = 0.1841 s', keeping the interval 'total wall time' unchanged.
Summary
The pt_expt trainer emitted a separate, always-on debug line on top of the normal progress message:
This
step=N step_time=...sline (leftover from #5397) is pt_expt-only (absent from pt/pd/jax) andstep_timethere was a single instantaneous sample of the last step, not an average.This PR removes the standalone line and folds an average per-step time into the normal message:
Changes
deepmd/loggers/training.py:format_training_messagegains an optionalstep_timeargument, rendered asstep time = X.XXXX sbetween the wall time and eta. Default-off, sopt/pd/jaxoutput is unchanged.deepmd/pt_expt/train/training.py: drop the standalonestep=... step_time=...log line and the per-steptime.time()bracketing (t_start/t_end); reportstep_time = interval_wall_time / steps_since_last_log(steps counted exactly once across intervals vialast_log_step).source/tests/pt_expt/test_training.py: addTestFormatTrainingMessageStepTimecovering the with/withoutstep_timeandetabranches.Known limitations
format_training_message) across all branches; they do not drive the trainer loop, so the average computation itself is verified by reasoning + existing training smoke tests rather than an end-to-end assertion.step timeis inflated vs steady state (expected).Summary by CodeRabbit