[Feat] Add producer task trace and viewer#1891
Conversation
|
@claude review |
| return routed_experts | ||
|
|
||
| @ray.method(concurrency_group=ROLLOUT_CONCURRENCY_GROUP_GENERATE) | ||
| @trace_function("xtuner.rollout_worker.generate", trace_kwargs_getter=lambda self, *args, **kwargs: {"worker_rank": self.rank}) |
There was a problem hiding this comment.
Claude: Nit: This line exceeds the 119-character limit configured in pyproject.toml. Consider splitting:
| @trace_function("xtuner.rollout_worker.generate", trace_kwargs_getter=lambda self, *args, **kwargs: {"worker_rank": self.rank}) | |
| @trace_function( | |
| "xtuner.rollout_worker.generate", | |
| trace_kwargs_getter=lambda self, *args, **kwargs: {"worker_rank": self.rank}, | |
| ) |
| <div class="trace-label"> | ||
| <div class="trace-id mono" title="${esc(row.trace_id)}">${esc(row.trace_id)}</div> | ||
| <div class="subtle">duration ${fmtSeconds(row.duration_s)}</div> | ||
| </div> | ||
| <div class="trace-lane"> | ||
| ${row.spans.map((span) => ` | ||
| <div | ||
| class="span-block ${esc(span.outcome)}" | ||
| title="${esc(span.display_stage)} ${fmtSeconds(span.duration_s)} ${esc(span.span)}" | ||
| style="left:${span.left_pct}%;width:${span.width_pct}%;top:${span.top_px}px;background:${span.color}"> | ||
| ${esc(span.display_stage)} | ||
| </div> | ||
| `).join("")} | ||
| </div> | ||
| </div> | ||
| `).join(""); | ||
| } | ||
|
|
||
| renderMetrics(); |
There was a problem hiding this comment.
Claude: Nit: _split_span_stage and _percentile are duplicated from producer_trace_analysis.py. Consider importing them instead to keep a single source of truth.
|
Claude: ## Summary Adds producer task-level tracing with online/offline viewers — solid observability feature for the RL pipeline. The core trace infrastructure (JSONL writer, in-memory store, decorator API) is well-designed and non-blocking. ProduceBatchResult impact: not affected — tracing is observability-only and does not modify batch accounting, timing, or reward semantics. RoutedExperts impact: low risk — IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
功能介绍
新增 Producer task 级 trace 能力,用于定位 rollout 过程中 task 的 执行阶段,可实时监控所有 task 的行为,并且可以离线分析 agent_loop.generate_group的热点
用法介绍
Trainer 启动后会打印实际地址:
Producer Trace Viewer: http://127.0.0.1:
查看全量历史:
python -m xtuner.tools.producer_trace_hotspots /path/to/work_dir/producer_trace --scope all在线监控图
离线热点图