Skip to content

[Feat] Add producer task trace and viewer#1891

Open
YanhuiDua wants to merge 3 commits into
InternLM:mainfrom
YanhuiDua:dev-task-trace
Open

[Feat] Add producer task trace and viewer#1891
YanhuiDua wants to merge 3 commits into
InternLM:mainfrom
YanhuiDua:dev-task-trace

Conversation

@YanhuiDua

@YanhuiDua YanhuiDua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

功能介绍

新增 Producer task 级 trace 能力,用于定位 rollout 过程中 task 的 执行阶段,可实时监控所有 task 的行为,并且可以离线分析 agent_loop.generate_group的热点

  • 新增 trace_function / trace_span,记录 producer、agent loop、rollout、judger、backend request 等阶段。
  • RLTrainer 开启 trace 后自动启动在线 Producer Trace Viewer。
  • 在线/离线 viewer 默认只展示最新 produce_batch,支持 --scope all 查看全量历史。

用法介绍

  • 在线viewer用法:
from xtuner.v1.rl.trace import TraceConfig

  trace_config = TraceConfig(
      enabled=True,
      output_dir=Path(work_dir) / "producer_trace",
      viewer_enabled=True,
      viewer_host="0.0.0.0",
      viewer_port=0,  # 0 表示自动选择可用端口
      viewer_refresh_interval_s=1.0,
  )

Trainer 启动后会打印实际地址:
Producer Trace Viewer: http://127.0.0.1:

  • 离线viewer用法:
  # 执行转换脚本
  python -m xtuner.tools.producer_trace_hotspots /path/to/work_dir/producer_trace
  # 生成html的路径如下
  /path/to/work_dir/producer_trace/producer_trace_hotspots.html

查看全量历史:

python -m xtuner.tools.producer_trace_hotspots /path/to/work_dir/producer_trace --scope all

在线监控图

image

离线热点图

img_v3_0212f_505c194b-e56d-4e13-bb18-88953e3908fg

@YanhuiDua

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/rollout/utils.py Outdated
Comment thread xtuner/v1/rl/agent_loop_manager/sampler.py
Comment thread xtuner/v1/rl/agent_loop_manager/producer.py
return routed_experts

@ray.method(concurrency_group=ROLLOUT_CONCURRENCY_GROUP_GENERATE)
@trace_function("xtuner.rollout_worker.generate", trace_kwargs_getter=lambda self, *args, **kwargs: {"worker_rank": self.rank})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: This line exceeds the 119-character limit configured in pyproject.toml. Consider splitting:

Suggested change
@trace_function("xtuner.rollout_worker.generate", trace_kwargs_getter=lambda self, *args, **kwargs: {"worker_rank": self.rank})
@trace_function(
"xtuner.rollout_worker.generate",
trace_kwargs_getter=lambda self, *args, **kwargs: {"worker_rank": self.rank},
)

Comment thread xtuner/v1/rl/trace.py
Comment on lines +627 to +645
<div class="trace-label">
<div class="trace-id mono" title="${esc(row.trace_id)}">${esc(row.trace_id)}</div>
<div class="subtle">duration ${fmtSeconds(row.duration_s)}</div>
</div>
<div class="trace-lane">
${row.spans.map((span) => `
<div
class="span-block ${esc(span.outcome)}"
title="${esc(span.display_stage)} ${fmtSeconds(span.duration_s)} ${esc(span.span)}"
style="left:${span.left_pct}%;width:${span.width_pct}%;top:${span.top_px}px;background:${span.color}">
${esc(span.display_stage)}
</div>
`).join("")}
</div>
</div>
`).join("");
}

renderMetrics();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: _split_span_stage and _percentile are duplicated from producer_trace_analysis.py. Consider importing them instead to keep a single source of truth.

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Adds producer task-level tracing with online/offline viewers — solid observability feature for the RL pipeline. The core trace infrastructure (JSONL writer, in-memory store, decorator API) is well-designed and non-blocking.

ProduceBatchResult impact: not affected — tracing is observability-only and does not modify batch accounting, timing, or reward semantics.

RoutedExperts impact: low risk — _trace_* keys are added to extra_fields which travel through the pipeline, but they use a distinct prefix and don't affect routed-experts ownership or cleanup paths.

Issues

Critical

  • rollout/utils.py:278: Copy-paste bug — preprocess method is decorated with span name "xtuner.partial_rollout_handler.postprocess".

Warning

  • sampler.py:133-142: Setting state.task_name = task_name is a behavioral change beyond tracing — downstream code that relied on task_name=None for fresh states will behave differently.
  • producer.py:377-383: Injecting _trace_* keys into RolloutState.extra_fields couples trace internals to the data model. These keys travel through replay buffer, workers, and potentially checkpoints.

Nit

  • Several @trace_function(...) one-liners exceed the 119-char limit (worker.py:684, producer.py).
  • Duplicate _split_span_stage / _percentile helper functions in both producer_trace_analysis.py and producer_trace_hotspots.py.
  • Dead code in trace.py:382-385: path.exists() is always True after open("a").

Verdict

REQUEST_CHANGES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant