feat: add Time Between Turns (TBT) to audit/logs command and glossary#26321
feat: add Time Between Turns (TBT) to audit/logs command and glossary#26321
Conversation
- Add AvgTimeBetweenTurns and MaxTimeBetweenTurns to LogMetrics - Parse TBT from user.message timestamps in events.jsonl (Copilot engine) - Add TBT fields + cache warning to SessionAnalysis in audit command - Render TBT with Anthropic 5-min cache-expiry warning in audit report - Add AvgTimeBetweenTurns to WorkflowRun and RunData for logs command - Propagate TBT through logs orchestrator pipeline - Add TBT glossary entry to docs - Add unit tests for TBT computation in parseEventsJSONLFile Agent-Logs-Url: https://github.com/github/gh-aw/sessions/ac5d0f93-faf6-480f-878a-17d194e35075 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds Time Between Turns (TBT) as a first-class performance/cost metric across audit and logs reporting, with Copilot events.jsonl providing precise per-turn timing.
Changes:
- Extend
workflow.LogMetrics(and downstream run models) with Avg/Max TBT durations. - Parse Copilot
events.jsonltimestamps to compute precise Avg/Max TBT; render TBT + cache-expiry warnings ingh aw audit. - Expose Avg TBT via
gh aw logs --jsonand document TBT in the glossary.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/metrics.go | Adds Avg/Max TBT fields to LogMetrics. |
| pkg/cli/copilot_events_jsonl.go | Collects per-turn timestamps from events.jsonl and computes Avg/Max TBT. |
| pkg/cli/copilot_events_jsonl_test.go | Adds test coverage for TBT computation scenarios. |
| pkg/cli/audit_expanded.go | Adds TBT + cache-warning fields to SessionAnalysis and populates them (precise + fallback estimate). |
| pkg/cli/audit_report_render_overview.go | Renders Avg/Max TBT and cache warning in the audit overview. |
| pkg/cli/logs_models.go | Adds Avg TBT to WorkflowRun for propagation through logs pipeline. |
| pkg/cli/logs_orchestrator.go | Propagates Avg TBT from extracted metrics into WorkflowRun. |
| pkg/cli/logs_report.go | Emits avg_time_between_turns in logs JSON output (precise + fallback estimate). |
| docs/src/content/docs/reference/glossary.md | Documents TBT, TTL implications, and mitigation tips. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (2)
pkg/cli/logs_report.go:239
avg_time_between_turnsis exposed in--jsonoutput, but the estimated-path appends " (estimated)" to the duration string. This makes the field harder to parse programmatically and inconsistent with other duration fields. Consider keepingavg_time_between_turnsstrictly as a duration string and adding a separate boolean/source field to indicate whether it was estimated.
// Compute average TBT from metrics when available; fall back to wall-time / turns.
if run.AvgTimeBetweenTurns > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns)
} else if run.Turns > 1 && run.Duration > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns)) + " (estimated)"
}
pkg/cli/audit_expanded.go:238
AvgTimeBetweenTurnsis serialized to JSON for audit output, but the estimated path appends " (estimated)" directly into the value. This makes the JSON field non-uniform and harder to parse. Consider keeping the value as a plain duration string and adding a separate flag/source field (e.g.,avg_time_between_turns_estimatedortbt_source).
} else if metrics.Turns > 1 && run.Duration > 0 {
// Fallback: average cycle time as a proxy for TBT.
avgTBT := run.Duration / time.Duration(metrics.Turns)
session.AvgTimeBetweenTurns = timeutil.FormatDuration(avgTBT) + " (estimated)"
if avgTBT > anthropicCacheTTL {
- Files reviewed: 9/9 changed files
- Comments generated: 5
pkg/cli/logs_report.go
Outdated
| // Compute average TBT from metrics when available; fall back to wall-time / turns. | ||
| if run.AvgTimeBetweenTurns > 0 { | ||
| runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns) | ||
| } else if run.Turns > 1 && run.Duration > 0 { | ||
| runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns)) + " (estimated)" |
There was a problem hiding this comment.
Fallback estimate for AvgTimeBetweenTurns divides wall time by run.Turns, but “time between turns” has turns-1 intervals. This will systematically under-estimate TBT (e.g., 2 turns should have 1 interval). Consider using run.Duration / time.Duration(run.Turns-1) when Turns > 1 so the estimate matches the definition used by the timestamp-based computation.
This issue also appears on line 234 of the same file.
| // Compute average TBT from metrics when available; fall back to wall-time / turns. | |
| if run.AvgTimeBetweenTurns > 0 { | |
| runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns) | |
| } else if run.Turns > 1 && run.Duration > 0 { | |
| runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns)) + " (estimated)" | |
| // Compute average TBT from metrics when available; fall back to wall-time / (turns - 1). | |
| if run.AvgTimeBetweenTurns > 0 { | |
| runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns) | |
| } else if run.Turns > 1 && run.Duration > 0 { | |
| runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns-1)) + " (estimated)" |
pkg/cli/audit_expanded.go
Outdated
| } | ||
| } else if metrics.Turns > 1 && run.Duration > 0 { | ||
| // Fallback: average cycle time as a proxy for TBT. | ||
| avgTBT := run.Duration / time.Duration(metrics.Turns) |
There was a problem hiding this comment.
The wall-time fallback estimate for TBT uses run.Duration / time.Duration(metrics.Turns), but TBT is defined over turns-1 intervals between consecutive turns. For example, 2 turns should result in 1 interval, not 2. Using Turns-1 here will align the estimate with the timestamp-based computation.
This issue also appears on line 234 of the same file.
| avgTBT := run.Duration / time.Duration(metrics.Turns) | |
| avgTBT := run.Duration / time.Duration(metrics.Turns-1) |
pkg/cli/copilot_events_jsonl.go
Outdated
| for i := 1; i < len(turnTimestamps); i++ { | ||
| tbt := turnTimestamps[i].Sub(turnTimestamps[i-1]) | ||
| if tbt > 0 { | ||
| totalTBT += tbt | ||
| if tbt > maxTBT { | ||
| maxTBT = tbt | ||
| } | ||
| } | ||
| } | ||
| intervals := len(turnTimestamps) - 1 | ||
| if totalTBT > 0 && intervals > 0 { | ||
| metrics.AvgTimeBetweenTurns = totalTBT / time.Duration(intervals) | ||
| metrics.MaxTimeBetweenTurns = maxTBT | ||
| copilotEventsJSONLLog.Printf("TBT computed: avg=%s max=%s intervals=%d", | ||
| metrics.AvgTimeBetweenTurns, metrics.MaxTimeBetweenTurns, intervals) |
There was a problem hiding this comment.
TBT averaging currently divides by len(turnTimestamps)-1 even though zero/negative intervals are skipped (if tbt > 0 { ... }). If any adjacent timestamps are identical or out-of-order, totalTBT excludes those gaps but the divisor still counts them, which underestimates the average. Consider counting only the intervals you include (e.g., validIntervals++ when tbt > 0) and dividing by that count; also only set Avg/Max when validIntervals > 0.
| for i := 1; i < len(turnTimestamps); i++ { | |
| tbt := turnTimestamps[i].Sub(turnTimestamps[i-1]) | |
| if tbt > 0 { | |
| totalTBT += tbt | |
| if tbt > maxTBT { | |
| maxTBT = tbt | |
| } | |
| } | |
| } | |
| intervals := len(turnTimestamps) - 1 | |
| if totalTBT > 0 && intervals > 0 { | |
| metrics.AvgTimeBetweenTurns = totalTBT / time.Duration(intervals) | |
| metrics.MaxTimeBetweenTurns = maxTBT | |
| copilotEventsJSONLLog.Printf("TBT computed: avg=%s max=%s intervals=%d", | |
| metrics.AvgTimeBetweenTurns, metrics.MaxTimeBetweenTurns, intervals) | |
| validIntervals := 0 | |
| for i := 1; i < len(turnTimestamps); i++ { | |
| tbt := turnTimestamps[i].Sub(turnTimestamps[i-1]) | |
| if tbt > 0 { | |
| totalTBT += tbt | |
| validIntervals++ | |
| if tbt > maxTBT { | |
| maxTBT = tbt | |
| } | |
| } | |
| } | |
| if validIntervals > 0 { | |
| metrics.AvgTimeBetweenTurns = totalTBT / time.Duration(validIntervals) | |
| metrics.MaxTimeBetweenTurns = maxTBT | |
| copilotEventsJSONLLog.Printf("TBT computed: avg=%s max=%s intervals=%d", | |
| metrics.AvgTimeBetweenTurns, metrics.MaxTimeBetweenTurns, validIntervals) |
pkg/cli/copilot_events_jsonl_test.go
Outdated
| t.Run("no TBT when timestamps missing", func(t *testing.T) { | ||
| // Use realFormatEventsLine which always uses the same timestamp — no intervals. | ||
| content := realFormatEventsLine("user.message", `{"content":"turn 1"}`) + "\n" + | ||
| realFormatEventsLine("user.message", `{"content":"turn 2"}`) + "\n" + | ||
| realFormatEventsLine("session.shutdown", `{"shutdownType":"routine","totalPremiumRequests":2,"modelMetrics":{"m":{"usage":{"inputTokens":50,"outputTokens":5}}}}`) + "\n" | ||
|
|
There was a problem hiding this comment.
This subtest is titled "no TBT when timestamps missing", but the scenario is actually "timestamps present but identical" (via realFormatEventsLine). Renaming the subtest (or adjusting the comment) would make the intent clearer and avoid confusion when reading test failures.
| - **Anthropic** reduced their cache TTL from 1 hour to **5 minutes**. If the TBT for any turn exceeds 5 minutes, the cached prompt context expires and the full prompt must be re-processed, significantly increasing token costs. | ||
| - **OpenAI** has a similar server-side prompt cache with variable TTL. | ||
|
|
||
| `gh aw audit` reports both average and maximum TBT in the Session Analysis section. A cache warning is emitted when the maximum observed TBT exceeds the Anthropic 5-minute threshold. For Copilot engine runs, TBT is computed from precise per-turn timestamps in the `events.jsonl` session log; for other engines it is estimated from total wall time divided by turn count. |
There was a problem hiding this comment.
The glossary says the cache warning is emitted when the maximum observed TBT exceeds the 5-minute threshold, but the implementation also emits a warning for the wall-time estimated average TBT when per-turn timestamps/max aren’t available. Consider updating this description to reflect that the warning can be based on max (when available) or estimated avg (fallback).
| `gh aw audit` reports both average and maximum TBT in the Session Analysis section. A cache warning is emitted when the maximum observed TBT exceeds the Anthropic 5-minute threshold. For Copilot engine runs, TBT is computed from precise per-turn timestamps in the `events.jsonl` session log; for other engines it is estimated from total wall time divided by turn count. | |
| `gh aw audit` reports both average and maximum TBT in the Session Analysis section. A cache warning is emitted when the TBT used for cache analysis exceeds the Anthropic 5-minute threshold: the maximum observed TBT for Copilot engine runs, where precise per-turn timestamps are available in the `events.jsonl` session log, or the estimated average TBT for other engines, where TBT is derived from total wall time divided by turn count. |
|
Commit pushed:
|
🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic (207 new lines in AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs Matter
ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
|
🧪 Test Quality Sentinel ReportTest Quality Score: 95/100✅ Excellent test quality
Test Classification Details
Flagged Tests — Requires ReviewNo tests require changes. One minor observation: i️ Sub-test: "no TBT when timestamps missing" (
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: 95/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). The new TestParseEventsJSONLFile_TBT function covers the TBT behavioral contract with a happy-path case and two edge cases (single turn, identical timestamps), all verified end-to-end without mocks. Two minor observations (missing MaxTimeBetweenTurns assertion in one sub-test; setup calls without message arguments) do not affect correctness.
|
@copilot review comments |
- Use validIntervals count (not len-1) when averaging TBT to avoid underestimating when identical/out-of-order timestamps are skipped - Fix wall-time fallback to divide by Turns-1 intervals (not Turns) in both audit_expanded.go and logs_report.go - Rename misleading test subtest to clarify it tests identical timestamps - Update glossary to accurately describe both warning code paths (max TBT for Copilot, estimated avg TBT for other engines) Agent-Logs-Url: https://github.com/github/gh-aw/sessions/16bad17c-1b59-442d-acc5-1be6db1fc7e2 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed all 5 review findings in commit
|
Summary
Adds Time Between Turns (TBT) — an important agentic performance metric that determines whether LLM prompt caching is effective. Anthropic recently reduced their prompt cache TTL from 1 hour to 5 minutes. When the time between consecutive LLM API calls (TBT) exceeds this TTL, the prompt cache expires and every turn incurs full re-processing costs.
Changes
Core metric (
pkg/workflow/metrics.go)AvgTimeBetweenTurnsandMaxTimeBetweenTurns(time.Duration) fields toLogMetricsCopilot engine parsing (
pkg/cli/copilot_events_jsonl.go)user.messageevents inevents.jsonlAudit command (
pkg/cli/audit_expanded.go)AvgTimeBetweenTurns,MaxTimeBetweenTurns, andCacheWarningfields toSessionAnalysisbuildSessionAnalysispopulates TBT fromLogMetricswhen available; falls back toDuration / TurnsestimateAudit rendering (
pkg/cli/audit_report_render_overview.go)renderSessionAnalysisdisplays Avg/Max TBT and a cache-expiry warning in the Session Analysis sectionLogs command (
pkg/cli/logs_models.go,logs_orchestrator.go,logs_report.go)AvgTimeBetweenTurns time.DurationtoWorkflowRunRunData.AvgTimeBetweenTurnsexposed in JSON output (gh aw logs --json)Documentation (
docs/src/content/docs/reference/glossary.md)Tests (
pkg/cli/copilot_events_jsonl_test.go)TestParseEventsJSONLFile_TBT: three sub-cases verifying correct avg/max computation, single-turn no-op, and identical-timestamp no-op