Skip to content

feat: add Time Between Turns (TBT) to audit/logs command and glossary#26321

Merged
pelikhan merged 3 commits intomainfrom
copilot/add-tbt-information-to-audit-logs
Apr 14, 2026
Merged

feat: add Time Between Turns (TBT) to audit/logs command and glossary#26321
pelikhan merged 3 commits intomainfrom
copilot/add-tbt-information-to-audit-logs

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

Summary

Adds Time Between Turns (TBT) — an important agentic performance metric that determines whether LLM prompt caching is effective. Anthropic recently reduced their prompt cache TTL from 1 hour to 5 minutes. When the time between consecutive LLM API calls (TBT) exceeds this TTL, the prompt cache expires and every turn incurs full re-processing costs.

Changes

Core metric (pkg/workflow/metrics.go)

  • Added AvgTimeBetweenTurns and MaxTimeBetweenTurns (time.Duration) fields to LogMetrics

Copilot engine parsing (pkg/cli/copilot_events_jsonl.go)

  • Parse RFC3339 timestamps from user.message events in events.jsonl
  • Compute precise average and maximum TBT from inter-turn intervals
  • Falls back to zero (no TBT) when timestamps are identical or only one turn is recorded

Audit command (pkg/cli/audit_expanded.go)

  • Added AvgTimeBetweenTurns, MaxTimeBetweenTurns, and CacheWarning fields to SessionAnalysis
  • buildSessionAnalysis populates TBT from LogMetrics when available; falls back to Duration / Turns estimate
  • Emits a cache warning when max TBT exceeds the Anthropic 5-minute TTL (or avg TBT when max is unavailable)

Audit rendering (pkg/cli/audit_report_render_overview.go)

  • renderSessionAnalysis displays Avg/Max TBT and a cache-expiry warning in the Session Analysis section

Logs command (pkg/cli/logs_models.go, logs_orchestrator.go, logs_report.go)

  • Added AvgTimeBetweenTurns time.Duration to WorkflowRun
  • Propagated through the logs orchestrator pipeline
  • RunData.AvgTimeBetweenTurns exposed in JSON output (gh aw logs --json)

Documentation (docs/src/content/docs/reference/glossary.md)

  • Added Time Between Turns (TBT) glossary entry explaining the metric, its relationship to prompt cache TTLs, and tips for reducing TBT

Tests (pkg/cli/copilot_events_jsonl_test.go)

  • TestParseEventsJSONLFile_TBT: three sub-cases verifying correct avg/max computation, single-turn no-op, and identical-timestamp no-op

- Add AvgTimeBetweenTurns and MaxTimeBetweenTurns to LogMetrics
- Parse TBT from user.message timestamps in events.jsonl (Copilot engine)
- Add TBT fields + cache warning to SessionAnalysis in audit command
- Render TBT with Anthropic 5-min cache-expiry warning in audit report
- Add AvgTimeBetweenTurns to WorkflowRun and RunData for logs command
- Propagate TBT through logs orchestrator pipeline
- Add TBT glossary entry to docs
- Add unit tests for TBT computation in parseEventsJSONLFile

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/ac5d0f93-faf6-480f-878a-17d194e35075

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI requested a review from pelikhan April 14, 2026 22:52
@pelikhan pelikhan marked this pull request as ready for review April 14, 2026 22:54
Copilot AI review requested due to automatic review settings April 14, 2026 22:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Time Between Turns (TBT) as a first-class performance/cost metric across audit and logs reporting, with Copilot events.jsonl providing precise per-turn timing.

Changes:

  • Extend workflow.LogMetrics (and downstream run models) with Avg/Max TBT durations.
  • Parse Copilot events.jsonl timestamps to compute precise Avg/Max TBT; render TBT + cache-expiry warnings in gh aw audit.
  • Expose Avg TBT via gh aw logs --json and document TBT in the glossary.
Show a summary per file
File Description
pkg/workflow/metrics.go Adds Avg/Max TBT fields to LogMetrics.
pkg/cli/copilot_events_jsonl.go Collects per-turn timestamps from events.jsonl and computes Avg/Max TBT.
pkg/cli/copilot_events_jsonl_test.go Adds test coverage for TBT computation scenarios.
pkg/cli/audit_expanded.go Adds TBT + cache-warning fields to SessionAnalysis and populates them (precise + fallback estimate).
pkg/cli/audit_report_render_overview.go Renders Avg/Max TBT and cache warning in the audit overview.
pkg/cli/logs_models.go Adds Avg TBT to WorkflowRun for propagation through logs pipeline.
pkg/cli/logs_orchestrator.go Propagates Avg TBT from extracted metrics into WorkflowRun.
pkg/cli/logs_report.go Emits avg_time_between_turns in logs JSON output (precise + fallback estimate).
docs/src/content/docs/reference/glossary.md Documents TBT, TTL implications, and mitigation tips.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (2)

pkg/cli/logs_report.go:239

  • avg_time_between_turns is exposed in --json output, but the estimated-path appends " (estimated)" to the duration string. This makes the field harder to parse programmatically and inconsistent with other duration fields. Consider keeping avg_time_between_turns strictly as a duration string and adding a separate boolean/source field to indicate whether it was estimated.
		// Compute average TBT from metrics when available; fall back to wall-time / turns.
		if run.AvgTimeBetweenTurns > 0 {
			runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns)
		} else if run.Turns > 1 && run.Duration > 0 {
			runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns)) + " (estimated)"
		}

pkg/cli/audit_expanded.go:238

  • AvgTimeBetweenTurns is serialized to JSON for audit output, but the estimated path appends " (estimated)" directly into the value. This makes the JSON field non-uniform and harder to parse. Consider keeping the value as a plain duration string and adding a separate flag/source field (e.g., avg_time_between_turns_estimated or tbt_source).
	} else if metrics.Turns > 1 && run.Duration > 0 {
		// Fallback: average cycle time as a proxy for TBT.
		avgTBT := run.Duration / time.Duration(metrics.Turns)
		session.AvgTimeBetweenTurns = timeutil.FormatDuration(avgTBT) + " (estimated)"
		if avgTBT > anthropicCacheTTL {
  • Files reviewed: 9/9 changed files
  • Comments generated: 5

Comment on lines +234 to +238
// Compute average TBT from metrics when available; fall back to wall-time / turns.
if run.AvgTimeBetweenTurns > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns)
} else if run.Turns > 1 && run.Duration > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns)) + " (estimated)"
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback estimate for AvgTimeBetweenTurns divides wall time by run.Turns, but “time between turns” has turns-1 intervals. This will systematically under-estimate TBT (e.g., 2 turns should have 1 interval). Consider using run.Duration / time.Duration(run.Turns-1) when Turns > 1 so the estimate matches the definition used by the timestamp-based computation.

This issue also appears on line 234 of the same file.

Suggested change
// Compute average TBT from metrics when available; fall back to wall-time / turns.
if run.AvgTimeBetweenTurns > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns)
} else if run.Turns > 1 && run.Duration > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns)) + " (estimated)"
// Compute average TBT from metrics when available; fall back to wall-time / (turns - 1).
if run.AvgTimeBetweenTurns > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.AvgTimeBetweenTurns)
} else if run.Turns > 1 && run.Duration > 0 {
runData.AvgTimeBetweenTurns = timeutil.FormatDuration(run.Duration/time.Duration(run.Turns-1)) + " (estimated)"

Copilot uses AI. Check for mistakes.
}
} else if metrics.Turns > 1 && run.Duration > 0 {
// Fallback: average cycle time as a proxy for TBT.
avgTBT := run.Duration / time.Duration(metrics.Turns)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wall-time fallback estimate for TBT uses run.Duration / time.Duration(metrics.Turns), but TBT is defined over turns-1 intervals between consecutive turns. For example, 2 turns should result in 1 interval, not 2. Using Turns-1 here will align the estimate with the timestamp-based computation.

This issue also appears on line 234 of the same file.

Suggested change
avgTBT := run.Duration / time.Duration(metrics.Turns)
avgTBT := run.Duration / time.Duration(metrics.Turns-1)

Copilot uses AI. Check for mistakes.
Comment on lines +288 to +302
for i := 1; i < len(turnTimestamps); i++ {
tbt := turnTimestamps[i].Sub(turnTimestamps[i-1])
if tbt > 0 {
totalTBT += tbt
if tbt > maxTBT {
maxTBT = tbt
}
}
}
intervals := len(turnTimestamps) - 1
if totalTBT > 0 && intervals > 0 {
metrics.AvgTimeBetweenTurns = totalTBT / time.Duration(intervals)
metrics.MaxTimeBetweenTurns = maxTBT
copilotEventsJSONLLog.Printf("TBT computed: avg=%s max=%s intervals=%d",
metrics.AvgTimeBetweenTurns, metrics.MaxTimeBetweenTurns, intervals)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBT averaging currently divides by len(turnTimestamps)-1 even though zero/negative intervals are skipped (if tbt > 0 { ... }). If any adjacent timestamps are identical or out-of-order, totalTBT excludes those gaps but the divisor still counts them, which underestimates the average. Consider counting only the intervals you include (e.g., validIntervals++ when tbt > 0) and dividing by that count; also only set Avg/Max when validIntervals > 0.

Suggested change
for i := 1; i < len(turnTimestamps); i++ {
tbt := turnTimestamps[i].Sub(turnTimestamps[i-1])
if tbt > 0 {
totalTBT += tbt
if tbt > maxTBT {
maxTBT = tbt
}
}
}
intervals := len(turnTimestamps) - 1
if totalTBT > 0 && intervals > 0 {
metrics.AvgTimeBetweenTurns = totalTBT / time.Duration(intervals)
metrics.MaxTimeBetweenTurns = maxTBT
copilotEventsJSONLLog.Printf("TBT computed: avg=%s max=%s intervals=%d",
metrics.AvgTimeBetweenTurns, metrics.MaxTimeBetweenTurns, intervals)
validIntervals := 0
for i := 1; i < len(turnTimestamps); i++ {
tbt := turnTimestamps[i].Sub(turnTimestamps[i-1])
if tbt > 0 {
totalTBT += tbt
validIntervals++
if tbt > maxTBT {
maxTBT = tbt
}
}
}
if validIntervals > 0 {
metrics.AvgTimeBetweenTurns = totalTBT / time.Duration(validIntervals)
metrics.MaxTimeBetweenTurns = maxTBT
copilotEventsJSONLLog.Printf("TBT computed: avg=%s max=%s intervals=%d",
metrics.AvgTimeBetweenTurns, metrics.MaxTimeBetweenTurns, validIntervals)

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +345
t.Run("no TBT when timestamps missing", func(t *testing.T) {
// Use realFormatEventsLine which always uses the same timestamp — no intervals.
content := realFormatEventsLine("user.message", `{"content":"turn 1"}`) + "\n" +
realFormatEventsLine("user.message", `{"content":"turn 2"}`) + "\n" +
realFormatEventsLine("session.shutdown", `{"shutdownType":"routine","totalPremiumRequests":2,"modelMetrics":{"m":{"usage":{"inputTokens":50,"outputTokens":5}}}}`) + "\n"

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subtest is titled "no TBT when timestamps missing", but the scenario is actually "timestamps present but identical" (via realFormatEventsLine). Renaming the subtest (or adjusting the comment) would make the intent clearer and avoid confusion when reading test failures.

Copilot uses AI. Check for mistakes.
- **Anthropic** reduced their cache TTL from 1 hour to **5 minutes**. If the TBT for any turn exceeds 5 minutes, the cached prompt context expires and the full prompt must be re-processed, significantly increasing token costs.
- **OpenAI** has a similar server-side prompt cache with variable TTL.

`gh aw audit` reports both average and maximum TBT in the Session Analysis section. A cache warning is emitted when the maximum observed TBT exceeds the Anthropic 5-minute threshold. For Copilot engine runs, TBT is computed from precise per-turn timestamps in the `events.jsonl` session log; for other engines it is estimated from total wall time divided by turn count.
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glossary says the cache warning is emitted when the maximum observed TBT exceeds the 5-minute threshold, but the implementation also emits a warning for the wall-time estimated average TBT when per-turn timestamps/max aren’t available. Consider updating this description to reflect that the warning can be based on max (when available) or estimated avg (fallback).

Suggested change
`gh aw audit` reports both average and maximum TBT in the Session Analysis section. A cache warning is emitted when the maximum observed TBT exceeds the Anthropic 5-minute threshold. For Copilot engine runs, TBT is computed from precise per-turn timestamps in the `events.jsonl` session log; for other engines it is estimated from total wall time divided by turn count.
`gh aw audit` reports both average and maximum TBT in the Session Analysis section. A cache warning is emitted when the TBT used for cache analysis exceeds the Anthropic 5-minute threshold: the maximum observed TBT for Copilot engine runs, where precise per-turn timestamps are available in the `events.jsonl` session log, or the estimated average TBT for other engines, where TBT is derived from total wall time divided by turn count.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Commit pushed: cb1e5a7

🏗️ ADR gate enforced by Design Decision Gate 🏗️

@github-actions
Copy link
Copy Markdown
Contributor

🏗️ Design Decision Gate — ADR Required

This PR makes significant changes to core business logic (207 new lines in pkg/) but does not have a linked Architecture Decision Record (ADR).

AI has analyzed the PR diff and generated a draft ADR to help you get started:

📄 Draft ADR: docs/adr/26321-track-time-between-turns-tbt-metric.md

What to do next

  1. Review the draft ADR committed to your branch — it was generated from the PR diff
  2. Complete the missing sections — add context the AI couldn't infer, refine the decision rationale, and list real alternatives you considered
  3. Commit the finalized ADR to docs/adr/ on your branch
  4. Reference the ADR in this PR body by adding a line such as:

    ADR: ADR-26321: Track Time Between Turns (TBT) as an Agentic Performance Metric with Prompt Cache TTL Warning

Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision.

Why ADRs Matter

"AI made me procrastinate on key design decisions. Because refactoring was cheap, I could always say 'I'll deal with this later.' Deferring decisions corroded my ability to think clearly."

ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you.


📋 Michael Nygard ADR Format Reference

An ADR must contain these four sections to be considered complete:

  • Context — What is the problem? What forces are at play?
  • Decision — What did you decide? Why?
  • Alternatives Considered — What else could have been done?
  • Consequences — What are the trade-offs (positive and negative)?

All ADRs are stored in docs/adr/ as Markdown files numbered by PR number (e.g., 26321-track-time-between-turns-tbt-metric.md for PR #26321).

🔒 This PR cannot merge until an ADR is linked in the PR body.

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · ● 117.9K ·

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 95/100

Excellent test quality

Metric Value
New/modified tests analyzed 1 function (TestParseEventsJSONLFile_TBT) with 3 sub-tests
✅ Design tests (behavioral contracts) 3 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 3 (100%)
Duplicate test clusters 0
Test inflation detected No (64 test lines / 36 prod lines = 1.78:1 ✅)
🚨 Coding-guideline violations 0 (no mock libraries; build tag present; no missing assertion messages on behavioral assertions)

Test Classification Details

Test File Classification Issues Detected
TestParseEventsJSONLFile_TBT / computes avg and max TBT from multi-turn timestamps pkg/cli/copilot_events_jsonl_test.go ✅ Design None
TestParseEventsJSONLFile_TBT / no TBT when only one turn pkg/cli/copilot_events_jsonl_test.go ✅ Design None
TestParseEventsJSONLFile_TBT / no TBT when timestamps missing pkg/cli/copilot_events_jsonl_test.go ✅ Design Minor: MaxTimeBetweenTurns not asserted in this sub-test

Flagged Tests — Requires Review

No tests require changes. One minor observation:

i️ Sub-test: "no TBT when timestamps missing" (copilot_events_jsonl_test.go)

Classification: Design test ✅
Observation: This sub-test asserts AvgTimeBetweenTurns is zero when all timestamps are identical, but does not assert MaxTimeBetweenTurns. Since the happy-path sub-test already covers both fields, this is a very minor gap — adding assert.Zero(t, metrics.MaxTimeBetweenTurns, "max TBT should be zero when timestamps identical") would make the contract fully symmetric, but it is not required.

i️ Setup calls without assertion messages

Several require.NoError(t, os.WriteFile(...)) setup calls in the new sub-tests omit a message argument (lines ~314, ~330, ~348). The guideline asks for descriptive messages on every assertion; infrastructure setup assertions (file creation) are a low priority to annotate, but it is worth adding messages for consistency — e.g. require.NoError(t, os.WriteFile(eventsPath, []byte(content), 0644), "should write test fixture").


Analysis

The new test function TestParseEventsJSONLFile_TBT uses the preferred table-driven t.Run() pattern and exercises parseEventsJSONLFile() end-to-end through real temp files (no mocks). All three sub-tests verify observable outputs that directly reflect the feature's behavioral contract:

  1. Happy path with exact math — 3 turns at known timestamps; asserts both AvgTimeBetweenTurns (45 s) and MaxTimeBetweenTurns (60 s) with specific expected values. Deletion would allow a regression in the TBT formula to go undetected.
  2. Boundary: single turn — Verifies the degenerate case where there are no intervals to compute, asserting both fields are zero. Catches any off-by-one in the "need ≥ 2 turns" guard.
  3. Edge case: identical timestamps — Verifies that zero-length intervals do not produce spurious non-zero TBT values, catching any division-by-zero or accumulation bug.

All assertions include descriptive string messages on the behavioral assertions (only the os.WriteFile setup calls lack messages — see observation above).


Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 3 sub-tests — unit (//go:build !integration) ✅
  • 🟨 JavaScript (*.test.cjs, *.test.js): 0 tests changed

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). No coding-guideline violations detected. The new tests provide solid behavioral coverage for the Time Between Turns feature.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

🧪 Test quality analysis by Test Quality Sentinel · ● 612.4K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 95/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). The new TestParseEventsJSONLFile_TBT function covers the TBT behavioral contract with a happy-path case and two edge cases (single turn, identical timestamps), all verified end-to-end without mocks. Two minor observations (missing MaxTimeBetweenTurns assertion in one sub-test; setup calls without message arguments) do not affect correctness.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review comments

- Use validIntervals count (not len-1) when averaging TBT to avoid
  underestimating when identical/out-of-order timestamps are skipped
- Fix wall-time fallback to divide by Turns-1 intervals (not Turns)
  in both audit_expanded.go and logs_report.go
- Rename misleading test subtest to clarify it tests identical timestamps
- Update glossary to accurately describe both warning code paths
  (max TBT for Copilot, estimated avg TBT for other engines)

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/16bad17c-1b59-442d-acc5-1be6db1fc7e2

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 14, 2026

@copilot review comments

Addressed all 5 review findings in commit 0914a48:

  1. copilot_events_jsonl.go — TBT averaging now divides by validIntervals (count of positive gaps only), not len-1, so identical/out-of-order timestamps don't underestimate the average.
  2. audit_expanded.go — Wall-time fallback uses Duration / (Turns-1) (correct interval count).
  3. logs_report.go — Same Turns-1 fix for the fallback estimate there.
  4. copilot_events_jsonl_test.go — Renamed "no TBT when timestamps missing""no TBT when all timestamps are identical" to reflect what the test actually exercises.
  5. Glossary — Updated the TBT description to clarify the warning is based on max TBT (Copilot, precise timestamps) or estimated avg TBT (other engines, wall-time fallback).

@pelikhan pelikhan merged commit 7f524d3 into main Apr 14, 2026
@pelikhan pelikhan deleted the copilot/add-tbt-information-to-audit-logs branch April 14, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants