feat(k8s): support promql historical data queries#82
feat(k8s): support promql historical data queries#82
Conversation
There was a problem hiding this comment.
Pull request overview
Adds first-class historical querying support to the Sysdig Monitor k8s_list_* MCP tools by introducing shared start/end RFC3339 parameters, validating/resolving the time window, and wrapping PromQL appropriately for “windowed” queries while preserving legacy instant-snapshot behavior when omitted.
Changes:
- Introduce shared time-window parsing/validation utilities (
TimeWindow,ParseTimeWindow,WithTimeWindowParams) and apply them acrossk8s_list_*Monitor tools. - Update affected tools to pass
GetQueryV1Params.Time(eval atend) and set a 60s timeout for windowed queries; deprecateintervalfor HTTP/network error tools with precedence rules + warnings. - Expand test coverage to include windowed query construction using an injected clock; document the behavior in tool docs and top-level README.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/infra/mcp/tools/utils.go | Adds shared time-window types/helpers and request arg presence detection. |
| internal/infra/mcp/tools/tools_suite_test.go | Adds shared test helpers for expected windowed query params and limits. |
| internal/infra/mcp/tools/tool_k8s_list_workloads.go | Adds start/end support, eval time + timeout, and windowed PromQL wrapping logic. |
| internal/infra/mcp/tools/tool_k8s_list_workloads_test.go | Adds mock clock + windowed query expectation cases. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_memory_quota.go | Adds start/end support; wraps usage/limit in avg_over_time when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_memory_quota_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_cpu_quota.go | Adds start/end support; wraps usage/quota in avg_over_time when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_underutilized_pods_cpu_quota_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_top_unavailable_pods.go | Adds start/end support and Sysdig-canonical windowed unavailable semantics (min_over_time >= 1). |
| internal/infra/mcp/tools/tool_k8s_list_top_unavailable_pods_test.go | Adds mock clock + windowed semantics tests. |
| internal/infra/mcp/tools/tool_k8s_list_top_restarted_pods.go | Adds start/end support; uses increase() when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_restarted_pods_test.go | Adds mock clock + windowed increase() expectations. |
| internal/infra/mcp/tools/tool_k8s_list_top_network_errors_in_pods.go | Adds start/end support; deprecates interval with precedence and windowed sum_over_time/N behavior. |
| internal/infra/mcp/tools/tool_k8s_list_top_network_errors_in_pods_test.go | Adds mock clock + windowed query expectations and clarifies legacy cases. |
| internal/infra/mcp/tools/tool_k8s_list_top_http_errors_in_pods.go | Adds start/end support; deprecates interval with precedence and windowed sum_over_time/N behavior. |
| internal/infra/mcp/tools/tool_k8s_list_top_http_errors_in_pods_test.go | Adds mock clock + windowed query expectations (including precedence over interval). |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_workload.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_workload_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_container.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_memory_consumed_container_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_workload.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_workload_test.go | Adds mock clock + windowed query expectations + invalid window tests. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_container.go | Adds start/end support; uses avg_over_time for windowed. |
| internal/infra/mcp/tools/tool_k8s_list_top_cpu_consumed_container_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_pod_containers.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_pod_containers_test.go | Adds mock clock + windowed inventory semantics test. |
| internal/infra/mcp/tools/tool_k8s_list_nodes.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_nodes_test.go | Adds mock clock + windowed inventory semantics test. |
| internal/infra/mcp/tools/tool_k8s_list_cronjobs.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_cronjobs_test.go | Adds mock clock + windowed inventory semantics test. |
| internal/infra/mcp/tools/tool_k8s_list_count_pods_per_cluster.go | Adds start/end support; uses avg_over_time for windowed pod counts. |
| internal/infra/mcp/tools/tool_k8s_list_count_pods_per_cluster_test.go | Adds mock clock + windowed query expectation. |
| internal/infra/mcp/tools/tool_k8s_list_clusters.go | Adds start/end support; uses max_over_time(...) > 0 inventory semantics when windowed. |
| internal/infra/mcp/tools/tool_k8s_list_clusters_test.go | Adds mock clock + windowed inventory semantics tests. |
| internal/infra/mcp/tools/README.md | Documents start/end semantics and the per-tool PromQL wrapping table + deprecation behavior. |
| cmd/server/main.go | Injects a system clock into all window-aware tools at registration time. |
| README.md | Adds a note pointing to the tool docs for windowed aggregation semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // RangeSelector returns the PromQL range-selector literal for this window, e.g. "[3600s]". | ||
| // The duration is rounded down to whole seconds so the selector is stable and debuggable. | ||
| func (w TimeWindow) RangeSelector() string { | ||
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | ||
| } |
There was a problem hiding this comment.
TimeWindow.RangeSelector() can emit a "[0s]" range when the resolved window is <1s. This can happen in the supported "start without end" case because end defaults to clk.Now() (sub-second precision) while start is second precision; if start is within the current second, End.Sub(Start) truncates to 0 seconds. PromQL range selectors do not accept 0s, and for HTTP/network error tools the derived windowSeconds can become 0, causing a division-by-zero in the query. Consider normalizing end to whole seconds (or rounding up) and/or rejecting windows shorter than 1s in ParseTimeWindow before returning the TimeWindow.
tembleking
left a comment
There was a problem hiding this comment.
Nit: pre-existing typo in utils.go:16 — "exampes" should be "examples". Not from this PR but you're touching the file, easy drive-by fix.
…erplate, and add missing ParseTimeWindow unit tests
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 38 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
internal/infra/mcp/tools/utils.go:16
Exampleswrites toschema["exampes"], which looks like a typo and will prevent MCP JSON schema examples from being emitted/recognized. Rename the key to"examples"so tooling can surface these examples correctly.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // newWindowedQueryParams constructs the GetQueryV1Params value that a windowed tool | ||
| // invocation is expected to produce: Query string, Time = end.Unix() via FromQueryTime1, | ||
| // and a 60s Timeout. | ||
| func newWindowedQueryParams(query string, end time.Time) sysdig.GetQueryV1Params { | ||
| var qt sysdig.Time | ||
| Expect(qt.FromQueryTime1(end.Unix())).To(Succeed()) | ||
| timeout := sysdig.Timeout("60s") | ||
| return sysdig.GetQueryV1Params{ | ||
| Query: query, | ||
| Time: &qt, | ||
| Timeout: &timeout, | ||
| } |
There was a problem hiding this comment.
newWindowedQueryParams calls Expect(...) during helper execution. Since this helper is used while building DescribeTable Entry(...) values, it can run during package init / spec construction (before RegisterFailHandler / a running spec exists), which can panic or behave unpredictably. Prefer returning an error, or handle the FromQueryTime1 error without Gomega (e.g., panic with a clear message) and keep assertions inside It/table bodies.
| // RangeSelector returns the PromQL range-selector literal for this window, e.g. "[3600s]". | ||
| func (w TimeWindow) RangeSelector() string { | ||
| return fmt.Sprintf("[%ds]", int64(w.End.Sub(w.Start).Seconds())) | ||
| } | ||
|
|
||
| // WindowSeconds returns the duration of the window in whole seconds. | ||
| func (w TimeWindow) WindowSeconds() int64 { | ||
| return int64(w.End.Sub(w.Start).Seconds()) | ||
| } |
There was a problem hiding this comment.
RangeSelector / WindowSeconds compute whole seconds via Duration.Seconds() (float64) and then cast to int64, which truncates and can yield 0 for sub-second windows. That can produce invalid PromQL like [0s] and even / 0 in the HTTP/network error tools. Compute whole seconds using integer duration arithmetic (e.g., w.End.Sub(w.Start)/time.Second) and ensure the result is >= 1 for non-zero windows.
|
All done, not taking care of the Copilot comments as per this PR. |
Enable users the ability to query historical data (if provided a timeline) from metric tools that use Sysdig Monitor with no data restrictions.
Adds optional
start/endhistorical query parameters to all Sysdig Monitork8s_list_* tools, enabling LLMs to query Kubernetes metrics over a past time window instead of only the current snapshot.When provided, the underlying PromQL is wrapped in the appropriate aggregation for each tool:
avg_over_timeincreasemin_over_time >= 1sum_over_time / N (rate per second)max_over_time > 0When omitted, tools behave as before (instant snapshot)