Skip to content

Commit decddb3

Browse files
feat(eval): add gooddata-eval model-evaluation CLI (Phase 1 + Phase 2 + Langfuse)
New public package `gooddata-eval` with a `gd-eval` CLI that evaluates the GoodData AI agent against a dataset of natural-language questions. Phase 1 — visualization evaluation: - Layered core + thin argparse CLI; SSE agentic chat client (httpx); workspace LLM provider/model resolution and activation via GoodData SDK; local-folder and Langfuse dataset sources; visualization evaluator with strict checks (metrics/dimensions/filters/type, cross-ref, pass@K); console + JSON reports. - Streaming per-item progress with latency (total, avg) and quality score. - Provider flag accepts name or id; auto-switches workspace to the provider that offers the requested model. - SSE fallback: captures visualization from create_adhoc_visualization tool call args when the data source is inaccessible. Phase 2 — remaining agentic test kinds: - metric_skill, alert_skill, search_tool: scored via tool call arguments. - general_question + guardrail: LLM-as-judge via openai [llm-judge] extra, lazily imported so CLI starts without openai installed. - Shared helpers: _deep_subset, LLMJudge, _text_utils. Langfuse integration: - Dataset source uses REST API via httpx (no Langfuse SDK — broken on Python 3.14). Requires LANGFUSE_PUBLIC_KEY / SECRET_KEY / HOST env vars; no extra package needed. - Scoring sink (--langfuse, requires --langfuse-dataset): posts trace + 4 scores + dataset-run-item per evaluated item, creating the named experiment run automatically in Langfuse. - Scores: pass_at_k, quality_score, value_score, latency_s. 102 tests, ruff + ty clean. CLI starts without openai installed. JIRA: GDAI-1766 Risk: low — new isolated package; no changes to existing packages.
1 parent a21f854 commit decddb3

59 files changed

Lines changed: 4245 additions & 85 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

packages/gooddata-eval/README.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# gooddata-eval
2+
3+
CLI to evaluate the GoodData AI agent against a dataset of natural-language
4+
questions on a chosen workspace and LLM model.
5+
6+
## Install
7+
8+
uv add gooddata-eval
9+
10+
Or install `gd-eval` as a standalone tool:
11+
12+
uv tool install gooddata-eval
13+
14+
## Quick start
15+
16+
```bash
17+
export GOODDATA_TOKEN='your-api-token'
18+
19+
gd-eval run \
20+
--host https://your.gooddata.cloud \
21+
--workspace demo \
22+
--dataset ./my-dataset \
23+
--model gpt-5.2 \
24+
--runs 2 \
25+
--json results.json
26+
```
27+
28+
## All flags
29+
30+
### Connection
31+
32+
| Flag | Env var | Description |
33+
|---|---|---|
34+
| `--host HOST` || GoodData host URL (e.g. `https://your.gooddata.cloud`). |
35+
| `--token TOKEN` | `GOODDATA_TOKEN` | API token. Pass via flag or env var. |
36+
| `--profile NAME` || Profile name in `~/.gooddata/profiles.yaml` (same file as the `gdc` CLI). Provides host + token when both flags are omitted. |
37+
| `--workspace ID` || **Required.** Workspace id to evaluate against. |
38+
39+
### Dataset source (pick one)
40+
41+
| Flag | Description |
42+
|---|---|
43+
| `--dataset PATH` | Path to a flat folder of JSON files — one question per file. |
44+
| `--langfuse-dataset NAME` | Pull dataset items by name from Langfuse. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` env vars. |
45+
46+
### Model selection
47+
48+
| Flag | Description |
49+
|---|---|
50+
| `--model ID` | LLM model id to evaluate (e.g. `gpt-5.2`). Defaults to the workspace's currently active model. If the model is offered by a different provider than the active one, the workspace's active provider is switched automatically. |
51+
| `--provider NAME_OR_ID` | LLM provider name or id. Use when `--model` is offered by multiple providers and you need to pick one. Accepts either the human-readable provider name or its UUID id. |
52+
53+
### Evaluation
54+
55+
| Flag | Default | Description |
56+
|---|---|---|
57+
| `--runs K` | `2` | Number of independent conversation runs per item (pass@K). An item passes if any run passes. |
58+
59+
### Output
60+
61+
| Flag | Description |
62+
|---|---|
63+
| `--json PATH` | Write a machine-readable JSON report (keyed by item id, with per-item scores) to this path. Console summary is always printed. |
64+
| `--quiet` | Suppress per-item progress output. Only the final table and summary are printed. |
65+
66+
### Langfuse sink
67+
68+
| Flag | Description |
69+
|---|---|
70+
| `--langfuse` | Log evaluation results to Langfuse after each item. Requires `--langfuse-dataset` (so item ids can be linked to Langfuse dataset items). Creates a named experiment run (`gd-eval-{timestamp}-{model}`) in the Langfuse dataset. Requires `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST`. |
71+
72+
## Dataset format
73+
74+
A dataset is a folder of `.json` files, one per question. Each file must
75+
contain a common envelope:
76+
77+
```json
78+
{
79+
"id": "stable-unique-id",
80+
"dataset_name": "my_dataset",
81+
"test_kind": "visualization",
82+
"question": "Show revenue by quarter",
83+
"expected_output": { }
84+
}
85+
```
86+
87+
Supported `test_kind` values: `visualization`, `metric_skill`, `alert_skill`,
88+
`search_tool`, `general_question`, `guardrail`.
89+
90+
See the full dataset specification for `expected_output` shapes per test kind.
91+
92+
## Supported test kinds
93+
94+
| test_kind | What the agent must produce | Extra required |
95+
|---|---|---|
96+
| `visualization` | Correct AAC visualization (metrics, dimensions, filters, type) ||
97+
| `metric_skill` | `create_metric` tool call with correct MAQL and format ||
98+
| `alert_skill` | `create_metric_alert` tool call with correct operator, threshold, trigger, filters, metric, recipients ||
99+
| `search_tool` | `search_objects` tool call (correct function called = pass; correct arguments = quality score) ||
100+
| `general_question` | Text answer judged by LLM | `[llm-judge]` |
101+
| `guardrail` | Refusal/redirect (visualization response auto-fails) | `[llm-judge]` |
102+
103+
## Optional extras
104+
105+
### `[llm-judge]` — LLM-as-judge evaluators
106+
107+
`general_question` and `guardrail` items are scored by an LLM judge (GPT-4o)
108+
that compares the agent's text response against your expected-output description.
109+
This requires the OpenAI Python package and an API key:
110+
111+
```bash
112+
uv add 'gooddata-eval[llm-judge]' # project dependency
113+
# or, for the standalone gd-eval tool:
114+
uv tool install 'gooddata-eval[llm-judge]'
115+
```
116+
117+
Set your OpenAI key before running:
118+
119+
```bash
120+
export OPENAI_API_KEY='sk-...'
121+
```
122+
123+
Without `[llm-judge]`, items with `test_kind: general_question` or `guardrail`
124+
are reported as **skipped**.
125+
126+
127+
## Exit codes
128+
129+
| Code | Meaning |
130+
|---|---|
131+
| `0` | Run completed. Evaluation failures do **not** cause a non-zero exit — check the report. |
132+
| `2` | Operational error: bad connection, missing model, unreadable dataset, missing credentials. |
133+
134+
## Scores (in JSON report and Langfuse)
135+
136+
| Score | Description |
137+
|---|---|
138+
| `pass_at_k` | 1 if any of the K runs passed strict checks, else 0. |
139+
| `quality_score` | Fraction of strict check flags that are `True` (0.0–1.0). Shown in CLI as a percentage. |
140+
| `value_score` | Weighted blend: 0.6 × quality + 0.2 × speed (where speed = max(0, 1 − latency/60s)). |
141+
| `latency_s` | Average per-run latency in seconds. |
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# (C) 2026 GoodData Corporation
2+
[project]
3+
name = "gooddata-eval"
4+
version = "1.67.0"
5+
description = "Evaluate the GoodData AI agent against your own questions and models."
6+
readme = "README.md"
7+
license = "MIT"
8+
authors = [
9+
{name = "GoodData", email = "support@gooddata.com"}
10+
]
11+
keywords = ["gooddata", "ai", "evaluation", "llm", "analytics", "cli"]
12+
requires-python = ">=3.10"
13+
dependencies = [
14+
"gooddata-sdk~=1.67.0",
15+
"httpx>=0.27,<1.0",
16+
"orjson>=3.9.15,<4.0.0",
17+
"pydantic>=2.6,<3.0",
18+
"rich>=13.0,<15.0",
19+
]
20+
classifiers = [
21+
"Development Status :: 4 - Beta",
22+
"Environment :: Console",
23+
"Programming Language :: Python :: 3.10",
24+
"Programming Language :: Python :: 3.11",
25+
"Programming Language :: Python :: 3.12",
26+
"Programming Language :: Python :: 3.13",
27+
"Topic :: Scientific/Engineering",
28+
"Topic :: Software Development",
29+
"Typing :: Typed",
30+
]
31+
32+
[project.optional-dependencies]
33+
llm-judge = ["openai>=1.40,<2.0"]
34+
35+
[project.scripts]
36+
gd-eval = "gooddata_eval.cli.main:main"
37+
38+
[project.urls]
39+
Source = "https://github.com/gooddata/gooddata-python-sdk"
40+
41+
[dependency-groups]
42+
test = [
43+
"pytest~=8.3.4",
44+
"pytest-cov~=6.0.0",
45+
"pytest-mock>=3.14.0",
46+
]
47+
48+
[tool.hatch.build.targets.wheel]
49+
packages = ["src/gooddata_eval"]
50+
51+
[tool.coverage.run]
52+
source = ["gooddata_eval"]
53+
54+
[tool.coverage.paths]
55+
source = [
56+
"src/gooddata_eval",
57+
"**/site-packages/gooddata_eval",
58+
]
59+
60+
[tool.ty.analysis]
61+
allowed-unresolved-imports = ["openai.**", "gooddata_api_client.**"]
62+
63+
[build-system]
64+
requires = ["hatchling"]
65+
build-backend = "hatchling.build"
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# (C) 2026 GoodData Corporation
2+
"""gooddata-eval: evaluate the GoodData AI agent against your own datasets."""
3+
4+
from gooddata_eval._version import __version__
5+
6+
__all__ = ["__version__"]
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# (C) 2026 GoodData Corporation
2+
from importlib import metadata
3+
4+
try:
5+
__version__ = metadata.version("gooddata-eval")
6+
except metadata.PackageNotFoundError:
7+
__version__ = "unknown-version"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# (C) 2026 GoodData Corporation

0 commit comments

Comments
 (0)