Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,17 @@ playwright install chromium
Export credentials for the configured backend (for example, `OPENAI_API_KEY`
with `model_openai.yaml` or `ANTHROPIC_API_KEY` with `model_claude.yaml`). The
`image_qa` and `self_reflection` tools use the same configured model by default,
so an Anthropic run does not require an OpenAI key. Then:
so an Anthropic run does not require an OpenAI key.

The browser backend is selected by `environment.browser_mode` (default `local`):

| `browser_mode` | What the agent's scripts do | Required env |
|----------------|-----------------------------|--------------|
| `local` | Launch a local Playwright Chromium | — |
| `browserbase` | Create a [Browserbase](https://browserbase.com) cloud session over CDP | `BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID` |
| `steel` | Create a [Steel](https://steel.dev) cloud session over CDP | `STEEL_API_KEY` |

Then:

```bash
python -m webwright.run.cli \
Expand Down
41 changes: 35 additions & 6 deletions src/webwright/config/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
# - OPENAI_API_KEY (only when the configured agent or tool model_class is openai)
# - ANTHROPIC_API_KEY (only when stacking model_claude.yaml)
# - BROWSERBASE_API_KEY + BROWSERBASE_PROJECT_ID (only when browser_mode=browserbase)
# - STEEL_API_KEY (only when browser_mode=steel)

model:
# model_class / model_name / endpoint come from the model modifier yaml.
Expand Down Expand Up @@ -57,11 +58,12 @@ environment:
command_timeout_seconds: 240
shell: /bin/bash
# Path to a shell file that exports credentials (BROWSERBASE_API_KEY,
# BROWSERBASE_PROJECT_ID, ANTHROPIC_API_KEY, OPENAI_API_KEY, ...). Leave
# empty to read these from the parent process environment instead.
# BROWSERBASE_PROJECT_ID, STEEL_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY,
# ...). Leave empty to read these from the parent process environment instead.
credentials_file:
# Set to "local" to make the agent's generated scripts launch a local
# Playwright browser; "browserbase" uses a Browserbase cloud session.
# Playwright browser; "browserbase" uses a Browserbase cloud session and
# "steel" uses a Steel cloud session.
browser_mode: local
task_metadata_filename: task.json
final_script_name: final_script.py
Expand Down Expand Up @@ -115,9 +117,36 @@ agent:

## Browser Mode

The harness exposes `BROWSER_MODE` to your scripts (value: `browserbase` or `local`).
The harness exposes `BROWSER_MODE` to your scripts (value: `browserbase`, `steel`, or `local`).
- When `BROWSER_MODE=browserbase` (default): create a Browserbase cloud session via the
`BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` env vars and connect over CDP.
`BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` env vars and connect over CDP, e.g.:
```python
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
"https://api.browserbase.com/v1/sessions",
headers={"x-bb-api-key": os.environ["BROWSERBASE_API_KEY"]},
json={"projectId": os.environ["BROWSERBASE_PROJECT_ID"]},
)
resp.raise_for_status()
session = resp.json()
browser = await playwright.chromium.connect_over_cdp(session["connectUrl"])
```
- When `BROWSER_MODE=steel`: create a Steel cloud session via the `STEEL_API_KEY`
env var (no project id) and connect over CDP to a constructed websocket URL, e.g.:
```python
key = os.environ["STEEL_API_KEY"]
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
"https://api.steel.dev/v1/sessions",
headers={"Steel-Api-Key": key},
json={}, # optional: {"solveCaptcha": True, "useProxy": True}
)
resp.raise_for_status()
session = resp.json()
browser = await playwright.chromium.connect_over_cdp(
f"wss://connect.steel.dev?apiKey={key}&sessionId={session['id']}"
)
```
- When `BROWSER_MODE=local`: launch a local Playwright Chromium browser
(`playwright.chromium.launch(...)`) instead. No external credentials required.

Expand Down Expand Up @@ -337,7 +366,7 @@ agent:
- The required final artifact is `{{ final_script_path }}`.
- Create `final_runs/run_<id>/` folders for every clean execution of the final script. Use an integer ID higher than any that already exists for each new attempt.
- Store each run's `final_script.py`, `final_script_log.txt`, and final verification screenshots **only** inside that run folder.
- The browser mode is `{{ browser_mode }}`. Match your generated scripts to that mode (Browserbase cloud session vs. local Playwright launch).
- The browser mode is `{{ browser_mode }}`. Match your generated scripts to that mode (Browserbase or Steel cloud session vs. local Playwright launch).

## Web Task Rules

Expand Down
24 changes: 17 additions & 7 deletions src/webwright/config/crafted_cli.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ agent:
- Put exactly one shell command in the `bash_command` string. Never emit raw Python or shell outside that field. Use heredocs (`python - <<'PY' ... PY`) to run Python inline when needed.
- Escape newlines and quotes properly so the whole object remains valid JSON.
- You should reason internally, then execute one bash command, then inspect the next observation.
- There is NO persistent browser state. Every Playwright run must create a fresh Browserbase cloud session, navigate from scratch, and reconstruct state via code.
- There is NO persistent browser state. Every Playwright run must create a fresh cloud browser session matching `BROWSER_MODE` (`browserbase` or `steel`), navigate from scratch, and reconstruct state via code.
- Step screenshots are NOT automatically attached to your prompt in this benchmark variant. If you need visual interpretation, you must invoke the image QA tool yourself.
- Set `"done": true` only when the task goal is complete and `final_script.py` is the final artifact.
- NEVER set `"done": true` in the same response as a non-empty `bash_command`. Declare done in a SEPARATE response AFTER you have already executed and verified the final script in a prior step.
Expand Down Expand Up @@ -69,7 +69,7 @@ agent:
`help=` (copied from the docstring), and a sensible default equal to the
concrete task value so that running `python final_script.py` with no
arguments reproduces the original task.
5. The CLI must still perform the full end-to-end run (Browserbase session,
5. The CLI must still perform the full end-to-end run (cloud browser session,
screenshots, `final_script_log.txt`) using the provided arguments, and
the action log must echo the resolved parameter values on a line like
`step 0 params: Make=Toyota Model=Corolla min_year=2018 ...` so the judge
Expand All @@ -94,8 +94,18 @@ agent:
SCREENSHOTS = WORKSPACE / "screenshots"
SCREENSHOTS.mkdir(parents=True, exist_ok=True)

async def create_browserbase_session():
async def create_cloud_browser_cdp_url():
mode = os.environ.get("BROWSER_MODE", "browserbase")
async with httpx.AsyncClient(timeout=30) as client:
if mode == "steel":
key = os.environ["STEEL_API_KEY"]
response = await client.post(
"https://api.steel.dev/v1/sessions",
headers={"Steel-Api-Key": key},
json={},
)
response.raise_for_status()
return f"wss://connect.steel.dev?apiKey={key}&sessionId={response.json()['id']}"
response = await client.post(
"https://api.browserbase.com/v1/sessions",
headers={
Expand All @@ -110,12 +120,12 @@ agent:
},
)
response.raise_for_status()
return response.json()
return response.json()["connectUrl"]

async def main():
session = await create_browserbase_session()
cdp_url = await create_cloud_browser_cdp_url()
async with async_playwright() as playwright:
browser = await playwright.chromium.connect_over_cdp(session["connectUrl"])
browser = await playwright.chromium.connect_over_cdp(cdp_url)
context = browser.contexts[0] if browser.contexts else await browser.new_context()
page = context.pages[0] if context.pages else await context.new_page()
page.set_viewport_size({"width": 1280, "height": 1800}) # use 1280x1800 viewport for better desktop site rendering and more visible content in screenshots
Expand Down Expand Up @@ -320,7 +330,7 @@ agent:
- The required final artifact is `{{ final_script_path }}`.
- Create `final_runs/run_<id>/` folders for every clean execution of the final script. Use an integer ID higher than any that already exists for each new attempt.
- Store each run's `final_script.py`, `final_script_log.txt`, and final verification screenshots **only** inside that run folder.
- Always use Browserbase cloud sessions.
- Always use a cloud browser session matching `BROWSER_MODE` (`browserbase` or `steel`), never a local launch.

## Web Task Rules

Expand Down
14 changes: 9 additions & 5 deletions src/webwright/config/task_showcase.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,12 @@ environment:
command_timeout_seconds: 240
shell: /bin/bash
# Path to a shell file that exports credentials (BROWSERBASE_API_KEY,
# BROWSERBASE_PROJECT_ID, ANTHROPIC_API_KEY, OPENAI_API_KEY, ...). Leave
# empty to read these from the parent process environment instead.
# BROWSERBASE_PROJECT_ID, STEEL_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY,
# ...). Leave empty to read these from the parent process environment instead.
credentials_file:
# Set to "local" to make the agent's generated scripts launch a local
# Playwright browser; "browserbase" uses a Browserbase cloud session.
# Playwright browser; "browserbase" uses a Browserbase cloud session and
# "steel" uses a Steel cloud session.
browser_mode: local
task_metadata_filename: task.json
final_script_name: final_script.py
Expand Down Expand Up @@ -111,9 +112,12 @@ agent:

## Browser Mode

The harness exposes `BROWSER_MODE` to your scripts (value: `browserbase` or `local`).
The harness exposes `BROWSER_MODE` to your scripts (value: `browserbase`, `steel`, or `local`).
- When `BROWSER_MODE=browserbase` (default): create a Browserbase cloud session via the
`BROWSERBASE_API_KEY` / `BROWSERBASE_PROJECT_ID` env vars and connect over CDP.
- When `BROWSER_MODE=steel`: create a Steel cloud session via POST
https://api.steel.dev/v1/sessions (header `Steel-Api-Key: $STEEL_API_KEY`), then
connect over CDP to `wss://connect.steel.dev?apiKey=$STEEL_API_KEY&sessionId=<id>`.
- When `BROWSER_MODE=local`: launch a local Playwright Chromium browser
(`playwright.chromium.launch(...)`) instead. No external credentials required.

Expand Down Expand Up @@ -435,7 +439,7 @@ agent:
- `{{ workspace_dir }}/task_showcase/tasks/<short_id>/task.json`
- `{{ workspace_dir }}/task_showcase/tasks/<short_id>/report.json`
- Use `{{ task_id }}` as the preferred `<short_id>` when it is present and already URL-safe; otherwise derive a lowercase slug from the task title.
- The browser mode is `{{ browser_mode }}`. Match generated scripts to that mode (Browserbase cloud session vs. local Playwright launch).
- The browser mode is `{{ browser_mode }}`. Match generated scripts to that mode (Browserbase or Steel cloud session vs. local Playwright launch).

## Web Task Rules

Expand Down
10 changes: 6 additions & 4 deletions src/webwright/environments/local_workspace.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@ class LocalWorkspaceEnvironmentConfig(BaseModel):
"""Shell-based workspace environment.

The agent drives a real browser through bash commands it generates inside this
workspace. Two browser modes are exposed to those generated scripts via
workspace. Three browser modes are exposed to those generated scripts via
environment variables:

* ``browser_mode = "browserbase"`` (default): the agent's scripts should
create a Browserbase cloud session. ``BROWSERBASE_API_KEY`` and
``BROWSERBASE_PROJECT_ID`` are forwarded if present.
* ``browser_mode = "steel"``: the agent's scripts should create a Steel cloud
session. ``STEEL_API_KEY`` is forwarded if present.
* ``browser_mode = "local"``: the agent's scripts should launch a local
Playwright browser (``playwright.chromium.launch(...)``).

Expand All @@ -36,7 +38,7 @@ class LocalWorkspaceEnvironmentConfig(BaseModel):
shell: str = "/bin/bash"
env: dict[str, str] = Field(default_factory=dict)
credentials_file: Path | None = None
browser_mode: str = "browserbase" # "browserbase" or "local"
browser_mode: str = "browserbase" # "browserbase", "steel", or "local"
task_metadata_filename: str = "task.json"
final_script_name: str = "final_script.py"
output_truncation_chars: int = 12000
Expand Down Expand Up @@ -165,9 +167,9 @@ def prepare(self, **kwargs) -> None:
self._task_metadata_path().write_text(json.dumps(kwargs, indent=2), encoding="utf-8")

def _browser_env(self) -> dict[str, str]:
"""Forward Browserbase / browser-mode hints to the subprocess."""
"""Forward cloud-browser / browser-mode hints to the subprocess."""
env: dict[str, str] = {"BROWSER_MODE": str(self.config.browser_mode or "browserbase")}
for var in ("BROWSERBASE_API_KEY", "BROWSERBASE_PROJECT_ID"):
for var in ("BROWSERBASE_API_KEY", "BROWSERBASE_PROJECT_ID", "STEEL_API_KEY"):
value = self._credential_env.get(var) or os.environ.get(var)
if value:
env[var] = value
Expand Down
43 changes: 43 additions & 0 deletions tests/unit/test_browser_env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# ABOUTME: Tests that LocalWorkspaceEnvironment._browser_env forwards the right
# ABOUTME: BROWSER_MODE and cloud-browser credentials (Browserbase, Steel) to subprocesses.

from webwright.environments.local_workspace import LocalWorkspaceEnvironment


def test_browser_env_forwards_steel_mode_and_key(monkeypatch) -> None:
monkeypatch.setenv("STEEL_API_KEY", "sk-steel-test")
env = LocalWorkspaceEnvironment(browser_mode="steel")

result = env._browser_env()

assert result["BROWSER_MODE"] == "steel"
assert result["STEEL_API_KEY"] == "sk-steel-test"


def test_browser_env_omits_steel_key_when_unset(monkeypatch) -> None:
monkeypatch.delenv("STEEL_API_KEY", raising=False)
env = LocalWorkspaceEnvironment(browser_mode="steel")

result = env._browser_env()

assert "STEEL_API_KEY" not in result


def test_browser_env_forwards_browserbase_credentials(monkeypatch) -> None:
monkeypatch.setenv("BROWSERBASE_API_KEY", "bb-key")
monkeypatch.setenv("BROWSERBASE_PROJECT_ID", "bb-project")
env = LocalWorkspaceEnvironment(browser_mode="browserbase")

result = env._browser_env()

assert result["BROWSER_MODE"] == "browserbase"
assert result["BROWSERBASE_API_KEY"] == "bb-key"
assert result["BROWSERBASE_PROJECT_ID"] == "bb-project"


def test_browser_env_defaults_mode_to_browserbase(monkeypatch) -> None:
env = LocalWorkspaceEnvironment(browser_mode="")

result = env._browser_env()

assert result["BROWSER_MODE"] == "browserbase"