Skip to content

experimental/ssh: reject unsupported serverless env v5 base environments#5825

Open
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-reject-env-v5
Open

experimental/ssh: reject unsupported serverless env v5 base environments#5825
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-reject-env-v5

Conversation

@TanishqDatabricks

Copy link
Copy Markdown
Collaborator

Changes

databricks ssh connect --base-environment <X> submits a serverless job and reaches the SSH server through the driver proxy. When the base environment uses serverless environment_version 5 or newer, the connection never succeeds: the server task runs and binds its port, but the driver proxy does not route to that port on the v5 runtime, so the client polls /metadata and gets 503 remote connection failure for the entire startup budget, then fails with an opaque timed out ... status code 503. A v4 base environment connects normally on the same workspace and binary — the only material difference is the environment version.

Serverless env v5 is not supported by the SSH tunnel (confirmed with the team; the routing gap is on the platform side). This makes the CLI fail fast with an actionable error instead of hanging until the timeout.

  • Added validateBaseEnvironmentVersion, called early in client.Run — before any workspace side effects (secret scope, key upload, notebook import) — so the rejection is immediate and leaves nothing behind.
  • The version is read from ListWorkspaceBaseEnvironmentsAll (Spec.EnvironmentVersion), which covers the display-name and workspace-base-environments/... resource-ID input forms.
  • The env.yaml path form (leading /) carries its version only inside the file, not the environments API, so it is not checkable here without net-new WSFS-download + YAML-parse code for the least-used form. Instead, the command emits a warning by default when a path form is used, so the user knows the version wasn't verified and that v5 will fail.
  • The check is fail-open: any inability to determine the version (list-call error, no matching entry, nil spec, empty/unparseable version) proceeds rather than blocking a valid connect on a transient error.

Introduces maxSupportedEnvironmentVersion = 4 alongside the existing minEnvironmentVersion = 4.

Why

Without this, a user pointing --base-environment at a v5 environment waits out the full CPU/GPU startup timeout (10–45 min) for an opaque 503, with no indication that the configuration is unsupported. This turns that into an instant, actionable error.

Tests

  • TestValidateBaseEnvironmentVersion (table-driven, mock workspace client): v5 rejected for display-name and resource-ID forms; v4 allowed; path form skipped (no API call); and the four fail-open cases (list error, no match, nil spec, unparseable version).
  • go test ./experimental/ssh/... passes; gofmt and go vet clean.
  • TestAccept/ssh acceptance suite passes.

Manual validation (e2-dogfood)

Scenario Result
--base-environment RemoteDevelopmentTest (v5, display name) ✅ immediate error, no job submitted
--base-environment workspace-base-environments/remotedevelopmenttest-... (v5, resource ID) ✅ immediate error
--base-environment ssh-tunnel-test-env (v4, display name) ✅ connects (control)
--base-environment /Workspace/.../env.yaml (path form) ✅ prints the WARNING, then proceeds

This pull request and its description were written by Isaac.

ssh connect --base-environment against a serverless environment on
environment_version 5+ never connects: the driver proxy does not route to the
server's port on that runtime, so the health check returns 503 for the whole
startup budget and the command fails with an opaque timeout. A v4 base
environment connects normally on the same workspace.

Reject a known-v5 base environment up front, before any workspace side effects,
with an actionable error. The version is read from the base-environment listing,
which covers the display-name and resource-ID input forms. The env.yaml path
form carries its version only inside the file and is not checkable here, so we
emit a warning by default that its version is unverified. The check is
fail-open: any inability to determine the version (list error, no match, nil
spec, unparseable version) proceeds rather than blocking a valid connect.

Co-authored-by: Isaac
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @anton-107 -- recent work in experimental/ssh/internal/client/, ./

Eligible reviewers: @andrewnester, @denik, @janniklasrose, @pietern, @renaudhartert-db, @shreyas-goenka, @simonfaltum

Suggestions based on git history. See OWNERS for ownership rules.

@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 65b2901

Run: 28660473383

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 14 230 1045 5:13
💚​ aws windows 7 14 232 1043 4:09
💚​ aws-ucws linux 7 14 314 963 6:48
💚​ aws-ucws windows 7 14 316 961 6:48
🟨​ azure linux 3 1 15 230 1044 8:02
🔄​ azure windows 1 3 15 232 1042 7:12
💚​ azure-ucws linux 4 15 316 960 7:39
🔄​ azure-ucws windows 2 2 15 318 958 7:45
💚​ gcp linux 4 15 229 1046 5:43
💚​ gcp windows 4 15 231 1044 4:35
21 interesting tests: 14 SKIP, 4 RECOVERED, 3 KNOWN
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 🟨​K 🔄​f 💚​R 💚​R 💚​R 💚​R
🟨​ TestFetchRepositoryInfoAPI_FromRepo/root 💚​R 💚​R 💚​R 💚​R 🟨​K 💚​R 💚​R 🔄​f 💚​R 💚​R
🟨​ TestFetchRepositoryInfoAPI_FromRepo/subdir 💚​R 💚​R 💚​R 💚​R 🟨​K 💚​R 💚​R 🔄​f 💚​R 💚​R
Top 5 slowest tests (at least 2 minutes):
duration env testname
3:04 azure windows TestAccept
2:59 azure-ucws windows TestAccept
2:57 gcp windows TestAccept
2:57 aws-ucws windows TestAccept
2:51 aws windows TestAccept

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants