Pass --force-refresh to CLI auth token command#752
Pass --force-refresh to CLI auth token command#752mihaimitrea-db wants to merge 2 commits intomainfrom
Conversation
628f509 to
09cdbc4
Compare
Range-diff: stack/cli-force-refresh (628f509 -> 09cdbc4)
Reproduce locally: |
09cdbc4 to
f88c52a
Compare
Range-diff: stack/cli-force-refresh (09cdbc4 -> f88c52a)
Reproduce locally: |
f88c52a to
25a3779
Compare
Range-diff: stack/cli-force-refresh (f88c52a -> 25a3779)
Reproduce locally: |
25a3779 to
694521d
Compare
Range-diff: stack/cli-force-refresh (25a3779 -> 694521d)
Reproduce locally: |
694521d to
0f6baf6
Compare
Range-diff: stack/cli-force-refresh (694521d -> 0f6baf6)
Reproduce locally: |
0f6baf6 to
0ef99dc
Compare
0ef99dc to
0dd48d1
Compare
Range-diff: stack/cli-force-refresh (0ef99dc -> 0dd48d1)
Reproduce locally: |
0dd48d1 to
a69ceff
Compare
a69ceff to
68844a4
Compare
68844a4 to
5ee2187
Compare
simonfaltum
left a comment
There was a problem hiding this comment.
Swarm Review Summary (Isaac + Cursor, 2 rounds)
Verdict: Not ready yet. 4 Major, 11 Nit, 4 Suggestion. No Critical.
The design mirrors the Go/Python SDK implementations well, DatabricksCliVersion is clean, and the buildCliCommand test matrix is solid. The concerns are concentrated in probeCliVersion + parseCliVersion + getCliVersion:
The four Majors
- Parse failures are cached as successful
UNKNOWN(line 263). Violates the "don't cache failures" contract, a transiently corrupt CLI response pins the SDK to the conservative fallback for the rest of the JVM lifetime. asInt()silently coerces non-integer fields to 0 (line 329).{"Major":"v0","Minor":296,"Patch":0}parses asv0.296.0and enables--force-refresh. Strictly worse failure mode than UNKNOWN.- Subprocess stream draining: latent pipe-buffer deadlock (line 286).
waitForbefore draining stdout/stderr is the textbook JavaProcessBuildergotcha. Works today because output is ~100 bytes; breaks the first time a CLI release adds a stderr banner. - No tests for
getCliVersioncaching orprobeCliVersionsubprocess paths. The central correctness claim of this infrastructure, "cache successes, retry failures", is completely untested. Subprocess argument construction, OS wrapping, exit handling, timeout, and parse-failure handling also have zero coverage.
What went well
DatabricksCliVersionis clean, well-commented, with correctcompareTo/equals/hashCode, and handles sentinels intoString.- Cache-by-cliPath (not by
DatabricksConfig) is the right key. - "Don't cache failures" is the right policy and is explicitly documented, the bug is that the implementation doesn't match the policy for the parse-failure path.
- Explicit dev-build
(0, 0, 0)sentinel withisDefaultDevBuild()is thoughtful. - Removing
fallbackCmdfromCliTokenSourceis a real simplification.
Reviewer alignment
Isaac and Cursor independently converged on: the subprocess deadlock, test-coverage gap, log noise (3 WARNs per failed probe), package-private constants, VERSION_CACHE reset hook, and concurrent-probe dedup. That's strong signal on the areas above.
Isaac found the --host null regression in buildCoreCliCommand; Cursor verified it is gated by configure()'s early-return in the production path, so it's downgraded to Nit as a local-invariant concern (still worth a guard or a javadoc note).
Cursor found the cached-UNKNOWN bug and the asInt() coercion issue, the two most consequential correctness findings, during cross-review.
Full synthesis: inline comments below.
|
|
||
| ProcessBuilder pb = new ProcessBuilder(cmd); | ||
| pb.environment().putAll(env.getEnv()); | ||
| Process process = pb.start(); |
There was a problem hiding this comment.
[Nit] Child stdin is never closed, interactive prompts would hang until timeout.
ProcessBuilder leaves the child's stdin as an open pipe to the parent. If the CLI ever reads from stdin (future interactive prompt, TTY-triggered behavior, some flag we didn't pass), the child blocks on read() and we eat the full 5-second timeout for every configure() call. Today databricks version doesn't read stdin, so this doesn't fire in practice, but it's a one-line fix to harden against it.
Suggestion:
Process process = pb.start();
process.getOutputStream().close();Or, before start(): pb.redirectInput(ProcessBuilder.Redirect.from(new File(isWindows ? "NUL" : "/dev/null"))).
Found by: Cursor.
| LOG.warn( | ||
| "Databricks CLI {} does not support --force-refresh (requires >= {}). " | ||
| + "The CLI's token cache may provide stale tokens.", | ||
| version, | ||
| CLI_VERSION_FOR_FORCE_REFRESH); |
There was a problem hiding this comment.
[Nit] Warning message has no actionable fix for the user.
"Databricks CLI v0.xxx does not support --force-refresh (requires >= v0.296.0). The CLI's token cache may provide stale tokens." - a user who sees this has no idea that the fix is to upgrade the CLI.
Suggestion: append an upgrade hint, e.g. "Upgrade via brew upgrade databricks or https://docs.databricks.com/dev-tools/cli/install.html to silence this warning." Match the phrasing used in the Go SDK's equivalent message for cross-SDK consistency.
Found by: Cursor.
| process.destroyForcibly(); | ||
| throw new IOException( | ||
| "timed out after " | ||
| + VERSION_PROBE_TIMEOUT_SECONDS | ||
| + "s waiting for `databricks version`"); |
There was a problem hiding this comment.
[Suggestion] Capture and surface stderr on the timeout path for diagnostics.
On timeout, the IOException message is "timed out after 5s waiting for databricks version" with no clue why the CLI hung. Even partial stderr ("Initializing telemetry...", "Locking config dir...", "Checking for updates...") would help users diagnose AV scanners, lock contention, proxy issues, or first-run setup.
Suggestion: best-effort readStream(process.getErrorStream()) after destroyForcibly().waitFor(<short>) and append to the exception message. Wrap in its own try/catch, the stream may already be closed. This folds naturally into the concurrent-drain fix for the Major finding above.
Found by: Isaac.
| try { | ||
| DatabricksCliVersion version = probeCliVersion(cliPath, env); | ||
| VERSION_CACHE.put(cliPath, version); | ||
| return version; |
There was a problem hiding this comment.
[Suggestion] Log a DEBUG line on successful version detection.
When a user files a bug saying "force-refresh isn't being passed", the first diagnostic question is "what version did the probe detect?" Currently there's no way to answer without running under a debugger. One DEBUG line gives us free diagnosability.
Suggestion: LOG.debug("Detected Databricks CLI {} at {}", version, cliPath); after the successful probeCliVersion call.
Found by: Cursor.
| List<String> buildCliCommand( | ||
| String cliPath, DatabricksConfig config, DatabricksCliVersion version) { | ||
| List<String> cmd = buildCoreCliCommand(cliPath, config, version); | ||
| if (version.atLeast(CLI_VERSION_FOR_FORCE_REFRESH)) { | ||
| cmd.add("--force-refresh"); | ||
| } else if (version.equals(DatabricksCliVersion.UNKNOWN) || version.isDefaultDevBuild()) { | ||
| // Detection failed or no version metadata — we can't prove the CLI lacks --force-refresh, | ||
| // just failed to confirm it. The version probe already logged the underlying cause. | ||
| LOG.warn( | ||
| "Could not confirm --force-refresh support for Databricks CLI {} (requires >= {}). " | ||
| + "The CLI's token cache may provide stale tokens.", | ||
| version, | ||
| CLI_VERSION_FOR_FORCE_REFRESH); | ||
| } else { | ||
| LOG.warn( | ||
| "Databricks CLI {} does not support --force-refresh (requires >= {}). " | ||
| + "The CLI's token cache may provide stale tokens.", | ||
| version, | ||
| CLI_VERSION_FOR_FORCE_REFRESH); | ||
| } | ||
| return cmd; | ||
| } |
There was a problem hiding this comment.
[Suggestion] Defensive copy before mutating the list from buildCoreCliCommand.
buildCliCommand mutates the list returned by buildCoreCliCommand via cmd.add("--force-refresh"). This works today only because buildCoreCliCommand and buildHostArgs happen to return ArrayList. A future refactor returning List.of(...) or Collections.unmodifiableList(...) will silently break this with UnsupportedOperationException at a call site that looks innocent.
Suggestion: either List<String> cmd = new ArrayList<>(buildCoreCliCommand(...)); at the top of this method, or document the mutability contract in javadoc on buildCoreCliCommand. The copy is more robust for a trivial cost.
Found by: Isaac.
5ee2187 to
b68965e
Compare
b68965e to
9b49187
Compare
9b49187 to
dbc2062
Compare
`--profile` on `databricks auth token` is a global Cobra flag, so old CLIs (< v0.207.1) silently accept it and fail later with `cannot fetch credentials` instead of `unknown flag: --profile`. The previous error-based fallback never matched, leaving the `--host` fallback as dead code. This commit replaces the runtime fallback chain with version-based capability detection: * `CliVersion` carries a (major, minor, patch) triple plus an `UNKNOWN` sentinel and a default-dev-build (0,0,0) check. * `DatabricksCliCredentialsProvider` runs `databricks version --output json` once per CLI path (cached on success only, with a 5s timeout) and gates `--profile` on >= v0.207.1; everything else falls back to `--host` with a precise warning. * `CliTokenSource` is simplified to a single `cmd`; the `fallbackCmd` parameter and the runtime "unknown flag" retry loop are removed. Mirrors the equivalent refactors in the Go and Python SDKs: * databricks/databricks-sdk-go#1605 * databricks/databricks-sdk-py#1377 Co-authored-by: Isaac
The SDK manages its own token caching via `CachedTokenSource`. When the SDK shells out to `databricks auth token`, the CLI may return a token from *its* own cache that is about to expire (or has already expired from the SDK's perspective), producing unnecessary refresh failures and retry loops. The CLI added `--force-refresh` in v0.296.0 (databricks/cli#4767) to let callers bypass its cache. With the version-detection infrastructure from the parent PR already in place, opting in is a one-constant, one-branch change: * Introduce `CLI_VERSION_FOR_FORCE_REFRESH = v0.296.0`. * Split `buildCliCommand` into the existing profile/host decision (now `buildCoreCliCommand`) and a thin wrapper that appends `--force-refresh` when supported and otherwise logs a precise warning. Future capability-gated flags slot into the same wrapper. Mirrors: * databricks/databricks-sdk-go#1628 * databricks/databricks-sdk-py#1378 Co-authored-by: Isaac
dbc2062 to
6bbf406
Compare
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
🥞 Stacked PR
Use this link to review incremental changes.
Summary
Append
--force-refreshto thedatabricks auth tokencommand when the installed CLI is >= v0.296.0, so the CLI bypasses its internal token cache and hands the SDK a freshly minted token every time.Mirrors databricks/databricks-sdk-go#1628 and databricks/databricks-sdk-py#1378. Requires the version-detection infrastructure from the parent PR #751.
Why
The SDK already manages its own token caching via
CachedTokenSource. When the SDK decides it needs a new token and shells out todatabricks auth token, the CLI may return a token from its own cache that is about to expire (or that has already expired from the SDK's perspective). That produces unnecessary refresh failures and retry loops on top of a value that the SDK was confident was fresh.The CLI added
--force-refreshin databricks/cli#4767 (motivated by databricks/cli#4564) specifically to let callers bypass the CLI's cache. With the version-detection infrastructure from the parent PR already in place, opting in is a one-constant, one-branch change.What changed
Interface changes
None.
CliTokenSourceis not part of the public API surface.Behavioral changes
databricks auth tokeninvocations now end with--force-refreshwhenever the detected CLI is >= v0.296.0. Callers on older CLIs see no change.WARNINGis logged:"Databricks CLI <ver> does not support --force-refresh (requires >= v0.296.0). The CLI's token cache may provide stale tokens."AzureCliCredentialsProvideris unaffected — it does not pass throughDatabricksCliCredentialsProviderand does not opt into version-gated flag selection.Internal changes
DatabricksCliCredentialsProvider.CLI_VERSION_FOR_FORCE_REFRESH = new DatabricksCliVersion(0, 296, 0).buildCliCommandis split into two helpers, matching the shape the Go and Python SDKs settled on after the same PR there:buildCoreCliCommand(cliPath, config, version)— holds the existing profile-vs-host decision (moved out ofbuildCliCommand).buildCliCommand(cliPath, config, version)— now a thin wrapper that callsbuildCoreCliCommand, appends--force-refreshwhenversion.atLeast(CLI_VERSION_FOR_FORCE_REFRESH), and otherwise logs the unsupported-versionWARNING.Future version-gated flags slot into the same pattern: add a
CLI_VERSION_FOR_<flag>constant and anif version.atLeast(...)block inbuildCliCommand.How is this tested?
Additional
testBuildCliCommandparameterized cases inDatabricksCliCredentialsProviderTestcover the full matrix:--host+ v0.296.0 → appends--force-refresh.--host+ v0.296.0 → appends--force-refresh.--profile+ v0.296.0 →--profile+--force-refresh.--profile+ v0.207.1 →--profileonly (too old for--force-refresh).--host+ v0.295.0 →--hostonly (too old for--force-refresh).--hostonly, no--force-refresh.--hostonly, no--force-refresh.All parent-PR tests continue to pass unchanged.