Skip to content

Add azure.ai.training extension#8130

Open
saanikaguptamicrosoft wants to merge 56 commits into
Azure:mainfrom
saanikaguptamicrosoft:foundry-training-dev
Open

Add azure.ai.training extension#8130
saanikaguptamicrosoft wants to merge 56 commits into
Azure:mainfrom
saanikaguptamicrosoft:foundry-training-dev

Conversation

achauhan-scc and others added 28 commits March 18, 2026 12:31
* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands
…to show

- Make job name optional in YAML; auto-generate {adj}_{noun}_{suffix} (matching AML SDK)
- Fix buildProjectEndpoint to use services.ai.azure.com (not cognitiveservices.azure.com)
- Rename 'job get' to 'job show' to match models/finetune extensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* integrating with API
…rt (Azure#7203)

- Add --skip-token flag for pagination with next-page UX message
- Add --tag and --properties flags for server-side filtering
- Add --include-archived flag for listViewType control
- Add SystemData (createdBy, createdAt) to job list output
- Update doDataPlane() to support variadic query params

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… code, and input resolution (Azure#7205)

- Rename job create command to job submit for consistency with finetune extension
- Add resolver interfaces: ComputeResolver, CodeResolver, InputResolver
- Add JobResolver orchestrator that resolves all references in JobDefinition
- Wire resolver into submit flow before buildJobResource()
- Stub implementations guide users to provide full ARM IDs / remote URIs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Custom training (Azure#7180)

* adding design detaiils for command job CLI

* adding more details

* adding dedup details

* adding api details

* adding execution plan

* adding draft version of custom training commands

* integrating with API

* adding -e -s override

* fixing asset resolution

* custom training: enhance job show, fix asset resolution, add full resource config support

- Enhanced job show with rich output: run history, metrics, artifacts, timing, compute info
- Added client APIs for run history, metrics, and artifacts endpoints
- Fixed dataset version field: json:dataType -> json:type
- Fixed input/output mode mapping: ro_mount -> ReadOnlyMount, rw_mount -> ReadWriteMount
- Added full resource config support: instanceType, shmSize, dockerArgs, properties
- Added ResourceDefinition YAML struct with AISuperComputer properties pass-through
- Backward compatible: flat instance_count still works

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* custom training: add spinner progress to job show command

Show animated spinner with progress text while fetching job details.
Updates text as each parallel fetch (run history, metrics, artifacts)
completes, showing remaining items until all data is loaded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 11, 2026 11:35
Copy link
Copy Markdown
Member

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addresses my prior feedback - TenantId fix, go version bump, SDK v2, and semver pin all look correct. One remaining item: the PR body is still empty. For a 14K-line new extension, even a brief summary (what training jobs are, how this relates to the finetune extension, key design decisions) would help reviewers and show up in the git log.

@wbreza
Copy link
Copy Markdown
Contributor

wbreza commented May 18, 2026

Test comment

Copy link
Copy Markdown
Contributor

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review — supplementing prior reviews

Thanks for the substantial work here, @saanikaguptamicrosoft. Below are findings that aren't already covered by Copilot's or @jongio's earlier comments. Verified that the UserTenantId fix, go 1.26.1 bump, semver pin, and CODEOWNERS scope are all in place. ✅

🟧 High

1. --debug flag prints HTTP request URLs & bodies to stdout — risk of leaking SAS URIs / sensitive payloads
pkg/client/client.go:91–103. SetDebugBody() is called from production commands (job_submit.go:93, job_stream.go, job_download.go, job_connect_ssh.go) when --debug is passed. Logs include URLs (which for dataset blob/SAS operations may include SAS tokens) and request bodies that flow through GetDatasetCredentials / GetModelCredentials. Repo guidance forbids logging credentials/tokens.
Fix: Redact sasUri, Authorization, sig, signature, and known credential fields before printing; or move this behind a separate developer-only env var and document that it may expose tokens.

🟨 Medium

2. SSH proxy stdin goroutine intentionally untracked
internal/cmd/job_ssh_proxy.go:167–182. The upstream stdin → websocket goroutine is not added to the sync.WaitGroup and blocks on os.Stdin.Read(). The comment acknowledges the Windows limitation and relies on process exit. Works for the short-lived ProxyCommand model, but wg.Wait() at line 211 cannot guarantee clean shutdown on graceful cancellation.
Fix (optional): Close os.Stdin from the cancellation watcher to force Read() to return, or document the lifecycle expectation more visibly.

3. azcopy runner silently drops malformed JSON and parse errors
internal/azcopy/runner.go:167 (JSON unmarshal) and :175–177 (ParseInt). On azcopy output format drift, progress reporting silently degrades to "0 bytes" instead of surfacing the problem.
Fix: Log a one-time warning on first unparseable line; preserve raw text for diagnostics.

4. Nil-guard / pattern duplication on job.Properties.Services access
internal/service/stream_service.go:100, internal/cmd/job_download.go:350 (extractTrackingEndpoint), internal/cmd/job_show_services.go:45. Three inline reimplementations of "extract service endpoint" without consistent nil guards. Sparse API responses can omit Properties or Services.
Fix: Centralize as internal/utils/ExtractServiceEndpoint(job, name) with proper nil checks; replace the three call sites.

5. Critical untested code paths
Production files with zero test coverage: internal/service/upload_service.go (dedup + zombie recovery + sentinel tagging), internal/service/hash.go (deterministic directory hashing — drives dedup correctness), pkg/client/download.go (artifact/output downloads). The team's existing test files (resolver_test.go, job_validator_test.go, job_show_services_test.go) demonstrate solid patterns with fakes — this is a coverage gap on the most failure-prone code.
Fix (recommended before merge): Unit tests for hash.go (determinism + symlink + cross-platform path normalization) and upload_service.go (hash collision fallback, missing SAS URI, azcopy failure).

6. PR description is still empty
Reiterating @jongio's request — for a 14K-line new first-party extension, the empty description hurts git log archaeology, future reviewers, and release notes. A brief summary of what training jobs are, how this relates to azure.ai.finetune, key design decisions, and a pointer to the included Design/ docs would help significantly.

7. PR description references job get, but the actual command is job show
internal/cmd/job.go:32–42 registers newJobShowCommand() (defined in job_get.go). There is no newJobGetCommand. Commit history shows a rename from job getjob show. The PR description and screenshots still reference job get.
Fix: Update PR description and any docs (including content in Design/command-job-guide.md) to match the shipped command names.

🟦 Low

  1. extension.yaml is missing an example for the metadata capability (other first-party extensions list examples for every declared capability).
  2. proxyEndpointPattern (job_ssh_proxy.go) allows %; if any downstream consumer URL-decodes the endpoint, this could enable host/scheme bypass. Quick audit of buildTunnelWSURL recommended.
  3. internal/azcopy/runner.go sourceArg lacks defense-in-depth shell-metachar validation (#nosec G204 only covers azcopyPath).
  4. AdditionallyAllowedTenants: []string{"*"} in init.go:86 and job_ssh_proxy.go:265 is intentional for multi-tenant/guest support but should carry an inline security comment explaining the wide scope.
  5. Missing godoc on exported symbols in pkg/client/ and pkg/models/ per repo standards.
  6. ~140 KB of Design/ markdown shipped in the extension source tree (design.md 75 KB, execution-plan.md 40 KB, command-job-guide.md 25 KB). Will drift from code without a sync mechanism. Consider moving to repo-level docs/ or wiki, or trimming to a README link.
  7. Client init boilerplate duplicated across ~15 command files. Extract newDefaultJobClient(ctx) helper to centralize credential/endpoint/client construction.

✅ Verified clean / resolved

  • Subscription.UserTenantId correctly used at init.go:105.
  • CODEOWNERS scope creep: only the training-extension line added — no co-owner additions to unrelated paths.
  • go.mod is go 1.26.1 with semver-pinned azd dependency v1.25.1, aligned with extension.yaml's requiredAzdVersion.
  • safeJoin in download.go correctly prevents zip-slip.
  • resolver_test.go, job_validator_test.go, job_show_services_test.go, init_template_test.go are solid examples.
  • All 9 required CI checks passing.

Posted by Copilot CLI on behalf of the reviewer.

@saanikaguptamicrosoft
Copy link
Copy Markdown
Collaborator Author

saanikaguptamicrosoft commented May 19, 2026

@wbreza

  1. Traced every c.debugBody print site — only control-plane URLs (*.api.azureml.ms, auth via Authorization header) and request bodies (job definitions) are logged; SAS URIs come back in response bodies which are never logged, and GetBlobContent's SAS-bearing request URL has no debug print at all, so no token is exposed today.
  2. az AML CLI takes the identical approach: stdin reader runs untracked, blocked in a blocking read, reaped by process exit when the short-lived ProxyCommand subprocess terminates. There's no portable way to interrupt os.Stdin.Read() on Windows, and az ml deliberately doesn't try. The existing comment block at lines 127–133 documents this; keeping as-is for consistency with the canonical AML SSH proxy behavior.
  3. Addressed
  4. Addressed
  5. Added unit tests for hash.go (8 cases: determinism, order/content/path sensitivity, symlink skip, missing dir) and upload_service.go (12 cases including dedup hit, zombie recovery, hash collision, missing SAS URI, and azcopy failure propagation) — extracted small unexported interfaces on UploadService so the flow is testable without an HTTP server. Deferred pkg/client/download.go (HTTP plumbing — best covered via integration) to a follow-up.
  6. PR description is added
  7. Good catch — renamed job_get.gojob_show.go. PR already mentions job show.
  8. Fixed 8 and 13. Other low priority skipped.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extension for job training

6 participants