Skip to content

[WIP] RFC-0054: HUD Integration for Out-of-Tree CI Results#96

Draft
jewelkm89 wants to merge 3 commits intopytorch:masterfrom
jewelkm89:oot-hud-integration-rfc
Draft

[WIP] RFC-0054: HUD Integration for Out-of-Tree CI Results#96
jewelkm89 wants to merge 3 commits intopytorch:masterfrom
jewelkm89:oot-hud-integration-rfc

Conversation

@jewelkm89
Copy link
Copy Markdown

Summary

This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).

Data Flow

flowchart LR
    subgraph Downstream["Downstream CI (OOT Backend)"]
        DS["Run tests\n+ upload artifacts"]
    end

    subgraph ART["Artifact Storage (org-managed)"]
        STORE[("Logs, test reports,\nJUnit XML")]
    end

    subgraph Relay["Relay Server"]
        RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"]
    end

    subgraph HUD["HUD"]
        API["/api/oot/results\n• Auth check\n• Payload validation\n• Payload caps (2MB)"]
    end

    subgraph Storage["Storage"]
        DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")]
        STR["DynamoDB Stream"]
        REP["clickhouse-replicator-dynamo"]
        CH[("ClickHouse\ndefault.oot_workflow_job\n(completed only)")]
    end

    subgraph Frontend["HUD Frontend"]
        P1["/oot — Global Summary"]
        P2["/oot/org/repo — Per-Backend"]
        P3["/pr/N — OOT Section"]
    end

    DS -->|"Upload artifacts"| STORE
    DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH
    RH -->|"X-Hud-Internal-Bot\n{trusted, untrusted}"| API
    API -->|"PutItem"| DDB
    DDB --> STR --> REP -->|"completed only"| CH
    CH -->|"Query results +\nartifact_url"| P1 & P2 & P3
    P2 & P3 -.->|"User clicks\nexternal link"| STORE
Loading

Key points:

  • Artifact URLs are included in the completed callback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouse
  • HUD pages read artifact_url from ClickHouse and render it as an external link — no direct connection between HUD and downstream storage
  • Only completed records are replicated to ClickHouse; in_progress stays in DynamoDB for mutable state tracking

What this RFC covers

  • Write path: Downstream CI → Result Handler → HUD API → DynamoDB → ClickHouse (completed records only)
    • in_progress callbacks → DynamoDB only (mutable state tracking)
    • completed callbacks → DynamoDB → replicated to ClickHouse (dashboard queries)
    • Artifact URLs flow through the callback payload, not sent directly to HUD
  • Read path: Three new HUD views:
    • /oot — Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)
    • /oot/[org]/[repo] — Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)
    • /pr/[number] — Collapsible "Out-of-Tree Backends" section in existing PR pages
  • Storage schemas: DynamoDB table and ClickHouse table designs
  • DB protection: Rate limiting (per-repo at relay), payload caps (2MB at HUD API)
  • Security: OIDC authentication, trusted/untrusted payload split, error handling strategy (delivered/hud_rejected/hud_unavailable/skipped), signed callback token proposal, state machine for status transitions
  • Sample payloads: In-progress, success, and failure callback examples with full field definitions
  • Implementation plan: 6-phase rollout with task-level breakdown:
    1. Storage Layer — DynamoDB + ClickHouse + replicator mapping
    2. HUD API Endpoint — types, validation, write logic
    3. Relay Integration — handler → HUD forwarding, rate limiting, reusable GHA action
    4. HUD Frontend Pages — 3 views + saved ClickHouse queries
    5. End-to-End Validation — real downstream repo testing
    6. Security Hardening — callback token, state machine (future)

Reference implementation

A reference implementation is available at subinz1/test-infra#1, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.

Status

This is a WIP draft. Feedback welcome.

Defines the HUD-side ingestion and display layer for OOT CI results,
building on RFC-0050 (Cross-Repository CI Relay). Covers write path,
storage schemas, DB protection, security, and three frontend views.
Reference implementation: subinz1/test-infra#1
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 29, 2026

Hi @jewelkm89!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla meta-cla Bot added the cla signed label Apr 29, 2026
Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick high level comments based on the PR description (I really appreciate the diagram btw!):

  • I see some payload validation happening in HUD. Why not have that in Relay Server instead? In HUD, auth check should likely be limited to authenticating the relay server's header value and a maaaaybe payload cap to ensure we don't overburden the database. Everything else probably will find a better home in Relay Server, where the logic can be centralized. And the payload cap is likely worth adding to the relay server as well (HUD's check is more of a sanity check as the db guardian.)

  • Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3

Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — the architecture diagrams are really clear and the RFC builds on RFC-0050 well. A few things to sort out, mostly around aligning with the L2 PR's actual wire format and a couple of bugs in the extraction logic.

const pr = cb.payload?.pull_request;

const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`;
const now = new Date().toISOString();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of bugs here:

  • started_at is set to now() on every callback, including completed. That's the time HUD received the POST, not when the job started. On a completed callback, started_at ≈ completed_at, so the ClickHouse duration_seconds alias will always be ≈ 0. Neither started_at nor completed_at are available in the github context, so the action should capture its own wall-clock time and include it in the payload — in_progress sends started_at, completed sends completed_at. Then HUD stores what the action reports instead of using now().

  • queue_time is set from trusted.ci_metrics?.queue_time, which is null on completed callbacks (the relay only sends it on in_progress). This needs to skip setting queue_time when it's null rather than writing null over a valid value. Same applies to execution_time on in_progress callbacks.

Also, the DynamoDB write needs to use UpdateItem with SET expressions instead of PutItem. PutItem does a full item replace, so even with the above fixes, the completed callback would clobber queue_time and started_at from the in_progress record. UpdateItem lets each callback write only its own fields.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three fixed:

  • started_at/completed_at now use downstream-reported timestamps (wf.started_at, wf.completed_at) instead of new Date().toISOString(). Note: these fields need to be added to the L2 PR's callback action (action.yml), which currently doesn't send wall-clock timestamps.
  • queue_time/execution_time are only set when the relay provides a non-null value. in_progress sets queue_time; completed sets execution_time. Neither overwrites the other.
  • Changed from PutItem to UpdateItem with SET expressions throughout — Hop 3, data flow table, code samples, and implementation plan.

const wf = cb.workflow;
const pr = cb.payload?.pull_request;

const dynamoKey = `${trusted.verified_repo}/${cb.delivery_id}/${wf.name}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key needs to include the job name and attempt number. The callback action runs inside a specific job, and if a downstream repo uses it in multiple jobs within the same workflow (build, test-cpu, test-gpu), all those jobs share the same github.workflow name and would write to the same dynamoKey — last writer wins, the rest are silently lost. And without the attempt number, job retries would overwrite the original run.

Let's add github.job and github.run_attempt to the key: {repo}/{delivery_id}/{workflow_name}/{job_name}/{attempt}. That gives each job its own row, preserves retry history, and makes the /oot/[org]/[repo] matrix view ("columns = OOT jobs") actually useful.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for the thorough review @ZainRizvi, Changed from {repo}/{delivery_id}/{workflow_name} to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}. Added job_name and run_attempt to both DynamoDB and ClickHouse schemas. The L2 callback action needs to be updated to include github.job and github.run_attempt in the payload.


### Sample Payloads

#### In-Progress Callback
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These samples don't match the actual wire format. They show a flat structure with top-level status, head_sha, pr_number, etc. But the L2 PR's action.yml sends a nested structure — delivery_id and payload from the dispatch envelope, with a workflow sibling containing status/conclusion/name/url. Your own TypeScript code (RelayPayload, extractDynamoRecord) parses the nested format correctly, so the code is right — but a downstream developer implementing from these samples would produce payloads the system can't parse. These need to match the actual {delivery_id, payload, workflow} structure from the action.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all flat-structure samples with the actual nested wire format, shown at two stages:

Stage 1 (downstream → relay): {event_type, delivery_id, payload, workflow} — matching the callback action's output from action.yml
Stage 2 (relay → HUD): {trusted: {verified_repo, ci_metrics}, untrusted: {callback_payload: ...}} — matching forward_to_hud in hud.py
Also added a note about fields (job_name, run_attempt, started_at, completed_at) that need to be added to the L2 action.

This hop requires **no new code**. The existing infrastructure handles it:

1. **DynamoDB Streams** (enabled on the table with `NEW_AND_OLD_IMAGES`) captures every write
2. **`clickhouse-replicator-dynamo`** receives stream events and inserts `completed` records into ClickHouse (`in_progress` records are filtered out — they serve only as mutable state in DynamoDB)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_progress records need to be replicated to ClickHouse too, not just completed. Current practice for in-tree CI is we update job status in ClickHouse (in_progress → completed) and OOT should follow the same pattern. SharedReplacingMergeTree already supports upserts. Without this, the /pr/N page wouldn't be able to show running jobs.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both in_progress and completed records now replicate to ClickHouse. The replicator forwards all DynamoDB stream events without filtering. SharedReplacingMergeTree handles the upsert — completed replaces in_progress for the same dynamoKey after merges.

| 2 | Schema validation + payload caps (2MB max body) | `400 Bad Request` |
| 3 | Write to DynamoDB | `502 Bad Gateway` |

The HUD API endpoint (`torchci/pages/api/oot/results.ts`) follows the existing `webhookToDynamo` pattern:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema validation and field extraction should live in the Relay Server, not HUD. The relay already does OIDC, allowlist checks, and rate limiting — adding schema validation there keeps HUD simple (authenticate the relay, write what it's told) and avoids duplicating validation if other consumers of relay data appear later. HUD should still enforce a payload size cap as a safety net.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have moved — schema validation is now at the relay (Hop 1, step 5). The HUD API endpoint is simplified to: validate X-OOT-Relay-Token header, enforce 2MB payload cap, extract/flatten the record, and UpdateItem write. HUD trusts that the relay has already validated the structure.

| Hop | From → To | Auth Mechanism | Credential Scope |
|-----|-----------|----------------|------------------|
| 1 | Downstream → Result Handler | OIDC token (GHA-issued, 5 min TTL) | No secrets needed by downstream |
| 2 | Result Handler → HUD API | `X-Hud-Internal-Bot` header | Same pattern as DrCI / trymerge |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a dedicated auth header for the relay instead of X-Hud-Internal-Bot. That header is shared across multiple systems (DrCI, trymerge) and has only been used for read-only operations. Since this is a write path, giving the relay its own key limits blast radius if any single one leaks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely @ZainRizvi,
we have changed to X-OOT-Relay-Token — a dedicated header scoped only to the OOT relay write path. Added an explanation in the RFC that this isolates the write path so key rotation or revocation doesn't affect other internal HUD consumers (DrCI, trymerge).

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: moved schema validation from HUD to relay (Hop 1)
- Replication: both in_progress and completed records go to ClickHouse
- Timestamps: use downstream-reported started_at/completed_at
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Schema: added job_name, run_attempt columns
- Rate limit: 10 → 20 req/min (matches L2 PR default)
- Error handling: updated to match actual retry + raise behavior
- Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format
@subinz1
Copy link
Copy Markdown

subinz1 commented May 1, 2026

  • I see some payload validation happening in HUD. Why not have that in Relay Server instead?

Hello @ZainRizvi, Agreed — moved schema validation to the relay (Hop 1, step 5). The relay now validates delivery_id, workflow.status, and conclusion presence before forwarding. HUD is simplified to: auth (X-OOT-Relay-Token) + payload cap (2MB) + UpdateItem write. This aligns with the L2 PR's result_handler.py which already validates these fields at the relay level

@subinz1
Copy link
Copy Markdown

subinz1 commented May 1, 2026

  • Why does the Storage box show that only completed jobs will be replicated to ClickHouse? I think the default behavior would be to replicate both in progress and completed, and that would offer more value as well. Current practice for in-tree CI is we update the job status in clickhouse (in progress -> completed) and we should do the same for OOT too. That's why we're storing the data in dynamo, else we could have stored it in S3

Fixed — removed all "completed only" language. Both in_progress and completed records now replicate to ClickHouse. SharedReplacingMergeTree handles the deduplication — when the completed record arrives for the same dynamoKey, it replaces the in_progress row after merges. This matches in-tree CI behavior and allows the /pr/N page to show running jobs.

Security note: We should give the relay tool it's own unique header to send to hud and only that header should be allowed to make these relay api calls. The X-Hud-Internal-Bot is a more widely used key and so far it has only been used for readonly operations. Let's not give it read-write access since the key is at higher risk of leaking.

Changed everywhere — diagrams, data flow table, authentication flow table, code samples, and implementation plan. The relay now uses a dedicated X-OOT-Relay-Token header instead of the shared X-Hud-Internal-Bot. This isolates the OOT write path so rotating/revoking the token doesn't affect DrCI/trymerge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants