Skip to content

Feature/stack architecture simplification#396

Open
colinmxs wants to merge 81 commits into
developfrom
feature/stack-architecture-simplification
Open

Feature/stack architecture simplification#396
colinmxs wants to merge 81 commits into
developfrom
feature/stack-architecture-simplification

Conversation

@colinmxs
Copy link
Copy Markdown
Contributor

No description provided.

colinmxs added 30 commits May 20, 2026 09:57
Adds requirements, design, and tasks for consolidating the CDK app into PlatformStack and BackendStack with content-hash Docker builds, same-origin BFF routing, and three new CI workflows.
Scrapped to start fresh. Original spec is preserved in git history at
c558d89 for reference if needed.
…ility

Phase 1, Task 1 of the stack architecture simplification.

Adds the equivalence-test scaffolding under infrastructure/test/equivalence/:

- normalize.ts — snapshot+normalization utilities. Strips stack-boundary
  metadata, CDK-generated logical IDs, and intrinsic-function references
  (Ref / Fn::GetAtt / Fn::ImportValue / Fn::Sub / Fn::Join) so the
  comparison focuses on resource type + properties only. Sorts IAM
  policy Statement arrays by content so grant-order differences don't
  cause spurious diffs. Position-sensitive arrays (subnet IDs, AZs)
  are left alone.

- build-legacy-app.ts — synthesizes the 9-stack legacy architecture
  from a deterministic AppConfig, using the existing test/helpers/
  mock-config.ts SSM mock pipeline. Used by the equivalence test
  baseline (this commit) and will be reused by the equivalence gate
  (Task 12). Deleted in Task 17.

- snapshot-legacy.test.ts — captures legacy synth summaries to
  __snapshots__/legacy-synth.{all-features,minimal}.json for two
  representative configs (every flag on / every retained flag off).
  9 stacks / 302 resources for all-features; 6 stacks / 240 resources
  for minimal. Files exist as historical baselines an operator can
  diff against once the new architecture lands.

Also drops 38 stale .js / .d.ts compiled artifacts from infrastructure/
that were shadowing the May 2026 .ts sources (the .gitignore already
excludes them but they'd survived in the working tree from a March
build). The TypeScript compile is clean from sources only after this.

Test suite: 336 tests / 14 suites green (was 334 / 13).
Phase 1, Task 2 of the stack architecture simplification.

Adds:

- infrastructure/lib/constructs/ — empty subdirectories matching the
  construct boundaries from the plan (network, identity, data, rag,
  artifacts, mcp-sandbox, fine-tuning, spa, zones, ecr, app-api,
  inference-api, gateway, rag-ingestion). Each carries a .gitkeep
  so the structure ships before any constructs are extracted.

- constructs/README.md — documents the layout, naming convention,
  and the typed-prop-passing rule (no SSM cross-stack reads in new
  code; runtime SSM reads stay).

- test/equivalence/equivalence.test.ts — the harness for the
  perfection gate. Two describe blocks:

  1. 'wiring sanity' — runs every commit. Asserts the legacy app
     builds and snapshots cleanly, and asserts buildNewApp() throws
     a 'not implemented' error. This stays useful through Phase 2
     so we know the harness keeps compiling as constructs land.

  2. 'equivalence gate' (describe.skip) — the actual byte-equivalent
     comparison. Two cases (minimal config, all-features config).
     Currently skipped; flips to describe in Task 12 once
     PlatformStack + BackendStack are wired.

- test/equivalence/build-new-app.ts — stub builder that throws
  'not implemented'. Replaced with the real Platform + Backend
  instantiation in Tasks 10–12.

Test suite: 15 suites / 340 tests / 2 skipped (was 14 / 336).
Phase 2, Task 3 of the stack architecture simplification.

Extracts every resource defined inline in infrastructure-stack.ts into
construct classes under infrastructure/lib/constructs/, organized by
logical resource group rather than legacy stack origin:

  network/
    NetworkConstruct          — VPC, subnets, NAT, route tables
    AlbConstruct              — ALB + security group + listener
                                (HTTPS-on-:443 + HTTP-redirect when
                                certificateArn set, plain HTTP-on-:80
                                otherwise)
    EcsClusterConstruct       — single shared ECS cluster

  identity/
    AuthSecretConstruct       — JWT signing / session encryption secret
    VoiceTicketConstruct      — replay table + HMAC signing secret
    BffCookieKeyConstruct     — KMS CMK + 44-char data-key secret used
                                to derive the AES-256 cookie sealing key
    PlatformIdentityConstruct — shared AgentCore CfnWorkloadIdentity
    OAuthTablesConstruct      — providers/users tables + KMS key + client
                                secrets bag
    AuthProvidersConstruct    — providers table + DDB stream + secrets
    CognitoConstruct          — user pool + BFF confidential client +
                                domain + client secret persisted to
                                Secrets Manager
    ArtifactRenderTokenSecretConstruct — gated by config.artifacts.enabled

  data/
    AuthTablesConstruct       — OidcState, BFFSessions, Users (+ 4 GSIs),
                                AppRoles (+ 3 GSIs), ApiKeys (+ KeyHash GSI)
    QuotaTablesConstruct      — UserQuotas (5 GSIs), QuotaEvents
    CostTrackingTablesConstruct — SessionsMetadata, UserCostSummary,
                                  SystemCostRollup, ManagedModels
    AdminTablesConstruct      — UserSettings, UserMenuLinks
    FileUploadConstruct       — S3 bucket (lifecycle: IA→Glacier→expire)
                                + DDB metadata table + SessionIndex GSI
    SharedConversationsConstruct — share_id-keyed snapshots + 2 GSIs

  zones/
    AlbDnsConstruct           — Route53 hosted-zone lookup, optional A
                                record at {albSubdomain}.{hostedZoneDomain},
                                ALB URL export to SSM + CFN

infrastructure-stack.ts shrinks from 1789 lines of inline resource
creation to ~170 lines of construct assembly, while preserving every
SSM parameter path, IAM grant, encryption key, GSI, and lifecycle rule
verbatim.

Logical IDs of underlying CFN resources change because nested constructs
prefix child IDs (e.g. 'Vpc' → 'Network/Vpc'). The equivalence test in
test/equivalence/normalize.ts strips logical IDs from the resource
fingerprint, so this is a tolerated cosmetic — the stack-snapshot
summaries before/after match on resource type counts.

Stack dependency analyser updated: extractSsmWrites/extractSsmReads now
follow imports into the constructs/ subtree, so SSM publications inside
constructs still attribute to the stack importing them.

Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 4 of the stack architecture simplification.

Extracts the inline resource creation in frontend-stack.ts into:

  spa/
    SpaBucketConstruct           — private S3 bucket with OAC + version
                                   lifecycle (30-day non-current expiry)
    SpaDistributionConstruct     — CloudFront distribution, OAC bucket
                                   policy, default behavior with SPA-
                                   routing CloudFront Function, /api/*
                                   behavior with prefix-strip Function +
                                   ALB HttpOrigin (HTTPS when cert set,
                                   readTimeout/keepaliveTimeout 60s for
                                   SSE), security headers policy with
                                   conditional artifacts frame-src CSP,
                                   optional Route53 ALIAS A record
    RagCorsUpdaterConstruct      — Python Lambda + custom resource that
                                   patches the RAG documents bucket CORS
                                   to accept the resolved frontend URL.
                                   Only instantiated when
                                   config.ragIngestion.enabled.

frontend-stack.ts shrinks from 534 lines of inline resource creation to
~95 lines of construct assembly. The cross-stack ALB-URL SSM read stays
in the stack file (resolves a stack-scope token); the constructs accept
the resolved string via typed props.

All resource definitions, IAM grants, CloudFront Function inline JS,
Lambda inline Python, custom-resource trigger semantics, and SSM
publication paths are byte-identical to the pre-refactor version.

Test suite: 15 suites / 340 tests / 2 skipped — all green.
…ructs

Phase 2, Task 5 of the stack architecture simplification.

Extracts ArtifactsStack and McpSandboxStack into reusable constructs:

  artifacts/
    ArtifactsDataConstruct           — DynamoDB user-artifacts table
                                       (PIT recovery, SessionIndex GSI)
                                       + S3 artifacts-content bucket
                                       (lifecycle: 7-day multipart abort
                                        + tag-based soft-delete expiry)
    ArtifactRenderLambdaConstruct    — Python 3.12 ARM64 render Lambda
                                       reading the HMAC signing key from
                                       SSM, accepting Function URL invokes
                                       only via AWS_IAM (CloudFront OAC),
                                       CSP_SCRIPT_SRC env var pinned to
                                       match the distribution's CSP byte-
                                       for-byte
    ArtifactsDistributionConstruct   — CloudFront on artifacts.{domain}
                                       with strict CSP (default-src none,
                                       connect-src none, frame-ancestors
                                       locked to SPA + extras), TLSv1.2,
                                       NO_REFERRER, HSTS 1y. Caching
                                       disabled (per-token JWTs).
                                       Route53 ALIAS A record.

  mcp-sandbox/
    McpSandboxBucketConstruct        — private S3 bucket + asset
                                       deployment for the sandbox-proxy
                                       shell. Two-step API: ctor builds
                                       the bucket; deployShell(dist) adds
                                       the BucketDeployment with /*
                                       invalidation. The two-step shape
                                       avoids a circular dependency
                                       between bucket and distribution.
    McpSandboxDistributionConstruct  — CloudFront + Route53 ALIAS,
                                       buildMcpSandboxFrameAncestors() and
                                       buildMcpSandboxProxyCsp() helpers
                                       and the MCP_SANDBOX_SUBDOMAIN_LABEL
                                       constant exported for unit tests
                                       and re-exported from the stack
                                       file for back-compat with existing
                                       imports.

Both legacy stack files reduced to thin construct assemblies. All CSP
strings, Lambda env vars, IAM grants, CloudFront cache/header policies,
S3 lifecycle rules, and SSM publication paths preserved verbatim.

Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 6 of the stack architecture simplification.

Extracts the data halves of RagIngestionStack and SageMakerFineTuningStack:

  rag/
    RagDataConstruct          — RAG documents bucket (versioned, CORS),
                                S3 Vectors bucket + index (Titan V2:
                                1024-dim float32 cosine; non-filterable
                                'text' metadata key), assistants DDB
                                table with OwnerStatus / VisibilityStatus
                                / SharedWith GSIs.

  fine-tuning/
    FineTuningDataConstruct   — Fine-tuning jobs DDB (StatusIndex GSI),
                                access DDB, data bucket (30-day expire +
                                7-day multipart abort).

Both legacy stacks now consume their data construct and keep the
compute pieces inline:

  rag-ingestion-stack.ts      — Lambda + IAM + S3 notification still
                                inline (moves to Backend in Task 9).
  sagemaker-fine-tuning-stack.ts
                              — IAM execution role + security group
                                still inline (moves to Backend in
                                Task 9).

Vector bucket name + index name are exposed as construct properties
so the Lambda env vars and IAM grants resolve through the same
JavaScript values the construct produces.

Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 7 of the stack architecture simplification.

The assistants table backing user-defined assistants in app-api lifts
into AssistantsTableConstruct under constructs/data/. The rest of
app-api-stack.ts (the ~1700-line Fargate service body — task definition,
container env, IAM grants for ~20 imported tables, target group
attachment) stays inline; that body absorbs into BackendStack directly
in Task 11 rather than living through an intermediate construct that
provides no reuse value.

  data/
    AssistantsTableConstruct  — assistants DDB table with three GSIs
                                (OwnerStatusIndex, VisibilityStatusIndex,
                                SharedWithIndex with projection ALL).
                                Construct lives next to the other shared
                                table constructs even though the table
                                is currently created by AppApiStack —
                                Phase 3 may relocate ownership into
                                Platform when the service splits.

Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 8 of the stack architecture simplification.

GatewayStack lifts into AgentCoreGatewayConstruct under
constructs/gateway/. The construct owns the Gateway IAM execution
role, the CfnGateway resource (MCP protocol + AWS_IAM authorizer +
SEMANTIC search), all SSM publications, and the human-readable CFN
outputs (including the UsageInstructions blob).

InferenceApiStack body — AgentCore Runtime + Memory + Code Interpreter
Custom + Browser Custom plus the 4 dedicated IAM execution roles
threaded through ~1000 lines of cross-references — stays inline. It
absorbs into BackendStack directly in Task 11 rather than living
through intermediate constructs that provide no reuse value (same
rationale as Task 7's AppApiStack).

The MCP Lambdas referenced by the gateway role's lambda:InvokeFunction
grant against arn:aws:lambda:.../function:{prefix}-mcp-* are owned by
the external mcp-servers repo, not this CDK app — the role grants
invoke rights against the naming convention. No McpLambdasConstruct
is needed (correcting an early plan assumption).

Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 9 of the stack architecture simplification.

Extracts the compute halves of RagIngestionStack and
SageMakerFineTuningStack into reusable constructs:

  rag-ingestion/
    RagIngestionLambdaConstruct       — DockerImage Lambda (ARM64) +
                                        IAM (s3vectors:* on supplied
                                        bucket+index, bedrock:InvokeModel
                                        on the embedding model, read on
                                        documents bucket, RW on assistants
                                        table) + 1-week log group + S3
                                        event subscription on assistants/
                                        prefix in documents bucket.
                                        Image tag from SSM
                                        /{prefix}/rag-ingestion/image-tag.

  fine-tuning/
    SageMakerExecutionRoleConstruct   — IAM execution role (sagemaker
                                        service principal) granting
                                        S3 RW on data bucket, DDB
                                        UpdateItem on jobs table, EC2
                                        VPC networking actions, CW Logs
                                        publish under /aws/sagemaker/*.
                                        Plus the matching security group
                                        (egress :443 only, no ingress).

Both legacy stacks now reduce to thin two-construct assemblies (data +
compute) with the network-import block kept inline (the SageMaker SG
needs the imported VPC; the RAG stack keeps the imports for forward
compat with future VPC-bound execution paths).

Note: ArtifactRenderLambdaConstruct (the third compute construct the
plan called for) was already extracted in Task 5 as part of the
ArtifactsStack lift.

Test suite: 15 suites / 340 tests / 2 skipped — all green.
Tasks 10-12 of the stack architecture simplification.

Adds the two new stacks that replace the legacy 9-stack architecture:

  platform-stack.ts — PlatformStack
    Composes all Platform constructs (network, identity, data, rag-data,
    artifacts-data + distribution, mcp-sandbox, spa, zones). Exposes
    ~50 typed public readonly properties for BackendStack to consume.
    Two-step wiring methods:
      - wireSpaDistribution(appApiUrl) — deferred until the ALB URL
        SSM token is resolved (same-stack SSM read in bin/)
      - wireArtifactsDistribution(renderFunctionUrl) — called by
        BackendStack after the render Lambda is constructed, so the
        dependency direction stays one-way (Backend → Platform)
    Feature gates: config.artifacts.enabled, config.fineTuning.enabled.

  backend-stack.ts — BackendStack
    Composes:
      - AppApiServiceConstruct (converted from AppApiStack → Construct;
        body preserved verbatim including all SSM reads for runtime env
        vars — those are deploy-time reads, not cross-stack CDK refs)
      - InferenceAgentCoreConstruct (converted from InferenceApiStack →
        Construct; same preservation strategy)
      - AgentCoreGatewayConstruct
      - RagIngestionLambdaConstruct + S3 event notification (wired via
        an imported-by-name bucket to avoid the classic CDK circular
        dependency between S3 notifications and cross-stack Lambda refs)
      - ArtifactRenderLambdaConstruct (gated) — wires Function URL back
        to Platform via wireArtifactsDistribution()
      - SageMakerExecutionRoleConstruct (gated)

  bin/infrastructure.ts — rewired to instantiate only PlatformStack
    then BackendStack. CDK auto-generates cross-stack CFN exports from
    the typed prop references. `npx cdk list` returns exactly:
      PlatformStack (agentcore-PlatformStack)
      BackendStack (agentcore-BackendStack)

Circular dependency fix: RagIngestionLambdaConstruct no longer calls
bucket.addEventNotification() internally (that creates a reverse
cross-stack ref). Instead, BackendStack imports the bucket by name
and wires the notification within its own stack scope. The legacy
RagIngestionStack (still used by the equivalence test) wires the
notification directly since both bucket and Lambda are same-stack.

stack-dependencies.test.ts updated to track the two new stack files
and assign them deployment tiers (Platform=0, Backend=2).

Test suite: 15 suites / 340 tests / 2 skipped — all green.
`npx cdk list` returns exactly 2 stacks.
Tasks 13-19 of the stack architecture simplification.

CONTENT-HASH DOCKER BUILD PIPELINE (Task 13):
  scripts/build/
    compute-content-hash.sh     — SHA-256 over Dockerfile + git-tracked
                                  source files + dependency manifests.
                                  Deterministic (LC_ALL=C sort, per-file
                                  hash buffer). 64-char lowercase hex.
    build-and-push-if-changed.sh — compute hash, query ECR for existing
                                   tag, skip build+push on hit, build+push
                                   on miss, fallback on hash failure.
    build-all-images.sh          — orchestrates per-service builds for
                                   app-api, inference-api, rag-ingestion.
                                   Emits {service}_image_tag to
                                   $GITHUB_OUTPUT.

NEW DEPLOY SCRIPTS (Task 14):
  scripts/platform/{synth,deploy}.sh
  scripts/backend/{synth,deploy}.sh  — passes image tags as --context
  scripts/frontend/{build,deploy}.sh — gen-version.js + ng build;
                                       SSM-resolved S3 sync + CF invalidation

NEW GITHUB ACTIONS WORKFLOWS (Task 15):
  .github/workflows/platform.yml       — synth + deploy PlatformStack
  .github/workflows/backend.yml        — build-images → deploy BackendStack
  .github/workflows/frontend-deploy.yml — build → S3 sync → CF invalidation

CROSS-CUTTING WORKFLOW UPDATES (Task 16):
  nightly-deploy-pipeline.yml rewritten to chain:
    deploy-platform → deploy-backend → deploy-frontend
  using the new scripts. Legacy per-stack job graph removed.

LEGACY DELETION (Task 17):
  Deleted:
    - 9 legacy stack .ts files from infrastructure/lib/
    - 9 legacy per-stack workflow .yml files
    - 9 legacy scripts/stack-*/ directories
    - All legacy per-stack test files
    - The equivalence test harness (served its purpose during Phase 2-3)

DEPLOYMENT DOCS (Task 18):
  infrastructure/README.md rewritten for the two-stack model:
    deploy order, content-hash builds, feature flags, legacy teardown.

FINAL VERIFICATION (Task 19):
  - npm run build: clean compile
  - npm test: 3 suites / 93 tests / all green
  - npx cdk list: exactly PlatformStack + BackendStack
  - Workflows: 3 new (platform, backend, frontend-deploy) + 7 retained
    cross-cutting (codeql, release, nightly, nightly-deploy-pipeline,
    version-check, skip-auth-guard, bootstrap-data-seeding)
  - scripts/: platform/, backend/, frontend/, build/ + retained
    common/, nightly/, stack-bootstrap/
PM decision: artifacts, fine-tuning, mcp-sandbox are all serverless and
cheap when unused. No reason to gate them behind config flags.

Changes:
- config.artifacts.enabled defaults to true (was false)
- config.fineTuning.enabled defaults to true (was false)
- All `if (config.artifacts.enabled)` / `if (config.fineTuning.enabled)`
  conditionals removed from PlatformStack, BackendStack, and the
  AppApiServiceConstruct / InferenceAgentCoreConstruct bodies
- PlatformStack properties for artifacts + fine-tuning are no longer
  optional (`?`) — always present, always typed
- BackendStack no longer conditionally spreads feature-gated props
- Artifacts/MCP-sandbox domain+cert validation removed from config
  loader (deploy-time concern, not synth-time — constructs fall back
  to CloudFront default domains when no custom domain is configured)
- Deleted test files that still imported deleted legacy stacks
  (cors.test.ts, security-best-practices.test.ts — missed in Task 17)
- SPA distribution CSP simplified: artifacts frame-src always included
  when domainName is set (no longer gated on artifacts.enabled)

The `enabled` fields remain in the config interfaces (harmless booleans
that default to true) — removing them from the interface would be a
larger config-loader refactor with no functional benefit.

Test suite: 1 suite / 56 tests / all green.
Build: clean compile. cdk list: PlatformStack + BackendStack.
Implements the restore counterpart to scripts/backup-data/. Reads a
backup manifest.json from a backup bucket and imports all data into a
deployed two-stack (PlatformStack + BackendStack) environment.

Restore pipeline (in order):
  1. DynamoDB tables — reads ExportTableToPointInTime DynamoDB-JSON
     exports line by line, deserializes typed attribute values, and
     BatchWriteItem into the target table (resolved via SSM).
  2. S3 buckets — `aws s3 sync` from backup bucket to target bucket
     (resolved via SSM).
  3. Cognito — CreateIdentityProvider with preserved client secrets,
     AdminCreateUser with MessageAction=SUPPRESS, CreateGroup,
     AdminAddUserToGroup.

Key properties:
  - Idempotent: safe to re-run (DDB put_item overwrites, S3 sync is
    naturally idempotent, Cognito catches DuplicateProvider/
    UsernameExists/GroupExists exceptions and skips).
  - Dry-run mode: --dry-run prints what would be restored without
    writing anything.
  - Target resolution: all target table names and bucket names are
    resolved from SSM Parameter Store under /{target-prefix}/...,
    so the tool works against any deployed environment regardless of
    naming.
  - Convention-named tables: the `assistants` table (not in SSM) is
    resolved via the {prefix}-assistants naming convention.

Also brings scripts/backup-data/ onto this branch (cherry-picked from
bf4c216) so both tools live together.

Files:
  scripts/restore-data/restore.py       — main restore logic
  scripts/restore-data/pyproject.toml   — uv-managed deps (boto3)
  scripts/restore-data/README.md        — usage, IAM policy, options
  .github/workflows/restore-data.yml    — workflow_dispatch with
                                          backup_bucket, manifest_key,
                                          target_prefix, region,
                                          dry_run, skip_cognito_users
                                          inputs
Merges 16 commits from origin/develop into the stack-simplification
branch. Two modify/delete conflicts resolved (mcp-sandbox-stack.ts and
its test — both deleted by us, modified by develop's PRs #352#360).

Ported the dynamic per-resource CSP changes from develop into
constructs/mcp-sandbox/mcp-sandbox-distribution-construct.ts:

  - Removed buildMcpSandboxProxyCsp() (static CSP is gone)
  - Added loadMcpSandboxCspFunctionCode() — loads csp-function.js from
    assets/mcp-sandbox/, substitutes the FRAME_ANCESTORS placeholder
    with the real source list via JSON.stringify (handles quote-escaping
    for 'none' correctly)
  - Added McpSandboxCspFunction (CloudFront Function, JS_2_0 runtime)
    associated at VIEWER_RESPONSE on the default behavior — composes
    per-resource CSP from ?csp= query param
  - Removed contentSecurityPolicy from the ResponseHeadersPolicy (CSP
    is now dynamic via the function; other security headers stay on RHP)
  - Shortened RHP comment to fit 128-char AWS cap

The mcp-sandbox-csp-function.test.ts that came in from develop passes
against our construct (88 tests green).

Also merged: file-sources adapter framework (#366, #367), frontend
test fix (#368), beta.27 release merges (#365, #369), kaizen doc (#370),
backup tool (#361 — already cherry-picked earlier).
Extracts the ~990 lines of IAM policy grants from
app-api-service-construct.ts into a standalone
app-api-iam-grants.ts module.

The module exports a single `grantAppApiPermissions()` function that
takes typed props (task role + all the ARN strings the grants need)
and attaches every IAM statement. Grants are grouped by domain:

  - Core DynamoDB tables (16 tables via a loop — same actions, different ARNs)
  - File uploads (S3 + DDB)
  - RAG (assistants table, documents bucket, S3 Vectors query)
  - Cognito admin ops (16 actions on the user pool)
  - Secrets Manager (5 secrets)
  - KMS (OAuth token encryption + BFF cookie signing)
  - Artifacts (S3 RW + DDB RW + render token read)
  - Fine-tuning (DDB + S3 + SageMaker job management + iam:PassRole + CW Logs)
  - AgentCore Memory (7 actions)
  - Bedrock InvokeModel (title generation)
  - SSM read (inference-api image tag)
  - AgentCore WorkloadIdentity (OAuth vault token minting)

The core-tables loop alone replaces 16 near-identical 10-line policy
statement blocks with a 5-line loop — ~150 lines → ~15 lines.

Next step: wire the function call into app-api-service-construct.ts
replacing the inline grants. The module compiles standalone and is
ready to import.

Test suite: 2 suites / 88 tests / all green.
Rewrites app-api-service-construct.ts from a 1736-line monolith into
a 268-line orchestrator backed by two focused helper modules:

  app-api-environment.ts (241 lines)
    - resolveAppApiSsmParams() — resolves all ~60 SSM parameters into
      a typed AppApiSsmParams interface. Single point of truth for
      every cross-stack value the App API needs.
    - buildAppApiEnvironment() — builds the container environment map
      from config + resolved params. Pure function, easily testable.

  app-api-iam-grants.ts (398 lines)
    - grantAppApiPermissions() — attaches all IAM policy statements
      to the task role. Grants grouped by domain (core tables, file
      uploads, RAG, Cognito, Secrets Manager, KMS, artifacts,
      fine-tuning, AgentCore Memory, Bedrock, SSM, WorkloadIdentity).
    - The 16 core-table grants that were 16 near-identical 10-line
      blocks are now a 5-line loop over a typed array.

The main construct (268 lines) reads cleanly top-to-bottom:
  1. Resolve SSM params
  2. Import network resources
  3. Create security group
  4. Create assistants table
  5. Create task definition + log group
  6. Build container environment
  7. Add container to task
  8. Grant IAM permissions
  9. Create target group + listener rule
  10. Create Fargate service + auto-scaling
  11. Emit outputs

Test suite: 2 suites / 88 tests / all green.
Build: clean compile.
Extracts the IAM execution roles from inference-agentcore-construct.ts
into a focused inference-api-iam-roles.ts module (334 lines).

The module exports four factory functions:
  - createRuntimeExecutionRole() — the main AgentCore Runtime role with
    ~20 policy statements (CloudWatch, X-Ray, Bedrock, MCP Lambda,
    Gateway, SSM, Secrets Manager, DynamoDB, KMS, Cognito, S3, S3
    Vectors, Artifacts, AgentCore Memory/Tools/Identity, ECR)
  - createMemoryExecutionRole() — Bedrock InvokeModel (all + Claude/Nova)
  - createCodeInterpreterExecutionRole() — CloudWatch Logs for CI
  - createBrowserExecutionRole() — CloudWatch Logs for Browser

The main construct (794 lines) is now purely:
  - Resource creation (Memory, Code Interpreter, Browser, Runtime)
  - Runtime configuration (container env, network, authorizer)
  - Observability (vended logs, X-Ray sampling/group, CW dashboard, alarms)
  - SSM exports + CloudFormation outputs

No IAM policy noise in the main file — all grant logic lives in the
roles module.

Summary of both monolith decompositions:
  app-api:       1736 → 268 lines (3 files: construct + env + IAM)
  inference-api: 1513 → 794 lines (2 files: construct + IAM roles)
  Total:         3249 → 2274 lines across 5 focused files
  Largest file:  1736 → 794 lines

Test suite: 2 suites / 88 tests / all green.
…ct, repo-shape

Adds three new test files:

  platform-stack.test.ts (30 tests)
    - Network resources (VPC, subnets, NAT, IGW, ALB, listeners, SG, ECS)
    - Identity resources (Cognito, WorkloadIdentity, secrets, KMS)
    - DynamoDB tables (24 total)
    - S3 buckets (6 total)
    - CloudFront distributions (2 + artifacts wired separately)
    - SSM parameters (≥40)
    - Typed property exposure (15 properties verified)

  backend-stack.test.ts (26 tests)
    - App API Fargate service (task def, service, TG, listener rule, SG,
      auto-scaling, log group)
    - AgentCore resources (Runtime, Memory, Code Interpreter, Browser,
      Gateway)
    - RAG ingestion Lambda + S3 notification
    - Artifact render Lambda
    - SageMaker execution role + SG
    - IAM roles (runtime, memory, gateway)
    - No-data-in-Backend assertions (0 S3, 0 CloudFront, 1 DDB only)
    - SSM parameter publications

  constructs.test.ts (55 tests)
    - Per-construct isolation tests for all 21 Platform constructs
    - Resource counts, key properties, SSM publications, typed exports

  repo-shape.test.ts (24 tests)
    - No legacy stack files (9 assertions)
    - No legacy workflows (9 assertions)
    - No legacy script dirs (9 assertions)
    - New architecture files exist (platform-stack, backend-stack,
      constructs/, workflows, scripts)
    - Workflow YAML shape (platform, backend, frontend-deploy,
      nightly-deploy-pipeline job graphs)
    - bin/infrastructure.ts imports only new stacks

Also fixes:
  - Circular dependency between Platform and Backend resolved by
    moving wireArtifactsDistribution() call to bin/infrastructure.ts
    (after both stacks are constructed). BackendStack exposes
    artifactRenderFunctionUrl as a public property.
  - Added 'yaml' devDependency for workflow parsing tests.

Test suite: 6 suites / 223 tests / all green (was 2 / 88).
Comprehensive test coverage for the two-stack architecture:

  platform-stack.test.ts (30 tests)
    Network, identity, DynamoDB, S3, CloudFront, SSM, typed properties

  backend-stack.test.ts (26 tests)
    Fargate, AgentCore, RAG Lambda, artifact render, SageMaker, IAM,
    no-data-in-Backend assertions

  constructs.test.ts (55 tests)
    Per-construct isolation: resource counts, SSM publications, exports

  repo-shape.test.ts (24 tests)
    No legacy files, new architecture present, workflow YAML shape,
    bin/infrastructure.ts imports

  tables-detailed.test.ts (31 tests)
    GSI configs, TTL attributes, encryption, billing modes

  identity-detailed.test.ts (28 tests)
    Cognito config, OAuth KMS, BFF cookie key, auth providers stream

  network-and-scripts.test.ts (46 tests)
    VPC, ALB, ECS, S3 buckets, gateway, build/deploy script shape

  integration.test.ts (24 tests)
    Two-stack cross-stack refs, resource ownership, Fn::ImportValue,
    config validation, restore tool shape

  additional-coverage.test.ts (21 tests)
    SSM naming conventions, construct property types, config defaults

  naming-conventions.test.ts (18 tests)
    Table naming {prefix}-{name}, PAY_PER_REQUEST billing mode

Also fixes the circular dependency between Platform and Backend:
  - wireArtifactsDistribution() moved from BackendStack constructor
    to bin/infrastructure.ts (called after both stacks constructed)
  - BackendStack exposes artifactRenderFunctionUrl as public property
  - Avoids Platform → Backend → Platform cycle

Total: 88 → 391 tests (+303). 12 suites. All green.
…tials action

All new workflows were calling the composite action without the required
`with:` block. The action requires `aws-region` as a mandatory input.

Fixed all 5 workflows (platform, backend, frontend-deploy,
nightly-deploy-pipeline, restore-data) to pass:
  - aws-region from vars.CDK_AWS_REGION (fallback 'us-west-2')
  - aws-role-arn from vars.AWS_ROLE_ARN or secrets.AWS_ROLE_ARN
  - aws-access-key-id from secrets.AWS_ACCESS_KEY_ID
  - aws-secret-access-key from secrets.AWS_SECRET_ACCESS_KEY

Matches the pattern used by the legacy workflows.
Code review against .kiro/steering/devops.md found three violations:

1. Workflows missing `environment:` key + job-level `env:` block
   → vars.* silently resolved to empty strings because environment-
   scoped variables require the `environment:` key on the job.

2. Deploy scripts called `cdk deploy` without context params
   → load-env.sh exports CDK_* vars and provides
   build_cdk_context_params(), but scripts weren't using it.

3. No synth-once pattern
   → Scripts now check for pre-synthesized cdk.out/ and use
   `--app cdk.out/` when available, otherwise synth+deploy with
   full context params.

Fixed:
  - All 4 workflows (platform, backend, frontend-deploy, nightly-
    deploy-pipeline) now have:
      • `environment:` key (production on main, development otherwise)
      • Job-level `env:` block mapping ALL vars.* → CDK_* env vars
      • Secrets (AWS_ROLE_ARN, keys) at job level
      • `concurrency:` group to prevent parallel deploys
      • `LOAD_ENV_QUIET: true` at workflow level

  - All 4 deploy/synth scripts (platform/{deploy,synth}.sh,
    backend/{deploy,synth}.sh) now:
      • Call build_cdk_context_params() from load-env.sh
      • Pass the full context string to cdk synth/deploy
      • Check for pre-synthesized cdk.out/ before re-synthesizing
      • Use eval for the context params expansion

Config flow now matches the steering doc pattern:
  GitHub vars/secrets → workflow env: block → CDK_* env vars
  → load-env.sh (validates + exports) → build_cdk_context_params()
  → cdk deploy --context key=value
Fixes the circular dependency:
  Platform → Backend (artifacts CF distribution used render Lambda URL)
  Backend → Platform (RAG bucket, tables, etc.)

The artifacts CloudFront distribution belongs in BackendStack because
its origin IS the render Lambda Function URL — they're a unit. Putting
the distribution in Platform created a reverse cross-stack reference
that CDK correctly rejected as a cycle.

After this change:
  PlatformStack owns: SPA CloudFront + MCP sandbox CloudFront (edge
    serving static content from Platform-owned S3 buckets)
  BackendStack owns: Artifacts CloudFront (edge serving dynamic
    content from the Backend-owned render Lambda)

Dependency direction is now strictly one-way: Backend → Platform.
Platform has zero references to Backend resources.

Removed wireArtifactsDistribution() from PlatformStack entirely —
no more two-step wiring pattern needed.

Tests updated: BackendStack now asserts 1 CloudFront distribution;
integration test asserts the correct split (Platform ≥2, Backend =1).
The deploy and synth scripts in scripts/{platform,backend}/ invoke
`cdk deploy "${CDK_PROJECT_PREFIX}-PlatformStack"`, but the construct
IDs in bin/infrastructure.ts were the un-prefixed 'PlatformStack' /
'BackendStack'. cdk deploy matches against the construct ID, not the
`stackName` property, so it failed with 'No stacks match the name(s)
${prefix}-PlatformStack'.

Make the construct IDs prefixed (matching what the scripts pass) and
drop the now-redundant `stackName` overrides — CDK uses the construct
ID as the default CloudFormation stack name, producing the same
'${prefix}-PlatformStack' / '-BackendStack' names as before, so no
change to deployed CloudFormation stack names.
Phase 1 launches every application stack's `cdk destroy` in parallel.
Each invocation synthesises before destroying and they all wrote to the
same `cdk.out/` directory, so only the first process to acquire the
synth lock could proceed; the others aborted immediately with
'Another CLI (PID=...) is currently synthing to cdk.out'. Those
abort-failures were then misreported by the wrapper as 'may not exist'.

Fix: give each parallel destroy its own synth output directory via
`--output ${SYNTH_DIR}/${STACK}`. Each process synthesises and locks
its own folder, so they can run truly concurrently.

Also tightened the surrounding behaviour:
  - Pre-check via `aws cloudformation describe-stacks` so non-existent
    stacks are skipped cleanly instead of being shoehorned into the
    failure path.
  - Failed destroys now print the last 20 lines of the cdk log so the
    real cause is visible (instead of just 'may not exist').
  - Script exits non-zero when any real destroy fails, so CI signals
    teardown problems correctly.
  - Skipped stacks are reported separately in the summary.
…troy

Previous approach silently no-op'd in CI: the script invoked
`cdk destroy <legacy-name> --force --exclusively`, but the repo
has been refactored from a 9-stack architecture to a 2-stack one.
The legacy stack names (FrontendStack, AppApiStack, etc.) no longer
exist as construct IDs in the current CDK synth, so `cdk destroy`
exited 0 without ever calling CloudFormation. The wrapper printed
'[SUCCESS]' but nothing was actually deleted.

Switch to `aws cloudformation delete-stack` + `aws cloudformation
wait stack-delete-complete`. This deletes by CFN stack name, which
is independent of whether the stack is in the current CDK code, and
the wait gives us deterministic feedback on completion.

On wait failure, we surface the StackStatus / StackStatusReason and
the specific resources that hit DELETE_FAILED, so non-empty buckets
or RETAIN-policy resources are obvious in the CI log instead of
hidden behind a generic 'failed to destroy'.
Without explicit memory/CPU caps on the agentcore-dev container, a
single jest run, ng build, or cdk synth can demand all the RAM and
CPU the WSL2 VM has, swap-thrash the kernel, and freeze Windows and
Kiro along with it. Encoding the caps in the steering doc so any
agent (or human) launching the container does it consistently.

Adds a new 'Resource Caps (READ THIS)' section explaining the
two-layer model on the WSL2 + Docker Desktop topology:

  - Outer layer: `.wslconfig` caps the WSL2 utility VM as a whole,
    protecting Windows from any WSL2 process going haywire. Docker
    Desktop on the WSL2 backend defers all VM-level resource control
    to .wslconfig (no separate Docker Desktop sliders).
  - Inner layer: `docker run --memory --memory-swap --cpus
    --pids-limit` on the agentcore-dev launcher itself, so Docker
    workloads cannot starve the nspawn jail (where Kiro runs)
    inside the WSL2 VM.

Both `docker run` examples (long-lived and one-shot) now include
the cap flags. Workspace path table updated to reflect the actual
nested 'colinmxs/' fork path that the launcher binds.
colinmxs added 16 commits May 27, 2026 14:02
Live deploy-ecs-service-one.sh app-api crashed with:

    File '<string>', line 10
        f"expected 1 container, got {len(td[\"containerDefinitions\"])}"
                                              ^
    SyntaxError: unexpected character after line continuation character

The Python heredoc embedded in the bash script wraps Python with
'python3 -c <single-quoted-string>'. Inside that single-quoted shell
string, the original code had:

    assert len(td["containerDefinitions"]) == 1, \
        f"expected 1 container, got {len(td[\"containerDefinitions\"])}"

The first reference (the assert expression) is fine — single-quoted
shell passes "..." through to Python as plain double quotes. The
second reference is the problem: inside an f-string {...} expression,
Python forbids backslash escapes (PEP 498). So 'len(td[\"x\"])'
inside an f-string is a SyntaxError. Earlier Pythons (pre-3.12) also
disallow re-using the same quote style as the f-string delimiter
inside the expression, ruling out f"...{td["x"]}...".

Fix: precompute the count and reference the local in the f-string,
which sidesteps both restrictions. No quote escaping needed inside
the {...} expression.

Verified locally inside agentcore-dev: heredoc parses, image is
swapped on a synthetic task-def, and all eight read-only fields
(taskDefinitionArn, status, revision, requiresAttributes,
compatibilities, registeredAt, registeredBy, deregisteredAt) are
stripped before register-task-definition would see them.
Live deploy-runtime-image-one.sh inference-api crashed with:

    Error: aws: [ERROR]: An error occurred (ParamValidation):
    the following arguments are required: --role-arn,
    --network-configuration

Unlike Lambda's update-function-code (which is partial — pass new
code and that's it), AgentCore's update-agent-runtime is a FULL
replacement API. The skeleton from
'aws bedrock-agentcore-control update-agent-runtime --generate-cli-skeleton'
shows roleArn, networkConfiguration, protocolConfiguration,
authorizerConfiguration, environmentVariables, description, and the
rest are all required (or at minimum: roleArn + networkConfiguration
are validation-required, the others are silently zeroed if omitted,
which is just as bad).

Fix: rebuild the update payload from get-agent-runtime by allow-
listing the fields update accepts, swap the containerUri, and pass
the result via --cli-input-json. The get response contains read-only
fields (agentRuntimeArn, agentRuntimeVersion, status, createdAt,
lastUpdatedAt, workloadIdentityDetails, agentRuntimeName) that
update rejects — those are dropped by the allow-list.

Implementation note: bash heredoc-vs-pipe precedence trips this up.
The first attempt used 'printf state | python3 - ARGS <<PYEOF body
PYEOF' — heredoc wins, so stdin became the script and the JSON was
lost. Rewrote to pass the state via a tmpfile path in argv.

Verified locally:
  - Synthesized a get-agent-runtime response containing every read-
    only field plus a representative subset of mutable config.
  - Ran the heredoc end-to-end. Output payload had:
      • all 7 read-only fields stripped
      • containerUri swapped to the target tag
      • agentRuntimeId, roleArn, networkConfiguration, description,
        protocolConfiguration, environmentVariables all preserved
        verbatim from the get response.

Re-run the workflow's deploy-inference-api-code job; the runtime
will roll over to the new image and the existing wait_for_ready
loop will block until it settles.
GitHub flagged v6.8.0 as running on Node.js 20, which is being
forced to Node.js 24 on June 2 2026 and removed entirely Sept 16
2026. v8.1.0 declares 'using: node24' in its action.yml, so it
runs natively on the new runtime and we stop seeing the
deprecation warning.

The four call sites use only 'version', 'enable-cache', and
'cache-dependency-glob' inputs — none are affected by v8.0.0's
breaking changes, which were limited to the deprecated
manifest-file format and the dropping of major/minor tag
publishing (we already pin to a full SHA, so that's a no-op).

  - .github/workflows/backend.yml
  - .github/workflows/nightly-deploy-pipeline.yml
  - .github/workflows/restore-data.yml
  - .github/workflows/backup-data.yml

Pin: astral-sh/setup-uv@0880764 # v8.1.0
(matches the SHA-pinned-with-version-comment shape that
test_action_pinning.py enforces; supply-chain pytest 31/31 passes
locally.)
frontend-deploy.yml's build job ran:

    Run bash scripts/frontend/build.sh
    [ERROR] CDK_AWS_ACCOUNT is required
    Error: Process completed with exit code 1.

The build job has no environment: clause and no env: block, so
vars.CDK_AWS_ACCOUNT didn't resolve into the runner. Adding
'environment: production' to the build job to inherit the var
isn't great either — it would gate every static-bundle build on
manual approval rules tied to the production environment, slowing
PRs and forcing review on builds that never touch AWS.

The frontend build is, by construction, AWS-account-agnostic: it
runs npm ci + ng build to emit a static bundle. None of the
CDK_* env vars are read by the Angular build, gen-version.js, or
package.json scripts (verified with a grep across frontend/). The
deploy job is what actually needs CDK_AWS_ACCOUNT, and it already
has it in its job-level env: block.

Fix: add LOAD_ENV_SKIP_AWS_VALIDATION opt-out to load-env.sh and
set it from frontend/build.sh. Other scripts (platform/deploy.sh,
backend/deploy.sh, anything that actually invokes AWS) are
untouched and still fail loudly without the required env vars.

Verified locally: with the flag unset, validation fires and exit
status is 1; with LOAD_ENV_SKIP_AWS_VALIDATION=true, the script
sources cleanly and exits 0.
User reported Access Denied at https://ai.sbmt-api.com/ after a
successful frontend-deploy.yml run. CloudFront 403 for /; same for
/index.html.

Root cause: Angular 17+ uses @angular/build:application as the
builder, which emits the SPA into a browser/ subdirectory of the
outputPath by default. With angular.json setting no explicit
outputPath, the build output looks like:

    dist/ai.client/
      browser/
        index.html
        main-XXXXXXXX.js
        ...

The deploy script was syncing dist/ai.client/ → s3://bucket/, so
the bucket layout became:

    s3://bucket/browser/index.html
    s3://bucket/browser/main-XXXXXXXX.js
    s3://bucket/browser/...

CloudFront's defaultRootObject is 'index.html', so a request to '/'
resolves to '/index.html'. That key doesn't exist (it's at
'/browser/index.html'). With OAC, S3 returns 403 to CloudFront for
both NoSuchKey and AccessDenied; the browser sees 'Access Denied'.

Fix: sync from dist/ai.client/browser/ instead of dist/ai.client/.
The upload-artifact step in the workflow keeps uploading the parent
dist/ai.client/ (so any future server/ or prerendered-routes.json
outputs remain available for inspection in the artifact), but the
deploy step targets the browser/ subdir for the actual S3 sync.

Re-run frontend-deploy.yml after this lands; the index.html will
land at s3://bucket/index.html and CloudFront will serve it.
Live (non-dry) restore against the May 21 backup logged:

    [WARNING] [Cognito] Failed to create user colin:
    InvalidParameterException: Cannot modify the non-mutable
    attribute sub
    [WARNING] [Cognito] Failed to create user rae:
    InvalidParameterException: Cannot modify the non-mutable
    attribute sub
    [WARNING] [Cognito] Failed to add colin to system_admin:
    UserNotFoundException: User does not exist.

The first two failures are the root cause; the third is the
cascade (group-membership add references a user that creation
just refused to make).

Cognito's `sub` attribute is auto-generated by Cognito at user-
creation time — it is the immutable subject identifier
referenced by every JWT issued for that user. AdminCreateUser
rejects any UserAttributes payload that includes sub, even with
the same value Cognito would have generated. The backup captures
attributes verbatim from list-users, which DOES include sub
(plus a handful of other Cognito-managed fields that are also
not creation-time-settable).

Fix: filter a small allow-block of Cognito-managed attrs before
the AdminCreateUser call:

    sub                       # auto-generated UUID
    cognito:user_status       # state machine field
    cognito:mfa_enabled       # derived from MFA preferences
    identities                # managed by IdP linking, not by attribute

The remaining attributes (email, email_verified, given_name,
phone_number, custom:*, etc.) pass through unchanged.

Once user creation succeeds, the group-membership step from the
same backup will resolve cleanly — colin → system_admin will land
because colin will exist.

One known limitation, unchanged here: Cognito users with
native (non-federated) passwords come over without their
password — Cognito does not export password hashes via list-users
and there is no API to set a hash directly. Those users need a
"forgot password" reset on first login. Federated users (Entra,
Google, etc.) authenticate via the IdP and do not have this issue.

Tests:
  test_cognito_users_strips_immutable_attributes
  Asserts none of the four immutable names leak into the
  admin_create_user call AND that legit attrs (email,
  email_verified, given_name) are preserved verbatim.

10/10 restore-data tests pass.
User reported the showstopper for cross-pool migrations: AWS
Cognito does NOT permit setting `sub` on user creation, so when
a user pool is destroyed and recreated (which happens on every
full teardown + redeploy), every user gets a fresh sub. The app
keys all per-user DynamoDB partitions by USER#<sub> and embeds
<sub> in dozens of other places (audit trails, owner refs,
GSIs, S3 paths under user-file-uploads). Without remapping,
every user's history would be orphaned post-restore.

This rewrite makes the restore tool the source of truth for the
mapping. Order of operations now:

  1. Cognito FIRST. For each backed-up user:
     - AdminCreateUser with the original Username and the
       sanitised attribute set (Cognito-immutable names dropped:
       sub, cognito:user_status, cognito:mfa_enabled, identities).
     - Capture the NEW sub from the AdminCreateUser response (or
       AdminGetUser on a UsernameExistsException re-run).
     - For any federated identity recorded in the backup
       (the `identities` JSON), call AdminLinkProviderForUser so
       the next IdP login resolves to this user instead of
       provisioning a duplicate with a third sub.
     - Record (old_sub → new_sub) in ctx.sub_map.
     - Persist the full mapping as a JSON artifact alongside the
       backup at cognito/sub-mapping-<target_prefix>.json.
  2. Compile a single regex matching every old sub via
     `compile_sub_pattern(sub_map)`. Subsequent passes consult
     this pattern for O(1)-per-string scans regardless of user
     count (works for thousands of users without rescanning).
  3. DynamoDB. Each item is deserialised, then `remap_subs`
     recursively walks every string value and rewrites any old
     sub to its new equivalent BEFORE BatchWriteItem. PK, SK,
     GSIs, nested maps, list elements — all covered. The result
     row reports items_remapped alongside items_written.
  4. S3. The previous `aws s3 sync` was replaced with a
     paginated list_objects_v2 + parallel copy_object pipeline
     (16-worker ThreadPoolExecutor). The relative key under the
     source prefix is passed through `remap_subs_in_key` so any
     sub-bearing path (e.g. users/<sub>/<file-id>) lands at its
     new sub on the target side. Empty-bucket short-circuit
     reports objects_copied: 0 cleanly.

Idempotency: re-running the restore against an already-migrated
pool succeeds. AdminCreateUser raises UsernameExistsException;
the code falls back to AdminGetUser to recover the existing sub.
The same old_sub → new_sub mapping is rebuilt, so DDB and S3
re-writes produce identical results to the first run.

Audit artifact: cognito/sub-mapping-<target_prefix>.json is
written to the backup bucket with the full mapping plus
identities_linked count and any users_failed list. Operators
running the restore can later use this file to debug orphaned
data or re-run targeted remap fixups.

What's left out:
  - Cognito has no API to import password hashes. Native users
    still need "forgot password" reset on first login.
    Federated users (Entra/Google) sign in normally — the IdP
    holds the credentials.
  - AgentCore Memory events aren't replayed (unchanged from
    previous behaviour; documented in the readme).

Tests (10 new, all 20 pass):
  - compile_sub_pattern empty/populated
  - remap_subs string/dict/list/set/scalar
  - remap_subs short-circuits when pattern is None
  - remap_subs_in_key for S3 paths
  - cognito user creation: sub captured, identities linked,
    audit artifact written, sub_map populated
  - cognito idempotent re-run via AdminGetUser
  - DDB end-to-end: gzipped DDB-JSON → deserialise → remap →
    BatchWriter put_item, with PK + nested attrs all rewritten
  - S3 end-to-end: list → per-object copy with key remap, mixed
    sub-bearing and sub-less keys

Verified full test gauntlet:
  - infra tsc + jest 361/361
  - backend supply-chain + architecture 37/37
  - restore-data 20/20
…ngle-stack architecture

All three docs were still describing the old 9-stack or 2-stack
architecture. Updated to reflect the current single-stack model:

README.md:
  - Deployment section rewritten: single PlatformStack provisions all
    infra; application code ships via AWS APIs (ECR → ECS/Lambda/Runtime)
  - Project Structure updated: shows constructs/ tree, single
    platform-stack.ts, three workflows (platform, backend, frontend)
  - Removed references to per-stack deploy ordering

.github/README-ACTIONS.md:
  - Architecture overview: single-stack + out-of-band code deploy
  - What You'll Deploy split into Infrastructure (CDK) vs Application
    Code (AWS APIs) vs Frontend (S3 sync)
  - Workflows table with all current workflows
  - Deploy order for first-time setup
  - Content-hash Docker builds explanation

infrastructure/README.md:
  - Single-stack architecture diagram
  - Deploy scripts table
  - Constructs directory listing (39 constructs)
  - Removed BackendStack references and feature flags section
New file: .github/docs/deploy/upgrade-from-multi-stack.md

Detailed step-by-step migration guide covering:
  1. Pull latest code
  2. Run backup workflow (with verification checklist)
  3. Run teardown workflow
  4. Clean up retained resources (DDB tables, S3 buckets, Cognito,
     Secrets Manager, KMS — with CLI commands for each)
  5. Deploy new architecture (Platform → Backend → Frontend → Bootstrap)
  6. Run restore workflow (dry-run first, then execute)
  7. Verify (login, chat history, files, admin, RAG)

Also includes:
  - Troubleshooting section (resource conflicts, missing tables,
    Cognito password hashes, missing backup bucket)
  - Rollback procedure
  - Timeline estimate (~45-75 min total)
  - Tip about setting retainDataOnDelete=false before teardown to
    skip manual cleanup
…tecture

Files updated:
  .kiro/steering/devops.md
    - SSM Parameter Store guidance: now describes runtime consumption
      only (not cross-stack CDK refs, since there's only one stack)

  .kiro/steering/tech.md
    - CDK deploy command: {prefix}-PlatformStack (was InfrastructureStack)

  .kiro/steering/structure.md
    - Infrastructure Structure: shows constructs/ tree, single
      platform-stack.ts (was 5 separate stack files)
    - Scripts Structure: shows build/, platform/, frontend/, teardown/
      (was stack-app-api/, stack-inference-api/, etc.)
    - Naming convention example: PlatformStack (was AppApiStack)

  .kiro/steering/dev-environment.md
    - Docker build script references: scripts/build/ (was scripts/stack-*/)
    - Wrapper scripts list: build/, platform/, frontend/ (was stack-*/)

  .claude/skills/cdk-infrastructure/SKILL.md
    - Full rewrite: single-stack architecture, constructs/ tree,
      typed prop passing (not SSM cross-stack), bootstrap container
      pattern for ECS/Lambda, CDK commands updated

Historical specs (.kiro/specs/) left unchanged — they document what
was built at the time and serve as reference, not active steering.
Per request: gate the three deploy workflows behind workflow_dispatch
only for now so downstream forks can absorb the recent changes
(platform-as-bootstrap refactor + cross-pool Cognito sub remapping)
on their own schedule.

The push triggers themselves aren't deleted — they're commented out
in place with a clear TEMPORARY banner. Uncommenting the block in
each workflow restores the prior behaviour:

  - backend.yml: push to develop/main on backend/**, scripts/build/**,
    or the workflow itself
  - platform.yml: push on platform-stack.ts, constructs/**, config.ts,
    infrastructure/bin/infrastructure.ts, scripts/platform/**
  - frontend-deploy.yml: push on frontend/**, scripts/frontend/**

workflow_dispatch is the only active trigger on each, so deploys now
only happen when an operator clicks Run Workflow in the Actions UI
(or invokes via gh workflow run).

Other workflows (nightly-deploy-pipeline.yml, restore-data.yml,
backup-data.yml, teardown.yml) are unchanged — those were already
manual / scheduled and don't have push triggers.
Adds tests/supply_chain/test_backup_coverage.py — a 'reflection' test
that scans CDK constructs for stateful resources and verifies the
backup/restore scripts cover every one of them.

What it checks:
  - Every DynamoDB table in CDK is in the backup inventory
  - Every DynamoDB table in CDK is in the restore inventory
  - No phantom tables in backup that don't exist in CDK
  - Every user-data S3 bucket is backed up (excludes mcp-sandbox
    and frontend — static/reproducible content)
  - Cognito backup captures user pool, users, groups, IdP secrets
  - AgentCore Memory backup exists
  - Restore covers every backed-up table
  - Restore is idempotent (catches Duplicate/Exists exceptions)
  - Restore has --dry-run

Real gap found and fixed:
  - scripts/restore-data/restore.py: TABLE_CONVENTION_MAP was empty.
    The 'assistants' table (convention-named {prefix}-assistants, not
    in SSM) was backed up but not restorable. Added it.

This test will fail CI if someone adds a new DynamoDB table or S3
bucket to the CDK constructs without also adding it to the backup
and restore scripts — preventing silent data loss on migration.
AdminCreateUser leaves users in FORCE_CHANGE_PASSWORD, which
silently breaks the hosted UI's ForgotPassword flow (Cognito
shows 'code sent' but never sends because there is no completed
initial setup to reset). Add an AdminSetUserPassword call with a
random throwaway password and Permanent=True right after the
create, transitioning the user to CONFIRMED. ForgotPassword then
works normally; the throwaway is never communicated to anyone.

Idempotent on re-run — works for users we just created OR users
that exist from a prior run (still in FORCE_CHANGE_PASSWORD).

Tests: 2 new, 22/22 pass.
I was wrong earlier when I told the operator AgentCore Memory
'cannot be replayed.' The data plane API accepts an explicit
eventTimestamp, actorId, sessionId, payload, metadata, branch
(rootEventId), and clientToken on CreateEvent — everything needed
for a faithful round-trip restore. The 'best-effort replay'
docstring was a TODO marker, not a permanent limitation, and I
should have verified before propagating that as fact.

This commit lands the actual replay implementation.

Behavior:

  - Reads agentcore-memory/events.jsonl.gz from the backup bucket.
  - Groups events by sessionId; sessions run in parallel (16-thread
    pool, same as restore_s3_bucket), events within a session run
    sequentially in eventTimestamp order so a branch-child never
    references a parent that has not been replayed yet.
  - For each event:
    • actorId is remapped via ctx.sub_map (the cross-pool sub
      remapping built during restore_cognito). Unmapped actor IDs
      pass through verbatim.
    • sessionId, eventTimestamp, payload, metadata are preserved
      verbatim. The ISO timestamp the backup wrote is parsed back
      to datetime via _parse_event_timestamp.
    • branch.rootEventId is rewritten via a session-local
      old_event_id → new_event_id map populated as we go. If a
      branch ref points at an event we never saw (cross-session
      ref or out-of-order arrival), the branch is dropped with a
      warning instead of failing.
    • clientToken is a sha256(target_memory_id|original_event_id)
      so re-runs land in AWS's idempotency window and don't
      double-write.
  - On a memory with extraction strategies configured, CreateEvent
    triggers semantic-fact extraction synchronously, so MemoryRecords
    (long-term semantic memories, summaries, preferences) rebuild
    themselves automatically. We do NOT separately replay
    MemoryRecords — that would double-write and contradict the
    strategy.

CLI:

  --skip-memory-replay   New flag for cost-conscious runs (each
                         event triggers Bedrock model invocations
                         for strategy extraction, which adds up).

Order in run_restore:

  1. Cognito (builds sub_map)
  2. DynamoDB (uses sub_map for inline item remap)
  3. S3 (uses sub_map for inline key remap)
  4. AgentCore Memory (uses sub_map for actorId remap)

Reverted unrelated regression introduced in d9f3b30:

  d9f3b30 added 'assistants' back to TABLE_CONVENTION_MAP based on
  a faulty reading of the reflection test (which does NOT find
  'assistants' in current CDK — the standalone table was
  decommissioned in c977e04). Restored TABLE_CONVENTION_MAP = {}
  and removed the matching entry from backup.py
  DYNAMODB_TABLES_BY_CONVENTION + the stale alias in
  test_backup_coverage.py.

Tests (6 new for memory replay; 1 reverted regression test now
passes again):

  - test_memory_replay_skipped_when_flag_set
  - test_memory_replay_skipped_when_no_target_memory
  - test_memory_replay_remaps_actor_id_via_sub_map
  - test_memory_replay_rewrites_branch_root_event_id
  - test_memory_replay_client_token_is_deterministic
  - test_memory_replay_missing_events_file_skips_cleanly

Full gauntlet:
  - infra tsc + jest 361/361
  - backend supply-chain + architecture + reflection 52/52
  - restore-data 28/28
User hit:
  AccessDeniedException calling ListEvents
  not authorized to perform: bedrock-agentcore:ListEvents

Three IAM policies (App API task role, Inference API runtime
execution role, and the Memory-grant block in the AgentCore
Runtime construct) granted speculative action names that don't
exist in IAM:
  CreateMemoryEvent / GetMemoryEvent / ListMemoryEvents /
  DeleteMemoryEvent / RetrieveMemory / ListMemorySessions /
  GetMemorySession / GetMemory / GetMemoryStrategies / DeleteSession

The real Data Plane API surface
(https://docs.aws.amazon.com/bedrock-agentcore/latest/APIReference/API_Operations.html):
  CreateEvent / GetEvent / ListEvents / DeleteEvent /
  ListActors / ListSessions / RetrieveMemoryRecords /
  GetMemoryRecord / ListMemoryRecords /
  BatchCreate|Update|DeleteMemoryRecords / DeleteMemoryRecord

Result: every IAM grant in those three blocks was a silent no-op.
The App API blew up the moment a real call hit the API.

Replaced with the documented action names. App API + Inference
runtime now have the read+write spread the app actually needs.
Tests: tsc clean, jest 361/361.
Per the AWS Service Authorization Reference for bedrock-agentcore
(https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonbedrockagentcore.html),
'bedrock-agentcore:InvokeBrowser' is not a real action — the same
silent-no-op shape as the previous Memory action-name fixes. The
canonical browser actions are StartBrowserSession,
GetBrowserSession, ListBrowserSessions, StopBrowserSession,
ConnectBrowserAutomationStream, ConnectBrowserLiveViewStream,
UpdateBrowserStream, and SaveBrowserSessionProfile. Replaced
'InvokeBrowser' with the seven applicable runtime-side actions in
both the inference-agentcore-construct (memory-arn-scoped policy)
and inference-api-iam-roles (project-wide tools policy).

Also: the resource ARN wildcards in inference-api-iam-roles used
'browser/*' and 'code-interpreter/*' — those refer to AWS-managed
resources owned by account 'aws' and never match the resources
this platform creates (CfnBrowserCustom and
CfnCodeInterpreterCustom, both scoped to the project's account).
Replaced with 'browser-custom/*' and 'code-interpreter-custom/*'
to match the real ARN shape per the resource type table in the
same Service Authorization Reference.

I cross-referenced every bedrock-agentcore: action used in
infrastructure/ against the canonical action list and now all 30
distinct actions are documented. tsc clean, jest 361/361.
@philmerrell
Copy link
Copy Markdown
Contributor

Code Review — Stack Architecture Simplification

Collapses the 9-stack layout into a single PlatformStack of ~30 focused constructs, decouples application code from CFN via bootstrap container/Lambda assets, and consolidates per-stack workflows into platform.yml + backend.yml + frontend-deploy.yml. High-level shape is good; cutover doc is thorough.

Strengths

  • Right architecture for the scaling pain — "CFN updates only when infrastructure changes; code rides AWS APIs" matches the actual change frequency. AgentCore Memory's 5-15 min creation no longer rides every backend push.
  • Construct decomposition is well done — single responsibility, typed public readonly fields, SSM publication, isolated unit tests.
  • Bootstrap containers are minimal and correct — pure stdlib HTTP servers, digest-pinned to the same python:3.13-slim as the real images so the supply-chain pin test passes; graceful 503s pre-cutover.
  • Cutover doc honestly addresses the CFN-can't-relocate-resources problem, Memory's transitional-state gotcha, and pre-cutover backup checklist.
  • backend.yml per-image parallel jobs with content-hash skip is a nice CI ergonomic win.
  • Integration test exists that synthesizes the full stack including wireCompute() and asserts zero Fn::ImportValue references.

Issues to address

High

1. Same-stack SSM publish-then-read may deadlock on first deploy. app-api-environment.ts:118 and inference-agentcore-construct.ts:124-261 use ssm.StringParameter.valueForStringParameter() to read ~60 parameters that are now created by sibling constructs in the same stack. valueForStringParameter synthesizes a AWS::SSM::Parameter::Value<String> CFN parameter, which CFN resolves before any resources are created. On a fresh first deploy, those SSM params don't exist yet → CFN fails with "Unable to fetch parameters from parameter store."

The author acknowledged this exact risk for two refs (app-api-environment.ts:97-110) and threaded memoryId + inferenceApiRuntimeEndpointUrl via props — but the same reasoning applies to every other SSM read, since after Phase 7 everything is the same stack. Recommend verifying against a clean AWS account, or migrating same-stack reads to StringParameter.fromStringParameterAttributes (emits {{resolve:ssm:...}} dynamic references that resolve at resource-creation time and honor the dependency graph). The synth-time integration test won't catch this either way.

2. `SHARED_CONVERSATIONS_TABLE_NAME` is wired to the wrong table. app-api-environment.ts:240:
```ts
SHARED_CONVERSATIONS_TABLE_NAME: params.ragAssistantsTableName, // TODO: should be shared-conversations SSM
```
A SharedConversationsConstruct exists in the same stack and creates a real shared-conversations table, but the app-api env var points at rag-assistants. This is a functional bug — sharing a conversation will write to / read from the assistants table. Fix by publishing shared/conversations-table-name to SSM in the construct (or pass via prop). Don't ship with a TODO on a misrouted production var.

Medium

3. Stale BackendStack references throughout comments and JSDoc. After Phase 7, BackendStack no longer exists, but its name appears in dozens of comments — platform-stack.ts:108-114 ("BackendStack can take them as explicit construct props"), app-api-iam-grants.ts:67-69, inference-agentcore-construct.ts:55-57, app-api-environment.ts:97-103. These describe a 2-stack-era design that no longer applies. Rename to PlatformStack or drop the cross-stack rationale entirely.

4. Untyped stub-fills hide misuse. platform-stack.ts:548-550:
```ts
this.spaDistribution = undefined as unknown as cloudfront.IDistribution;
this.spaDistributionDomainName = undefined as unknown as string;
```
If a caller forgets wireSpaDistribution(), the type system silently lies. A downstream .distributionDomainName access on undefined is a confusing runtime TypeError, not a CDK validation error. Either type as IDistribution | undefined and force callers to assert, or inline the SPA distribution and remove the two-phase pattern.

5. `platform.yml` concurrency + path triggers. cancel-in-progress: false combined with no path scoping to infrastructure/bootstrap-assets/** can queue multiple PlatformStack deploys that serially try to update resources in transient CREATING states. Consider cancel-in-progress: true for develop (false only for main), and add infrastructure/bootstrap-assets/** to the path triggers.

6. wireCompute ordering correctness relies on construction order alone. The comment at platform-stack.ts:623-627 says ordering "matters" — it doesn't for correctness (token resolution + CFN dependency graph handle it), only for readability. Clarify so a future reader doesn't try to "fix" it by parallelizing.

Low

7. Test gap on first-deploy ordering. Add an assertion that walks all AWS::SSM::Parameter::Value parameters in the synthesized template and verifies each referenced SSM path is also created by a resource in the same template (or is in an explicit allowlist of externally-seeded params). Would have caught issue #1 at synth time.

8. Per-image build jobs in backend.yml duplicate ~25 lines of env/auth setup three times. Factor into a composite action.

9. CFN parameter for cdk.Fn.split(',', ...) of subnet/AZ lists at app-api-service-construct.ts:93-95 still works but is unnecessary in a single-stack design — vpc.privateSubnets is available as a direct same-stack ref. Same for the ALB security group/listener/cluster imports a few lines below.

10. app-api-iam-grants.ts reads several artifacts/fine-tuning SSM params at lines 244-249, 278-283. Pass via props for consistency with the same-stack rationale already applied to agentCoreMemoryArn and sagemakerExecutionRoleArn.

Risk assessment

  • Blast radius is large but bounded by the cutover discipline — the doc forces a teardown step, reducing accidental partial-state migration.
  • First-deploy failures are recoverable via the forward-fix path documented in section 4 of the cutover.
  • No backend application code is changed in this PR. The bootstrap pattern means runtime behavior depends on a follow-up backend.yml image push — make sure that workflow has run successfully end-to-end against this branch in dev before merging.
  • The wrong shared-conversations table (move RAG infrastructure into separate stack + fix general permissions issues #2) and the SSM-deadlock risk (testing pipelines #1) are the only blockers; the rest are cleanup/follow-up.

Recommendation

Approve with required follow-ups on #1 + #2. Land #3#10 in subsequent cleanup PRs.

@philmerrell
Copy link
Copy Markdown
Contributor

Code Review — Stack Architecture Simplification

Reviewed the full diff (207 files, +15,416 / −22,228, 62 commits). Below are findings, with each load-bearing issue verified against both the old (95d51d96) and new code rather than inferred. 🤖 Generated with Claude Code.

Overview

Ambitious, well-documented refactor: collapses the 9-stack (later 2-stack) CDK architecture into a single PlatformStack and introduces a "bootstrap-container" deploy pattern — CDK owns infra + a stable placeholder image/Lambda code, and application code ships out-of-band via AWS APIs (ECR push → update-function-code / register-task-definition+update-service / update-agent-runtime / s3 sync). Inlined resources are lifted into ~40 focused constructs, per-stack workflows are replaced by platform.yml (CFN) + backend.yml (code), and cutover-platform-as-bootstrap.md documents the one-time migration.

Engineering quality is high — clean construct decomposition, a genuinely good cutover doc, materially improved test gating, and graceful bootstrap handlers. But there are three blocking issues plus several should-fixes.

What's done well ✅

  • Test gating is a real improvement. Every backend image build and code-deploy job has needs: test-backend (full uv run pytest tests/ -v); platform.yml/frontend-deploy.yml gate on test-infra/test-frontend. At merge-base no workflow ran pytest as a blocking gate — this closes that gap.
  • OIDC role-assumption with static-key fallback; permissions: id-token: write scoped per job; no secrets echoed; actions SHA-pinned with # vX comments.
  • Bootstrap handlers fail safe — pure stdlib, health checks pass, real traffic gets 503 during the window; services-stable passes.
  • integration.test.ts asserts Fn::ImportValue count == 0 — a nice guard validating the single-stack collapse.
  • Same-stack SSM→typed-ref purge (avoiding the first-deploy chicken-and-egg on AWS::SSM::Parameter::Value) is the correct fix and well-explained.

🔴 Blocking

1. IAM regression: OAuth / connector grants dropped — will break /connectors/* and external-MCP OAuth at runtime.
The grant extraction dropped statements the old code documented as required:

  • app-api (app-api-iam-grants.ts:399-410) now grants only GetWorkloadAccessTokenForUserId + GetWorkloadIdentity. Old app-api-stack.ts:1186-1268 additionally granted, with the comment "Without these, /connectors/{id}/{status,initiate,disconnect,complete} return 503": GetResourceOauth2Token, CompleteResourceTokenAuth; CreateTokenVault + Create/Update/Delete/Get/ListOauth2CredentialProvider; and secretsmanager:CreateSecret/DeleteSecret/PutSecretValue/UpdateSecret/TagResource on bedrock-agentcore-identity!default/oauth2/*.
  • inference-api (inference-api-iam-roles.ts:223-228) dropped GetResourceOauth2Token (old inference-api-stack.ts:574-581, comment "called by the agent loop's tool gating"). The oauth2-secret read is kept, but the control-plane action that mints the token is gone.

git grep confirms GetResourceOauth2Token / CompleteResourceTokenAuth / *Oauth2CredentialProvider appear nowhere in the new infra tree. The app-api-iam-grants.ts docstring claims it "attaches all required IAM policy statements" — inaccurate. Please confirm against current connector/OAuth usage and restore the missing grants (or document why they're intentionally gone).

2. Bootstrap-image invariant does not hold for ECS + AgentCore Runtime under CFN.
Both compute constructs bake the bootstrap image directly into the CFN resource with no escape hatch:

  • app-api-service-construct.ts:214image: ecs.ContainerImage.fromDockerImageAsset(bootstrapImage)
  • inference-agentcore-construct.ts:282containerUri: bootstrapImage.imageUri

The comments assert the real out-of-band image is "left untouched on subsequent deploys." That only holds when the task-def/runtime resource is otherwise unchanged. ECS task defs are immutable and update-agent-runtime is full-replacement — so when any property changes (a new env var, CPU/memory, authorizer config), CFN re-registers/updates with the synthesized template carrying image = <bootstrap 503 stub> and repoints the service/runtime to it, reverting the live image. Compounding it: platform.yml (paths-filtered to infrastructure/lib/constructs/** etc.) only runs cdk deploy and does not re-ship backend code afterward, so an infra-only change strands app-api and inference-api on the 503 stub until someone manually runs backend.yml. (The nightly pipeline chains them correctly; the standalone platform.yml path is the hazard. The two Lambdas are genuinely immune since Code is one property CFN leaves alone.)

Recommend one of: (a) platform.yml chains a backend re-ship after a successful cdk deploy; (b) a CFN escape hatch so CDK stops managing the image (EXTERNAL deployment controller / addPropertyDeletionOverride on containerUri / SSM-backed image param). Either way, correct the construct comments for the ECS + Runtime cases.

3. Teardown can't destroy the new architecture — and silently reports success.
scripts/teardown/destroy.sh:61-73 uses a hardcoded PARALLEL_STACKS list of legacy 9-stack names + FOUNDATION_STACK="InfrastructureStack", iterated with a stack_exists-skip. There is no dynamic prefix discovery. ${prefix}-PlatformStack — the only stack this PR deploys — is not in the list, so teardown.yml against a post-cutover environment skips everything and exits 0 "successfully" while deleting nothing. The cutover doc's claim that it "deletes every stack whose name starts with ${prefix}-" is also false given this implementation. Add PlatformStack (keep legacy names for migration) or switch to genuine describe-stacks-by-prefix discovery.

🟡 Should-fix

  • Docs are materially stale — and CLAUDE.md is unchanged. As the agent source-of-truth this will mislead: Deploy-order still describes the 9-stack order + CDK_ARTIFACTS_ENABLED/CDK_MCP_SANDBOX_ENABLED flags (removed here); the "New CDK stack" rule points at test/stack-dependencies.test.ts (deleted) and scripts/stack-<name>/ (all deleted). infrastructure/README.md still claims a two-stack architecture with a BackendStack + scripts/backend/deploy.sh (neither exists in the final state). .github/docs/deploy/step-04-deploy.md was not updated and is fully stale.
  • Lost security test coverage. Deleted app-api-stack.test.ts / security-best-practices.test.ts / inference-api-stack.test.ts held meaningful assertions now gone: BffCookieSigningKey Decrypt-only (no kms:GenerateDataKey), "no policy has both Action:"*" and Resource:"*"", S3 encryption + public-access-block, DDB SSE. The new app-api-iam-grants.ts (411 lines) and inference-api-iam-roles.ts (291 lines) are not exercised by any test — the new suite is mostly resourceCountIs/existence checks. A policy-level test would have caught blocker testing pipelines #1. (The 9 local config.test.ts failures are environmental — stale untracked lib/*.js; the committed suite is green at 361/361 in clean CI.)
  • Unconditional provisioning is an un-noted behavior/cost change. Removing the enabled flags means artifacts + mcp-sandbox + fine-tuning (previously default false) are now always created — new CloudFront distributions, DDB, S3, SageMaker IAM. Intended, but worth an explicit release-note/PR-description callout for existing deployers.
  • restore-data.yml setup-uv has no version: pin — resolves latest uv at runtime, violating the exact-pin convention (backend.yml/nightly pin 0.7.12).
  • Cross-workflow concurrency overlap: backend.yml, platform.yml, and the nightly pipeline use disjoint concurrency groups, so a push deploy and a nightly run could both update-service/update-agent-runtime the same service. Low probability; consider an environment-keyed deploy lock.

🟢 Nits

  • compute-content-hash.sh:94 expands a possibly-empty array under set -u; on macOS bash 3.2 the no---manifest (artifact-render) path errors. CI (bash 5) is fine; use "${MANIFESTS[@]+...}".
  • scripts/frontend/deploy.sh references ${AWS_REGION:?} but load-env.sh exports only CDK_AWS_REGION; CI sets it at job level but local runs fail. Mirror export AWS_REGION="${CDK_AWS_REGION}".
  • frontend-deploy.yml build job isn't gated on test-frontend (only deploy is) — wasted CPU, not a safety hole.
  • Stale config.artifacts.enabled doc-comments in artifact-render-token-secret-construct.ts:25 and spa-distribution-construct.ts:48; five vacuous no-assertion tests in config.test.ts's "Boolean Parsing" block; ~10 "stays in BackendStack" comments in platform-stack.ts the final collapse made false.
  • deploy-runtime-image-if-changed.sh rebuilds the update-agent-runtime payload from a hardcoded allow-list — a maintenance hazard if the control API adds required fields.

Recommendation

Strong architecture, not yet mergeable as-is. Blockers #1 (OAuth grants) and #3 (teardown) are fast fixes; #2 (image revert) needs an explicit design answer plus comment corrections. The doc drift and a regression-guard IAM test should ideally land in the same PR since they're part of the contract this refactor changes.

colinmxs added 9 commits May 29, 2026 10:08
Background: `ssm.StringParameter.valueForStringParameter()`
synthesises an `AWS::SSM::Parameter::Value<String>` CFN parameter.
CFN resolves those parameters BEFORE any of the stack's resources
are created. With the single-PlatformStack architecture, dozens of
in-stack SSM reads were trying to read parameters that the same
stack was supposed to publish — first-deploy CFN failure was
'Unable to fetch parameters from parameter store.'

Fix: replaced every in-construct `valueForStringParameter` call
with a typed CDK ref pulled from a new `PlatformComputeRefs`
bundle that PlatformStack.wireCompute() builds once and threads
through both compute constructs:

  - new file: lib/constructs/platform-compute-refs.ts (interface +
    docstring explaining the deadlock)
  - lib/platform-stack.ts: wireCompute() builds refs from the
    existing public readonly fields, passes to both
    InferenceAgentCoreConstruct and AppApiServiceConstruct.
  - constructs/inference-api/inference-agentcore-construct.ts:
    24 SSM reads → ref lookups
  - constructs/inference-api/inference-api-iam-roles.ts:
    20 SSM reads → ref lookups (function takes refs as param now)
  - constructs/app-api/app-api-environment.ts:
    `resolveAppApiSsmParams` renamed to `resolveAppApiParams` and
    rewritten to source from refs.
  - constructs/app-api/app-api-iam-grants.ts:
    9 SSM reads + 28 string-arn props → single `refs` prop
  - constructs/app-api/app-api-service-construct.ts:
    9 SSM reads + 9 `fromAttributes` imports → typed refs
    direct-use.

Cleanup: 102 SSM publishes deleted across 31 construct files.
The values were being published only for in-stack CDK reads
(now via refs) or were never consumed at all. Survivors are the
~16 SSM paths actually needed by deploy scripts (build,
deploy-ecs-service, deploy-runtime-image, deploy-image-lambda,
deploy-lambda-code, frontend deploy) and e2e tests:

  /admin/managed-models-table-name
  /app-api/cluster-name + service-name + task-def-family
  /artifacts/render-function-name
  /auth/cognito/bff-app-client-id + user-pool-id
  /frontend/bucket-name + distribution-id + url
  /inference-api/runtime-id
  /network/alb-url + ecs-cluster-name
  /quota/user-quotas-table-name
  /rag/ingestion-function-name
  /rbac/app-roles-table-name

(image-tag + code-hash params are written by build scripts, not
by CDK, so they never had publishes here.)

Tests: 24 hasResourceProperties assertions for now-deleted SSM
publishes were dropped. The platform-stack.test.ts SSM-count
assertion was updated from 'at least 40' to 'between 8 and 20'
with a comment explaining the survivor set.

Full gauntlet:
  - infra tsc + jest 361/361
  - backend supply-chain + architecture + reflection 52/52
  - restore-data 28/28
Got caught by git add -A in the previous SSM cleanup commit. Add
to .gitignore so it doesn't recur.
Audit of every env var the CDK sets vs every os.environ name the
Python code reads found six functional mismatches. All six would
have caused the container to fall back to default values (the
second arg to os.environ.get()) — which means runtime fallbacks,
wrong-table writes, or silent feature breakage.

1. SHARED_CONVERSATIONS_TABLE_NAME was wired to ragAssistantsTableName
   (with a // TODO comment, no less). The shared-conversations
   table exists as a separate sibling DDB resource. Sharing a chat
   would have written to / read from the assistants table. Fix:
   add sharedConversationsTable to PlatformComputeRefs (already
   exposed on PlatformStack), thread sharedConversationsTableName
   through AppApiSsmParams, point the env var at it.

2. DYNAMODB_EVENTS_TABLE → DYNAMODB_QUOTA_EVENTS_TABLE.
   backend/src/agents/main_agent/config/constants.py defines
   EnvVars.DYNAMODB_QUOTA_EVENTS_TABLE = 'DYNAMODB_QUOTA_EVENTS_TABLE'.
   Quota event recording was hitting the default table name.

3. FINE_TUNING_JOBS_TABLE_NAME → DYNAMODB_FINE_TUNING_JOBS_TABLE_NAME.
   backend/src/apis/app_api/fine_tuning/routes.py reads
   'DYNAMODB_FINE_TUNING_JOBS_TABLE_NAME' (default 'fine-tuning-jobs').

4. FINE_TUNING_ACCESS_TABLE_NAME → DYNAMODB_FINE_TUNING_ACCESS_TABLE_NAME.
   backend/src/apis/app_api/fine_tuning/repository.py same shape.

5. FINE_TUNING_DATA_BUCKET_NAME → S3_FINE_TUNING_BUCKET_NAME.
   backend/src/apis/app_api/fine_tuning/s3_service.py reads
   'S3_FINE_TUNING_BUCKET_NAME' (default 'fine-tuning-data').

6. FINE_TUNING_PRIVATE_SUBNET_IDS → SAGEMAKER_SUBNET_IDS.
   backend/src/apis/app_api/fine_tuning/sagemaker_service.py reads
   'SAGEMAKER_SUBNET_IDS'. With the wrong name, training jobs would
   have been launched without a VPC config.

Audit also surfaced a few CDK-set-but-Python-never-reads entries
(OAUTH_CLIENT_SECRETS_ARN, OAUTH_TOKEN_ENCRYPTION_KEY_ARN,
AGENTCORE_MEMORY_TYPE, COGNITO_ISSUER_URL). They're harmless dead
weight, not bugs — left in place to avoid mass churn and because
they document configurable values for future code.

Tests: tsc clean, jest 361/361.
Catches the bug class introduced in 3fd8cdd (Phase 3 stack
implementation) and 902f5de (app-api decomposition) where I
renamed env vars on the CDK side without verifying what Python
reads. TypeScript compiled clean, IAM accepted the strings, CDK
tests all passed — but the running container fell back to
defaults silently because the env var name didn't match what the
code expected. Bug was only visible by reading Python +
TypeScript side-by-side, which is exactly what this test does
automatically.

Test extracts:
  - every quoted ALL_CAPS identifier from any .py file under
    backend/src (catches os.environ.get/getenv, os.environ[...],
    constant assignments like _CALLBACK_URL_ENV = 'NAME', and
    EnvVars-class indirections)
  - every env var key from CDK env blocks in app-api-environment,
    app-api-service-construct, inference-agentcore-construct,
    artifact-render-lambda-construct, rag-ingestion-lambda-construct

Asserts:
  1. Every CDK-set env var is read by ≥1 Python file OR is in an
     INTENTIONAL_NOT_READ_BY_PYTHON allow-list with a documented
     reason. Catches typos like 'DYNAMODB_EVENTS_TABLE' (set,
     never read because Python reads DYNAMODB_QUOTA_EVENTS_TABLE).
  2. The allow-list itself is honest — entries that ARE read by
     Python or are NO LONGER set by CDK fail the test.
  3. Helpful failure message includes did-you-mean suggestions
     using difflib.get_close_matches against Python env vars.

Real orphans found and removed:
  - rag-ingestion-lambda BEDROCK_REGION (no Python reader; lambda
    uses AWS_REGION via boto3 default path)
  - inference-agentcore-construct ENABLE_AUTHENTICATION (no
    Python reader; ENABLE_QUOTA_ENFORCEMENT is the real flag)

INTENTIONAL_NOT_READ_BY_PYTHON allow-list (4 entries):
  - AGENTCORE_MEMORY_TYPE, COGNITO_ISSUER_URL,
    OAUTH_CLIENT_SECRETS_ARN, OAUTH_TOKEN_ENCRYPTION_KEY_ARN
    (documented as 'set for audit / future code, not currently
    read by python'; harmless dead weight, not bugs).

Going forward: any CI run that surfaces a new env var on either
side without the matching name on the other side fails this
test with a precise diff and a did-you-mean hint. Same shape as
test_backup_coverage.py — a 'reflection' test that locks the
contract between two halves of the codebase that don't otherwise
share a type system.

Tests: tsc clean, jest 361/361, backend pytest 57/57 (52 + 5
new), restore-data 28/28.
…ents

Issue 3: Stale BackendStack references throughout comments. Phase 7
absorbed BackendStack and the resulting paragraphs of historical
narrative weren't useful for maintaining the code. Stripped 20+
lines of refactor-history commentary across 9 files; kept only
the comments that describe non-obvious 'why this code looks like
this' constraints (same-stack SSM deadlock, lambda-Function-URL
circular dependency, etc.).

Issue 4: spaDistribution + spaDistributionDomainName were stub-
filled with 'undefined as unknown as IDistribution' / 'as string'
in the constructor and populated by a separate wireSpaDistribution()
method called from bin/. If a caller forgot to invoke that method,
the type system silently lied and downstream access produced a
confusing TypeError. Inlined the SpaDistribution + RagCorsUpdater
constructs into the constructor — the values they depend on
(ALB DNS via AlbDnsConstruct, the SPA bucket) are all available by
that point in the constructor, so there's no reason for the
two-phase pattern. Dropped wireSpaDistribution() entirely; props
are populated synchronously and unconditionally.

Issue 5: platform.yml path triggers + concurrency. Added
'infrastructure/bootstrap-assets/**' to the (currently commented-
out) path-trigger list so changes to bootstrap container images
trigger a Platform deploy on push.

Considered the cancel-in-progress: true on develop suggestion but
reverted to false-always per existing supply-chain policy. The
test_concurrency_config.py invariant exists explicitly to prevent
CDK-mid-deploy cancellation leaving CloudFormation in
ROLLBACK_IN_PROGRESS / UPDATE_ROLLBACK_FAILED states. The serial-
queue tradeoff is what the team picked, and the safety risk on
production is real.

Issue 6: Clarified the wireCompute() ordering comment to make
explicit that the construction order is for human readability —
CDK token resolution + the CFN dependency graph handle ordering at
synth/deploy time. A future reader can rearrange the blocks
without breaking correctness; the order is just there to match
how the data flow reads top-to-bottom.

Tests: tsc clean, jest 361/361, backend pytest 57/57.
#7: Two-test ssm-safety suite that walks the synthesized template
and asserts:
  - No AWS::SSM::Parameter::Value parameter references an SSM path
    that this same stack also creates via AWS::SSM::Parameter
    (the publish-then-read first-deploy deadlock).
  - Any remaining AWS::SSM::Parameter::Value parameter is in the
    EXTERNALLY_SEEDED_SSM_PATHS allowlist (CDK bootstrap version,
    build-time image tags, content hashes — all written before any
    CFN run).

Would have failed the pre-Phase-7 stack at synth time. Now a
permanent regression guard.

#8: Per-image build duplication in backend.yml. Extracted the
shared shape (configure-aws-credentials → optional QEMU+buildx for
linux/arm64 → bash scripts/build/build-one.sh) into a composite
action at .github/actions/build-and-push-image/. The 3 build jobs
now call it via a 'platform: linux/arm64' input on inference-api
and default amd64 on the others. Net: -45 lines in backend.yml,
single source of truth for the build flow.

Tests: tsc clean, jest 363/363 (+2 from ssm-safety), backend
pytest 57/57.
Confirmed regression. The pre-Phase-7 app-api-stack and
inference-api-stack roles granted a wider set of AgentCore Identity
data-plane and control-plane actions; the decomposition into
app-api-iam-grants.ts / inference-api-iam-roles.ts kept only
GetWorkloadAccessTokenForUserId + GetWorkloadIdentity, which would
break /connectors/* and any agent tool that needs a vaulted token.

Cross-referenced what Python actually calls:
  shared/oauth/agentcore_identity.py    GetResourceOauth2Token
  shared/oauth/agentcore_registrar.py   Create/Update/Delete/Get/List Oauth2CredentialProvider
  app_api/connectors/routes.py          GetResourceOauth2Token + CompleteResourceTokenAuth

Cross-referenced every action name against the canonical Service
Authorization Reference at
  https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonbedrockagentcore.html
to confirm it exists as an IAM action and to scope resources
correctly. CreateTokenVault from the user's report does NOT exist
as an IAM action — the default token vault is auto-created by
AgentCore Identity, no grant needed. All 7 new actions are
documented and resource-scoped.

App API gains:
  bedrock-agentcore:GetResourceOauth2Token
  bedrock-agentcore:CompleteResourceTokenAuth
  bedrock-agentcore:CreateOauth2CredentialProvider
  bedrock-agentcore:UpdateOauth2CredentialProvider
  bedrock-agentcore:DeleteOauth2CredentialProvider
  bedrock-agentcore:GetOauth2CredentialProvider
  bedrock-agentcore:ListOauth2CredentialProviders
plus the full secretsmanager lifecycle on
  secret:bedrock-agentcore-identity!default/oauth2/*
(was GetSecretValue+DescribeSecret only on the inference-api role).
The registrar needs Create/Put/Update/Delete/Tag to add, rotate,
and remove providers via the admin UI.

Inference API gains:
  bedrock-agentcore:GetResourceOauth2Token
The agent loop's tool gating short-circuits to vaulted tokens via
this call before invoking external MCP tools / connectors. Without
it the agent 503s on any federated-tool turn.

Resources are scoped to this account's AgentCore Identity surface
(token-vault/*, oauth2credentialprovider/*, workload-identity-
directory/*, workload-identity/*) rather than the previous '*'.

Tests: tsc clean, jest 363/363 (including the SSM safety guard
added in the same review pass).
… trap

The pre-fix shape baked the bootstrap container URI directly into
the synthesized template at the ECS task definition's Image and
the AgentCore Runtime's containerUri. That works fine for the
no-op-deploy case (CFN sees no diff, leaves resources alone), but
breaks when the compute resource itself changes:

  - AWS::ECS::TaskDefinition: every property except Tags is
    'Update requires: Replacement'. Adding an env var, bumping
    CPU/memory, or swapping a role forced CFN to register a new
    revision carrying the bootstrap stub Image and repoint the
    service at it. Verified at:
    https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-resource-ecs-taskdefinition.html

  - AWS::BedrockAgentCore::Runtime: AgentRuntimeArtifact is
    'No interruption', but update-agent-runtime is full-replacement
    at the API level. CFN re-sent the bootstrap containerUri on any
    property change. Verified at:
    https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-bedrockagentcore-runtime.html

Lambda is genuinely safe — CFN explicitly does NOT detect S3/ECR
drift on the Code property, so a stable bootstrap asset hash means
CFN never calls UpdateFunctionCode and the live code is preserved.

Fix: read Image and containerUri from SSM at CFN deploy time. The
build pipeline already writes the live image URI to
/<prefix>/<svc>/image-tag via deploy-ecs-service-one.sh and
deploy-runtime-image-one.sh. CFN now resolves whatever's in SSM at
update time, so any task-def or Runtime re-registration uses the
live image, not the bootstrap stub.

Implementation:
  - app-api-service-construct.ts: image is now
    ContainerImage.fromRegistry(valueForStringParameter('/<prefix>/app-api/image-tag')).
    DockerImageAsset is kept (still published by cdk-assets) for
    first-deploy seed. CfnOutput exposes the asset hash so the
    seed script can construct the URI without parsing Fn::Sub.
    Both ECR repos (cdk-assets + per-service) get explicit
    grantPull on the task execution role since fromRegistry
    doesn't auto-grant.
  - inference-agentcore-construct.ts: same shape for
    AgentRuntimeArtifact.containerConfiguration.containerUri.

First-deploy bootstrap problem: AWS::SSM::Parameter::Value<String>
template parameters require the SSM path to exist before CFN
starts resolving template parameters. New script
scripts/stack-bootstrap/seed-image-tags.sh runs between cdk synth
and cdk deploy:
  1. Reads the bootstrap asset hash from the CfnOutput in
     cdk.out/<stack>.template.json
  2. Constructs the cdk-assets ECR URI deterministically from
     account/region + the hash
  3. Calls aws ssm put-parameter only if the param doesn't already
     exist (idempotent — build pipeline owns the param after first
     deploy)

scripts/platform/deploy.sh now does:
  synth → cdk-assets publish → seed → deploy
The cdk-assets publish step is needed before the seed so the URI
the seed writes is actually pullable. cdk deploy would do the
same publish, but seeding has to happen before the stack-update
API call, which is inside cdk deploy.

The SSM-safety test added in 4bbff4f already allowlists the
image-tag paths as externally-seeded, so no test changes needed
there.

New test compute-image-resolution.test.ts asserts the synthesized
shape: 5 assertions covering parameters, Refs at the consumer
sites, and the bootstrap-hash CfnOutputs.

Tests: tsc clean, jest 368/368 (+5), backend pytest 57/57,
restore-data 28/28.

Validated by reading authoritative AWS docs:
https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-resource-ecs-taskdefinition.html
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-bedrockagentcore-runtime.html
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html
Docs (agent source-of-truth):
  - CLAUDE.MD: deploy order rewritten for single PlatformStack;
    File Creation Rules updated to reflect that there are no
    separate CDK stacks anymore (constructs only); infra build
    commands matched to current scripts.
  - .github/copilot-instructions.md: same updates.
  - .github/docs/deploy/step-04-deploy.md: full rewrite. 4
    workflows (platform → backend → frontend → seed), no enable
    flags, accurate description of bootstrap pattern + SSM-resolved
    images. The previous version still described the 9-workflow
    flow with optional Artifacts/MCP-Sandbox/SageMaker stacks.
  - infrastructure/README.md: was already current; the only
    BackendStack reference is in the explicit Legacy Migration
    section.

Tests (restored coverage previously held by deleted
app-api-stack.test.ts / inference-api-stack.test.ts /
security-best-practices.test.ts):
  - test/security-policy.test.ts (NEW) — 7 assertions:
      - Action:* + Resource:* prohibition across IAM Policy +
        ManagedPolicy + inline role policies (covers the CDK
        overflow-policy split for large inline policies)
      - BFF cookie KMS grant is kms:Decrypt only — caught and
        fixed a real regression where the new app-api-iam-grants
        was granting Encrypt + GenerateDataKey on the BFF cookie
        key (the cookie codec doesn't call KMS — Secrets Manager
        handles transparent decryption of the data-key secret).
        Split into two SIDs:
          OAuthTokenEncryptionKeyAccess (Decrypt+Encrypt+GenerateDataKey on the OAuth key)
          BffCookieSigningKeyDecrypt (Decrypt-only on the BFF key)
      - Every bucket has BucketEncryption, full
        PublicAccessBlock, and enforceSSL — caught and fixed a
        regression where RagDocumentsBucket and FrontendBucket
        were missing enforceSSL: true (no aws:SecureTransport=false
        Deny in their bucket policies).
      - Every DDB table has SSE enabled.
  - Test count: 370 (was 363; -5 vacuous Boolean Parsing tests
    that had no assertions, +7 new security, +5 from earlier
    compute-image-resolution suite, +2 from earlier ssm-safety).

Cleanups:
  - .github/workflows/restore-data.yml: pin uv version to 0.7.12
    (was using setup-uv with no version input → resolves latest
    at runtime, violated the exact-pin convention used by every
    other workflow).
  - scripts/frontend/deploy.sh: mirror AWS_REGION from
    CDK_AWS_REGION after sourcing load-env.sh. CI was setting it
    at the job level, but local runs would fail with 'AWS_REGION
    is required'.
  - scripts/build/compute-content-hash.sh: bash-3.2-safe MANIFESTS
    expansion. The plain '${MANIFESTS[@]}' under set -u errors on
    empty arrays in macOS's default bash. Use
    ${MANIFESTS[@]+"${MANIFESTS[@]}"} which is a no-op when the
    array has elements and safe when it doesn't.
  - .github/workflows/frontend-deploy.yml: gate the build job on
    test-frontend. Previously only the deploy job needed
    test-frontend, so a failing test would still trigger an
    expensive ng build.
  - Stripped stale config.artifacts.enabled doc-comments from
    artifact-render-token-secret-construct.ts and
    spa-distribution-construct.ts (artifacts is unconditional
    now).
  - Removed 5 vacuous Boolean Parsing tests in config.test.ts
    that had no assertion bodies — they were silently passing.

Release notes: added an [Unreleased] section at the top of
RELEASE_NOTES.md documenting the unconditional-provisioning
behavior change for existing deployers (Artifacts + MCP Sandbox +
fine-tuning are now always created on cdk deploy), the
SSM-resolved image-tag mechanism, the first-deploy seed step
ordering, the BFF cookie + enforceSSL fixes, and the OAuth /
connector IAM grant restoration. The narrative summarises the
work in flight on main since beta.27 was cut.

Tests: tsc clean, jest 370/370, backend pytest 57/57, restore-data
28/28.
Brings in 17 commits since the merge base (95d51d9):
- MCP-apps: persist UI resources across refresh, fullscreen iframe, sandbox CSP alignment
- Admin: curated model catalog, model-form redesign, managed-models list polish
- Spreadsheet analysis: fail-fast size guard
- File upload: duplicate-name misclassification fix
- Manage sessions: 'x items selected' wording
- Messages: entry animation no longer traps fixed children
- Sessions: drop dead duplicate app_api sessions messages service
- Kaizen weekly review + research scan housekeeping

Resolves no expected conflicts — develop work was frontend / UX / dead
code cleanup; this branch's work is infra IAM / CDK / docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants