Feature/stack architecture simplification by colinmxs · Pull Request #396 · Boise-State-Development/agentcore-public-stack

colinmxs · 2026-05-27T21:37:35Z

No description provided.

Adds requirements, design, and tasks for consolidating the CDK app into PlatformStack and BackendStack with content-hash Docker builds, same-origin BFF routing, and three new CI workflows.

Scrapped to start fresh. Original spec is preserved in git history at c558d89 for reference if needed.

…ility Phase 1, Task 1 of the stack architecture simplification. Adds the equivalence-test scaffolding under infrastructure/test/equivalence/: - normalize.ts — snapshot+normalization utilities. Strips stack-boundary metadata, CDK-generated logical IDs, and intrinsic-function references (Ref / Fn::GetAtt / Fn::ImportValue / Fn::Sub / Fn::Join) so the comparison focuses on resource type + properties only. Sorts IAM policy Statement arrays by content so grant-order differences don't cause spurious diffs. Position-sensitive arrays (subnet IDs, AZs) are left alone. - build-legacy-app.ts — synthesizes the 9-stack legacy architecture from a deterministic AppConfig, using the existing test/helpers/ mock-config.ts SSM mock pipeline. Used by the equivalence test baseline (this commit) and will be reused by the equivalence gate (Task 12). Deleted in Task 17. - snapshot-legacy.test.ts — captures legacy synth summaries to __snapshots__/legacy-synth.{all-features,minimal}.json for two representative configs (every flag on / every retained flag off). 9 stacks / 302 resources for all-features; 6 stacks / 240 resources for minimal. Files exist as historical baselines an operator can diff against once the new architecture lands. Also drops 38 stale .js / .d.ts compiled artifacts from infrastructure/ that were shadowing the May 2026 .ts sources (the .gitignore already excludes them but they'd survived in the working tree from a March build). The TypeScript compile is clean from sources only after this. Test suite: 336 tests / 14 suites green (was 334 / 13).

Phase 1, Task 2 of the stack architecture simplification. Adds: - infrastructure/lib/constructs/ — empty subdirectories matching the construct boundaries from the plan (network, identity, data, rag, artifacts, mcp-sandbox, fine-tuning, spa, zones, ecr, app-api, inference-api, gateway, rag-ingestion). Each carries a .gitkeep so the structure ships before any constructs are extracted. - constructs/README.md — documents the layout, naming convention, and the typed-prop-passing rule (no SSM cross-stack reads in new code; runtime SSM reads stay). - test/equivalence/equivalence.test.ts — the harness for the perfection gate. Two describe blocks: 1. 'wiring sanity' — runs every commit. Asserts the legacy app builds and snapshots cleanly, and asserts buildNewApp() throws a 'not implemented' error. This stays useful through Phase 2 so we know the harness keeps compiling as constructs land. 2. 'equivalence gate' (describe.skip) — the actual byte-equivalent comparison. Two cases (minimal config, all-features config). Currently skipped; flips to describe in Task 12 once PlatformStack + BackendStack are wired. - test/equivalence/build-new-app.ts — stub builder that throws 'not implemented'. Replaced with the real Platform + Backend instantiation in Tasks 10–12. Test suite: 15 suites / 340 tests / 2 skipped (was 14 / 336).

Phase 2, Task 3 of the stack architecture simplification. Extracts every resource defined inline in infrastructure-stack.ts into construct classes under infrastructure/lib/constructs/, organized by logical resource group rather than legacy stack origin: network/ NetworkConstruct — VPC, subnets, NAT, route tables AlbConstruct — ALB + security group + listener (HTTPS-on-:443 + HTTP-redirect when certificateArn set, plain HTTP-on-:80 otherwise) EcsClusterConstruct — single shared ECS cluster identity/ AuthSecretConstruct — JWT signing / session encryption secret VoiceTicketConstruct — replay table + HMAC signing secret BffCookieKeyConstruct — KMS CMK + 44-char data-key secret used to derive the AES-256 cookie sealing key PlatformIdentityConstruct — shared AgentCore CfnWorkloadIdentity OAuthTablesConstruct — providers/users tables + KMS key + client secrets bag AuthProvidersConstruct — providers table + DDB stream + secrets CognitoConstruct — user pool + BFF confidential client + domain + client secret persisted to Secrets Manager ArtifactRenderTokenSecretConstruct — gated by config.artifacts.enabled data/ AuthTablesConstruct — OidcState, BFFSessions, Users (+ 4 GSIs), AppRoles (+ 3 GSIs), ApiKeys (+ KeyHash GSI) QuotaTablesConstruct — UserQuotas (5 GSIs), QuotaEvents CostTrackingTablesConstruct — SessionsMetadata, UserCostSummary, SystemCostRollup, ManagedModels AdminTablesConstruct — UserSettings, UserMenuLinks FileUploadConstruct — S3 bucket (lifecycle: IA→Glacier→expire) + DDB metadata table + SessionIndex GSI SharedConversationsConstruct — share_id-keyed snapshots + 2 GSIs zones/ AlbDnsConstruct — Route53 hosted-zone lookup, optional A record at {albSubdomain}.{hostedZoneDomain}, ALB URL export to SSM + CFN infrastructure-stack.ts shrinks from 1789 lines of inline resource creation to ~170 lines of construct assembly, while preserving every SSM parameter path, IAM grant, encryption key, GSI, and lifecycle rule verbatim. Logical IDs of underlying CFN resources change because nested constructs prefix child IDs (e.g. 'Vpc' → 'Network/Vpc'). The equivalence test in test/equivalence/normalize.ts strips logical IDs from the resource fingerprint, so this is a tolerated cosmetic — the stack-snapshot summaries before/after match on resource type counts. Stack dependency analyser updated: extractSsmWrites/extractSsmReads now follow imports into the constructs/ subtree, so SSM publications inside constructs still attribute to the stack importing them. Test suite: 15 suites / 340 tests / 2 skipped — all green.

Phase 2, Task 4 of the stack architecture simplification. Extracts the inline resource creation in frontend-stack.ts into: spa/ SpaBucketConstruct — private S3 bucket with OAC + version lifecycle (30-day non-current expiry) SpaDistributionConstruct — CloudFront distribution, OAC bucket policy, default behavior with SPA- routing CloudFront Function, /api/* behavior with prefix-strip Function + ALB HttpOrigin (HTTPS when cert set, readTimeout/keepaliveTimeout 60s for SSE), security headers policy with conditional artifacts frame-src CSP, optional Route53 ALIAS A record RagCorsUpdaterConstruct — Python Lambda + custom resource that patches the RAG documents bucket CORS to accept the resolved frontend URL. Only instantiated when config.ragIngestion.enabled. frontend-stack.ts shrinks from 534 lines of inline resource creation to ~95 lines of construct assembly. The cross-stack ALB-URL SSM read stays in the stack file (resolves a stack-scope token); the constructs accept the resolved string via typed props. All resource definitions, IAM grants, CloudFront Function inline JS, Lambda inline Python, custom-resource trigger semantics, and SSM publication paths are byte-identical to the pre-refactor version. Test suite: 15 suites / 340 tests / 2 skipped — all green.

…ructs Phase 2, Task 5 of the stack architecture simplification. Extracts ArtifactsStack and McpSandboxStack into reusable constructs: artifacts/ ArtifactsDataConstruct — DynamoDB user-artifacts table (PIT recovery, SessionIndex GSI) + S3 artifacts-content bucket (lifecycle: 7-day multipart abort + tag-based soft-delete expiry) ArtifactRenderLambdaConstruct — Python 3.12 ARM64 render Lambda reading the HMAC signing key from SSM, accepting Function URL invokes only via AWS_IAM (CloudFront OAC), CSP_SCRIPT_SRC env var pinned to match the distribution's CSP byte- for-byte ArtifactsDistributionConstruct — CloudFront on artifacts.{domain} with strict CSP (default-src none, connect-src none, frame-ancestors locked to SPA + extras), TLSv1.2, NO_REFERRER, HSTS 1y. Caching disabled (per-token JWTs). Route53 ALIAS A record. mcp-sandbox/ McpSandboxBucketConstruct — private S3 bucket + asset deployment for the sandbox-proxy shell. Two-step API: ctor builds the bucket; deployShell(dist) adds the BucketDeployment with /* invalidation. The two-step shape avoids a circular dependency between bucket and distribution. McpSandboxDistributionConstruct — CloudFront + Route53 ALIAS, buildMcpSandboxFrameAncestors() and buildMcpSandboxProxyCsp() helpers and the MCP_SANDBOX_SUBDOMAIN_LABEL constant exported for unit tests and re-exported from the stack file for back-compat with existing imports. Both legacy stack files reduced to thin construct assemblies. All CSP strings, Lambda env vars, IAM grants, CloudFront cache/header policies, S3 lifecycle rules, and SSM publication paths preserved verbatim. Test suite: 15 suites / 340 tests / 2 skipped — all green.

Phase 2, Task 6 of the stack architecture simplification. Extracts the data halves of RagIngestionStack and SageMakerFineTuningStack: rag/ RagDataConstruct — RAG documents bucket (versioned, CORS), S3 Vectors bucket + index (Titan V2: 1024-dim float32 cosine; non-filterable 'text' metadata key), assistants DDB table with OwnerStatus / VisibilityStatus / SharedWith GSIs. fine-tuning/ FineTuningDataConstruct — Fine-tuning jobs DDB (StatusIndex GSI), access DDB, data bucket (30-day expire + 7-day multipart abort). Both legacy stacks now consume their data construct and keep the compute pieces inline: rag-ingestion-stack.ts — Lambda + IAM + S3 notification still inline (moves to Backend in Task 9). sagemaker-fine-tuning-stack.ts — IAM execution role + security group still inline (moves to Backend in Task 9). Vector bucket name + index name are exposed as construct properties so the Lambda env vars and IAM grants resolve through the same JavaScript values the construct produces. Test suite: 15 suites / 340 tests / 2 skipped — all green.

Phase 2, Task 7 of the stack architecture simplification. The assistants table backing user-defined assistants in app-api lifts into AssistantsTableConstruct under constructs/data/. The rest of app-api-stack.ts (the ~1700-line Fargate service body — task definition, container env, IAM grants for ~20 imported tables, target group attachment) stays inline; that body absorbs into BackendStack directly in Task 11 rather than living through an intermediate construct that provides no reuse value. data/ AssistantsTableConstruct — assistants DDB table with three GSIs (OwnerStatusIndex, VisibilityStatusIndex, SharedWithIndex with projection ALL). Construct lives next to the other shared table constructs even though the table is currently created by AppApiStack — Phase 3 may relocate ownership into Platform when the service splits. Test suite: 15 suites / 340 tests / 2 skipped — all green.

Phase 2, Task 8 of the stack architecture simplification. GatewayStack lifts into AgentCoreGatewayConstruct under constructs/gateway/. The construct owns the Gateway IAM execution role, the CfnGateway resource (MCP protocol + AWS_IAM authorizer + SEMANTIC search), all SSM publications, and the human-readable CFN outputs (including the UsageInstructions blob). InferenceApiStack body — AgentCore Runtime + Memory + Code Interpreter Custom + Browser Custom plus the 4 dedicated IAM execution roles threaded through ~1000 lines of cross-references — stays inline. It absorbs into BackendStack directly in Task 11 rather than living through intermediate constructs that provide no reuse value (same rationale as Task 7's AppApiStack). The MCP Lambdas referenced by the gateway role's lambda:InvokeFunction grant against arn:aws:lambda:.../function:{prefix}-mcp-* are owned by the external mcp-servers repo, not this CDK app — the role grants invoke rights against the naming convention. No McpLambdasConstruct is needed (correcting an early plan assumption). Test suite: 15 suites / 340 tests / 2 skipped — all green.

Phase 2, Task 9 of the stack architecture simplification. Extracts the compute halves of RagIngestionStack and SageMakerFineTuningStack into reusable constructs: rag-ingestion/ RagIngestionLambdaConstruct — DockerImage Lambda (ARM64) + IAM (s3vectors:* on supplied bucket+index, bedrock:InvokeModel on the embedding model, read on documents bucket, RW on assistants table) + 1-week log group + S3 event subscription on assistants/ prefix in documents bucket. Image tag from SSM /{prefix}/rag-ingestion/image-tag. fine-tuning/ SageMakerExecutionRoleConstruct — IAM execution role (sagemaker service principal) granting S3 RW on data bucket, DDB UpdateItem on jobs table, EC2 VPC networking actions, CW Logs publish under /aws/sagemaker/*. Plus the matching security group (egress :443 only, no ingress). Both legacy stacks now reduce to thin two-construct assemblies (data + compute) with the network-import block kept inline (the SageMaker SG needs the imported VPC; the RAG stack keeps the imports for forward compat with future VPC-bound execution paths). Note: ArtifactRenderLambdaConstruct (the third compute construct the plan called for) was already extracted in Task 5 as part of the ArtifactsStack lift. Test suite: 15 suites / 340 tests / 2 skipped — all green.

Tasks 10-12 of the stack architecture simplification. Adds the two new stacks that replace the legacy 9-stack architecture: platform-stack.ts — PlatformStack Composes all Platform constructs (network, identity, data, rag-data, artifacts-data + distribution, mcp-sandbox, spa, zones). Exposes ~50 typed public readonly properties for BackendStack to consume. Two-step wiring methods: - wireSpaDistribution(appApiUrl) — deferred until the ALB URL SSM token is resolved (same-stack SSM read in bin/) - wireArtifactsDistribution(renderFunctionUrl) — called by BackendStack after the render Lambda is constructed, so the dependency direction stays one-way (Backend → Platform) Feature gates: config.artifacts.enabled, config.fineTuning.enabled. backend-stack.ts — BackendStack Composes: - AppApiServiceConstruct (converted from AppApiStack → Construct; body preserved verbatim including all SSM reads for runtime env vars — those are deploy-time reads, not cross-stack CDK refs) - InferenceAgentCoreConstruct (converted from InferenceApiStack → Construct; same preservation strategy) - AgentCoreGatewayConstruct - RagIngestionLambdaConstruct + S3 event notification (wired via an imported-by-name bucket to avoid the classic CDK circular dependency between S3 notifications and cross-stack Lambda refs) - ArtifactRenderLambdaConstruct (gated) — wires Function URL back to Platform via wireArtifactsDistribution() - SageMakerExecutionRoleConstruct (gated) bin/infrastructure.ts — rewired to instantiate only PlatformStack then BackendStack. CDK auto-generates cross-stack CFN exports from the typed prop references. `npx cdk list` returns exactly: PlatformStack (agentcore-PlatformStack) BackendStack (agentcore-BackendStack) Circular dependency fix: RagIngestionLambdaConstruct no longer calls bucket.addEventNotification() internally (that creates a reverse cross-stack ref). Instead, BackendStack imports the bucket by name and wires the notification within its own stack scope. The legacy RagIngestionStack (still used by the equivalence test) wires the notification directly since both bucket and Lambda are same-stack. stack-dependencies.test.ts updated to track the two new stack files and assign them deployment tiers (Platform=0, Backend=2). Test suite: 15 suites / 340 tests / 2 skipped — all green. `npx cdk list` returns exactly 2 stacks.

Tasks 13-19 of the stack architecture simplification. CONTENT-HASH DOCKER BUILD PIPELINE (Task 13): scripts/build/ compute-content-hash.sh — SHA-256 over Dockerfile + git-tracked source files + dependency manifests. Deterministic (LC_ALL=C sort, per-file hash buffer). 64-char lowercase hex. build-and-push-if-changed.sh — compute hash, query ECR for existing tag, skip build+push on hit, build+push on miss, fallback on hash failure. build-all-images.sh — orchestrates per-service builds for app-api, inference-api, rag-ingestion. Emits {service}_image_tag to $GITHUB_OUTPUT. NEW DEPLOY SCRIPTS (Task 14): scripts/platform/{synth,deploy}.sh scripts/backend/{synth,deploy}.sh — passes image tags as --context scripts/frontend/{build,deploy}.sh — gen-version.js + ng build; SSM-resolved S3 sync + CF invalidation NEW GITHUB ACTIONS WORKFLOWS (Task 15): .github/workflows/platform.yml — synth + deploy PlatformStack .github/workflows/backend.yml — build-images → deploy BackendStack .github/workflows/frontend-deploy.yml — build → S3 sync → CF invalidation CROSS-CUTTING WORKFLOW UPDATES (Task 16): nightly-deploy-pipeline.yml rewritten to chain: deploy-platform → deploy-backend → deploy-frontend using the new scripts. Legacy per-stack job graph removed. LEGACY DELETION (Task 17): Deleted: - 9 legacy stack .ts files from infrastructure/lib/ - 9 legacy per-stack workflow .yml files - 9 legacy scripts/stack-*/ directories - All legacy per-stack test files - The equivalence test harness (served its purpose during Phase 2-3) DEPLOYMENT DOCS (Task 18): infrastructure/README.md rewritten for the two-stack model: deploy order, content-hash builds, feature flags, legacy teardown. FINAL VERIFICATION (Task 19): - npm run build: clean compile - npm test: 3 suites / 93 tests / all green - npx cdk list: exactly PlatformStack + BackendStack - Workflows: 3 new (platform, backend, frontend-deploy) + 7 retained cross-cutting (codeql, release, nightly, nightly-deploy-pipeline, version-check, skip-auth-guard, bootstrap-data-seeding) - scripts/: platform/, backend/, frontend/, build/ + retained common/, nightly/, stack-bootstrap/

PM decision: artifacts, fine-tuning, mcp-sandbox are all serverless and cheap when unused. No reason to gate them behind config flags. Changes: - config.artifacts.enabled defaults to true (was false) - config.fineTuning.enabled defaults to true (was false) - All `if (config.artifacts.enabled)` / `if (config.fineTuning.enabled)` conditionals removed from PlatformStack, BackendStack, and the AppApiServiceConstruct / InferenceAgentCoreConstruct bodies - PlatformStack properties for artifacts + fine-tuning are no longer optional (`?`) — always present, always typed - BackendStack no longer conditionally spreads feature-gated props - Artifacts/MCP-sandbox domain+cert validation removed from config loader (deploy-time concern, not synth-time — constructs fall back to CloudFront default domains when no custom domain is configured) - Deleted test files that still imported deleted legacy stacks (cors.test.ts, security-best-practices.test.ts — missed in Task 17) - SPA distribution CSP simplified: artifacts frame-src always included when domainName is set (no longer gated on artifacts.enabled) The `enabled` fields remain in the config interfaces (harmless booleans that default to true) — removing them from the interface would be a larger config-loader refactor with no functional benefit. Test suite: 1 suite / 56 tests / all green. Build: clean compile. cdk list: PlatformStack + BackendStack.

Implements the restore counterpart to scripts/backup-data/. Reads a backup manifest.json from a backup bucket and imports all data into a deployed two-stack (PlatformStack + BackendStack) environment. Restore pipeline (in order): 1. DynamoDB tables — reads ExportTableToPointInTime DynamoDB-JSON exports line by line, deserializes typed attribute values, and BatchWriteItem into the target table (resolved via SSM). 2. S3 buckets — `aws s3 sync` from backup bucket to target bucket (resolved via SSM). 3. Cognito — CreateIdentityProvider with preserved client secrets, AdminCreateUser with MessageAction=SUPPRESS, CreateGroup, AdminAddUserToGroup. Key properties: - Idempotent: safe to re-run (DDB put_item overwrites, S3 sync is naturally idempotent, Cognito catches DuplicateProvider/ UsernameExists/GroupExists exceptions and skips). - Dry-run mode: --dry-run prints what would be restored without writing anything. - Target resolution: all target table names and bucket names are resolved from SSM Parameter Store under /{target-prefix}/..., so the tool works against any deployed environment regardless of naming. - Convention-named tables: the `assistants` table (not in SSM) is resolved via the {prefix}-assistants naming convention. Also brings scripts/backup-data/ onto this branch (cherry-picked from bf4c216) so both tools live together. Files: scripts/restore-data/restore.py — main restore logic scripts/restore-data/pyproject.toml — uv-managed deps (boto3) scripts/restore-data/README.md — usage, IAM policy, options .github/workflows/restore-data.yml — workflow_dispatch with backup_bucket, manifest_key, target_prefix, region, dry_run, skip_cognito_users inputs

Merges 16 commits from origin/develop into the stack-simplification branch. Two modify/delete conflicts resolved (mcp-sandbox-stack.ts and its test — both deleted by us, modified by develop's PRs #352–#360). Ported the dynamic per-resource CSP changes from develop into constructs/mcp-sandbox/mcp-sandbox-distribution-construct.ts: - Removed buildMcpSandboxProxyCsp() (static CSP is gone) - Added loadMcpSandboxCspFunctionCode() — loads csp-function.js from assets/mcp-sandbox/, substitutes the FRAME_ANCESTORS placeholder with the real source list via JSON.stringify (handles quote-escaping for 'none' correctly) - Added McpSandboxCspFunction (CloudFront Function, JS_2_0 runtime) associated at VIEWER_RESPONSE on the default behavior — composes per-resource CSP from ?csp= query param - Removed contentSecurityPolicy from the ResponseHeadersPolicy (CSP is now dynamic via the function; other security headers stay on RHP) - Shortened RHP comment to fit 128-char AWS cap The mcp-sandbox-csp-function.test.ts that came in from develop passes against our construct (88 tests green). Also merged: file-sources adapter framework (#366, #367), frontend test fix (#368), beta.27 release merges (#365, #369), kaizen doc (#370), backup tool (#361 — already cherry-picked earlier).

Extracts the ~990 lines of IAM policy grants from app-api-service-construct.ts into a standalone app-api-iam-grants.ts module. The module exports a single `grantAppApiPermissions()` function that takes typed props (task role + all the ARN strings the grants need) and attaches every IAM statement. Grants are grouped by domain: - Core DynamoDB tables (16 tables via a loop — same actions, different ARNs) - File uploads (S3 + DDB) - RAG (assistants table, documents bucket, S3 Vectors query) - Cognito admin ops (16 actions on the user pool) - Secrets Manager (5 secrets) - KMS (OAuth token encryption + BFF cookie signing) - Artifacts (S3 RW + DDB RW + render token read) - Fine-tuning (DDB + S3 + SageMaker job management + iam:PassRole + CW Logs) - AgentCore Memory (7 actions) - Bedrock InvokeModel (title generation) - SSM read (inference-api image tag) - AgentCore WorkloadIdentity (OAuth vault token minting) The core-tables loop alone replaces 16 near-identical 10-line policy statement blocks with a 5-line loop — ~150 lines → ~15 lines. Next step: wire the function call into app-api-service-construct.ts replacing the inline grants. The module compiles standalone and is ready to import. Test suite: 2 suites / 88 tests / all green.

Rewrites app-api-service-construct.ts from a 1736-line monolith into a 268-line orchestrator backed by two focused helper modules: app-api-environment.ts (241 lines) - resolveAppApiSsmParams() — resolves all ~60 SSM parameters into a typed AppApiSsmParams interface. Single point of truth for every cross-stack value the App API needs. - buildAppApiEnvironment() — builds the container environment map from config + resolved params. Pure function, easily testable. app-api-iam-grants.ts (398 lines) - grantAppApiPermissions() — attaches all IAM policy statements to the task role. Grants grouped by domain (core tables, file uploads, RAG, Cognito, Secrets Manager, KMS, artifacts, fine-tuning, AgentCore Memory, Bedrock, SSM, WorkloadIdentity). - The 16 core-table grants that were 16 near-identical 10-line blocks are now a 5-line loop over a typed array. The main construct (268 lines) reads cleanly top-to-bottom: 1. Resolve SSM params 2. Import network resources 3. Create security group 4. Create assistants table 5. Create task definition + log group 6. Build container environment 7. Add container to task 8. Grant IAM permissions 9. Create target group + listener rule 10. Create Fargate service + auto-scaling 11. Emit outputs Test suite: 2 suites / 88 tests / all green. Build: clean compile.

Extracts the IAM execution roles from inference-agentcore-construct.ts into a focused inference-api-iam-roles.ts module (334 lines). The module exports four factory functions: - createRuntimeExecutionRole() — the main AgentCore Runtime role with ~20 policy statements (CloudWatch, X-Ray, Bedrock, MCP Lambda, Gateway, SSM, Secrets Manager, DynamoDB, KMS, Cognito, S3, S3 Vectors, Artifacts, AgentCore Memory/Tools/Identity, ECR) - createMemoryExecutionRole() — Bedrock InvokeModel (all + Claude/Nova) - createCodeInterpreterExecutionRole() — CloudWatch Logs for CI - createBrowserExecutionRole() — CloudWatch Logs for Browser The main construct (794 lines) is now purely: - Resource creation (Memory, Code Interpreter, Browser, Runtime) - Runtime configuration (container env, network, authorizer) - Observability (vended logs, X-Ray sampling/group, CW dashboard, alarms) - SSM exports + CloudFormation outputs No IAM policy noise in the main file — all grant logic lives in the roles module. Summary of both monolith decompositions: app-api: 1736 → 268 lines (3 files: construct + env + IAM) inference-api: 1513 → 794 lines (2 files: construct + IAM roles) Total: 3249 → 2274 lines across 5 focused files Largest file: 1736 → 794 lines Test suite: 2 suites / 88 tests / all green.

…ct, repo-shape Adds three new test files: platform-stack.test.ts (30 tests) - Network resources (VPC, subnets, NAT, IGW, ALB, listeners, SG, ECS) - Identity resources (Cognito, WorkloadIdentity, secrets, KMS) - DynamoDB tables (24 total) - S3 buckets (6 total) - CloudFront distributions (2 + artifacts wired separately) - SSM parameters (≥40) - Typed property exposure (15 properties verified) backend-stack.test.ts (26 tests) - App API Fargate service (task def, service, TG, listener rule, SG, auto-scaling, log group) - AgentCore resources (Runtime, Memory, Code Interpreter, Browser, Gateway) - RAG ingestion Lambda + S3 notification - Artifact render Lambda - SageMaker execution role + SG - IAM roles (runtime, memory, gateway) - No-data-in-Backend assertions (0 S3, 0 CloudFront, 1 DDB only) - SSM parameter publications constructs.test.ts (55 tests) - Per-construct isolation tests for all 21 Platform constructs - Resource counts, key properties, SSM publications, typed exports repo-shape.test.ts (24 tests) - No legacy stack files (9 assertions) - No legacy workflows (9 assertions) - No legacy script dirs (9 assertions) - New architecture files exist (platform-stack, backend-stack, constructs/, workflows, scripts) - Workflow YAML shape (platform, backend, frontend-deploy, nightly-deploy-pipeline job graphs) - bin/infrastructure.ts imports only new stacks Also fixes: - Circular dependency between Platform and Backend resolved by moving wireArtifactsDistribution() call to bin/infrastructure.ts (after both stacks are constructed). BackendStack exposes artifactRenderFunctionUrl as a public property. - Added 'yaml' devDependency for workflow parsing tests. Test suite: 6 suites / 223 tests / all green (was 2 / 88).

Comprehensive test coverage for the two-stack architecture: platform-stack.test.ts (30 tests) Network, identity, DynamoDB, S3, CloudFront, SSM, typed properties backend-stack.test.ts (26 tests) Fargate, AgentCore, RAG Lambda, artifact render, SageMaker, IAM, no-data-in-Backend assertions constructs.test.ts (55 tests) Per-construct isolation: resource counts, SSM publications, exports repo-shape.test.ts (24 tests) No legacy files, new architecture present, workflow YAML shape, bin/infrastructure.ts imports tables-detailed.test.ts (31 tests) GSI configs, TTL attributes, encryption, billing modes identity-detailed.test.ts (28 tests) Cognito config, OAuth KMS, BFF cookie key, auth providers stream network-and-scripts.test.ts (46 tests) VPC, ALB, ECS, S3 buckets, gateway, build/deploy script shape integration.test.ts (24 tests) Two-stack cross-stack refs, resource ownership, Fn::ImportValue, config validation, restore tool shape additional-coverage.test.ts (21 tests) SSM naming conventions, construct property types, config defaults naming-conventions.test.ts (18 tests) Table naming {prefix}-{name}, PAY_PER_REQUEST billing mode Also fixes the circular dependency between Platform and Backend: - wireArtifactsDistribution() moved from BackendStack constructor to bin/infrastructure.ts (called after both stacks constructed) - BackendStack exposes artifactRenderFunctionUrl as public property - Avoids Platform → Backend → Platform cycle Total: 88 → 391 tests (+303). 12 suites. All green.

…tials action All new workflows were calling the composite action without the required `with:` block. The action requires `aws-region` as a mandatory input. Fixed all 5 workflows (platform, backend, frontend-deploy, nightly-deploy-pipeline, restore-data) to pass: - aws-region from vars.CDK_AWS_REGION (fallback 'us-west-2') - aws-role-arn from vars.AWS_ROLE_ARN or secrets.AWS_ROLE_ARN - aws-access-key-id from secrets.AWS_ACCESS_KEY_ID - aws-secret-access-key from secrets.AWS_SECRET_ACCESS_KEY Matches the pattern used by the legacy workflows.

Code review against .kiro/steering/devops.md found three violations: 1. Workflows missing `environment:` key + job-level `env:` block → vars.* silently resolved to empty strings because environment- scoped variables require the `environment:` key on the job. 2. Deploy scripts called `cdk deploy` without context params → load-env.sh exports CDK_* vars and provides build_cdk_context_params(), but scripts weren't using it. 3. No synth-once pattern → Scripts now check for pre-synthesized cdk.out/ and use `--app cdk.out/` when available, otherwise synth+deploy with full context params. Fixed: - All 4 workflows (platform, backend, frontend-deploy, nightly- deploy-pipeline) now have: • `environment:` key (production on main, development otherwise) • Job-level `env:` block mapping ALL vars.* → CDK_* env vars • Secrets (AWS_ROLE_ARN, keys) at job level • `concurrency:` group to prevent parallel deploys • `LOAD_ENV_QUIET: true` at workflow level - All 4 deploy/synth scripts (platform/{deploy,synth}.sh, backend/{deploy,synth}.sh) now: • Call build_cdk_context_params() from load-env.sh • Pass the full context string to cdk synth/deploy • Check for pre-synthesized cdk.out/ before re-synthesizing • Use eval for the context params expansion Config flow now matches the steering doc pattern: GitHub vars/secrets → workflow env: block → CDK_* env vars → load-env.sh (validates + exports) → build_cdk_context_params() → cdk deploy --context key=value

Fixes the circular dependency: Platform → Backend (artifacts CF distribution used render Lambda URL) Backend → Platform (RAG bucket, tables, etc.) The artifacts CloudFront distribution belongs in BackendStack because its origin IS the render Lambda Function URL — they're a unit. Putting the distribution in Platform created a reverse cross-stack reference that CDK correctly rejected as a cycle. After this change: PlatformStack owns: SPA CloudFront + MCP sandbox CloudFront (edge serving static content from Platform-owned S3 buckets) BackendStack owns: Artifacts CloudFront (edge serving dynamic content from the Backend-owned render Lambda) Dependency direction is now strictly one-way: Backend → Platform. Platform has zero references to Backend resources. Removed wireArtifactsDistribution() from PlatformStack entirely — no more two-step wiring pattern needed. Tests updated: BackendStack now asserts 1 CloudFront distribution; integration test asserts the correct split (Platform ≥2, Backend =1).

The deploy and synth scripts in scripts/{platform,backend}/ invoke `cdk deploy "${CDK_PROJECT_PREFIX}-PlatformStack"`, but the construct IDs in bin/infrastructure.ts were the un-prefixed 'PlatformStack' / 'BackendStack'. cdk deploy matches against the construct ID, not the `stackName` property, so it failed with 'No stacks match the name(s) ${prefix}-PlatformStack'. Make the construct IDs prefixed (matching what the scripts pass) and drop the now-redundant `stackName` overrides — CDK uses the construct ID as the default CloudFormation stack name, producing the same '${prefix}-PlatformStack' / '-BackendStack' names as before, so no change to deployed CloudFormation stack names.

Phase 1 launches every application stack's `cdk destroy` in parallel. Each invocation synthesises before destroying and they all wrote to the same `cdk.out/` directory, so only the first process to acquire the synth lock could proceed; the others aborted immediately with 'Another CLI (PID=...) is currently synthing to cdk.out'. Those abort-failures were then misreported by the wrapper as 'may not exist'. Fix: give each parallel destroy its own synth output directory via `--output ${SYNTH_DIR}/${STACK}`. Each process synthesises and locks its own folder, so they can run truly concurrently. Also tightened the surrounding behaviour: - Pre-check via `aws cloudformation describe-stacks` so non-existent stacks are skipped cleanly instead of being shoehorned into the failure path. - Failed destroys now print the last 20 lines of the cdk log so the real cause is visible (instead of just 'may not exist'). - Script exits non-zero when any real destroy fails, so CI signals teardown problems correctly. - Skipped stacks are reported separately in the summary.

…troy Previous approach silently no-op'd in CI: the script invoked `cdk destroy <legacy-name> --force --exclusively`, but the repo has been refactored from a 9-stack architecture to a 2-stack one. The legacy stack names (FrontendStack, AppApiStack, etc.) no longer exist as construct IDs in the current CDK synth, so `cdk destroy` exited 0 without ever calling CloudFormation. The wrapper printed '[SUCCESS]' but nothing was actually deleted. Switch to `aws cloudformation delete-stack` + `aws cloudformation wait stack-delete-complete`. This deletes by CFN stack name, which is independent of whether the stack is in the current CDK code, and the wait gives us deterministic feedback on completion. On wait failure, we surface the StackStatus / StackStatusReason and the specific resources that hit DELETE_FAILED, so non-empty buckets or RETAIN-policy resources are obvious in the CI log instead of hidden behind a generic 'failed to destroy'.

Without explicit memory/CPU caps on the agentcore-dev container, a single jest run, ng build, or cdk synth can demand all the RAM and CPU the WSL2 VM has, swap-thrash the kernel, and freeze Windows and Kiro along with it. Encoding the caps in the steering doc so any agent (or human) launching the container does it consistently. Adds a new 'Resource Caps (READ THIS)' section explaining the two-layer model on the WSL2 + Docker Desktop topology: - Outer layer: `.wslconfig` caps the WSL2 utility VM as a whole, protecting Windows from any WSL2 process going haywire. Docker Desktop on the WSL2 backend defers all VM-level resource control to .wslconfig (no separate Docker Desktop sliders). - Inner layer: `docker run --memory --memory-swap --cpus --pids-limit` on the agentcore-dev launcher itself, so Docker workloads cannot starve the nspawn jail (where Kiro runs) inside the WSL2 VM. Both `docker run` examples (long-lived and one-shot) now include the cap flags. Workspace path table updated to reflect the actual nested 'colinmxs/' fork path that the launcher binds.

Live deploy-ecs-service-one.sh app-api crashed with: File '<string>', line 10 f"expected 1 container, got {len(td[\"containerDefinitions\"])}" ^ SyntaxError: unexpected character after line continuation character The Python heredoc embedded in the bash script wraps Python with 'python3 -c <single-quoted-string>'. Inside that single-quoted shell string, the original code had: assert len(td["containerDefinitions"]) == 1, \ f"expected 1 container, got {len(td[\"containerDefinitions\"])}" The first reference (the assert expression) is fine — single-quoted shell passes "..." through to Python as plain double quotes. The second reference is the problem: inside an f-string {...} expression, Python forbids backslash escapes (PEP 498). So 'len(td[\"x\"])' inside an f-string is a SyntaxError. Earlier Pythons (pre-3.12) also disallow re-using the same quote style as the f-string delimiter inside the expression, ruling out f"...{td["x"]}...". Fix: precompute the count and reference the local in the f-string, which sidesteps both restrictions. No quote escaping needed inside the {...} expression. Verified locally inside agentcore-dev: heredoc parses, image is swapped on a synthetic task-def, and all eight read-only fields (taskDefinitionArn, status, revision, requiresAttributes, compatibilities, registeredAt, registeredBy, deregisteredAt) are stripped before register-task-definition would see them.

Live deploy-runtime-image-one.sh inference-api crashed with: Error: aws: [ERROR]: An error occurred (ParamValidation): the following arguments are required: --role-arn, --network-configuration Unlike Lambda's update-function-code (which is partial — pass new code and that's it), AgentCore's update-agent-runtime is a FULL replacement API. The skeleton from 'aws bedrock-agentcore-control update-agent-runtime --generate-cli-skeleton' shows roleArn, networkConfiguration, protocolConfiguration, authorizerConfiguration, environmentVariables, description, and the rest are all required (or at minimum: roleArn + networkConfiguration are validation-required, the others are silently zeroed if omitted, which is just as bad). Fix: rebuild the update payload from get-agent-runtime by allow- listing the fields update accepts, swap the containerUri, and pass the result via --cli-input-json. The get response contains read-only fields (agentRuntimeArn, agentRuntimeVersion, status, createdAt, lastUpdatedAt, workloadIdentityDetails, agentRuntimeName) that update rejects — those are dropped by the allow-list. Implementation note: bash heredoc-vs-pipe precedence trips this up. The first attempt used 'printf state | python3 - ARGS <<PYEOF body PYEOF' — heredoc wins, so stdin became the script and the JSON was lost. Rewrote to pass the state via a tmpfile path in argv. Verified locally: - Synthesized a get-agent-runtime response containing every read- only field plus a representative subset of mutable config. - Ran the heredoc end-to-end. Output payload had: • all 7 read-only fields stripped • containerUri swapped to the target tag • agentRuntimeId, roleArn, networkConfiguration, description, protocolConfiguration, environmentVariables all preserved verbatim from the get response. Re-run the workflow's deploy-inference-api-code job; the runtime will roll over to the new image and the existing wait_for_ready loop will block until it settles.

GitHub flagged v6.8.0 as running on Node.js 20, which is being forced to Node.js 24 on June 2 2026 and removed entirely Sept 16 2026. v8.1.0 declares 'using: node24' in its action.yml, so it runs natively on the new runtime and we stop seeing the deprecation warning. The four call sites use only 'version', 'enable-cache', and 'cache-dependency-glob' inputs — none are affected by v8.0.0's breaking changes, which were limited to the deprecated manifest-file format and the dropping of major/minor tag publishing (we already pin to a full SHA, so that's a no-op). - .github/workflows/backend.yml - .github/workflows/nightly-deploy-pipeline.yml - .github/workflows/restore-data.yml - .github/workflows/backup-data.yml Pin: astral-sh/setup-uv@0880764 # v8.1.0 (matches the SHA-pinned-with-version-comment shape that test_action_pinning.py enforces; supply-chain pytest 31/31 passes locally.)

frontend-deploy.yml's build job ran: Run bash scripts/frontend/build.sh [ERROR] CDK_AWS_ACCOUNT is required Error: Process completed with exit code 1. The build job has no environment: clause and no env: block, so vars.CDK_AWS_ACCOUNT didn't resolve into the runner. Adding 'environment: production' to the build job to inherit the var isn't great either — it would gate every static-bundle build on manual approval rules tied to the production environment, slowing PRs and forcing review on builds that never touch AWS. The frontend build is, by construction, AWS-account-agnostic: it runs npm ci + ng build to emit a static bundle. None of the CDK_* env vars are read by the Angular build, gen-version.js, or package.json scripts (verified with a grep across frontend/). The deploy job is what actually needs CDK_AWS_ACCOUNT, and it already has it in its job-level env: block. Fix: add LOAD_ENV_SKIP_AWS_VALIDATION opt-out to load-env.sh and set it from frontend/build.sh. Other scripts (platform/deploy.sh, backend/deploy.sh, anything that actually invokes AWS) are untouched and still fail loudly without the required env vars. Verified locally: with the flag unset, validation fires and exit status is 1; with LOAD_ENV_SKIP_AWS_VALIDATION=true, the script sources cleanly and exits 0.

User reported Access Denied at https://ai.sbmt-api.com/ after a successful frontend-deploy.yml run. CloudFront 403 for /; same for /index.html. Root cause: Angular 17+ uses @angular/build:application as the builder, which emits the SPA into a browser/ subdirectory of the outputPath by default. With angular.json setting no explicit outputPath, the build output looks like: dist/ai.client/ browser/ index.html main-XXXXXXXX.js ... The deploy script was syncing dist/ai.client/ → s3://bucket/, so the bucket layout became: s3://bucket/browser/index.html s3://bucket/browser/main-XXXXXXXX.js s3://bucket/browser/... CloudFront's defaultRootObject is 'index.html', so a request to '/' resolves to '/index.html'. That key doesn't exist (it's at '/browser/index.html'). With OAC, S3 returns 403 to CloudFront for both NoSuchKey and AccessDenied; the browser sees 'Access Denied'. Fix: sync from dist/ai.client/browser/ instead of dist/ai.client/. The upload-artifact step in the workflow keeps uploading the parent dist/ai.client/ (so any future server/ or prerendered-routes.json outputs remain available for inspection in the artifact), but the deploy step targets the browser/ subdir for the actual S3 sync. Re-run frontend-deploy.yml after this lands; the index.html will land at s3://bucket/index.html and CloudFront will serve it.

Live (non-dry) restore against the May 21 backup logged: [WARNING] [Cognito] Failed to create user colin: InvalidParameterException: Cannot modify the non-mutable attribute sub [WARNING] [Cognito] Failed to create user rae: InvalidParameterException: Cannot modify the non-mutable attribute sub [WARNING] [Cognito] Failed to add colin to system_admin: UserNotFoundException: User does not exist. The first two failures are the root cause; the third is the cascade (group-membership add references a user that creation just refused to make). Cognito's `sub` attribute is auto-generated by Cognito at user- creation time — it is the immutable subject identifier referenced by every JWT issued for that user. AdminCreateUser rejects any UserAttributes payload that includes sub, even with the same value Cognito would have generated. The backup captures attributes verbatim from list-users, which DOES include sub (plus a handful of other Cognito-managed fields that are also not creation-time-settable). Fix: filter a small allow-block of Cognito-managed attrs before the AdminCreateUser call: sub # auto-generated UUID cognito:user_status # state machine field cognito:mfa_enabled # derived from MFA preferences identities # managed by IdP linking, not by attribute The remaining attributes (email, email_verified, given_name, phone_number, custom:*, etc.) pass through unchanged. Once user creation succeeds, the group-membership step from the same backup will resolve cleanly — colin → system_admin will land because colin will exist. One known limitation, unchanged here: Cognito users with native (non-federated) passwords come over without their password — Cognito does not export password hashes via list-users and there is no API to set a hash directly. Those users need a "forgot password" reset on first login. Federated users (Entra, Google, etc.) authenticate via the IdP and do not have this issue. Tests: test_cognito_users_strips_immutable_attributes Asserts none of the four immutable names leak into the admin_create_user call AND that legit attrs (email, email_verified, given_name) are preserved verbatim. 10/10 restore-data tests pass.

User reported the showstopper for cross-pool migrations: AWS Cognito does NOT permit setting `sub` on user creation, so when a user pool is destroyed and recreated (which happens on every full teardown + redeploy), every user gets a fresh sub. The app keys all per-user DynamoDB partitions by USER#<sub> and embeds <sub> in dozens of other places (audit trails, owner refs, GSIs, S3 paths under user-file-uploads). Without remapping, every user's history would be orphaned post-restore. This rewrite makes the restore tool the source of truth for the mapping. Order of operations now: 1. Cognito FIRST. For each backed-up user: - AdminCreateUser with the original Username and the sanitised attribute set (Cognito-immutable names dropped: sub, cognito:user_status, cognito:mfa_enabled, identities). - Capture the NEW sub from the AdminCreateUser response (or AdminGetUser on a UsernameExistsException re-run). - For any federated identity recorded in the backup (the `identities` JSON), call AdminLinkProviderForUser so the next IdP login resolves to this user instead of provisioning a duplicate with a third sub. - Record (old_sub → new_sub) in ctx.sub_map. - Persist the full mapping as a JSON artifact alongside the backup at cognito/sub-mapping-<target_prefix>.json. 2. Compile a single regex matching every old sub via `compile_sub_pattern(sub_map)`. Subsequent passes consult this pattern for O(1)-per-string scans regardless of user count (works for thousands of users without rescanning). 3. DynamoDB. Each item is deserialised, then `remap_subs` recursively walks every string value and rewrites any old sub to its new equivalent BEFORE BatchWriteItem. PK, SK, GSIs, nested maps, list elements — all covered. The result row reports items_remapped alongside items_written. 4. S3. The previous `aws s3 sync` was replaced with a paginated list_objects_v2 + parallel copy_object pipeline (16-worker ThreadPoolExecutor). The relative key under the source prefix is passed through `remap_subs_in_key` so any sub-bearing path (e.g. users/<sub>/<file-id>) lands at its new sub on the target side. Empty-bucket short-circuit reports objects_copied: 0 cleanly. Idempotency: re-running the restore against an already-migrated pool succeeds. AdminCreateUser raises UsernameExistsException; the code falls back to AdminGetUser to recover the existing sub. The same old_sub → new_sub mapping is rebuilt, so DDB and S3 re-writes produce identical results to the first run. Audit artifact: cognito/sub-mapping-<target_prefix>.json is written to the backup bucket with the full mapping plus identities_linked count and any users_failed list. Operators running the restore can later use this file to debug orphaned data or re-run targeted remap fixups. What's left out: - Cognito has no API to import password hashes. Native users still need "forgot password" reset on first login. Federated users (Entra/Google) sign in normally — the IdP holds the credentials. - AgentCore Memory events aren't replayed (unchanged from previous behaviour; documented in the readme). Tests (10 new, all 20 pass): - compile_sub_pattern empty/populated - remap_subs string/dict/list/set/scalar - remap_subs short-circuits when pattern is None - remap_subs_in_key for S3 paths - cognito user creation: sub captured, identities linked, audit artifact written, sub_map populated - cognito idempotent re-run via AdminGetUser - DDB end-to-end: gzipped DDB-JSON → deserialise → remap → BatchWriter put_item, with PK + nested attrs all rewritten - S3 end-to-end: list → per-object copy with key remap, mixed sub-bearing and sub-less keys Verified full test gauntlet: - infra tsc + jest 361/361 - backend supply-chain + architecture 37/37 - restore-data 20/20

…ngle-stack architecture All three docs were still describing the old 9-stack or 2-stack architecture. Updated to reflect the current single-stack model: README.md: - Deployment section rewritten: single PlatformStack provisions all infra; application code ships via AWS APIs (ECR → ECS/Lambda/Runtime) - Project Structure updated: shows constructs/ tree, single platform-stack.ts, three workflows (platform, backend, frontend) - Removed references to per-stack deploy ordering .github/README-ACTIONS.md: - Architecture overview: single-stack + out-of-band code deploy - What You'll Deploy split into Infrastructure (CDK) vs Application Code (AWS APIs) vs Frontend (S3 sync) - Workflows table with all current workflows - Deploy order for first-time setup - Content-hash Docker builds explanation infrastructure/README.md: - Single-stack architecture diagram - Deploy scripts table - Constructs directory listing (39 constructs) - Removed BackendStack references and feature flags section

New file: .github/docs/deploy/upgrade-from-multi-stack.md Detailed step-by-step migration guide covering: 1. Pull latest code 2. Run backup workflow (with verification checklist) 3. Run teardown workflow 4. Clean up retained resources (DDB tables, S3 buckets, Cognito, Secrets Manager, KMS — with CLI commands for each) 5. Deploy new architecture (Platform → Backend → Frontend → Bootstrap) 6. Run restore workflow (dry-run first, then execute) 7. Verify (login, chat history, files, admin, RAG) Also includes: - Troubleshooting section (resource conflicts, missing tables, Cognito password hashes, missing backup bucket) - Rollback procedure - Timeline estimate (~45-75 min total) - Tip about setting retainDataOnDelete=false before teardown to skip manual cleanup

…tecture Files updated: .kiro/steering/devops.md - SSM Parameter Store guidance: now describes runtime consumption only (not cross-stack CDK refs, since there's only one stack) .kiro/steering/tech.md - CDK deploy command: {prefix}-PlatformStack (was InfrastructureStack) .kiro/steering/structure.md - Infrastructure Structure: shows constructs/ tree, single platform-stack.ts (was 5 separate stack files) - Scripts Structure: shows build/, platform/, frontend/, teardown/ (was stack-app-api/, stack-inference-api/, etc.) - Naming convention example: PlatformStack (was AppApiStack) .kiro/steering/dev-environment.md - Docker build script references: scripts/build/ (was scripts/stack-*/) - Wrapper scripts list: build/, platform/, frontend/ (was stack-*/) .claude/skills/cdk-infrastructure/SKILL.md - Full rewrite: single-stack architecture, constructs/ tree, typed prop passing (not SSM cross-stack), bootstrap container pattern for ECS/Lambda, CDK commands updated Historical specs (.kiro/specs/) left unchanged — they document what was built at the time and serve as reference, not active steering.

Per request: gate the three deploy workflows behind workflow_dispatch only for now so downstream forks can absorb the recent changes (platform-as-bootstrap refactor + cross-pool Cognito sub remapping) on their own schedule. The push triggers themselves aren't deleted — they're commented out in place with a clear TEMPORARY banner. Uncommenting the block in each workflow restores the prior behaviour: - backend.yml: push to develop/main on backend/**, scripts/build/**, or the workflow itself - platform.yml: push on platform-stack.ts, constructs/**, config.ts, infrastructure/bin/infrastructure.ts, scripts/platform/** - frontend-deploy.yml: push on frontend/**, scripts/frontend/** workflow_dispatch is the only active trigger on each, so deploys now only happen when an operator clicks Run Workflow in the Actions UI (or invokes via gh workflow run). Other workflows (nightly-deploy-pipeline.yml, restore-data.yml, backup-data.yml, teardown.yml) are unchanged — those were already manual / scheduled and don't have push triggers.

Adds tests/supply_chain/test_backup_coverage.py — a 'reflection' test that scans CDK constructs for stateful resources and verifies the backup/restore scripts cover every one of them. What it checks: - Every DynamoDB table in CDK is in the backup inventory - Every DynamoDB table in CDK is in the restore inventory - No phantom tables in backup that don't exist in CDK - Every user-data S3 bucket is backed up (excludes mcp-sandbox and frontend — static/reproducible content) - Cognito backup captures user pool, users, groups, IdP secrets - AgentCore Memory backup exists - Restore covers every backed-up table - Restore is idempotent (catches Duplicate/Exists exceptions) - Restore has --dry-run Real gap found and fixed: - scripts/restore-data/restore.py: TABLE_CONVENTION_MAP was empty. The 'assistants' table (convention-named {prefix}-assistants, not in SSM) was backed up but not restorable. Added it. This test will fail CI if someone adds a new DynamoDB table or S3 bucket to the CDK constructs without also adding it to the backup and restore scripts — preventing silent data loss on migration.

AdminCreateUser leaves users in FORCE_CHANGE_PASSWORD, which silently breaks the hosted UI's ForgotPassword flow (Cognito shows 'code sent' but never sends because there is no completed initial setup to reset). Add an AdminSetUserPassword call with a random throwaway password and Permanent=True right after the create, transitioning the user to CONFIRMED. ForgotPassword then works normally; the throwaway is never communicated to anyone. Idempotent on re-run — works for users we just created OR users that exist from a prior run (still in FORCE_CHANGE_PASSWORD). Tests: 2 new, 22/22 pass.

I was wrong earlier when I told the operator AgentCore Memory 'cannot be replayed.' The data plane API accepts an explicit eventTimestamp, actorId, sessionId, payload, metadata, branch (rootEventId), and clientToken on CreateEvent — everything needed for a faithful round-trip restore. The 'best-effort replay' docstring was a TODO marker, not a permanent limitation, and I should have verified before propagating that as fact. This commit lands the actual replay implementation. Behavior: - Reads agentcore-memory/events.jsonl.gz from the backup bucket. - Groups events by sessionId; sessions run in parallel (16-thread pool, same as restore_s3_bucket), events within a session run sequentially in eventTimestamp order so a branch-child never references a parent that has not been replayed yet. - For each event: • actorId is remapped via ctx.sub_map (the cross-pool sub remapping built during restore_cognito). Unmapped actor IDs pass through verbatim. • sessionId, eventTimestamp, payload, metadata are preserved verbatim. The ISO timestamp the backup wrote is parsed back to datetime via _parse_event_timestamp. • branch.rootEventId is rewritten via a session-local old_event_id → new_event_id map populated as we go. If a branch ref points at an event we never saw (cross-session ref or out-of-order arrival), the branch is dropped with a warning instead of failing. • clientToken is a sha256(target_memory_id|original_event_id) so re-runs land in AWS's idempotency window and don't double-write. - On a memory with extraction strategies configured, CreateEvent triggers semantic-fact extraction synchronously, so MemoryRecords (long-term semantic memories, summaries, preferences) rebuild themselves automatically. We do NOT separately replay MemoryRecords — that would double-write and contradict the strategy. CLI: --skip-memory-replay New flag for cost-conscious runs (each event triggers Bedrock model invocations for strategy extraction, which adds up). Order in run_restore: 1. Cognito (builds sub_map) 2. DynamoDB (uses sub_map for inline item remap) 3. S3 (uses sub_map for inline key remap) 4. AgentCore Memory (uses sub_map for actorId remap) Reverted unrelated regression introduced in d9f3b30: d9f3b30 added 'assistants' back to TABLE_CONVENTION_MAP based on a faulty reading of the reflection test (which does NOT find 'assistants' in current CDK — the standalone table was decommissioned in c977e04). Restored TABLE_CONVENTION_MAP = {} and removed the matching entry from backup.py DYNAMODB_TABLES_BY_CONVENTION + the stale alias in test_backup_coverage.py. Tests (6 new for memory replay; 1 reverted regression test now passes again): - test_memory_replay_skipped_when_flag_set - test_memory_replay_skipped_when_no_target_memory - test_memory_replay_remaps_actor_id_via_sub_map - test_memory_replay_rewrites_branch_root_event_id - test_memory_replay_client_token_is_deterministic - test_memory_replay_missing_events_file_skips_cleanly Full gauntlet: - infra tsc + jest 361/361 - backend supply-chain + architecture + reflection 52/52 - restore-data 28/28

User hit: AccessDeniedException calling ListEvents not authorized to perform: bedrock-agentcore:ListEvents Three IAM policies (App API task role, Inference API runtime execution role, and the Memory-grant block in the AgentCore Runtime construct) granted speculative action names that don't exist in IAM: CreateMemoryEvent / GetMemoryEvent / ListMemoryEvents / DeleteMemoryEvent / RetrieveMemory / ListMemorySessions / GetMemorySession / GetMemory / GetMemoryStrategies / DeleteSession The real Data Plane API surface (https://docs.aws.amazon.com/bedrock-agentcore/latest/APIReference/API_Operations.html): CreateEvent / GetEvent / ListEvents / DeleteEvent / ListActors / ListSessions / RetrieveMemoryRecords / GetMemoryRecord / ListMemoryRecords / BatchCreate|Update|DeleteMemoryRecords / DeleteMemoryRecord Result: every IAM grant in those three blocks was a silent no-op. The App API blew up the moment a real call hit the API. Replaced with the documented action names. App API + Inference runtime now have the read+write spread the app actually needs. Tests: tsc clean, jest 361/361.

Per the AWS Service Authorization Reference for bedrock-agentcore (https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonbedrockagentcore.html), 'bedrock-agentcore:InvokeBrowser' is not a real action — the same silent-no-op shape as the previous Memory action-name fixes. The canonical browser actions are StartBrowserSession, GetBrowserSession, ListBrowserSessions, StopBrowserSession, ConnectBrowserAutomationStream, ConnectBrowserLiveViewStream, UpdateBrowserStream, and SaveBrowserSessionProfile. Replaced 'InvokeBrowser' with the seven applicable runtime-side actions in both the inference-agentcore-construct (memory-arn-scoped policy) and inference-api-iam-roles (project-wide tools policy). Also: the resource ARN wildcards in inference-api-iam-roles used 'browser/*' and 'code-interpreter/*' — those refer to AWS-managed resources owned by account 'aws' and never match the resources this platform creates (CfnBrowserCustom and CfnCodeInterpreterCustom, both scoped to the project's account). Replaced with 'browser-custom/*' and 'code-interpreter-custom/*' to match the real ARN shape per the resource type table in the same Service Authorization Reference. I cross-referenced every bedrock-agentcore: action used in infrastructure/ against the canonical action list and now all 30 distinct actions are documented. tsc clean, jest 361/361.

philmerrell · 2026-05-29T13:51:25Z

Code Review — Stack Architecture Simplification

Collapses the 9-stack layout into a single PlatformStack of ~30 focused constructs, decouples application code from CFN via bootstrap container/Lambda assets, and consolidates per-stack workflows into platform.yml + backend.yml + frontend-deploy.yml. High-level shape is good; cutover doc is thorough.

Strengths

Right architecture for the scaling pain — "CFN updates only when infrastructure changes; code rides AWS APIs" matches the actual change frequency. AgentCore Memory's 5-15 min creation no longer rides every backend push.
Construct decomposition is well done — single responsibility, typed public readonly fields, SSM publication, isolated unit tests.
Bootstrap containers are minimal and correct — pure stdlib HTTP servers, digest-pinned to the same python:3.13-slim as the real images so the supply-chain pin test passes; graceful 503s pre-cutover.
Cutover doc honestly addresses the CFN-can't-relocate-resources problem, Memory's transitional-state gotcha, and pre-cutover backup checklist.
backend.yml per-image parallel jobs with content-hash skip is a nice CI ergonomic win.
Integration test exists that synthesizes the full stack including wireCompute() and asserts zero Fn::ImportValue references.

Issues to address

High

1. Same-stack SSM publish-then-read may deadlock on first deploy. app-api-environment.ts:118 and inference-agentcore-construct.ts:124-261 use ssm.StringParameter.valueForStringParameter() to read ~60 parameters that are now created by sibling constructs in the same stack. valueForStringParameter synthesizes a AWS::SSM::Parameter::Value<String> CFN parameter, which CFN resolves before any resources are created. On a fresh first deploy, those SSM params don't exist yet → CFN fails with "Unable to fetch parameters from parameter store."

The author acknowledged this exact risk for two refs (app-api-environment.ts:97-110) and threaded memoryId + inferenceApiRuntimeEndpointUrl via props — but the same reasoning applies to every other SSM read, since after Phase 7 everything is the same stack. Recommend verifying against a clean AWS account, or migrating same-stack reads to StringParameter.fromStringParameterAttributes (emits {{resolve:ssm:...}} dynamic references that resolve at resource-creation time and honor the dependency graph). The synth-time integration test won't catch this either way.

2. `SHARED_CONVERSATIONS_TABLE_NAME` is wired to the wrong table. app-api-environment.ts:240:
```ts
SHARED_CONVERSATIONS_TABLE_NAME: params.ragAssistantsTableName, // TODO: should be shared-conversations SSM
```
A SharedConversationsConstruct exists in the same stack and creates a real shared-conversations table, but the app-api env var points at rag-assistants. This is a functional bug — sharing a conversation will write to / read from the assistants table. Fix by publishing shared/conversations-table-name to SSM in the construct (or pass via prop). Don't ship with a TODO on a misrouted production var.

Medium

3. Stale BackendStack references throughout comments and JSDoc. After Phase 7, BackendStack no longer exists, but its name appears in dozens of comments — platform-stack.ts:108-114 ("BackendStack can take them as explicit construct props"), app-api-iam-grants.ts:67-69, inference-agentcore-construct.ts:55-57, app-api-environment.ts:97-103. These describe a 2-stack-era design that no longer applies. Rename to PlatformStack or drop the cross-stack rationale entirely.

4. Untyped stub-fills hide misuse. platform-stack.ts:548-550:
```ts
this.spaDistribution = undefined as unknown as cloudfront.IDistribution;
this.spaDistributionDomainName = undefined as unknown as string;
```
If a caller forgets wireSpaDistribution(), the type system silently lies. A downstream .distributionDomainName access on undefined is a confusing runtime TypeError, not a CDK validation error. Either type as IDistribution | undefined and force callers to assert, or inline the SPA distribution and remove the two-phase pattern.

5. `platform.yml` concurrency + path triggers. cancel-in-progress: false combined with no path scoping to infrastructure/bootstrap-assets/** can queue multiple PlatformStack deploys that serially try to update resources in transient CREATING states. Consider cancel-in-progress: true for develop (false only for main), and add infrastructure/bootstrap-assets/** to the path triggers.

6. wireCompute ordering correctness relies on construction order alone. The comment at platform-stack.ts:623-627 says ordering "matters" — it doesn't for correctness (token resolution + CFN dependency graph handle it), only for readability. Clarify so a future reader doesn't try to "fix" it by parallelizing.

Low

7. Test gap on first-deploy ordering. Add an assertion that walks all AWS::SSM::Parameter::Value parameters in the synthesized template and verifies each referenced SSM path is also created by a resource in the same template (or is in an explicit allowlist of externally-seeded params). Would have caught issue #1 at synth time.

8. Per-image build jobs in backend.yml duplicate ~25 lines of env/auth setup three times. Factor into a composite action.

9. CFN parameter for cdk.Fn.split(',', ...) of subnet/AZ lists at app-api-service-construct.ts:93-95 still works but is unnecessary in a single-stack design — vpc.privateSubnets is available as a direct same-stack ref. Same for the ALB security group/listener/cluster imports a few lines below.

10. app-api-iam-grants.ts reads several artifacts/fine-tuning SSM params at lines 244-249, 278-283. Pass via props for consistency with the same-stack rationale already applied to agentCoreMemoryArn and sagemakerExecutionRoleArn.

Risk assessment

Blast radius is large but bounded by the cutover discipline — the doc forces a teardown step, reducing accidental partial-state migration.
First-deploy failures are recoverable via the forward-fix path documented in section 4 of the cutover.
No backend application code is changed in this PR. The bootstrap pattern means runtime behavior depends on a follow-up backend.yml image push — make sure that workflow has run successfully end-to-end against this branch in dev before merging.
The wrong shared-conversations table (move RAG infrastructure into separate stack + fix general permissions issues #2) and the SSM-deadlock risk (testing pipelines #1) are the only blockers; the rest are cleanup/follow-up.

Recommendation

Approve with required follow-ups on #1 + #2. Land #3–#10 in subsequent cleanup PRs.

philmerrell · 2026-05-29T14:43:58Z

Code Review — Stack Architecture Simplification

Reviewed the full diff (207 files, +15,416 / −22,228, 62 commits). Below are findings, with each load-bearing issue verified against both the old (95d51d96) and new code rather than inferred. 🤖 Generated with Claude Code.

Overview

Ambitious, well-documented refactor: collapses the 9-stack (later 2-stack) CDK architecture into a single PlatformStack and introduces a "bootstrap-container" deploy pattern — CDK owns infra + a stable placeholder image/Lambda code, and application code ships out-of-band via AWS APIs (ECR push → update-function-code / register-task-definition+update-service / update-agent-runtime / s3 sync). Inlined resources are lifted into ~40 focused constructs, per-stack workflows are replaced by platform.yml (CFN) + backend.yml (code), and cutover-platform-as-bootstrap.md documents the one-time migration.

Engineering quality is high — clean construct decomposition, a genuinely good cutover doc, materially improved test gating, and graceful bootstrap handlers. But there are three blocking issues plus several should-fixes.

What's done well ✅

Test gating is a real improvement. Every backend image build and code-deploy job has needs: test-backend (full uv run pytest tests/ -v); platform.yml/frontend-deploy.yml gate on test-infra/test-frontend. At merge-base no workflow ran pytest as a blocking gate — this closes that gap.
OIDC role-assumption with static-key fallback; permissions: id-token: write scoped per job; no secrets echoed; actions SHA-pinned with # vX comments.
Bootstrap handlers fail safe — pure stdlib, health checks pass, real traffic gets 503 during the window; services-stable passes.
integration.test.ts asserts Fn::ImportValue count == 0 — a nice guard validating the single-stack collapse.
Same-stack SSM→typed-ref purge (avoiding the first-deploy chicken-and-egg on AWS::SSM::Parameter::Value) is the correct fix and well-explained.

🔴 Blocking

1. IAM regression: OAuth / connector grants dropped — will break /connectors/* and external-MCP OAuth at runtime.
The grant extraction dropped statements the old code documented as required:

app-api (app-api-iam-grants.ts:399-410) now grants only GetWorkloadAccessTokenForUserId + GetWorkloadIdentity. Old app-api-stack.ts:1186-1268 additionally granted, with the comment "Without these, /connectors/{id}/{status,initiate,disconnect,complete} return 503": GetResourceOauth2Token, CompleteResourceTokenAuth; CreateTokenVault + Create/Update/Delete/Get/ListOauth2CredentialProvider; and secretsmanager:CreateSecret/DeleteSecret/PutSecretValue/UpdateSecret/TagResource on bedrock-agentcore-identity!default/oauth2/*.
inference-api (inference-api-iam-roles.ts:223-228) dropped GetResourceOauth2Token (old inference-api-stack.ts:574-581, comment "called by the agent loop's tool gating"). The oauth2-secret read is kept, but the control-plane action that mints the token is gone.

git grep confirms GetResourceOauth2Token / CompleteResourceTokenAuth / *Oauth2CredentialProvider appear nowhere in the new infra tree. The app-api-iam-grants.ts docstring claims it "attaches all required IAM policy statements" — inaccurate. Please confirm against current connector/OAuth usage and restore the missing grants (or document why they're intentionally gone).

2. Bootstrap-image invariant does not hold for ECS + AgentCore Runtime under CFN.
Both compute constructs bake the bootstrap image directly into the CFN resource with no escape hatch:

app-api-service-construct.ts:214 — image: ecs.ContainerImage.fromDockerImageAsset(bootstrapImage)
inference-agentcore-construct.ts:282 — containerUri: bootstrapImage.imageUri

The comments assert the real out-of-band image is "left untouched on subsequent deploys." That only holds when the task-def/runtime resource is otherwise unchanged. ECS task defs are immutable and update-agent-runtime is full-replacement — so when any property changes (a new env var, CPU/memory, authorizer config), CFN re-registers/updates with the synthesized template carrying image = <bootstrap 503 stub> and repoints the service/runtime to it, reverting the live image. Compounding it: platform.yml (paths-filtered to infrastructure/lib/constructs/** etc.) only runs cdk deploy and does not re-ship backend code afterward, so an infra-only change strands app-api and inference-api on the 503 stub until someone manually runs backend.yml. (The nightly pipeline chains them correctly; the standalone platform.yml path is the hazard. The two Lambdas are genuinely immune since Code is one property CFN leaves alone.)

Recommend one of: (a) platform.yml chains a backend re-ship after a successful cdk deploy; (b) a CFN escape hatch so CDK stops managing the image (EXTERNAL deployment controller / addPropertyDeletionOverride on containerUri / SSM-backed image param). Either way, correct the construct comments for the ECS + Runtime cases.

3. Teardown can't destroy the new architecture — and silently reports success.
scripts/teardown/destroy.sh:61-73 uses a hardcoded PARALLEL_STACKS list of legacy 9-stack names + FOUNDATION_STACK="InfrastructureStack", iterated with a stack_exists-skip. There is no dynamic prefix discovery. ${prefix}-PlatformStack — the only stack this PR deploys — is not in the list, so teardown.yml against a post-cutover environment skips everything and exits 0 "successfully" while deleting nothing. The cutover doc's claim that it "deletes every stack whose name starts with ${prefix}-" is also false given this implementation. Add PlatformStack (keep legacy names for migration) or switch to genuine describe-stacks-by-prefix discovery.

🟡 Should-fix

Docs are materially stale — and CLAUDE.md is unchanged. As the agent source-of-truth this will mislead: Deploy-order still describes the 9-stack order + CDK_ARTIFACTS_ENABLED/CDK_MCP_SANDBOX_ENABLED flags (removed here); the "New CDK stack" rule points at test/stack-dependencies.test.ts (deleted) and scripts/stack-<name>/ (all deleted). infrastructure/README.md still claims a two-stack architecture with a BackendStack + scripts/backend/deploy.sh (neither exists in the final state). .github/docs/deploy/step-04-deploy.md was not updated and is fully stale.
Lost security test coverage. Deleted app-api-stack.test.ts / security-best-practices.test.ts / inference-api-stack.test.ts held meaningful assertions now gone: BffCookieSigningKey Decrypt-only (no kms:GenerateDataKey), "no policy has both Action:"*" and Resource:"*"", S3 encryption + public-access-block, DDB SSE. The new app-api-iam-grants.ts (411 lines) and inference-api-iam-roles.ts (291 lines) are not exercised by any test — the new suite is mostly resourceCountIs/existence checks. A policy-level test would have caught blocker testing pipelines #1. (The 9 local config.test.ts failures are environmental — stale untracked lib/*.js; the committed suite is green at 361/361 in clean CI.)
Unconditional provisioning is an un-noted behavior/cost change. Removing the enabled flags means artifacts + mcp-sandbox + fine-tuning (previously default false) are now always created — new CloudFront distributions, DDB, S3, SageMaker IAM. Intended, but worth an explicit release-note/PR-description callout for existing deployers.
restore-data.yml setup-uv has no version: pin — resolves latest uv at runtime, violating the exact-pin convention (backend.yml/nightly pin 0.7.12).
Cross-workflow concurrency overlap: backend.yml, platform.yml, and the nightly pipeline use disjoint concurrency groups, so a push deploy and a nightly run could both update-service/update-agent-runtime the same service. Low probability; consider an environment-keyed deploy lock.

🟢 Nits

compute-content-hash.sh:94 expands a possibly-empty array under set -u; on macOS bash 3.2 the no---manifest (artifact-render) path errors. CI (bash 5) is fine; use "${MANIFESTS[@]+...}".
scripts/frontend/deploy.sh references ${AWS_REGION:?} but load-env.sh exports only CDK_AWS_REGION; CI sets it at job level but local runs fail. Mirror export AWS_REGION="${CDK_AWS_REGION}".
frontend-deploy.yml build job isn't gated on test-frontend (only deploy is) — wasted CPU, not a safety hole.
Stale config.artifacts.enabled doc-comments in artifact-render-token-secret-construct.ts:25 and spa-distribution-construct.ts:48; five vacuous no-assertion tests in config.test.ts's "Boolean Parsing" block; ~10 "stays in BackendStack" comments in platform-stack.ts the final collapse made false.
deploy-runtime-image-if-changed.sh rebuilds the update-agent-runtime payload from a hardcoded allow-list — a maintenance hazard if the control API adds required fields.

Recommendation

Strong architecture, not yet mergeable as-is. Blockers #1 (OAuth grants) and #3 (teardown) are fast fixes; #2 (image revert) needs an explicit design answer plus comment corrections. The doc drift and a regression-guard IAM test should ideally land in the same PR since they're part of the contract this refactor changes.

Background: `ssm.StringParameter.valueForStringParameter()` synthesises an `AWS::SSM::Parameter::Value<String>` CFN parameter. CFN resolves those parameters BEFORE any of the stack's resources are created. With the single-PlatformStack architecture, dozens of in-stack SSM reads were trying to read parameters that the same stack was supposed to publish — first-deploy CFN failure was 'Unable to fetch parameters from parameter store.' Fix: replaced every in-construct `valueForStringParameter` call with a typed CDK ref pulled from a new `PlatformComputeRefs` bundle that PlatformStack.wireCompute() builds once and threads through both compute constructs: - new file: lib/constructs/platform-compute-refs.ts (interface + docstring explaining the deadlock) - lib/platform-stack.ts: wireCompute() builds refs from the existing public readonly fields, passes to both InferenceAgentCoreConstruct and AppApiServiceConstruct. - constructs/inference-api/inference-agentcore-construct.ts: 24 SSM reads → ref lookups - constructs/inference-api/inference-api-iam-roles.ts: 20 SSM reads → ref lookups (function takes refs as param now) - constructs/app-api/app-api-environment.ts: `resolveAppApiSsmParams` renamed to `resolveAppApiParams` and rewritten to source from refs. - constructs/app-api/app-api-iam-grants.ts: 9 SSM reads + 28 string-arn props → single `refs` prop - constructs/app-api/app-api-service-construct.ts: 9 SSM reads + 9 `fromAttributes` imports → typed refs direct-use. Cleanup: 102 SSM publishes deleted across 31 construct files. The values were being published only for in-stack CDK reads (now via refs) or were never consumed at all. Survivors are the ~16 SSM paths actually needed by deploy scripts (build, deploy-ecs-service, deploy-runtime-image, deploy-image-lambda, deploy-lambda-code, frontend deploy) and e2e tests: /admin/managed-models-table-name /app-api/cluster-name + service-name + task-def-family /artifacts/render-function-name /auth/cognito/bff-app-client-id + user-pool-id /frontend/bucket-name + distribution-id + url /inference-api/runtime-id /network/alb-url + ecs-cluster-name /quota/user-quotas-table-name /rag/ingestion-function-name /rbac/app-roles-table-name (image-tag + code-hash params are written by build scripts, not by CDK, so they never had publishes here.) Tests: 24 hasResourceProperties assertions for now-deleted SSM publishes were dropped. The platform-stack.test.ts SSM-count assertion was updated from 'at least 40' to 'between 8 and 20' with a comment explaining the survivor set. Full gauntlet: - infra tsc + jest 361/361 - backend supply-chain + architecture + reflection 52/52 - restore-data 28/28

Got caught by git add -A in the previous SSM cleanup commit. Add to .gitignore so it doesn't recur.

Audit of every env var the CDK sets vs every os.environ name the Python code reads found six functional mismatches. All six would have caused the container to fall back to default values (the second arg to os.environ.get()) — which means runtime fallbacks, wrong-table writes, or silent feature breakage. 1. SHARED_CONVERSATIONS_TABLE_NAME was wired to ragAssistantsTableName (with a // TODO comment, no less). The shared-conversations table exists as a separate sibling DDB resource. Sharing a chat would have written to / read from the assistants table. Fix: add sharedConversationsTable to PlatformComputeRefs (already exposed on PlatformStack), thread sharedConversationsTableName through AppApiSsmParams, point the env var at it. 2. DYNAMODB_EVENTS_TABLE → DYNAMODB_QUOTA_EVENTS_TABLE. backend/src/agents/main_agent/config/constants.py defines EnvVars.DYNAMODB_QUOTA_EVENTS_TABLE = 'DYNAMODB_QUOTA_EVENTS_TABLE'. Quota event recording was hitting the default table name. 3. FINE_TUNING_JOBS_TABLE_NAME → DYNAMODB_FINE_TUNING_JOBS_TABLE_NAME. backend/src/apis/app_api/fine_tuning/routes.py reads 'DYNAMODB_FINE_TUNING_JOBS_TABLE_NAME' (default 'fine-tuning-jobs'). 4. FINE_TUNING_ACCESS_TABLE_NAME → DYNAMODB_FINE_TUNING_ACCESS_TABLE_NAME. backend/src/apis/app_api/fine_tuning/repository.py same shape. 5. FINE_TUNING_DATA_BUCKET_NAME → S3_FINE_TUNING_BUCKET_NAME. backend/src/apis/app_api/fine_tuning/s3_service.py reads 'S3_FINE_TUNING_BUCKET_NAME' (default 'fine-tuning-data'). 6. FINE_TUNING_PRIVATE_SUBNET_IDS → SAGEMAKER_SUBNET_IDS. backend/src/apis/app_api/fine_tuning/sagemaker_service.py reads 'SAGEMAKER_SUBNET_IDS'. With the wrong name, training jobs would have been launched without a VPC config. Audit also surfaced a few CDK-set-but-Python-never-reads entries (OAUTH_CLIENT_SECRETS_ARN, OAUTH_TOKEN_ENCRYPTION_KEY_ARN, AGENTCORE_MEMORY_TYPE, COGNITO_ISSUER_URL). They're harmless dead weight, not bugs — left in place to avoid mass churn and because they document configurable values for future code. Tests: tsc clean, jest 361/361.

Catches the bug class introduced in 3fd8cdd (Phase 3 stack implementation) and 902f5de (app-api decomposition) where I renamed env vars on the CDK side without verifying what Python reads. TypeScript compiled clean, IAM accepted the strings, CDK tests all passed — but the running container fell back to defaults silently because the env var name didn't match what the code expected. Bug was only visible by reading Python + TypeScript side-by-side, which is exactly what this test does automatically. Test extracts: - every quoted ALL_CAPS identifier from any .py file under backend/src (catches os.environ.get/getenv, os.environ[...], constant assignments like _CALLBACK_URL_ENV = 'NAME', and EnvVars-class indirections) - every env var key from CDK env blocks in app-api-environment, app-api-service-construct, inference-agentcore-construct, artifact-render-lambda-construct, rag-ingestion-lambda-construct Asserts: 1. Every CDK-set env var is read by ≥1 Python file OR is in an INTENTIONAL_NOT_READ_BY_PYTHON allow-list with a documented reason. Catches typos like 'DYNAMODB_EVENTS_TABLE' (set, never read because Python reads DYNAMODB_QUOTA_EVENTS_TABLE). 2. The allow-list itself is honest — entries that ARE read by Python or are NO LONGER set by CDK fail the test. 3. Helpful failure message includes did-you-mean suggestions using difflib.get_close_matches against Python env vars. Real orphans found and removed: - rag-ingestion-lambda BEDROCK_REGION (no Python reader; lambda uses AWS_REGION via boto3 default path) - inference-agentcore-construct ENABLE_AUTHENTICATION (no Python reader; ENABLE_QUOTA_ENFORCEMENT is the real flag) INTENTIONAL_NOT_READ_BY_PYTHON allow-list (4 entries): - AGENTCORE_MEMORY_TYPE, COGNITO_ISSUER_URL, OAUTH_CLIENT_SECRETS_ARN, OAUTH_TOKEN_ENCRYPTION_KEY_ARN (documented as 'set for audit / future code, not currently read by python'; harmless dead weight, not bugs). Going forward: any CI run that surfaces a new env var on either side without the matching name on the other side fails this test with a precise diff and a did-you-mean hint. Same shape as test_backup_coverage.py — a 'reflection' test that locks the contract between two halves of the codebase that don't otherwise share a type system. Tests: tsc clean, jest 361/361, backend pytest 57/57 (52 + 5 new), restore-data 28/28.

…ents Issue 3: Stale BackendStack references throughout comments. Phase 7 absorbed BackendStack and the resulting paragraphs of historical narrative weren't useful for maintaining the code. Stripped 20+ lines of refactor-history commentary across 9 files; kept only the comments that describe non-obvious 'why this code looks like this' constraints (same-stack SSM deadlock, lambda-Function-URL circular dependency, etc.). Issue 4: spaDistribution + spaDistributionDomainName were stub- filled with 'undefined as unknown as IDistribution' / 'as string' in the constructor and populated by a separate wireSpaDistribution() method called from bin/. If a caller forgot to invoke that method, the type system silently lied and downstream access produced a confusing TypeError. Inlined the SpaDistribution + RagCorsUpdater constructs into the constructor — the values they depend on (ALB DNS via AlbDnsConstruct, the SPA bucket) are all available by that point in the constructor, so there's no reason for the two-phase pattern. Dropped wireSpaDistribution() entirely; props are populated synchronously and unconditionally. Issue 5: platform.yml path triggers + concurrency. Added 'infrastructure/bootstrap-assets/**' to the (currently commented- out) path-trigger list so changes to bootstrap container images trigger a Platform deploy on push. Considered the cancel-in-progress: true on develop suggestion but reverted to false-always per existing supply-chain policy. The test_concurrency_config.py invariant exists explicitly to prevent CDK-mid-deploy cancellation leaving CloudFormation in ROLLBACK_IN_PROGRESS / UPDATE_ROLLBACK_FAILED states. The serial- queue tradeoff is what the team picked, and the safety risk on production is real. Issue 6: Clarified the wireCompute() ordering comment to make explicit that the construction order is for human readability — CDK token resolution + the CFN dependency graph handle ordering at synth/deploy time. A future reader can rearrange the blocks without breaking correctness; the order is just there to match how the data flow reads top-to-bottom. Tests: tsc clean, jest 361/361, backend pytest 57/57.

#7: Two-test ssm-safety suite that walks the synthesized template and asserts: - No AWS::SSM::Parameter::Value parameter references an SSM path that this same stack also creates via AWS::SSM::Parameter (the publish-then-read first-deploy deadlock). - Any remaining AWS::SSM::Parameter::Value parameter is in the EXTERNALLY_SEEDED_SSM_PATHS allowlist (CDK bootstrap version, build-time image tags, content hashes — all written before any CFN run). Would have failed the pre-Phase-7 stack at synth time. Now a permanent regression guard. #8: Per-image build duplication in backend.yml. Extracted the shared shape (configure-aws-credentials → optional QEMU+buildx for linux/arm64 → bash scripts/build/build-one.sh) into a composite action at .github/actions/build-and-push-image/. The 3 build jobs now call it via a 'platform: linux/arm64' input on inference-api and default amd64 on the others. Net: -45 lines in backend.yml, single source of truth for the build flow. Tests: tsc clean, jest 363/363 (+2 from ssm-safety), backend pytest 57/57.

Confirmed regression. The pre-Phase-7 app-api-stack and inference-api-stack roles granted a wider set of AgentCore Identity data-plane and control-plane actions; the decomposition into app-api-iam-grants.ts / inference-api-iam-roles.ts kept only GetWorkloadAccessTokenForUserId + GetWorkloadIdentity, which would break /connectors/* and any agent tool that needs a vaulted token. Cross-referenced what Python actually calls: shared/oauth/agentcore_identity.py GetResourceOauth2Token shared/oauth/agentcore_registrar.py Create/Update/Delete/Get/List Oauth2CredentialProvider app_api/connectors/routes.py GetResourceOauth2Token + CompleteResourceTokenAuth Cross-referenced every action name against the canonical Service Authorization Reference at https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonbedrockagentcore.html to confirm it exists as an IAM action and to scope resources correctly. CreateTokenVault from the user's report does NOT exist as an IAM action — the default token vault is auto-created by AgentCore Identity, no grant needed. All 7 new actions are documented and resource-scoped. App API gains: bedrock-agentcore:GetResourceOauth2Token bedrock-agentcore:CompleteResourceTokenAuth bedrock-agentcore:CreateOauth2CredentialProvider bedrock-agentcore:UpdateOauth2CredentialProvider bedrock-agentcore:DeleteOauth2CredentialProvider bedrock-agentcore:GetOauth2CredentialProvider bedrock-agentcore:ListOauth2CredentialProviders plus the full secretsmanager lifecycle on secret:bedrock-agentcore-identity!default/oauth2/* (was GetSecretValue+DescribeSecret only on the inference-api role). The registrar needs Create/Put/Update/Delete/Tag to add, rotate, and remove providers via the admin UI. Inference API gains: bedrock-agentcore:GetResourceOauth2Token The agent loop's tool gating short-circuits to vaulted tokens via this call before invoking external MCP tools / connectors. Without it the agent 503s on any federated-tool turn. Resources are scoped to this account's AgentCore Identity surface (token-vault/*, oauth2credentialprovider/*, workload-identity- directory/*, workload-identity/*) rather than the previous '*'. Tests: tsc clean, jest 363/363 (including the SSM safety guard added in the same review pass).

… trap The pre-fix shape baked the bootstrap container URI directly into the synthesized template at the ECS task definition's Image and the AgentCore Runtime's containerUri. That works fine for the no-op-deploy case (CFN sees no diff, leaves resources alone), but breaks when the compute resource itself changes: - AWS::ECS::TaskDefinition: every property except Tags is 'Update requires: Replacement'. Adding an env var, bumping CPU/memory, or swapping a role forced CFN to register a new revision carrying the bootstrap stub Image and repoint the service at it. Verified at: https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-resource-ecs-taskdefinition.html - AWS::BedrockAgentCore::Runtime: AgentRuntimeArtifact is 'No interruption', but update-agent-runtime is full-replacement at the API level. CFN re-sent the bootstrap containerUri on any property change. Verified at: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-bedrockagentcore-runtime.html Lambda is genuinely safe — CFN explicitly does NOT detect S3/ECR drift on the Code property, so a stable bootstrap asset hash means CFN never calls UpdateFunctionCode and the live code is preserved. Fix: read Image and containerUri from SSM at CFN deploy time. The build pipeline already writes the live image URI to /<prefix>/<svc>/image-tag via deploy-ecs-service-one.sh and deploy-runtime-image-one.sh. CFN now resolves whatever's in SSM at update time, so any task-def or Runtime re-registration uses the live image, not the bootstrap stub. Implementation: - app-api-service-construct.ts: image is now ContainerImage.fromRegistry(valueForStringParameter('/<prefix>/app-api/image-tag')). DockerImageAsset is kept (still published by cdk-assets) for first-deploy seed. CfnOutput exposes the asset hash so the seed script can construct the URI without parsing Fn::Sub. Both ECR repos (cdk-assets + per-service) get explicit grantPull on the task execution role since fromRegistry doesn't auto-grant. - inference-agentcore-construct.ts: same shape for AgentRuntimeArtifact.containerConfiguration.containerUri. First-deploy bootstrap problem: AWS::SSM::Parameter::Value<String> template parameters require the SSM path to exist before CFN starts resolving template parameters. New script scripts/stack-bootstrap/seed-image-tags.sh runs between cdk synth and cdk deploy: 1. Reads the bootstrap asset hash from the CfnOutput in cdk.out/<stack>.template.json 2. Constructs the cdk-assets ECR URI deterministically from account/region + the hash 3. Calls aws ssm put-parameter only if the param doesn't already exist (idempotent — build pipeline owns the param after first deploy) scripts/platform/deploy.sh now does: synth → cdk-assets publish → seed → deploy The cdk-assets publish step is needed before the seed so the URI the seed writes is actually pullable. cdk deploy would do the same publish, but seeding has to happen before the stack-update API call, which is inside cdk deploy. The SSM-safety test added in 4bbff4f already allowlists the image-tag paths as externally-seeded, so no test changes needed there. New test compute-image-resolution.test.ts asserts the synthesized shape: 5 assertions covering parameters, Refs at the consumer sites, and the bootstrap-hash CfnOutputs. Tests: tsc clean, jest 368/368 (+5), backend pytest 57/57, restore-data 28/28. Validated by reading authoritative AWS docs: https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-resource-ecs-taskdefinition.html https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-bedrockagentcore-runtime.html https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html

Docs (agent source-of-truth): - CLAUDE.MD: deploy order rewritten for single PlatformStack; File Creation Rules updated to reflect that there are no separate CDK stacks anymore (constructs only); infra build commands matched to current scripts. - .github/copilot-instructions.md: same updates. - .github/docs/deploy/step-04-deploy.md: full rewrite. 4 workflows (platform → backend → frontend → seed), no enable flags, accurate description of bootstrap pattern + SSM-resolved images. The previous version still described the 9-workflow flow with optional Artifacts/MCP-Sandbox/SageMaker stacks. - infrastructure/README.md: was already current; the only BackendStack reference is in the explicit Legacy Migration section. Tests (restored coverage previously held by deleted app-api-stack.test.ts / inference-api-stack.test.ts / security-best-practices.test.ts): - test/security-policy.test.ts (NEW) — 7 assertions: - Action:* + Resource:* prohibition across IAM Policy + ManagedPolicy + inline role policies (covers the CDK overflow-policy split for large inline policies) - BFF cookie KMS grant is kms:Decrypt only — caught and fixed a real regression where the new app-api-iam-grants was granting Encrypt + GenerateDataKey on the BFF cookie key (the cookie codec doesn't call KMS — Secrets Manager handles transparent decryption of the data-key secret). Split into two SIDs: OAuthTokenEncryptionKeyAccess (Decrypt+Encrypt+GenerateDataKey on the OAuth key) BffCookieSigningKeyDecrypt (Decrypt-only on the BFF key) - Every bucket has BucketEncryption, full PublicAccessBlock, and enforceSSL — caught and fixed a regression where RagDocumentsBucket and FrontendBucket were missing enforceSSL: true (no aws:SecureTransport=false Deny in their bucket policies). - Every DDB table has SSE enabled. - Test count: 370 (was 363; -5 vacuous Boolean Parsing tests that had no assertions, +7 new security, +5 from earlier compute-image-resolution suite, +2 from earlier ssm-safety). Cleanups: - .github/workflows/restore-data.yml: pin uv version to 0.7.12 (was using setup-uv with no version input → resolves latest at runtime, violated the exact-pin convention used by every other workflow). - scripts/frontend/deploy.sh: mirror AWS_REGION from CDK_AWS_REGION after sourcing load-env.sh. CI was setting it at the job level, but local runs would fail with 'AWS_REGION is required'. - scripts/build/compute-content-hash.sh: bash-3.2-safe MANIFESTS expansion. The plain '${MANIFESTS[@]}' under set -u errors on empty arrays in macOS's default bash. Use ${MANIFESTS[@]+"${MANIFESTS[@]}"} which is a no-op when the array has elements and safe when it doesn't. - .github/workflows/frontend-deploy.yml: gate the build job on test-frontend. Previously only the deploy job needed test-frontend, so a failing test would still trigger an expensive ng build. - Stripped stale config.artifacts.enabled doc-comments from artifact-render-token-secret-construct.ts and spa-distribution-construct.ts (artifacts is unconditional now). - Removed 5 vacuous Boolean Parsing tests in config.test.ts that had no assertion bodies — they were silently passing. Release notes: added an [Unreleased] section at the top of RELEASE_NOTES.md documenting the unconditional-provisioning behavior change for existing deployers (Artifacts + MCP Sandbox + fine-tuning are now always created on cdk deploy), the SSM-resolved image-tag mechanism, the first-deploy seed step ordering, the BFF cookie + enforceSSL fixes, and the OAuth / connector IAM grant restoration. The narrative summarises the work in flight on main since beta.27 was cut. Tests: tsc clean, jest 370/370, backend pytest 57/57, restore-data 28/28.

Brings in 17 commits since the merge base (95d51d9): - MCP-apps: persist UI resources across refresh, fullscreen iframe, sandbox CSP alignment - Admin: curated model catalog, model-form redesign, managed-models list polish - Spreadsheet analysis: fail-fast size guard - File upload: duplicate-name misclassification fix - Manage sessions: 'x items selected' wording - Messages: entry animation no longer traps fixed children - Sessions: drop dead duplicate app_api sessions messages service - Kaizen weekly review + research scan housekeeping Resolves no expected conflicts — develop work was frontend / UX / dead code cleanup; this branch's work is infra IAM / CDK / docs.

colinmxs added 30 commits May 20, 2026 09:57

docs(spec): add stack-architecture-simplification spec

c558d89

Adds requirements, design, and tasks for consolidating the CDK app into PlatformStack and BackendStack with content-hash Docker builds, same-origin BFF routing, and three new CI workflows.

chore(specs): remove stack-architecture-simplification spec

f485c26

Scrapped to start fresh. Original spec is preserved in git history at c558d89 for reference if needed.

Merge upstream develop into main

32efbdf

Merge upstream develop into main

870c05c

colinmxs added 16 commits May 27, 2026 14:02

colinmxs added 9 commits May 29, 2026 10:08

fix: stop tracking manifest.json scratch file

d479081

Got caught by git add -A in the previous SSM cleanup commit. Add to .gitignore so it doesn't recur.

philmerrell mentioned this pull request May 29, 2026

chore(kaizen): weekly research scan 2026-05-29 #407

Merged

philmerrell mentioned this pull request Jun 2, 2026

Admin-managed AgentCore Gateway target registration (protocol=mcp) #419

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/stack architecture simplification#396

Feature/stack architecture simplification#396
colinmxs wants to merge 81 commits into
developfrom
feature/stack-architecture-simplification

colinmxs commented May 27, 2026

Uh oh!

philmerrell commented May 29, 2026

Uh oh!

philmerrell commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants