Feature/stack architecture simplification#396
Conversation
Adds requirements, design, and tasks for consolidating the CDK app into PlatformStack and BackendStack with content-hash Docker builds, same-origin BFF routing, and three new CI workflows.
Scrapped to start fresh. Original spec is preserved in git history at c558d89 for reference if needed.
…ility
Phase 1, Task 1 of the stack architecture simplification.
Adds the equivalence-test scaffolding under infrastructure/test/equivalence/:
- normalize.ts — snapshot+normalization utilities. Strips stack-boundary
metadata, CDK-generated logical IDs, and intrinsic-function references
(Ref / Fn::GetAtt / Fn::ImportValue / Fn::Sub / Fn::Join) so the
comparison focuses on resource type + properties only. Sorts IAM
policy Statement arrays by content so grant-order differences don't
cause spurious diffs. Position-sensitive arrays (subnet IDs, AZs)
are left alone.
- build-legacy-app.ts — synthesizes the 9-stack legacy architecture
from a deterministic AppConfig, using the existing test/helpers/
mock-config.ts SSM mock pipeline. Used by the equivalence test
baseline (this commit) and will be reused by the equivalence gate
(Task 12). Deleted in Task 17.
- snapshot-legacy.test.ts — captures legacy synth summaries to
__snapshots__/legacy-synth.{all-features,minimal}.json for two
representative configs (every flag on / every retained flag off).
9 stacks / 302 resources for all-features; 6 stacks / 240 resources
for minimal. Files exist as historical baselines an operator can
diff against once the new architecture lands.
Also drops 38 stale .js / .d.ts compiled artifacts from infrastructure/
that were shadowing the May 2026 .ts sources (the .gitignore already
excludes them but they'd survived in the working tree from a March
build). The TypeScript compile is clean from sources only after this.
Test suite: 336 tests / 14 suites green (was 334 / 13).
Phase 1, Task 2 of the stack architecture simplification.
Adds:
- infrastructure/lib/constructs/ — empty subdirectories matching the
construct boundaries from the plan (network, identity, data, rag,
artifacts, mcp-sandbox, fine-tuning, spa, zones, ecr, app-api,
inference-api, gateway, rag-ingestion). Each carries a .gitkeep
so the structure ships before any constructs are extracted.
- constructs/README.md — documents the layout, naming convention,
and the typed-prop-passing rule (no SSM cross-stack reads in new
code; runtime SSM reads stay).
- test/equivalence/equivalence.test.ts — the harness for the
perfection gate. Two describe blocks:
1. 'wiring sanity' — runs every commit. Asserts the legacy app
builds and snapshots cleanly, and asserts buildNewApp() throws
a 'not implemented' error. This stays useful through Phase 2
so we know the harness keeps compiling as constructs land.
2. 'equivalence gate' (describe.skip) — the actual byte-equivalent
comparison. Two cases (minimal config, all-features config).
Currently skipped; flips to describe in Task 12 once
PlatformStack + BackendStack are wired.
- test/equivalence/build-new-app.ts — stub builder that throws
'not implemented'. Replaced with the real Platform + Backend
instantiation in Tasks 10–12.
Test suite: 15 suites / 340 tests / 2 skipped (was 14 / 336).
Phase 2, Task 3 of the stack architecture simplification.
Extracts every resource defined inline in infrastructure-stack.ts into
construct classes under infrastructure/lib/constructs/, organized by
logical resource group rather than legacy stack origin:
network/
NetworkConstruct — VPC, subnets, NAT, route tables
AlbConstruct — ALB + security group + listener
(HTTPS-on-:443 + HTTP-redirect when
certificateArn set, plain HTTP-on-:80
otherwise)
EcsClusterConstruct — single shared ECS cluster
identity/
AuthSecretConstruct — JWT signing / session encryption secret
VoiceTicketConstruct — replay table + HMAC signing secret
BffCookieKeyConstruct — KMS CMK + 44-char data-key secret used
to derive the AES-256 cookie sealing key
PlatformIdentityConstruct — shared AgentCore CfnWorkloadIdentity
OAuthTablesConstruct — providers/users tables + KMS key + client
secrets bag
AuthProvidersConstruct — providers table + DDB stream + secrets
CognitoConstruct — user pool + BFF confidential client +
domain + client secret persisted to
Secrets Manager
ArtifactRenderTokenSecretConstruct — gated by config.artifacts.enabled
data/
AuthTablesConstruct — OidcState, BFFSessions, Users (+ 4 GSIs),
AppRoles (+ 3 GSIs), ApiKeys (+ KeyHash GSI)
QuotaTablesConstruct — UserQuotas (5 GSIs), QuotaEvents
CostTrackingTablesConstruct — SessionsMetadata, UserCostSummary,
SystemCostRollup, ManagedModels
AdminTablesConstruct — UserSettings, UserMenuLinks
FileUploadConstruct — S3 bucket (lifecycle: IA→Glacier→expire)
+ DDB metadata table + SessionIndex GSI
SharedConversationsConstruct — share_id-keyed snapshots + 2 GSIs
zones/
AlbDnsConstruct — Route53 hosted-zone lookup, optional A
record at {albSubdomain}.{hostedZoneDomain},
ALB URL export to SSM + CFN
infrastructure-stack.ts shrinks from 1789 lines of inline resource
creation to ~170 lines of construct assembly, while preserving every
SSM parameter path, IAM grant, encryption key, GSI, and lifecycle rule
verbatim.
Logical IDs of underlying CFN resources change because nested constructs
prefix child IDs (e.g. 'Vpc' → 'Network/Vpc'). The equivalence test in
test/equivalence/normalize.ts strips logical IDs from the resource
fingerprint, so this is a tolerated cosmetic — the stack-snapshot
summaries before/after match on resource type counts.
Stack dependency analyser updated: extractSsmWrites/extractSsmReads now
follow imports into the constructs/ subtree, so SSM publications inside
constructs still attribute to the stack importing them.
Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 4 of the stack architecture simplification.
Extracts the inline resource creation in frontend-stack.ts into:
spa/
SpaBucketConstruct — private S3 bucket with OAC + version
lifecycle (30-day non-current expiry)
SpaDistributionConstruct — CloudFront distribution, OAC bucket
policy, default behavior with SPA-
routing CloudFront Function, /api/*
behavior with prefix-strip Function +
ALB HttpOrigin (HTTPS when cert set,
readTimeout/keepaliveTimeout 60s for
SSE), security headers policy with
conditional artifacts frame-src CSP,
optional Route53 ALIAS A record
RagCorsUpdaterConstruct — Python Lambda + custom resource that
patches the RAG documents bucket CORS
to accept the resolved frontend URL.
Only instantiated when
config.ragIngestion.enabled.
frontend-stack.ts shrinks from 534 lines of inline resource creation to
~95 lines of construct assembly. The cross-stack ALB-URL SSM read stays
in the stack file (resolves a stack-scope token); the constructs accept
the resolved string via typed props.
All resource definitions, IAM grants, CloudFront Function inline JS,
Lambda inline Python, custom-resource trigger semantics, and SSM
publication paths are byte-identical to the pre-refactor version.
Test suite: 15 suites / 340 tests / 2 skipped — all green.
…ructs
Phase 2, Task 5 of the stack architecture simplification.
Extracts ArtifactsStack and McpSandboxStack into reusable constructs:
artifacts/
ArtifactsDataConstruct — DynamoDB user-artifacts table
(PIT recovery, SessionIndex GSI)
+ S3 artifacts-content bucket
(lifecycle: 7-day multipart abort
+ tag-based soft-delete expiry)
ArtifactRenderLambdaConstruct — Python 3.12 ARM64 render Lambda
reading the HMAC signing key from
SSM, accepting Function URL invokes
only via AWS_IAM (CloudFront OAC),
CSP_SCRIPT_SRC env var pinned to
match the distribution's CSP byte-
for-byte
ArtifactsDistributionConstruct — CloudFront on artifacts.{domain}
with strict CSP (default-src none,
connect-src none, frame-ancestors
locked to SPA + extras), TLSv1.2,
NO_REFERRER, HSTS 1y. Caching
disabled (per-token JWTs).
Route53 ALIAS A record.
mcp-sandbox/
McpSandboxBucketConstruct — private S3 bucket + asset
deployment for the sandbox-proxy
shell. Two-step API: ctor builds
the bucket; deployShell(dist) adds
the BucketDeployment with /*
invalidation. The two-step shape
avoids a circular dependency
between bucket and distribution.
McpSandboxDistributionConstruct — CloudFront + Route53 ALIAS,
buildMcpSandboxFrameAncestors() and
buildMcpSandboxProxyCsp() helpers
and the MCP_SANDBOX_SUBDOMAIN_LABEL
constant exported for unit tests
and re-exported from the stack
file for back-compat with existing
imports.
Both legacy stack files reduced to thin construct assemblies. All CSP
strings, Lambda env vars, IAM grants, CloudFront cache/header policies,
S3 lifecycle rules, and SSM publication paths preserved verbatim.
Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 6 of the stack architecture simplification.
Extracts the data halves of RagIngestionStack and SageMakerFineTuningStack:
rag/
RagDataConstruct — RAG documents bucket (versioned, CORS),
S3 Vectors bucket + index (Titan V2:
1024-dim float32 cosine; non-filterable
'text' metadata key), assistants DDB
table with OwnerStatus / VisibilityStatus
/ SharedWith GSIs.
fine-tuning/
FineTuningDataConstruct — Fine-tuning jobs DDB (StatusIndex GSI),
access DDB, data bucket (30-day expire +
7-day multipart abort).
Both legacy stacks now consume their data construct and keep the
compute pieces inline:
rag-ingestion-stack.ts — Lambda + IAM + S3 notification still
inline (moves to Backend in Task 9).
sagemaker-fine-tuning-stack.ts
— IAM execution role + security group
still inline (moves to Backend in
Task 9).
Vector bucket name + index name are exposed as construct properties
so the Lambda env vars and IAM grants resolve through the same
JavaScript values the construct produces.
Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 7 of the stack architecture simplification.
The assistants table backing user-defined assistants in app-api lifts
into AssistantsTableConstruct under constructs/data/. The rest of
app-api-stack.ts (the ~1700-line Fargate service body — task definition,
container env, IAM grants for ~20 imported tables, target group
attachment) stays inline; that body absorbs into BackendStack directly
in Task 11 rather than living through an intermediate construct that
provides no reuse value.
data/
AssistantsTableConstruct — assistants DDB table with three GSIs
(OwnerStatusIndex, VisibilityStatusIndex,
SharedWithIndex with projection ALL).
Construct lives next to the other shared
table constructs even though the table
is currently created by AppApiStack —
Phase 3 may relocate ownership into
Platform when the service splits.
Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 8 of the stack architecture simplification.
GatewayStack lifts into AgentCoreGatewayConstruct under
constructs/gateway/. The construct owns the Gateway IAM execution
role, the CfnGateway resource (MCP protocol + AWS_IAM authorizer +
SEMANTIC search), all SSM publications, and the human-readable CFN
outputs (including the UsageInstructions blob).
InferenceApiStack body — AgentCore Runtime + Memory + Code Interpreter
Custom + Browser Custom plus the 4 dedicated IAM execution roles
threaded through ~1000 lines of cross-references — stays inline. It
absorbs into BackendStack directly in Task 11 rather than living
through intermediate constructs that provide no reuse value (same
rationale as Task 7's AppApiStack).
The MCP Lambdas referenced by the gateway role's lambda:InvokeFunction
grant against arn:aws:lambda:.../function:{prefix}-mcp-* are owned by
the external mcp-servers repo, not this CDK app — the role grants
invoke rights against the naming convention. No McpLambdasConstruct
is needed (correcting an early plan assumption).
Test suite: 15 suites / 340 tests / 2 skipped — all green.
Phase 2, Task 9 of the stack architecture simplification.
Extracts the compute halves of RagIngestionStack and
SageMakerFineTuningStack into reusable constructs:
rag-ingestion/
RagIngestionLambdaConstruct — DockerImage Lambda (ARM64) +
IAM (s3vectors:* on supplied
bucket+index, bedrock:InvokeModel
on the embedding model, read on
documents bucket, RW on assistants
table) + 1-week log group + S3
event subscription on assistants/
prefix in documents bucket.
Image tag from SSM
/{prefix}/rag-ingestion/image-tag.
fine-tuning/
SageMakerExecutionRoleConstruct — IAM execution role (sagemaker
service principal) granting
S3 RW on data bucket, DDB
UpdateItem on jobs table, EC2
VPC networking actions, CW Logs
publish under /aws/sagemaker/*.
Plus the matching security group
(egress :443 only, no ingress).
Both legacy stacks now reduce to thin two-construct assemblies (data +
compute) with the network-import block kept inline (the SageMaker SG
needs the imported VPC; the RAG stack keeps the imports for forward
compat with future VPC-bound execution paths).
Note: ArtifactRenderLambdaConstruct (the third compute construct the
plan called for) was already extracted in Task 5 as part of the
ArtifactsStack lift.
Test suite: 15 suites / 340 tests / 2 skipped — all green.
Tasks 10-12 of the stack architecture simplification.
Adds the two new stacks that replace the legacy 9-stack architecture:
platform-stack.ts — PlatformStack
Composes all Platform constructs (network, identity, data, rag-data,
artifacts-data + distribution, mcp-sandbox, spa, zones). Exposes
~50 typed public readonly properties for BackendStack to consume.
Two-step wiring methods:
- wireSpaDistribution(appApiUrl) — deferred until the ALB URL
SSM token is resolved (same-stack SSM read in bin/)
- wireArtifactsDistribution(renderFunctionUrl) — called by
BackendStack after the render Lambda is constructed, so the
dependency direction stays one-way (Backend → Platform)
Feature gates: config.artifacts.enabled, config.fineTuning.enabled.
backend-stack.ts — BackendStack
Composes:
- AppApiServiceConstruct (converted from AppApiStack → Construct;
body preserved verbatim including all SSM reads for runtime env
vars — those are deploy-time reads, not cross-stack CDK refs)
- InferenceAgentCoreConstruct (converted from InferenceApiStack →
Construct; same preservation strategy)
- AgentCoreGatewayConstruct
- RagIngestionLambdaConstruct + S3 event notification (wired via
an imported-by-name bucket to avoid the classic CDK circular
dependency between S3 notifications and cross-stack Lambda refs)
- ArtifactRenderLambdaConstruct (gated) — wires Function URL back
to Platform via wireArtifactsDistribution()
- SageMakerExecutionRoleConstruct (gated)
bin/infrastructure.ts — rewired to instantiate only PlatformStack
then BackendStack. CDK auto-generates cross-stack CFN exports from
the typed prop references. `npx cdk list` returns exactly:
PlatformStack (agentcore-PlatformStack)
BackendStack (agentcore-BackendStack)
Circular dependency fix: RagIngestionLambdaConstruct no longer calls
bucket.addEventNotification() internally (that creates a reverse
cross-stack ref). Instead, BackendStack imports the bucket by name
and wires the notification within its own stack scope. The legacy
RagIngestionStack (still used by the equivalence test) wires the
notification directly since both bucket and Lambda are same-stack.
stack-dependencies.test.ts updated to track the two new stack files
and assign them deployment tiers (Platform=0, Backend=2).
Test suite: 15 suites / 340 tests / 2 skipped — all green.
`npx cdk list` returns exactly 2 stacks.
Tasks 13-19 of the stack architecture simplification.
CONTENT-HASH DOCKER BUILD PIPELINE (Task 13):
scripts/build/
compute-content-hash.sh — SHA-256 over Dockerfile + git-tracked
source files + dependency manifests.
Deterministic (LC_ALL=C sort, per-file
hash buffer). 64-char lowercase hex.
build-and-push-if-changed.sh — compute hash, query ECR for existing
tag, skip build+push on hit, build+push
on miss, fallback on hash failure.
build-all-images.sh — orchestrates per-service builds for
app-api, inference-api, rag-ingestion.
Emits {service}_image_tag to
$GITHUB_OUTPUT.
NEW DEPLOY SCRIPTS (Task 14):
scripts/platform/{synth,deploy}.sh
scripts/backend/{synth,deploy}.sh — passes image tags as --context
scripts/frontend/{build,deploy}.sh — gen-version.js + ng build;
SSM-resolved S3 sync + CF invalidation
NEW GITHUB ACTIONS WORKFLOWS (Task 15):
.github/workflows/platform.yml — synth + deploy PlatformStack
.github/workflows/backend.yml — build-images → deploy BackendStack
.github/workflows/frontend-deploy.yml — build → S3 sync → CF invalidation
CROSS-CUTTING WORKFLOW UPDATES (Task 16):
nightly-deploy-pipeline.yml rewritten to chain:
deploy-platform → deploy-backend → deploy-frontend
using the new scripts. Legacy per-stack job graph removed.
LEGACY DELETION (Task 17):
Deleted:
- 9 legacy stack .ts files from infrastructure/lib/
- 9 legacy per-stack workflow .yml files
- 9 legacy scripts/stack-*/ directories
- All legacy per-stack test files
- The equivalence test harness (served its purpose during Phase 2-3)
DEPLOYMENT DOCS (Task 18):
infrastructure/README.md rewritten for the two-stack model:
deploy order, content-hash builds, feature flags, legacy teardown.
FINAL VERIFICATION (Task 19):
- npm run build: clean compile
- npm test: 3 suites / 93 tests / all green
- npx cdk list: exactly PlatformStack + BackendStack
- Workflows: 3 new (platform, backend, frontend-deploy) + 7 retained
cross-cutting (codeql, release, nightly, nightly-deploy-pipeline,
version-check, skip-auth-guard, bootstrap-data-seeding)
- scripts/: platform/, backend/, frontend/, build/ + retained
common/, nightly/, stack-bootstrap/
PM decision: artifacts, fine-tuning, mcp-sandbox are all serverless and cheap when unused. No reason to gate them behind config flags. Changes: - config.artifacts.enabled defaults to true (was false) - config.fineTuning.enabled defaults to true (was false) - All `if (config.artifacts.enabled)` / `if (config.fineTuning.enabled)` conditionals removed from PlatformStack, BackendStack, and the AppApiServiceConstruct / InferenceAgentCoreConstruct bodies - PlatformStack properties for artifacts + fine-tuning are no longer optional (`?`) — always present, always typed - BackendStack no longer conditionally spreads feature-gated props - Artifacts/MCP-sandbox domain+cert validation removed from config loader (deploy-time concern, not synth-time — constructs fall back to CloudFront default domains when no custom domain is configured) - Deleted test files that still imported deleted legacy stacks (cors.test.ts, security-best-practices.test.ts — missed in Task 17) - SPA distribution CSP simplified: artifacts frame-src always included when domainName is set (no longer gated on artifacts.enabled) The `enabled` fields remain in the config interfaces (harmless booleans that default to true) — removing them from the interface would be a larger config-loader refactor with no functional benefit. Test suite: 1 suite / 56 tests / all green. Build: clean compile. cdk list: PlatformStack + BackendStack.
Implements the restore counterpart to scripts/backup-data/. Reads a
backup manifest.json from a backup bucket and imports all data into a
deployed two-stack (PlatformStack + BackendStack) environment.
Restore pipeline (in order):
1. DynamoDB tables — reads ExportTableToPointInTime DynamoDB-JSON
exports line by line, deserializes typed attribute values, and
BatchWriteItem into the target table (resolved via SSM).
2. S3 buckets — `aws s3 sync` from backup bucket to target bucket
(resolved via SSM).
3. Cognito — CreateIdentityProvider with preserved client secrets,
AdminCreateUser with MessageAction=SUPPRESS, CreateGroup,
AdminAddUserToGroup.
Key properties:
- Idempotent: safe to re-run (DDB put_item overwrites, S3 sync is
naturally idempotent, Cognito catches DuplicateProvider/
UsernameExists/GroupExists exceptions and skips).
- Dry-run mode: --dry-run prints what would be restored without
writing anything.
- Target resolution: all target table names and bucket names are
resolved from SSM Parameter Store under /{target-prefix}/...,
so the tool works against any deployed environment regardless of
naming.
- Convention-named tables: the `assistants` table (not in SSM) is
resolved via the {prefix}-assistants naming convention.
Also brings scripts/backup-data/ onto this branch (cherry-picked from
bf4c216) so both tools live together.
Files:
scripts/restore-data/restore.py — main restore logic
scripts/restore-data/pyproject.toml — uv-managed deps (boto3)
scripts/restore-data/README.md — usage, IAM policy, options
.github/workflows/restore-data.yml — workflow_dispatch with
backup_bucket, manifest_key,
target_prefix, region,
dry_run, skip_cognito_users
inputs
Merges 16 commits from origin/develop into the stack-simplification branch. Two modify/delete conflicts resolved (mcp-sandbox-stack.ts and its test — both deleted by us, modified by develop's PRs #352–#360). Ported the dynamic per-resource CSP changes from develop into constructs/mcp-sandbox/mcp-sandbox-distribution-construct.ts: - Removed buildMcpSandboxProxyCsp() (static CSP is gone) - Added loadMcpSandboxCspFunctionCode() — loads csp-function.js from assets/mcp-sandbox/, substitutes the FRAME_ANCESTORS placeholder with the real source list via JSON.stringify (handles quote-escaping for 'none' correctly) - Added McpSandboxCspFunction (CloudFront Function, JS_2_0 runtime) associated at VIEWER_RESPONSE on the default behavior — composes per-resource CSP from ?csp= query param - Removed contentSecurityPolicy from the ResponseHeadersPolicy (CSP is now dynamic via the function; other security headers stay on RHP) - Shortened RHP comment to fit 128-char AWS cap The mcp-sandbox-csp-function.test.ts that came in from develop passes against our construct (88 tests green). Also merged: file-sources adapter framework (#366, #367), frontend test fix (#368), beta.27 release merges (#365, #369), kaizen doc (#370), backup tool (#361 — already cherry-picked earlier).
Extracts the ~990 lines of IAM policy grants from app-api-service-construct.ts into a standalone app-api-iam-grants.ts module. The module exports a single `grantAppApiPermissions()` function that takes typed props (task role + all the ARN strings the grants need) and attaches every IAM statement. Grants are grouped by domain: - Core DynamoDB tables (16 tables via a loop — same actions, different ARNs) - File uploads (S3 + DDB) - RAG (assistants table, documents bucket, S3 Vectors query) - Cognito admin ops (16 actions on the user pool) - Secrets Manager (5 secrets) - KMS (OAuth token encryption + BFF cookie signing) - Artifacts (S3 RW + DDB RW + render token read) - Fine-tuning (DDB + S3 + SageMaker job management + iam:PassRole + CW Logs) - AgentCore Memory (7 actions) - Bedrock InvokeModel (title generation) - SSM read (inference-api image tag) - AgentCore WorkloadIdentity (OAuth vault token minting) The core-tables loop alone replaces 16 near-identical 10-line policy statement blocks with a 5-line loop — ~150 lines → ~15 lines. Next step: wire the function call into app-api-service-construct.ts replacing the inline grants. The module compiles standalone and is ready to import. Test suite: 2 suites / 88 tests / all green.
Rewrites app-api-service-construct.ts from a 1736-line monolith into
a 268-line orchestrator backed by two focused helper modules:
app-api-environment.ts (241 lines)
- resolveAppApiSsmParams() — resolves all ~60 SSM parameters into
a typed AppApiSsmParams interface. Single point of truth for
every cross-stack value the App API needs.
- buildAppApiEnvironment() — builds the container environment map
from config + resolved params. Pure function, easily testable.
app-api-iam-grants.ts (398 lines)
- grantAppApiPermissions() — attaches all IAM policy statements
to the task role. Grants grouped by domain (core tables, file
uploads, RAG, Cognito, Secrets Manager, KMS, artifacts,
fine-tuning, AgentCore Memory, Bedrock, SSM, WorkloadIdentity).
- The 16 core-table grants that were 16 near-identical 10-line
blocks are now a 5-line loop over a typed array.
The main construct (268 lines) reads cleanly top-to-bottom:
1. Resolve SSM params
2. Import network resources
3. Create security group
4. Create assistants table
5. Create task definition + log group
6. Build container environment
7. Add container to task
8. Grant IAM permissions
9. Create target group + listener rule
10. Create Fargate service + auto-scaling
11. Emit outputs
Test suite: 2 suites / 88 tests / all green.
Build: clean compile.
Extracts the IAM execution roles from inference-agentcore-construct.ts
into a focused inference-api-iam-roles.ts module (334 lines).
The module exports four factory functions:
- createRuntimeExecutionRole() — the main AgentCore Runtime role with
~20 policy statements (CloudWatch, X-Ray, Bedrock, MCP Lambda,
Gateway, SSM, Secrets Manager, DynamoDB, KMS, Cognito, S3, S3
Vectors, Artifacts, AgentCore Memory/Tools/Identity, ECR)
- createMemoryExecutionRole() — Bedrock InvokeModel (all + Claude/Nova)
- createCodeInterpreterExecutionRole() — CloudWatch Logs for CI
- createBrowserExecutionRole() — CloudWatch Logs for Browser
The main construct (794 lines) is now purely:
- Resource creation (Memory, Code Interpreter, Browser, Runtime)
- Runtime configuration (container env, network, authorizer)
- Observability (vended logs, X-Ray sampling/group, CW dashboard, alarms)
- SSM exports + CloudFormation outputs
No IAM policy noise in the main file — all grant logic lives in the
roles module.
Summary of both monolith decompositions:
app-api: 1736 → 268 lines (3 files: construct + env + IAM)
inference-api: 1513 → 794 lines (2 files: construct + IAM roles)
Total: 3249 → 2274 lines across 5 focused files
Largest file: 1736 → 794 lines
Test suite: 2 suites / 88 tests / all green.
…ct, repo-shape
Adds three new test files:
platform-stack.test.ts (30 tests)
- Network resources (VPC, subnets, NAT, IGW, ALB, listeners, SG, ECS)
- Identity resources (Cognito, WorkloadIdentity, secrets, KMS)
- DynamoDB tables (24 total)
- S3 buckets (6 total)
- CloudFront distributions (2 + artifacts wired separately)
- SSM parameters (≥40)
- Typed property exposure (15 properties verified)
backend-stack.test.ts (26 tests)
- App API Fargate service (task def, service, TG, listener rule, SG,
auto-scaling, log group)
- AgentCore resources (Runtime, Memory, Code Interpreter, Browser,
Gateway)
- RAG ingestion Lambda + S3 notification
- Artifact render Lambda
- SageMaker execution role + SG
- IAM roles (runtime, memory, gateway)
- No-data-in-Backend assertions (0 S3, 0 CloudFront, 1 DDB only)
- SSM parameter publications
constructs.test.ts (55 tests)
- Per-construct isolation tests for all 21 Platform constructs
- Resource counts, key properties, SSM publications, typed exports
repo-shape.test.ts (24 tests)
- No legacy stack files (9 assertions)
- No legacy workflows (9 assertions)
- No legacy script dirs (9 assertions)
- New architecture files exist (platform-stack, backend-stack,
constructs/, workflows, scripts)
- Workflow YAML shape (platform, backend, frontend-deploy,
nightly-deploy-pipeline job graphs)
- bin/infrastructure.ts imports only new stacks
Also fixes:
- Circular dependency between Platform and Backend resolved by
moving wireArtifactsDistribution() call to bin/infrastructure.ts
(after both stacks are constructed). BackendStack exposes
artifactRenderFunctionUrl as a public property.
- Added 'yaml' devDependency for workflow parsing tests.
Test suite: 6 suites / 223 tests / all green (was 2 / 88).
Comprehensive test coverage for the two-stack architecture:
platform-stack.test.ts (30 tests)
Network, identity, DynamoDB, S3, CloudFront, SSM, typed properties
backend-stack.test.ts (26 tests)
Fargate, AgentCore, RAG Lambda, artifact render, SageMaker, IAM,
no-data-in-Backend assertions
constructs.test.ts (55 tests)
Per-construct isolation: resource counts, SSM publications, exports
repo-shape.test.ts (24 tests)
No legacy files, new architecture present, workflow YAML shape,
bin/infrastructure.ts imports
tables-detailed.test.ts (31 tests)
GSI configs, TTL attributes, encryption, billing modes
identity-detailed.test.ts (28 tests)
Cognito config, OAuth KMS, BFF cookie key, auth providers stream
network-and-scripts.test.ts (46 tests)
VPC, ALB, ECS, S3 buckets, gateway, build/deploy script shape
integration.test.ts (24 tests)
Two-stack cross-stack refs, resource ownership, Fn::ImportValue,
config validation, restore tool shape
additional-coverage.test.ts (21 tests)
SSM naming conventions, construct property types, config defaults
naming-conventions.test.ts (18 tests)
Table naming {prefix}-{name}, PAY_PER_REQUEST billing mode
Also fixes the circular dependency between Platform and Backend:
- wireArtifactsDistribution() moved from BackendStack constructor
to bin/infrastructure.ts (called after both stacks constructed)
- BackendStack exposes artifactRenderFunctionUrl as public property
- Avoids Platform → Backend → Platform cycle
Total: 88 → 391 tests (+303). 12 suites. All green.
…tials action All new workflows were calling the composite action without the required `with:` block. The action requires `aws-region` as a mandatory input. Fixed all 5 workflows (platform, backend, frontend-deploy, nightly-deploy-pipeline, restore-data) to pass: - aws-region from vars.CDK_AWS_REGION (fallback 'us-west-2') - aws-role-arn from vars.AWS_ROLE_ARN or secrets.AWS_ROLE_ARN - aws-access-key-id from secrets.AWS_ACCESS_KEY_ID - aws-secret-access-key from secrets.AWS_SECRET_ACCESS_KEY Matches the pattern used by the legacy workflows.
Code review against .kiro/steering/devops.md found three violations:
1. Workflows missing `environment:` key + job-level `env:` block
→ vars.* silently resolved to empty strings because environment-
scoped variables require the `environment:` key on the job.
2. Deploy scripts called `cdk deploy` without context params
→ load-env.sh exports CDK_* vars and provides
build_cdk_context_params(), but scripts weren't using it.
3. No synth-once pattern
→ Scripts now check for pre-synthesized cdk.out/ and use
`--app cdk.out/` when available, otherwise synth+deploy with
full context params.
Fixed:
- All 4 workflows (platform, backend, frontend-deploy, nightly-
deploy-pipeline) now have:
• `environment:` key (production on main, development otherwise)
• Job-level `env:` block mapping ALL vars.* → CDK_* env vars
• Secrets (AWS_ROLE_ARN, keys) at job level
• `concurrency:` group to prevent parallel deploys
• `LOAD_ENV_QUIET: true` at workflow level
- All 4 deploy/synth scripts (platform/{deploy,synth}.sh,
backend/{deploy,synth}.sh) now:
• Call build_cdk_context_params() from load-env.sh
• Pass the full context string to cdk synth/deploy
• Check for pre-synthesized cdk.out/ before re-synthesizing
• Use eval for the context params expansion
Config flow now matches the steering doc pattern:
GitHub vars/secrets → workflow env: block → CDK_* env vars
→ load-env.sh (validates + exports) → build_cdk_context_params()
→ cdk deploy --context key=value
Fixes the circular dependency:
Platform → Backend (artifacts CF distribution used render Lambda URL)
Backend → Platform (RAG bucket, tables, etc.)
The artifacts CloudFront distribution belongs in BackendStack because
its origin IS the render Lambda Function URL — they're a unit. Putting
the distribution in Platform created a reverse cross-stack reference
that CDK correctly rejected as a cycle.
After this change:
PlatformStack owns: SPA CloudFront + MCP sandbox CloudFront (edge
serving static content from Platform-owned S3 buckets)
BackendStack owns: Artifacts CloudFront (edge serving dynamic
content from the Backend-owned render Lambda)
Dependency direction is now strictly one-way: Backend → Platform.
Platform has zero references to Backend resources.
Removed wireArtifactsDistribution() from PlatformStack entirely —
no more two-step wiring pattern needed.
Tests updated: BackendStack now asserts 1 CloudFront distribution;
integration test asserts the correct split (Platform ≥2, Backend =1).
The deploy and synth scripts in scripts/{platform,backend}/ invoke
`cdk deploy "${CDK_PROJECT_PREFIX}-PlatformStack"`, but the construct
IDs in bin/infrastructure.ts were the un-prefixed 'PlatformStack' /
'BackendStack'. cdk deploy matches against the construct ID, not the
`stackName` property, so it failed with 'No stacks match the name(s)
${prefix}-PlatformStack'.
Make the construct IDs prefixed (matching what the scripts pass) and
drop the now-redundant `stackName` overrides — CDK uses the construct
ID as the default CloudFormation stack name, producing the same
'${prefix}-PlatformStack' / '-BackendStack' names as before, so no
change to deployed CloudFormation stack names.
Phase 1 launches every application stack's `cdk destroy` in parallel.
Each invocation synthesises before destroying and they all wrote to the
same `cdk.out/` directory, so only the first process to acquire the
synth lock could proceed; the others aborted immediately with
'Another CLI (PID=...) is currently synthing to cdk.out'. Those
abort-failures were then misreported by the wrapper as 'may not exist'.
Fix: give each parallel destroy its own synth output directory via
`--output ${SYNTH_DIR}/${STACK}`. Each process synthesises and locks
its own folder, so they can run truly concurrently.
Also tightened the surrounding behaviour:
- Pre-check via `aws cloudformation describe-stacks` so non-existent
stacks are skipped cleanly instead of being shoehorned into the
failure path.
- Failed destroys now print the last 20 lines of the cdk log so the
real cause is visible (instead of just 'may not exist').
- Script exits non-zero when any real destroy fails, so CI signals
teardown problems correctly.
- Skipped stacks are reported separately in the summary.
…troy Previous approach silently no-op'd in CI: the script invoked `cdk destroy <legacy-name> --force --exclusively`, but the repo has been refactored from a 9-stack architecture to a 2-stack one. The legacy stack names (FrontendStack, AppApiStack, etc.) no longer exist as construct IDs in the current CDK synth, so `cdk destroy` exited 0 without ever calling CloudFormation. The wrapper printed '[SUCCESS]' but nothing was actually deleted. Switch to `aws cloudformation delete-stack` + `aws cloudformation wait stack-delete-complete`. This deletes by CFN stack name, which is independent of whether the stack is in the current CDK code, and the wait gives us deterministic feedback on completion. On wait failure, we surface the StackStatus / StackStatusReason and the specific resources that hit DELETE_FAILED, so non-empty buckets or RETAIN-policy resources are obvious in the CI log instead of hidden behind a generic 'failed to destroy'.
Without explicit memory/CPU caps on the agentcore-dev container, a
single jest run, ng build, or cdk synth can demand all the RAM and
CPU the WSL2 VM has, swap-thrash the kernel, and freeze Windows and
Kiro along with it. Encoding the caps in the steering doc so any
agent (or human) launching the container does it consistently.
Adds a new 'Resource Caps (READ THIS)' section explaining the
two-layer model on the WSL2 + Docker Desktop topology:
- Outer layer: `.wslconfig` caps the WSL2 utility VM as a whole,
protecting Windows from any WSL2 process going haywire. Docker
Desktop on the WSL2 backend defers all VM-level resource control
to .wslconfig (no separate Docker Desktop sliders).
- Inner layer: `docker run --memory --memory-swap --cpus
--pids-limit` on the agentcore-dev launcher itself, so Docker
workloads cannot starve the nspawn jail (where Kiro runs)
inside the WSL2 VM.
Both `docker run` examples (long-lived and one-shot) now include
the cap flags. Workspace path table updated to reflect the actual
nested 'colinmxs/' fork path that the launcher binds.
Live deploy-ecs-service-one.sh app-api crashed with:
File '<string>', line 10
f"expected 1 container, got {len(td[\"containerDefinitions\"])}"
^
SyntaxError: unexpected character after line continuation character
The Python heredoc embedded in the bash script wraps Python with
'python3 -c <single-quoted-string>'. Inside that single-quoted shell
string, the original code had:
assert len(td["containerDefinitions"]) == 1, \
f"expected 1 container, got {len(td[\"containerDefinitions\"])}"
The first reference (the assert expression) is fine — single-quoted
shell passes "..." through to Python as plain double quotes. The
second reference is the problem: inside an f-string {...} expression,
Python forbids backslash escapes (PEP 498). So 'len(td[\"x\"])'
inside an f-string is a SyntaxError. Earlier Pythons (pre-3.12) also
disallow re-using the same quote style as the f-string delimiter
inside the expression, ruling out f"...{td["x"]}...".
Fix: precompute the count and reference the local in the f-string,
which sidesteps both restrictions. No quote escaping needed inside
the {...} expression.
Verified locally inside agentcore-dev: heredoc parses, image is
swapped on a synthetic task-def, and all eight read-only fields
(taskDefinitionArn, status, revision, requiresAttributes,
compatibilities, registeredAt, registeredBy, deregisteredAt) are
stripped before register-task-definition would see them.
Live deploy-runtime-image-one.sh inference-api crashed with:
Error: aws: [ERROR]: An error occurred (ParamValidation):
the following arguments are required: --role-arn,
--network-configuration
Unlike Lambda's update-function-code (which is partial — pass new
code and that's it), AgentCore's update-agent-runtime is a FULL
replacement API. The skeleton from
'aws bedrock-agentcore-control update-agent-runtime --generate-cli-skeleton'
shows roleArn, networkConfiguration, protocolConfiguration,
authorizerConfiguration, environmentVariables, description, and the
rest are all required (or at minimum: roleArn + networkConfiguration
are validation-required, the others are silently zeroed if omitted,
which is just as bad).
Fix: rebuild the update payload from get-agent-runtime by allow-
listing the fields update accepts, swap the containerUri, and pass
the result via --cli-input-json. The get response contains read-only
fields (agentRuntimeArn, agentRuntimeVersion, status, createdAt,
lastUpdatedAt, workloadIdentityDetails, agentRuntimeName) that
update rejects — those are dropped by the allow-list.
Implementation note: bash heredoc-vs-pipe precedence trips this up.
The first attempt used 'printf state | python3 - ARGS <<PYEOF body
PYEOF' — heredoc wins, so stdin became the script and the JSON was
lost. Rewrote to pass the state via a tmpfile path in argv.
Verified locally:
- Synthesized a get-agent-runtime response containing every read-
only field plus a representative subset of mutable config.
- Ran the heredoc end-to-end. Output payload had:
• all 7 read-only fields stripped
• containerUri swapped to the target tag
• agentRuntimeId, roleArn, networkConfiguration, description,
protocolConfiguration, environmentVariables all preserved
verbatim from the get response.
Re-run the workflow's deploy-inference-api-code job; the runtime
will roll over to the new image and the existing wait_for_ready
loop will block until it settles.
GitHub flagged v6.8.0 as running on Node.js 20, which is being forced to Node.js 24 on June 2 2026 and removed entirely Sept 16 2026. v8.1.0 declares 'using: node24' in its action.yml, so it runs natively on the new runtime and we stop seeing the deprecation warning. The four call sites use only 'version', 'enable-cache', and 'cache-dependency-glob' inputs — none are affected by v8.0.0's breaking changes, which were limited to the deprecated manifest-file format and the dropping of major/minor tag publishing (we already pin to a full SHA, so that's a no-op). - .github/workflows/backend.yml - .github/workflows/nightly-deploy-pipeline.yml - .github/workflows/restore-data.yml - .github/workflows/backup-data.yml Pin: astral-sh/setup-uv@0880764 # v8.1.0 (matches the SHA-pinned-with-version-comment shape that test_action_pinning.py enforces; supply-chain pytest 31/31 passes locally.)
frontend-deploy.yml's build job ran:
Run bash scripts/frontend/build.sh
[ERROR] CDK_AWS_ACCOUNT is required
Error: Process completed with exit code 1.
The build job has no environment: clause and no env: block, so
vars.CDK_AWS_ACCOUNT didn't resolve into the runner. Adding
'environment: production' to the build job to inherit the var
isn't great either — it would gate every static-bundle build on
manual approval rules tied to the production environment, slowing
PRs and forcing review on builds that never touch AWS.
The frontend build is, by construction, AWS-account-agnostic: it
runs npm ci + ng build to emit a static bundle. None of the
CDK_* env vars are read by the Angular build, gen-version.js, or
package.json scripts (verified with a grep across frontend/). The
deploy job is what actually needs CDK_AWS_ACCOUNT, and it already
has it in its job-level env: block.
Fix: add LOAD_ENV_SKIP_AWS_VALIDATION opt-out to load-env.sh and
set it from frontend/build.sh. Other scripts (platform/deploy.sh,
backend/deploy.sh, anything that actually invokes AWS) are
untouched and still fail loudly without the required env vars.
Verified locally: with the flag unset, validation fires and exit
status is 1; with LOAD_ENV_SKIP_AWS_VALIDATION=true, the script
sources cleanly and exits 0.
User reported Access Denied at https://ai.sbmt-api.com/ after a successful frontend-deploy.yml run. CloudFront 403 for /; same for /index.html. Root cause: Angular 17+ uses @angular/build:application as the builder, which emits the SPA into a browser/ subdirectory of the outputPath by default. With angular.json setting no explicit outputPath, the build output looks like: dist/ai.client/ browser/ index.html main-XXXXXXXX.js ... The deploy script was syncing dist/ai.client/ → s3://bucket/, so the bucket layout became: s3://bucket/browser/index.html s3://bucket/browser/main-XXXXXXXX.js s3://bucket/browser/... CloudFront's defaultRootObject is 'index.html', so a request to '/' resolves to '/index.html'. That key doesn't exist (it's at '/browser/index.html'). With OAC, S3 returns 403 to CloudFront for both NoSuchKey and AccessDenied; the browser sees 'Access Denied'. Fix: sync from dist/ai.client/browser/ instead of dist/ai.client/. The upload-artifact step in the workflow keeps uploading the parent dist/ai.client/ (so any future server/ or prerendered-routes.json outputs remain available for inspection in the artifact), but the deploy step targets the browser/ subdir for the actual S3 sync. Re-run frontend-deploy.yml after this lands; the index.html will land at s3://bucket/index.html and CloudFront will serve it.
Live (non-dry) restore against the May 21 backup logged:
[WARNING] [Cognito] Failed to create user colin:
InvalidParameterException: Cannot modify the non-mutable
attribute sub
[WARNING] [Cognito] Failed to create user rae:
InvalidParameterException: Cannot modify the non-mutable
attribute sub
[WARNING] [Cognito] Failed to add colin to system_admin:
UserNotFoundException: User does not exist.
The first two failures are the root cause; the third is the
cascade (group-membership add references a user that creation
just refused to make).
Cognito's `sub` attribute is auto-generated by Cognito at user-
creation time — it is the immutable subject identifier
referenced by every JWT issued for that user. AdminCreateUser
rejects any UserAttributes payload that includes sub, even with
the same value Cognito would have generated. The backup captures
attributes verbatim from list-users, which DOES include sub
(plus a handful of other Cognito-managed fields that are also
not creation-time-settable).
Fix: filter a small allow-block of Cognito-managed attrs before
the AdminCreateUser call:
sub # auto-generated UUID
cognito:user_status # state machine field
cognito:mfa_enabled # derived from MFA preferences
identities # managed by IdP linking, not by attribute
The remaining attributes (email, email_verified, given_name,
phone_number, custom:*, etc.) pass through unchanged.
Once user creation succeeds, the group-membership step from the
same backup will resolve cleanly — colin → system_admin will land
because colin will exist.
One known limitation, unchanged here: Cognito users with
native (non-federated) passwords come over without their
password — Cognito does not export password hashes via list-users
and there is no API to set a hash directly. Those users need a
"forgot password" reset on first login. Federated users (Entra,
Google, etc.) authenticate via the IdP and do not have this issue.
Tests:
test_cognito_users_strips_immutable_attributes
Asserts none of the four immutable names leak into the
admin_create_user call AND that legit attrs (email,
email_verified, given_name) are preserved verbatim.
10/10 restore-data tests pass.
User reported the showstopper for cross-pool migrations: AWS
Cognito does NOT permit setting `sub` on user creation, so when
a user pool is destroyed and recreated (which happens on every
full teardown + redeploy), every user gets a fresh sub. The app
keys all per-user DynamoDB partitions by USER#<sub> and embeds
<sub> in dozens of other places (audit trails, owner refs,
GSIs, S3 paths under user-file-uploads). Without remapping,
every user's history would be orphaned post-restore.
This rewrite makes the restore tool the source of truth for the
mapping. Order of operations now:
1. Cognito FIRST. For each backed-up user:
- AdminCreateUser with the original Username and the
sanitised attribute set (Cognito-immutable names dropped:
sub, cognito:user_status, cognito:mfa_enabled, identities).
- Capture the NEW sub from the AdminCreateUser response (or
AdminGetUser on a UsernameExistsException re-run).
- For any federated identity recorded in the backup
(the `identities` JSON), call AdminLinkProviderForUser so
the next IdP login resolves to this user instead of
provisioning a duplicate with a third sub.
- Record (old_sub → new_sub) in ctx.sub_map.
- Persist the full mapping as a JSON artifact alongside the
backup at cognito/sub-mapping-<target_prefix>.json.
2. Compile a single regex matching every old sub via
`compile_sub_pattern(sub_map)`. Subsequent passes consult
this pattern for O(1)-per-string scans regardless of user
count (works for thousands of users without rescanning).
3. DynamoDB. Each item is deserialised, then `remap_subs`
recursively walks every string value and rewrites any old
sub to its new equivalent BEFORE BatchWriteItem. PK, SK,
GSIs, nested maps, list elements — all covered. The result
row reports items_remapped alongside items_written.
4. S3. The previous `aws s3 sync` was replaced with a
paginated list_objects_v2 + parallel copy_object pipeline
(16-worker ThreadPoolExecutor). The relative key under the
source prefix is passed through `remap_subs_in_key` so any
sub-bearing path (e.g. users/<sub>/<file-id>) lands at its
new sub on the target side. Empty-bucket short-circuit
reports objects_copied: 0 cleanly.
Idempotency: re-running the restore against an already-migrated
pool succeeds. AdminCreateUser raises UsernameExistsException;
the code falls back to AdminGetUser to recover the existing sub.
The same old_sub → new_sub mapping is rebuilt, so DDB and S3
re-writes produce identical results to the first run.
Audit artifact: cognito/sub-mapping-<target_prefix>.json is
written to the backup bucket with the full mapping plus
identities_linked count and any users_failed list. Operators
running the restore can later use this file to debug orphaned
data or re-run targeted remap fixups.
What's left out:
- Cognito has no API to import password hashes. Native users
still need "forgot password" reset on first login.
Federated users (Entra/Google) sign in normally — the IdP
holds the credentials.
- AgentCore Memory events aren't replayed (unchanged from
previous behaviour; documented in the readme).
Tests (10 new, all 20 pass):
- compile_sub_pattern empty/populated
- remap_subs string/dict/list/set/scalar
- remap_subs short-circuits when pattern is None
- remap_subs_in_key for S3 paths
- cognito user creation: sub captured, identities linked,
audit artifact written, sub_map populated
- cognito idempotent re-run via AdminGetUser
- DDB end-to-end: gzipped DDB-JSON → deserialise → remap →
BatchWriter put_item, with PK + nested attrs all rewritten
- S3 end-to-end: list → per-object copy with key remap, mixed
sub-bearing and sub-less keys
Verified full test gauntlet:
- infra tsc + jest 361/361
- backend supply-chain + architecture 37/37
- restore-data 20/20
…ngle-stack architecture
All three docs were still describing the old 9-stack or 2-stack
architecture. Updated to reflect the current single-stack model:
README.md:
- Deployment section rewritten: single PlatformStack provisions all
infra; application code ships via AWS APIs (ECR → ECS/Lambda/Runtime)
- Project Structure updated: shows constructs/ tree, single
platform-stack.ts, three workflows (platform, backend, frontend)
- Removed references to per-stack deploy ordering
.github/README-ACTIONS.md:
- Architecture overview: single-stack + out-of-band code deploy
- What You'll Deploy split into Infrastructure (CDK) vs Application
Code (AWS APIs) vs Frontend (S3 sync)
- Workflows table with all current workflows
- Deploy order for first-time setup
- Content-hash Docker builds explanation
infrastructure/README.md:
- Single-stack architecture diagram
- Deploy scripts table
- Constructs directory listing (39 constructs)
- Removed BackendStack references and feature flags section
New file: .github/docs/deploy/upgrade-from-multi-stack.md
Detailed step-by-step migration guide covering:
1. Pull latest code
2. Run backup workflow (with verification checklist)
3. Run teardown workflow
4. Clean up retained resources (DDB tables, S3 buckets, Cognito,
Secrets Manager, KMS — with CLI commands for each)
5. Deploy new architecture (Platform → Backend → Frontend → Bootstrap)
6. Run restore workflow (dry-run first, then execute)
7. Verify (login, chat history, files, admin, RAG)
Also includes:
- Troubleshooting section (resource conflicts, missing tables,
Cognito password hashes, missing backup bucket)
- Rollback procedure
- Timeline estimate (~45-75 min total)
- Tip about setting retainDataOnDelete=false before teardown to
skip manual cleanup
…tecture
Files updated:
.kiro/steering/devops.md
- SSM Parameter Store guidance: now describes runtime consumption
only (not cross-stack CDK refs, since there's only one stack)
.kiro/steering/tech.md
- CDK deploy command: {prefix}-PlatformStack (was InfrastructureStack)
.kiro/steering/structure.md
- Infrastructure Structure: shows constructs/ tree, single
platform-stack.ts (was 5 separate stack files)
- Scripts Structure: shows build/, platform/, frontend/, teardown/
(was stack-app-api/, stack-inference-api/, etc.)
- Naming convention example: PlatformStack (was AppApiStack)
.kiro/steering/dev-environment.md
- Docker build script references: scripts/build/ (was scripts/stack-*/)
- Wrapper scripts list: build/, platform/, frontend/ (was stack-*/)
.claude/skills/cdk-infrastructure/SKILL.md
- Full rewrite: single-stack architecture, constructs/ tree,
typed prop passing (not SSM cross-stack), bootstrap container
pattern for ECS/Lambda, CDK commands updated
Historical specs (.kiro/specs/) left unchanged — they document what
was built at the time and serve as reference, not active steering.
Per request: gate the three deploy workflows behind workflow_dispatch
only for now so downstream forks can absorb the recent changes
(platform-as-bootstrap refactor + cross-pool Cognito sub remapping)
on their own schedule.
The push triggers themselves aren't deleted — they're commented out
in place with a clear TEMPORARY banner. Uncommenting the block in
each workflow restores the prior behaviour:
- backend.yml: push to develop/main on backend/**, scripts/build/**,
or the workflow itself
- platform.yml: push on platform-stack.ts, constructs/**, config.ts,
infrastructure/bin/infrastructure.ts, scripts/platform/**
- frontend-deploy.yml: push on frontend/**, scripts/frontend/**
workflow_dispatch is the only active trigger on each, so deploys now
only happen when an operator clicks Run Workflow in the Actions UI
(or invokes via gh workflow run).
Other workflows (nightly-deploy-pipeline.yml, restore-data.yml,
backup-data.yml, teardown.yml) are unchanged — those were already
manual / scheduled and don't have push triggers.
Adds tests/supply_chain/test_backup_coverage.py — a 'reflection' test
that scans CDK constructs for stateful resources and verifies the
backup/restore scripts cover every one of them.
What it checks:
- Every DynamoDB table in CDK is in the backup inventory
- Every DynamoDB table in CDK is in the restore inventory
- No phantom tables in backup that don't exist in CDK
- Every user-data S3 bucket is backed up (excludes mcp-sandbox
and frontend — static/reproducible content)
- Cognito backup captures user pool, users, groups, IdP secrets
- AgentCore Memory backup exists
- Restore covers every backed-up table
- Restore is idempotent (catches Duplicate/Exists exceptions)
- Restore has --dry-run
Real gap found and fixed:
- scripts/restore-data/restore.py: TABLE_CONVENTION_MAP was empty.
The 'assistants' table (convention-named {prefix}-assistants, not
in SSM) was backed up but not restorable. Added it.
This test will fail CI if someone adds a new DynamoDB table or S3
bucket to the CDK constructs without also adding it to the backup
and restore scripts — preventing silent data loss on migration.
AdminCreateUser leaves users in FORCE_CHANGE_PASSWORD, which silently breaks the hosted UI's ForgotPassword flow (Cognito shows 'code sent' but never sends because there is no completed initial setup to reset). Add an AdminSetUserPassword call with a random throwaway password and Permanent=True right after the create, transitioning the user to CONFIRMED. ForgotPassword then works normally; the throwaway is never communicated to anyone. Idempotent on re-run — works for users we just created OR users that exist from a prior run (still in FORCE_CHANGE_PASSWORD). Tests: 2 new, 22/22 pass.
I was wrong earlier when I told the operator AgentCore Memory
'cannot be replayed.' The data plane API accepts an explicit
eventTimestamp, actorId, sessionId, payload, metadata, branch
(rootEventId), and clientToken on CreateEvent — everything needed
for a faithful round-trip restore. The 'best-effort replay'
docstring was a TODO marker, not a permanent limitation, and I
should have verified before propagating that as fact.
This commit lands the actual replay implementation.
Behavior:
- Reads agentcore-memory/events.jsonl.gz from the backup bucket.
- Groups events by sessionId; sessions run in parallel (16-thread
pool, same as restore_s3_bucket), events within a session run
sequentially in eventTimestamp order so a branch-child never
references a parent that has not been replayed yet.
- For each event:
• actorId is remapped via ctx.sub_map (the cross-pool sub
remapping built during restore_cognito). Unmapped actor IDs
pass through verbatim.
• sessionId, eventTimestamp, payload, metadata are preserved
verbatim. The ISO timestamp the backup wrote is parsed back
to datetime via _parse_event_timestamp.
• branch.rootEventId is rewritten via a session-local
old_event_id → new_event_id map populated as we go. If a
branch ref points at an event we never saw (cross-session
ref or out-of-order arrival), the branch is dropped with a
warning instead of failing.
• clientToken is a sha256(target_memory_id|original_event_id)
so re-runs land in AWS's idempotency window and don't
double-write.
- On a memory with extraction strategies configured, CreateEvent
triggers semantic-fact extraction synchronously, so MemoryRecords
(long-term semantic memories, summaries, preferences) rebuild
themselves automatically. We do NOT separately replay
MemoryRecords — that would double-write and contradict the
strategy.
CLI:
--skip-memory-replay New flag for cost-conscious runs (each
event triggers Bedrock model invocations
for strategy extraction, which adds up).
Order in run_restore:
1. Cognito (builds sub_map)
2. DynamoDB (uses sub_map for inline item remap)
3. S3 (uses sub_map for inline key remap)
4. AgentCore Memory (uses sub_map for actorId remap)
Reverted unrelated regression introduced in d9f3b30:
d9f3b30 added 'assistants' back to TABLE_CONVENTION_MAP based on
a faulty reading of the reflection test (which does NOT find
'assistants' in current CDK — the standalone table was
decommissioned in c977e04). Restored TABLE_CONVENTION_MAP = {}
and removed the matching entry from backup.py
DYNAMODB_TABLES_BY_CONVENTION + the stale alias in
test_backup_coverage.py.
Tests (6 new for memory replay; 1 reverted regression test now
passes again):
- test_memory_replay_skipped_when_flag_set
- test_memory_replay_skipped_when_no_target_memory
- test_memory_replay_remaps_actor_id_via_sub_map
- test_memory_replay_rewrites_branch_root_event_id
- test_memory_replay_client_token_is_deterministic
- test_memory_replay_missing_events_file_skips_cleanly
Full gauntlet:
- infra tsc + jest 361/361
- backend supply-chain + architecture + reflection 52/52
- restore-data 28/28
User hit: AccessDeniedException calling ListEvents not authorized to perform: bedrock-agentcore:ListEvents Three IAM policies (App API task role, Inference API runtime execution role, and the Memory-grant block in the AgentCore Runtime construct) granted speculative action names that don't exist in IAM: CreateMemoryEvent / GetMemoryEvent / ListMemoryEvents / DeleteMemoryEvent / RetrieveMemory / ListMemorySessions / GetMemorySession / GetMemory / GetMemoryStrategies / DeleteSession The real Data Plane API surface (https://docs.aws.amazon.com/bedrock-agentcore/latest/APIReference/API_Operations.html): CreateEvent / GetEvent / ListEvents / DeleteEvent / ListActors / ListSessions / RetrieveMemoryRecords / GetMemoryRecord / ListMemoryRecords / BatchCreate|Update|DeleteMemoryRecords / DeleteMemoryRecord Result: every IAM grant in those three blocks was a silent no-op. The App API blew up the moment a real call hit the API. Replaced with the documented action names. App API + Inference runtime now have the read+write spread the app actually needs. Tests: tsc clean, jest 361/361.
Per the AWS Service Authorization Reference for bedrock-agentcore (https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonbedrockagentcore.html), 'bedrock-agentcore:InvokeBrowser' is not a real action — the same silent-no-op shape as the previous Memory action-name fixes. The canonical browser actions are StartBrowserSession, GetBrowserSession, ListBrowserSessions, StopBrowserSession, ConnectBrowserAutomationStream, ConnectBrowserLiveViewStream, UpdateBrowserStream, and SaveBrowserSessionProfile. Replaced 'InvokeBrowser' with the seven applicable runtime-side actions in both the inference-agentcore-construct (memory-arn-scoped policy) and inference-api-iam-roles (project-wide tools policy). Also: the resource ARN wildcards in inference-api-iam-roles used 'browser/*' and 'code-interpreter/*' — those refer to AWS-managed resources owned by account 'aws' and never match the resources this platform creates (CfnBrowserCustom and CfnCodeInterpreterCustom, both scoped to the project's account). Replaced with 'browser-custom/*' and 'code-interpreter-custom/*' to match the real ARN shape per the resource type table in the same Service Authorization Reference. I cross-referenced every bedrock-agentcore: action used in infrastructure/ against the canonical action list and now all 30 distinct actions are documented. tsc clean, jest 361/361.
Code Review — Stack Architecture SimplificationCollapses the 9-stack layout into a single Strengths
Issues to addressHigh1. Same-stack SSM publish-then-read may deadlock on first deploy. The author acknowledged this exact risk for two refs ( 2. `SHARED_CONVERSATIONS_TABLE_NAME` is wired to the wrong table. Medium3. Stale 4. Untyped stub-fills hide misuse. 5. `platform.yml` concurrency + path triggers. 6. Low7. Test gap on first-deploy ordering. Add an assertion that walks all 8. Per-image build jobs in 9. CFN parameter for 10. Risk assessment
RecommendationApprove with required follow-ups on #1 + #2. Land #3–#10 in subsequent cleanup PRs. |
Code Review — Stack Architecture SimplificationReviewed the full diff (207 files, +15,416 / −22,228, 62 commits). Below are findings, with each load-bearing issue verified against both the old ( OverviewAmbitious, well-documented refactor: collapses the 9-stack (later 2-stack) CDK architecture into a single Engineering quality is high — clean construct decomposition, a genuinely good cutover doc, materially improved test gating, and graceful bootstrap handlers. But there are three blocking issues plus several should-fixes. What's done well ✅
🔴 Blocking1. IAM regression: OAuth / connector grants dropped — will break
2. Bootstrap-image invariant does not hold for ECS + AgentCore Runtime under CFN.
The comments assert the real out-of-band image is "left untouched on subsequent deploys." That only holds when the task-def/runtime resource is otherwise unchanged. ECS task defs are immutable and Recommend one of: (a) 3. Teardown can't destroy the new architecture — and silently reports success. 🟡 Should-fix
🟢 Nits
RecommendationStrong architecture, not yet mergeable as-is. Blockers #1 (OAuth grants) and #3 (teardown) are fast fixes; #2 (image revert) needs an explicit design answer plus comment corrections. The doc drift and a regression-guard IAM test should ideally land in the same PR since they're part of the contract this refactor changes. |
Background: `ssm.StringParameter.valueForStringParameter()`
synthesises an `AWS::SSM::Parameter::Value<String>` CFN parameter.
CFN resolves those parameters BEFORE any of the stack's resources
are created. With the single-PlatformStack architecture, dozens of
in-stack SSM reads were trying to read parameters that the same
stack was supposed to publish — first-deploy CFN failure was
'Unable to fetch parameters from parameter store.'
Fix: replaced every in-construct `valueForStringParameter` call
with a typed CDK ref pulled from a new `PlatformComputeRefs`
bundle that PlatformStack.wireCompute() builds once and threads
through both compute constructs:
- new file: lib/constructs/platform-compute-refs.ts (interface +
docstring explaining the deadlock)
- lib/platform-stack.ts: wireCompute() builds refs from the
existing public readonly fields, passes to both
InferenceAgentCoreConstruct and AppApiServiceConstruct.
- constructs/inference-api/inference-agentcore-construct.ts:
24 SSM reads → ref lookups
- constructs/inference-api/inference-api-iam-roles.ts:
20 SSM reads → ref lookups (function takes refs as param now)
- constructs/app-api/app-api-environment.ts:
`resolveAppApiSsmParams` renamed to `resolveAppApiParams` and
rewritten to source from refs.
- constructs/app-api/app-api-iam-grants.ts:
9 SSM reads + 28 string-arn props → single `refs` prop
- constructs/app-api/app-api-service-construct.ts:
9 SSM reads + 9 `fromAttributes` imports → typed refs
direct-use.
Cleanup: 102 SSM publishes deleted across 31 construct files.
The values were being published only for in-stack CDK reads
(now via refs) or were never consumed at all. Survivors are the
~16 SSM paths actually needed by deploy scripts (build,
deploy-ecs-service, deploy-runtime-image, deploy-image-lambda,
deploy-lambda-code, frontend deploy) and e2e tests:
/admin/managed-models-table-name
/app-api/cluster-name + service-name + task-def-family
/artifacts/render-function-name
/auth/cognito/bff-app-client-id + user-pool-id
/frontend/bucket-name + distribution-id + url
/inference-api/runtime-id
/network/alb-url + ecs-cluster-name
/quota/user-quotas-table-name
/rag/ingestion-function-name
/rbac/app-roles-table-name
(image-tag + code-hash params are written by build scripts, not
by CDK, so they never had publishes here.)
Tests: 24 hasResourceProperties assertions for now-deleted SSM
publishes were dropped. The platform-stack.test.ts SSM-count
assertion was updated from 'at least 40' to 'between 8 and 20'
with a comment explaining the survivor set.
Full gauntlet:
- infra tsc + jest 361/361
- backend supply-chain + architecture + reflection 52/52
- restore-data 28/28
Got caught by git add -A in the previous SSM cleanup commit. Add to .gitignore so it doesn't recur.
Audit of every env var the CDK sets vs every os.environ name the Python code reads found six functional mismatches. All six would have caused the container to fall back to default values (the second arg to os.environ.get()) — which means runtime fallbacks, wrong-table writes, or silent feature breakage. 1. SHARED_CONVERSATIONS_TABLE_NAME was wired to ragAssistantsTableName (with a // TODO comment, no less). The shared-conversations table exists as a separate sibling DDB resource. Sharing a chat would have written to / read from the assistants table. Fix: add sharedConversationsTable to PlatformComputeRefs (already exposed on PlatformStack), thread sharedConversationsTableName through AppApiSsmParams, point the env var at it. 2. DYNAMODB_EVENTS_TABLE → DYNAMODB_QUOTA_EVENTS_TABLE. backend/src/agents/main_agent/config/constants.py defines EnvVars.DYNAMODB_QUOTA_EVENTS_TABLE = 'DYNAMODB_QUOTA_EVENTS_TABLE'. Quota event recording was hitting the default table name. 3. FINE_TUNING_JOBS_TABLE_NAME → DYNAMODB_FINE_TUNING_JOBS_TABLE_NAME. backend/src/apis/app_api/fine_tuning/routes.py reads 'DYNAMODB_FINE_TUNING_JOBS_TABLE_NAME' (default 'fine-tuning-jobs'). 4. FINE_TUNING_ACCESS_TABLE_NAME → DYNAMODB_FINE_TUNING_ACCESS_TABLE_NAME. backend/src/apis/app_api/fine_tuning/repository.py same shape. 5. FINE_TUNING_DATA_BUCKET_NAME → S3_FINE_TUNING_BUCKET_NAME. backend/src/apis/app_api/fine_tuning/s3_service.py reads 'S3_FINE_TUNING_BUCKET_NAME' (default 'fine-tuning-data'). 6. FINE_TUNING_PRIVATE_SUBNET_IDS → SAGEMAKER_SUBNET_IDS. backend/src/apis/app_api/fine_tuning/sagemaker_service.py reads 'SAGEMAKER_SUBNET_IDS'. With the wrong name, training jobs would have been launched without a VPC config. Audit also surfaced a few CDK-set-but-Python-never-reads entries (OAUTH_CLIENT_SECRETS_ARN, OAUTH_TOKEN_ENCRYPTION_KEY_ARN, AGENTCORE_MEMORY_TYPE, COGNITO_ISSUER_URL). They're harmless dead weight, not bugs — left in place to avoid mass churn and because they document configurable values for future code. Tests: tsc clean, jest 361/361.
Catches the bug class introduced in 3fd8cdd (Phase 3 stack implementation) and 902f5de (app-api decomposition) where I renamed env vars on the CDK side without verifying what Python reads. TypeScript compiled clean, IAM accepted the strings, CDK tests all passed — but the running container fell back to defaults silently because the env var name didn't match what the code expected. Bug was only visible by reading Python + TypeScript side-by-side, which is exactly what this test does automatically. Test extracts: - every quoted ALL_CAPS identifier from any .py file under backend/src (catches os.environ.get/getenv, os.environ[...], constant assignments like _CALLBACK_URL_ENV = 'NAME', and EnvVars-class indirections) - every env var key from CDK env blocks in app-api-environment, app-api-service-construct, inference-agentcore-construct, artifact-render-lambda-construct, rag-ingestion-lambda-construct Asserts: 1. Every CDK-set env var is read by ≥1 Python file OR is in an INTENTIONAL_NOT_READ_BY_PYTHON allow-list with a documented reason. Catches typos like 'DYNAMODB_EVENTS_TABLE' (set, never read because Python reads DYNAMODB_QUOTA_EVENTS_TABLE). 2. The allow-list itself is honest — entries that ARE read by Python or are NO LONGER set by CDK fail the test. 3. Helpful failure message includes did-you-mean suggestions using difflib.get_close_matches against Python env vars. Real orphans found and removed: - rag-ingestion-lambda BEDROCK_REGION (no Python reader; lambda uses AWS_REGION via boto3 default path) - inference-agentcore-construct ENABLE_AUTHENTICATION (no Python reader; ENABLE_QUOTA_ENFORCEMENT is the real flag) INTENTIONAL_NOT_READ_BY_PYTHON allow-list (4 entries): - AGENTCORE_MEMORY_TYPE, COGNITO_ISSUER_URL, OAUTH_CLIENT_SECRETS_ARN, OAUTH_TOKEN_ENCRYPTION_KEY_ARN (documented as 'set for audit / future code, not currently read by python'; harmless dead weight, not bugs). Going forward: any CI run that surfaces a new env var on either side without the matching name on the other side fails this test with a precise diff and a did-you-mean hint. Same shape as test_backup_coverage.py — a 'reflection' test that locks the contract between two halves of the codebase that don't otherwise share a type system. Tests: tsc clean, jest 361/361, backend pytest 57/57 (52 + 5 new), restore-data 28/28.
…ents Issue 3: Stale BackendStack references throughout comments. Phase 7 absorbed BackendStack and the resulting paragraphs of historical narrative weren't useful for maintaining the code. Stripped 20+ lines of refactor-history commentary across 9 files; kept only the comments that describe non-obvious 'why this code looks like this' constraints (same-stack SSM deadlock, lambda-Function-URL circular dependency, etc.). Issue 4: spaDistribution + spaDistributionDomainName were stub- filled with 'undefined as unknown as IDistribution' / 'as string' in the constructor and populated by a separate wireSpaDistribution() method called from bin/. If a caller forgot to invoke that method, the type system silently lied and downstream access produced a confusing TypeError. Inlined the SpaDistribution + RagCorsUpdater constructs into the constructor — the values they depend on (ALB DNS via AlbDnsConstruct, the SPA bucket) are all available by that point in the constructor, so there's no reason for the two-phase pattern. Dropped wireSpaDistribution() entirely; props are populated synchronously and unconditionally. Issue 5: platform.yml path triggers + concurrency. Added 'infrastructure/bootstrap-assets/**' to the (currently commented- out) path-trigger list so changes to bootstrap container images trigger a Platform deploy on push. Considered the cancel-in-progress: true on develop suggestion but reverted to false-always per existing supply-chain policy. The test_concurrency_config.py invariant exists explicitly to prevent CDK-mid-deploy cancellation leaving CloudFormation in ROLLBACK_IN_PROGRESS / UPDATE_ROLLBACK_FAILED states. The serial- queue tradeoff is what the team picked, and the safety risk on production is real. Issue 6: Clarified the wireCompute() ordering comment to make explicit that the construction order is for human readability — CDK token resolution + the CFN dependency graph handle ordering at synth/deploy time. A future reader can rearrange the blocks without breaking correctness; the order is just there to match how the data flow reads top-to-bottom. Tests: tsc clean, jest 361/361, backend pytest 57/57.
#7: Two-test ssm-safety suite that walks the synthesized template and asserts: - No AWS::SSM::Parameter::Value parameter references an SSM path that this same stack also creates via AWS::SSM::Parameter (the publish-then-read first-deploy deadlock). - Any remaining AWS::SSM::Parameter::Value parameter is in the EXTERNALLY_SEEDED_SSM_PATHS allowlist (CDK bootstrap version, build-time image tags, content hashes — all written before any CFN run). Would have failed the pre-Phase-7 stack at synth time. Now a permanent regression guard. #8: Per-image build duplication in backend.yml. Extracted the shared shape (configure-aws-credentials → optional QEMU+buildx for linux/arm64 → bash scripts/build/build-one.sh) into a composite action at .github/actions/build-and-push-image/. The 3 build jobs now call it via a 'platform: linux/arm64' input on inference-api and default amd64 on the others. Net: -45 lines in backend.yml, single source of truth for the build flow. Tests: tsc clean, jest 363/363 (+2 from ssm-safety), backend pytest 57/57.
Confirmed regression. The pre-Phase-7 app-api-stack and inference-api-stack roles granted a wider set of AgentCore Identity data-plane and control-plane actions; the decomposition into app-api-iam-grants.ts / inference-api-iam-roles.ts kept only GetWorkloadAccessTokenForUserId + GetWorkloadIdentity, which would break /connectors/* and any agent tool that needs a vaulted token. Cross-referenced what Python actually calls: shared/oauth/agentcore_identity.py GetResourceOauth2Token shared/oauth/agentcore_registrar.py Create/Update/Delete/Get/List Oauth2CredentialProvider app_api/connectors/routes.py GetResourceOauth2Token + CompleteResourceTokenAuth Cross-referenced every action name against the canonical Service Authorization Reference at https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonbedrockagentcore.html to confirm it exists as an IAM action and to scope resources correctly. CreateTokenVault from the user's report does NOT exist as an IAM action — the default token vault is auto-created by AgentCore Identity, no grant needed. All 7 new actions are documented and resource-scoped. App API gains: bedrock-agentcore:GetResourceOauth2Token bedrock-agentcore:CompleteResourceTokenAuth bedrock-agentcore:CreateOauth2CredentialProvider bedrock-agentcore:UpdateOauth2CredentialProvider bedrock-agentcore:DeleteOauth2CredentialProvider bedrock-agentcore:GetOauth2CredentialProvider bedrock-agentcore:ListOauth2CredentialProviders plus the full secretsmanager lifecycle on secret:bedrock-agentcore-identity!default/oauth2/* (was GetSecretValue+DescribeSecret only on the inference-api role). The registrar needs Create/Put/Update/Delete/Tag to add, rotate, and remove providers via the admin UI. Inference API gains: bedrock-agentcore:GetResourceOauth2Token The agent loop's tool gating short-circuits to vaulted tokens via this call before invoking external MCP tools / connectors. Without it the agent 503s on any federated-tool turn. Resources are scoped to this account's AgentCore Identity surface (token-vault/*, oauth2credentialprovider/*, workload-identity- directory/*, workload-identity/*) rather than the previous '*'. Tests: tsc clean, jest 363/363 (including the SSM safety guard added in the same review pass).
… trap
The pre-fix shape baked the bootstrap container URI directly into
the synthesized template at the ECS task definition's Image and
the AgentCore Runtime's containerUri. That works fine for the
no-op-deploy case (CFN sees no diff, leaves resources alone), but
breaks when the compute resource itself changes:
- AWS::ECS::TaskDefinition: every property except Tags is
'Update requires: Replacement'. Adding an env var, bumping
CPU/memory, or swapping a role forced CFN to register a new
revision carrying the bootstrap stub Image and repoint the
service at it. Verified at:
https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-resource-ecs-taskdefinition.html
- AWS::BedrockAgentCore::Runtime: AgentRuntimeArtifact is
'No interruption', but update-agent-runtime is full-replacement
at the API level. CFN re-sent the bootstrap containerUri on any
property change. Verified at:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-bedrockagentcore-runtime.html
Lambda is genuinely safe — CFN explicitly does NOT detect S3/ECR
drift on the Code property, so a stable bootstrap asset hash means
CFN never calls UpdateFunctionCode and the live code is preserved.
Fix: read Image and containerUri from SSM at CFN deploy time. The
build pipeline already writes the live image URI to
/<prefix>/<svc>/image-tag via deploy-ecs-service-one.sh and
deploy-runtime-image-one.sh. CFN now resolves whatever's in SSM at
update time, so any task-def or Runtime re-registration uses the
live image, not the bootstrap stub.
Implementation:
- app-api-service-construct.ts: image is now
ContainerImage.fromRegistry(valueForStringParameter('/<prefix>/app-api/image-tag')).
DockerImageAsset is kept (still published by cdk-assets) for
first-deploy seed. CfnOutput exposes the asset hash so the
seed script can construct the URI without parsing Fn::Sub.
Both ECR repos (cdk-assets + per-service) get explicit
grantPull on the task execution role since fromRegistry
doesn't auto-grant.
- inference-agentcore-construct.ts: same shape for
AgentRuntimeArtifact.containerConfiguration.containerUri.
First-deploy bootstrap problem: AWS::SSM::Parameter::Value<String>
template parameters require the SSM path to exist before CFN
starts resolving template parameters. New script
scripts/stack-bootstrap/seed-image-tags.sh runs between cdk synth
and cdk deploy:
1. Reads the bootstrap asset hash from the CfnOutput in
cdk.out/<stack>.template.json
2. Constructs the cdk-assets ECR URI deterministically from
account/region + the hash
3. Calls aws ssm put-parameter only if the param doesn't already
exist (idempotent — build pipeline owns the param after first
deploy)
scripts/platform/deploy.sh now does:
synth → cdk-assets publish → seed → deploy
The cdk-assets publish step is needed before the seed so the URI
the seed writes is actually pullable. cdk deploy would do the
same publish, but seeding has to happen before the stack-update
API call, which is inside cdk deploy.
The SSM-safety test added in 4bbff4f already allowlists the
image-tag paths as externally-seeded, so no test changes needed
there.
New test compute-image-resolution.test.ts asserts the synthesized
shape: 5 assertions covering parameters, Refs at the consumer
sites, and the bootstrap-hash CfnOutputs.
Tests: tsc clean, jest 368/368 (+5), backend pytest 57/57,
restore-data 28/28.
Validated by reading authoritative AWS docs:
https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/aws-resource-ecs-taskdefinition.html
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-bedrockagentcore-runtime.html
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html
Docs (agent source-of-truth):
- CLAUDE.MD: deploy order rewritten for single PlatformStack;
File Creation Rules updated to reflect that there are no
separate CDK stacks anymore (constructs only); infra build
commands matched to current scripts.
- .github/copilot-instructions.md: same updates.
- .github/docs/deploy/step-04-deploy.md: full rewrite. 4
workflows (platform → backend → frontend → seed), no enable
flags, accurate description of bootstrap pattern + SSM-resolved
images. The previous version still described the 9-workflow
flow with optional Artifacts/MCP-Sandbox/SageMaker stacks.
- infrastructure/README.md: was already current; the only
BackendStack reference is in the explicit Legacy Migration
section.
Tests (restored coverage previously held by deleted
app-api-stack.test.ts / inference-api-stack.test.ts /
security-best-practices.test.ts):
- test/security-policy.test.ts (NEW) — 7 assertions:
- Action:* + Resource:* prohibition across IAM Policy +
ManagedPolicy + inline role policies (covers the CDK
overflow-policy split for large inline policies)
- BFF cookie KMS grant is kms:Decrypt only — caught and
fixed a real regression where the new app-api-iam-grants
was granting Encrypt + GenerateDataKey on the BFF cookie
key (the cookie codec doesn't call KMS — Secrets Manager
handles transparent decryption of the data-key secret).
Split into two SIDs:
OAuthTokenEncryptionKeyAccess (Decrypt+Encrypt+GenerateDataKey on the OAuth key)
BffCookieSigningKeyDecrypt (Decrypt-only on the BFF key)
- Every bucket has BucketEncryption, full
PublicAccessBlock, and enforceSSL — caught and fixed a
regression where RagDocumentsBucket and FrontendBucket
were missing enforceSSL: true (no aws:SecureTransport=false
Deny in their bucket policies).
- Every DDB table has SSE enabled.
- Test count: 370 (was 363; -5 vacuous Boolean Parsing tests
that had no assertions, +7 new security, +5 from earlier
compute-image-resolution suite, +2 from earlier ssm-safety).
Cleanups:
- .github/workflows/restore-data.yml: pin uv version to 0.7.12
(was using setup-uv with no version input → resolves latest
at runtime, violated the exact-pin convention used by every
other workflow).
- scripts/frontend/deploy.sh: mirror AWS_REGION from
CDK_AWS_REGION after sourcing load-env.sh. CI was setting it
at the job level, but local runs would fail with 'AWS_REGION
is required'.
- scripts/build/compute-content-hash.sh: bash-3.2-safe MANIFESTS
expansion. The plain '${MANIFESTS[@]}' under set -u errors on
empty arrays in macOS's default bash. Use
${MANIFESTS[@]+"${MANIFESTS[@]}"} which is a no-op when the
array has elements and safe when it doesn't.
- .github/workflows/frontend-deploy.yml: gate the build job on
test-frontend. Previously only the deploy job needed
test-frontend, so a failing test would still trigger an
expensive ng build.
- Stripped stale config.artifacts.enabled doc-comments from
artifact-render-token-secret-construct.ts and
spa-distribution-construct.ts (artifacts is unconditional
now).
- Removed 5 vacuous Boolean Parsing tests in config.test.ts
that had no assertion bodies — they were silently passing.
Release notes: added an [Unreleased] section at the top of
RELEASE_NOTES.md documenting the unconditional-provisioning
behavior change for existing deployers (Artifacts + MCP Sandbox +
fine-tuning are now always created on cdk deploy), the
SSM-resolved image-tag mechanism, the first-deploy seed step
ordering, the BFF cookie + enforceSSL fixes, and the OAuth /
connector IAM grant restoration. The narrative summarises the
work in flight on main since beta.27 was cut.
Tests: tsc clean, jest 370/370, backend pytest 57/57, restore-data
28/28.
Brings in 17 commits since the merge base (95d51d9): - MCP-apps: persist UI resources across refresh, fullscreen iframe, sandbox CSP alignment - Admin: curated model catalog, model-form redesign, managed-models list polish - Spreadsheet analysis: fail-fast size guard - File upload: duplicate-name misclassification fix - Manage sessions: 'x items selected' wording - Messages: entry animation no longer traps fixed children - Sessions: drop dead duplicate app_api sessions messages service - Kaizen weekly review + research scan housekeeping Resolves no expected conflicts — develop work was frontend / UX / dead code cleanup; this branch's work is infra IAM / CDK / docs.
No description provided.