feat(kiloclaw) Admin instance actions scheduler#3000
Conversation
| } | ||
|
|
||
| try { | ||
| const outcome = await dispatchByActionType(target, ctx, db); |
There was a problem hiding this comment.
WARNING: Dispatch runs before the target is claimed
Because the external side effect starts before the pending target is atomically claimed, overlapping alarm/waitUntil passes can both process the same due row. One pass can kick off restartMachine, while another records failed or skipped first, leaving the database outcome inconsistent with the real restart/version change. Claim the target before dispatching, or otherwise serialize the apply path per target so only the claimed pass can perform the side effect.
There was a problem hiding this comment.
Fixed in db68801. Apply path now claims the target with an atomic pending to running CAS via the new claimScheduledActionTarget helper before invoking the dispatcher. If two concurrent waitUntil passes find the same due row, only one wins the claim and proceeds to dispatch; the other gets 0 rows back and skips. Added running to the target status enum (TS only annotation, no migration since the column is text) and relaxed the CAS in recordScheduledActionTargetOutcome to accept either pending or running so both code paths still work.
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Resolved Previous Findings (click to expand)
Files Reviewed (20 files)
Reviewed by gpt-5.5-20260423 · 3,686,939 tokens |
Summary
Adds a scheduling primitive for kiloclaw admin actions covering two action types:
scheduled_restart(the worker DO redeploys on its current image at the chosen time) andversion_change(the worker DO redeploys on a chosen image tag, with optional pin override). Three new tables back the feature:kiloclaw_scheduled_actions(parent),kiloclaw_scheduled_action_stages, andkiloclaw_scheduled_action_targets. The DOalarm()handler gains a wedge that queries Postgres for due targets and dispatches peraction_type. Admin UI surfaces are: a new Scheduler tab on the kiloclaw dashboard, an upcoming action indicator on the per instance page, a Schedule for later toggle on the bulk Change Version dialog, and the same toggle on the per instance Change Version dialog. The Upgrade to Latest button keeps its immediate semantics (renamed "Upgrade to Latest Now") and now opens a confirmation dialog. Notifications are intentionally not in this PR; the admin tooling says so where it matters.Architectural notes worth flagging:
tracked_image_tagsync wedge from PR Kiloclaw tracked image tag #2942:runScheduledActionApplyruns insidealarm()viawaitUntil, so reconciliation is unaffected when Postgres is unreachable.scheduleAlarm()math. The existing reconcile cadence (5 minutes for running instances, longer for hibernated) is the only timing source. The scheduled time is a "no earlier than" bound, not an exact fire time. Drift is up to one alarm interval and is surfaced to admins in the UI.notifyScheduledActionPending()callsscheduleAlarm()to refresh the alarm to a fresh near future timestamp. Exposed via worker routePOST /api/platform/scheduled-action/wake. The webscheduleActionmutation fires this best effort after the inserts, batched at 20 concurrent so a 500 instance schedule does not open 500 outbound sockets at once.scheduleActionrun inside onedb.transaction(..., { isolationLevel: 'serializable' })so two admins racing the same instance cannot both succeed. Outcome recording (recordScheduledActionTargetOutcome) wraps the target write and both counter writes in a single transaction so a partial failure rolls back cleanly. The DO apply path claims the target with apendingtorunningCAS before invoking the side effect so two concurrent waitUntil passes cannot both firerestartMachine.CONFLICTwhen any selected instance already has a pending action under a parent inscheduledorrunningstatus. Bulk schedules silently filter destroyed instances (mirrors thebulkChangeVersionApply Now path) and reject only if every selected instance is destroyed.cancelScheduledActionTargetdrops one instance from a bulk schedule via atomic CAS on the target row and runs the same promotion sweep as the alarm path so an action whose targets are all individually cancelled does not stayscheduledforever.kiloclaw_scheduled_action_targets.user_idreferenceskilocode_users.id. The targets table is retained operational state;softDeleteUseris updated and a coverage test confirms it succeeds when the user owns scheduled action targets.Verification
docker compose rm -s -f -v postgres,docker volume rm dev_postgres_data,pnpm drizzle migrate) applied every migration cleanly with the regenerated 0109 migration after rebase.Visual Changes
/admin/kiloclaw?tab=schedulerwith a sortable Recent actions table, separate Applied / Skipped / Failed columns, a Run at column, and a per row Detail dialog showing every target with source and target tags.Reviewer Notes
scheduleAlarmto pickmin(reconcileNextAt, scheduledActionNextAt). Deferred until drift becomes operational pain.bulkChangeVersionpath (PR feat(kiloclaw) Admin tools for bulk version deployments #2975) has the same property; a follow up PR will add a notifications framework that makes both paths polite.