Commit Graph

495 Commits

Author SHA1 Message Date
saravanakumardb1
6bddc88f0f feat(fleet): reconcile PR state against GitHub (detect externally-merged PR)
A PR merged in the GitHub UI was invisible to the platform — prState only
flipped to `merged` when the platform merged it (ghMergePr on ship) or a runner
reported it, so the job details page kept showing the PR as open. This adds a
simple, pull-based reconcile (no inbound webhook / public ingress needed).

- coordinator.reconcileJobPrState(jobId, productId, fetcher?): finds the latest
  run carrying a prUrl and, when `gh pr view --json state` reports MERGED, flips
  the run's prState to `merged` and appends a `pr_merged` event (data.via:
  'reconcile'). The GitHub lookup is injectable for tests; pure `mapGhPrState`
  maps MERGED/OPEN/other. Best-effort: any gh failure is a no-op.
- POST /fleet/jobs/:id/pr/reconcile route; echoes the outcome to the tracker
  Item when a merge is detected.
- tracker-web: reconcilePrState() client + a "Refresh PR status" button on the
  job details PR section (shown until the PR is merged) that reconciles then
  refreshes the view.

Tests: +5 (mapGhPrState, reconcile merged/open/no_pr/not_found, route wiring);
full suite 1861 green; lint + tsc clean (service + tracker-web). Deployed: rebuilt
the docker platform-service; POST .../pr/reconcile returns 401 (wired), not 404.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 23:58:13 -07:00
saravanakumardb1
c63736459b feat(fleet): anti-flap hysteresis + autoscale Prometheus series & dashboard (ops #5)
Make the capacity autoscaling signal safe to act on automatically and observable
in Grafana.

Anti-flap hysteresis:
- New pure applyHysteresis: suppresses a direction reversal (scale_in after
  scale_out, or vice versa) within a cooldown window so a consumer cannot thrash
  capacity. A critical scale-out (queued work, zero usable capacity) always
  bypasses the cooldown. Cooldown anchor only advances on an emitted action, so a
  suppressed signal keeps counting down from the real last action.
- Process-wide per-product cooldown state (mirrors reaper/breaker in-mem state)
  with a test seam; cooldown tunable via FLEET_AUTOSCALE_COOLDOWN_SEC (default 300).
- GET /fleet/autoscale[/all] now serve the debounced (stateful) recommendation.

Observability:
- Prometheus exposition emits the RAW recommendation per product
  (fleet_autoscale_recommended_seats/delta/pressure + one-hot fleet_autoscale_action
  {action}). RAW (not stateful) so a scrape never mutates the cooldown anchors.
- Grafana "Fleet Overview" gains two panels: products recommending scale-out
  (stat) + recommended seat delta vs backlog (timeseries).

Docs: FLEET_AUTOSCALE_COOLDOWN_SEC in .env.example.

Tests: +10 (hysteresis/stateful/cooldown + prom autoscale series); full suite 1856
green; lint + tsc clean. Verified live: a throwaway Prometheus scraped the running
service and the dashboard PromQL returned real scale-out/scale-in recommendations
across products.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 23:02:08 -07:00
saravanakumardb1
321cfe7546 feat(fleet): correlation-id tracing + capacity autoscaling signal (ops #4/#5)
Thread a trace-context correlation id across the coordinator<->runner boundary
so a logical work-unit (job -> claim -> run -> ship) is stitchable end to end,
and add an advisory capacity autoscaling signal an external scaler can consume.

Tracing (#4):
- Mint/propagate a correlationId at submit from the inbound
  x-correlation-id/traceparent/x-request-id (else generate ftr_<uuid>); persist
  it on the job, inherit onto the run + lease at claim, and stamp every
  lifecycle event (submitted/assigned/transition/lease_renewed/lease_released/
  retry_scheduled/dead_letter). Children of a composite job share the parent id.
- Echo it back on the x-correlation-id response header (submit/claim/renew/
  release/patch) so a factory can carry it forward, and bind it to req.log.
- New pure trace.ts (header resolution incl. W3C traceparent trace-id).

Autoscaling signal (#5):
- New pure autoscaler.ts turns a product FleetMetrics + saturation alerts
  (no_live_capacity/saturated/queue_starvation) into an auditable scale
  recommendation (action/recommendedSeats/delta/urgency/signals).
  budget_exhausted suppresses scale-out; idle slack reclaims down to a floor.
  Thresholds tunable via FLEET_AUTOSCALE_* env.
- GET /fleet/autoscale (per-product) + GET /fleet/autoscale/all (global, admin
  or scrape token). Documented the env vars in .env.example.

Tests: +29 (trace 10, tracing 7, autoscaler 12); full suite 1846 green; lint + tsc clean.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 22:43:56 -07:00
saravanakumardb1
93d1caf4a2 feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4)
Exports fleet observability to Prometheus/Grafana (previously JSON-only).

- GET /api/fleet/metrics/prom: global, product-labelled Prometheus exposition
  (queue depth, blocked/active, per-stage histogram, factory health/seats/
  utilization, active alerts, budget spent/ceiling/projected) plus process-wide
  reaper/GC counters and engine circuit-breaker state. Pure renderer
  (renderFleetMetricsProm) is unit-tested; route auth accepts a FLEET_METRICS_TOKEN
  bearer (scrape path) or an admin JWT — never world-readable by default.
- Infra: add a prometheus container to docker-compose + a platform-service-fleet
  scrape job; pin the Prometheus Grafana datasource uid; add a provisioned
  "Fleet Overview" dashboard (breakers, dead-letter, stale factories, alerts,
  queue depth, utilization, budget burn, reaper rate) with a product template var.
- Document FLEET_METRICS_TOKEN + the fleet feature flags in .env.example.

No default behavior change: the endpoint is additive and the new container is
opt-in via the compose stack.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 22:24:03 -07:00
saravanakumardb1
42c63dcc6e feat(platform): product ownership + owner-scoped "my projects" + tenant guard
Foundation for a generic, multi-tenant platform (any developer, not just the
built-in products).

- Products carry an optional ownerId (set on create + auto-register), so a
  product has a tenant. GET /products/mine returns the caller's owner-scoped
  list; admins/super_admins see all. productsForUser() is pure + unit-tested.
- requireProductAccess(): a flag-gated tenant authorization guard
  (FLEET_TENANT_ENFORCEMENT, default OFF). OFF = byte-for-byte current behavior;
  ON = a non-admin may only act on products they own (others -> 403; owner-less
  legacy products keep a grace allowance until migrated). Fleet routes now
  resolve productId through it in place of getRequestProductId.

ownerId is additive/optional; enforcement is off by default, so this is a
no-op for existing deployments until explicitly enabled.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 16:47:05 -07:00
saravanakumardb1
bcd806c6ff feat(fleet): complete budget enforcement — per-engine ceilings + overspend projection
Builds on the existing product-level hard claim gate + idempotent accrual.

- Per-engine sub-ceilings (engineCeilingsUsd) with per-engine accrual
  (spentByEngineUsd). An engine at its sub-ceiling is routed around at claim
  time via the same per-engine availability gate as the circuit breaker — it
  never pauses the whole product, so other engines keep flowing. Gated by
  FLEET_BUDGETS (defaults off).
- /fleet/metrics now surfaces a budget summary (ceiling/spend/status/projection
  + per-engine breakdown) and derives guardrail alerts: budget_overspend_projected
  (burn-rate extrapolation, guarded against early-window false alarms),
  budget_exhausted, and engine_budget_exhausted. Surfaced whenever a budget exists,
  independent of the enforcement flag, so operators see the burn in dry-run.

projectBudgetSpend is pure + unit-tested; per-engine spend follows the same
idempotent accrual path as the total, so spentUsd and spentByEngineUsd agree.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 13:23:44 -07:00
saravanakumardb1
bdbb0a8ce4 feat(fleet): cost/latency-aware engine routing + per-engine circuit breaker
Adds two additive, flag-gated routing refinements on top of the §7 scoring
core; both default OFF so the deterministic claim path is unchanged.

- FLEET_COST_ROUTING: soft engineQuality term (weight 0.4) biases routing
  toward the historically cheaper/faster engine, derived from per-engine
  insights.costUsd + run duration. No-history engines stay neutral, so the
  nudge can only demote demonstrably costly engines, never penalise new ones.
- FLEET_ENGINE_BREAKER: per-(factory, engine) circuit breaker. releaseLease
  always records outcomes (observable via /fleet/metrics engineBreakers);
  when enabled, an OPEN pair is routed around. Only ever restricts the
  candidate set — never forces a route.

The scheduler stays pure: history lookup + availability gate are injected
predicates. New engineQuality term contributes 0 unless a lookup is supplied,
preserving every existing score/breakdown.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 13:17:25 -07:00
saravanakumardb1
5413c0e789 feat(fleet): reaper/GC telemetry in /fleet/metrics + lease TTL backstop
Observability + defense-in-depth for the recovery/GC machinery:

- The reaper now accumulates process-wide telemetry (getReaperStats): cumulative
  expired/stale reclaims and per-container GC deletions, plus startedAt/last-run
  timestamps. GET /fleet/metrics returns it under a `reaper` field so operators
  can see recovery activity (dead_letter counts/alerts were already added).
- Cosmos TTL backstop on fleet_leases (2 days): a held lease is renewed
  continuously so it never expires while active; only finished leases age out,
  matching the ~24h app GC. Purely defense-in-depth behind the reaper, which still
  OWNS recovery (requeue + epoch bump + checkpoint). TTL is deliberately NOT set on
  fleet_events (ids are <jobId>:evt:<seq> with seq=count, so partial TTL deletion
  could collide ids); events/runs/jobs are pruned by the cascade GC instead.

Memory provider ignores defaultTtl, so tests/dev are unaffected.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 12:18:22 -07:00
saravanakumardb1
141435fe95 feat(fleet): bind advertised capabilities to the enrolled token scope
The claim path already constrained a factory to its enrolled scope, but the
heartbeat trusted self-reported capabilities — so (with enforcement on) a factory
could advertise e.g. engine:codex it was never granted, polluting the engine
picker (GET /fleet/factories) and routing/explain decisions even though a codex
job still couldn't be claimed by it.

Heartbeat now intersects the factory's self-reported capabilities with the token
scope when enforcement is ON: it may report FEWER (an engine temporarily
unavailable) but never MORE than enrolled. Enforcement OFF is unchanged
(self-reported caps pass through verbatim). Covered by new route tests.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 12:14:09 -07:00
saravanakumardb1
a6adaee835 feat(fleet): operator re-drive for dead-letter jobs + dead-letter alert/UI
Closes the loop on the retry automation — a job that exhausts its retries lands
in dead_letter with no way to recover it:

- New `redrive` operator action: requeues the job AND grants a fresh retry budget
  by anchoring a new `attemptsBase` to the current `attempts` (and clearing any
  retryNotBefore backoff). `attempts` stays monotonic so run ids never collide; a
  plain `requeue` leaves the budget exhausted and would instantly re-dead-letter.
  The retry policy now measures used budget as `attempts - attemptsBase`.
- fleetMetrics raises a `dead_letter` warning alert when any job is dead-lettered.
- tracker-web: a "Re-drive" button on dead_letter/failed jobs; the timeline already
  renders the retry_scheduled / dead_letter / pr_merged / pr_merge_failed /
  factory_stale events generically.

Backward compatible: attemptsBase defaults to 0 and old docs without it read as 0.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 12:11:46 -07:00
saravanakumardb1
9e0afc23d2 test(fleet): end-to-end lifecycle/chaos guardrails across coordinator features
Adds cross-feature integration tests that drive the coordinator the way a real
factory does, asserting the newly-wired paths COMPOSE (the per-feature unit
suites only cover them in isolation):

- retry: fail then auto-requeue then fail again then dead_letter, one claim per attempt;
- chaos: factory dies mid-build, stale reclaim fences the zombie, a late
  transition from the dead holder is rejected, survivor claims with a higher
  epoch, builds, and ships with the run result/engine recorded;
- no double-assignment under a true claim race;
- expiry reaper recovers a vanished factory's job (requeue + epoch bump + lease expired).

Test-only; deterministic and offline (memory provider, injected time, no network).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 12:05:07 -07:00
saravanakumardb1
6770bbeef2 feat(fleet): honor job.autoMerge on ship and surface PR-merge failures
job.autoMerge was persisted but ignored — PR merging fired only when the host
set FLEET_SHIP_MERGES_PR=1, and a failed merge was silent (PR left open, no
signal). Now:

- mergeRunPrOnShip merges when EITHER the job opted in (job.autoMerge) OR the
  global flag is set (new pure, unit-tested shouldMergePrOnShip gate). Existing
  global-flag behavior is preserved.
- Merge outcomes are surfaced as job events: pr_merged on success (inline or via
  background retry) and pr_merge_failed when the inline attempt + 4 background
  retries all fail, so a stuck PR shows on the timeline instead of vanishing.

Still fully best-effort and gated (no merge attempted unless opted in), so the
real-world side effect only happens when explicitly requested.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 11:47:33 -07:00
saravanakumardb1
493027fbad feat(fleet): factory-token expiry, prod-default enforcement, token GC
Hardens the factory credential lifecycle (§12):

- Token expiry: tokens now carry an absolute expiresAt (FLEET_TOKEN_TTL_DAYS,
  default 90; 0 disables). verifyToken rejects an expired token regardless of
  status, bounding the blast radius of a leak.
- Enforcement default: factoryTokenEnforcementEnabled now defaults ON in
  production and OFF in development/test (an explicit FLEET_REQUIRE_FACTORY_TOKEN
  still wins) — real deployments are secure by default while the local prototype
  and the test suite keep working without enrollment.
- Token GC: pruneInvalidatedTokens deletes revoked, expired, and rotating-past-
  grace tokens; wired into the hourly fleet GC sweep (SweepResult.tokensDeleted)
  so the credential store stays bounded.

Covered by new enrollment.test.ts cases (expiry, TTL=0, enforcement default
matrix, prune) and the reaper/sweep accounting.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 11:45:17 -07:00
saravanakumardb1
42d27d8a4f feat(fleet): enforce job.retry — auto-requeue, backoff, and dead-letter
job.retry (max / on / backoff) was persisted but never enforced: a failed
attempt just went to `failed` and required a manual operator requeue. Now, when
a factory releases a lease reporting a failure, the coordinator applies the policy:

- retryable result (matches a retry.on class) with attempts remaining ⇒ requeue
  (queued, or blocked if deps are now unmet) with a retry backoff;
- retryable but attempts exhausted ⇒ dead_letter;
- no policy or non-retryable result (capability_mismatch/no_engine) ⇒ failed,
  exactly as before (behavior-preserving).

Backoff is honored via a new job.retryNotBefore timestamp; the scheduler skips a
queued job until it elapses (new pure isAwaitingRetryBackoff gate in selectJob).
parseBackoffMs supports "<n>", "<n>s|m|h", "<n>ms", and "exp" (30s·2^(n-1), capped
1h). retry_scheduled / dead_letter audit events are emitted. decideFailureOutcome
and parseBackoffMs are pure and unit-tested (retry.test.ts), plus scheduler-gate
and end-to-end releaseLease coverage.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 11:40:33 -07:00
saravanakumardb1
68bfa3dbd8 feat(fleet): stale-factory lease reclaim + bounded GC sweep
Two recovery/cleanup gaps left the coordinator's containers growing without
bound and jobs stuck longer than necessary:

- reclaimStaleFactoryLeases: a crashed/partitioned factory stops heartbeating
  ~90s before its 900s lease TTL expires; the reaper now reclaims held leases of
  stale (or vanished) holders within one stale window, via the same fence +
  checkpoint-preserving path as the expiry reaper (refactored into reclaimLeaseJob).

- sweepFleetGarbage: deletes ephemeral coordination state on by default (finished
  expired/released leases past a 24h TTL; factory docs with no heartbeat for 7d —
  a live host just re-registers). Terminal-job retention (jobs + their runs/events/
  artifacts+blobs) is OPT-IN only via FLEET_GC_RETENTION_DAYS (default 0 = never
  delete history). Every delete is best-effort so one failure can't stall the sweep.

Both are wired into the existing reaper loop: recovery scans run every 30s, the
deletion sweep is throttled to hourly. New repo helpers (listHeldLeases,
listFinishedLeasesOlderThan, deleteLease, listAllFactories, deleteFactory,
listTerminalJobsOlderThan, deleteRun, deleteEvent) back the new coordinator
functions. Covered by cleanup.test.ts + expanded reaper.test.ts.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 11:34:14 -07:00
saravanakumardb1
0bf8be9be5 fix(fleet): schedule the lease reaper so dead-factory jobs are recovered
reapExpiredLeases implements the full section-25 recovery (fence the zombie
holder via a leaseEpoch bump, return the job to queued/blocked, preserve the
checkpoint) but nothing ever called it: no route, no cron, no timer. So when a
factory crashed, lost network, or shut down, its in-flight job stayed stuck in
an active stage forever and was never requeued — the recovery code was dormant.

Add a process-wide background reaper (leases are queried across all products)
that runs reapExpiredLeases every 30s, started at server boot and stopped on
graceful shutdown, mirroring the diagnostics trigger-job pattern. A failing pass
is logged and retried on the next tick rather than crashing the service.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 11:11:14 -07:00
saravanakumardb1
f0a30b8356 fix(fleet): enforce the job's concrete engine in routing
The engine picker only constrained the UI; the router never gated on the
chosen engine, so a job pinned to e.g. engine:codex could be claimed by a
factory that doesn't run codex (the runner's resolve_engine honors an explicit
engine with no availability check, so it would then fail at execution time).

Add a pure engineEligible(job, caps) hard gate to the section-7 scheduler filter
(and preemption): a concrete-engine job runs only on a factory advertising the
matching engine:<e> cap. Gated only against engine-aware factories (those that
advertise any engine:* token); engine-agnostic/legacy factories stay eligible,
mirroring the picker's "engine set unknown => offer all" fallback. explainJob now
surfaces the mismatch reason. No DB migration; behavior-preserving for legacy.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 10:59:57 -07:00
saravanakumardb1
d318e0fa2a feat(fleet): engine picker offers only engines the factory advertises
Add GET /fleet/factories (lists a product's factory docs with capabilities) —
also fixes the fleet map's empty factory cards (listFactories had no route and
silently returned []). The New-Job form now loads the selected factory's
engine:* capabilities and constrains the engine dropdown to those (e.g. hides
codex when the host doesn't have it), keeping the current pick valid; falls back
to all engines when capabilities are unknown.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 02:38:35 -07:00
saravanakumardb1
6c31577cf2 feat(fleet): per-job engine picker (devin/claude/codex/copilot, default devin)
Add an optional concrete `engine` to a job (overrides engineClass; resolved by
the runner's resolve_engine where an explicit engine wins). All additive +
optional, so existing engineless jobs keep falling back to the factory default.

- types: FLEET_ENGINES enum; engine on SubmitJob/FleetJobDoc/UpdateDraft.
- coordinator: store engine on create/supersede/updateDraft; run.engine at claim
  prefers job.engine, then engineClass, then 'unknown'.
- tracker-web: Engine dropdown on the New-Job form (default devin) + editable on
  draft/queued jobs; shown in the detail config grid.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 02:30:22 -07:00
saravanakumardb1
928edad0af feat(fleet): surface engine + agent session, editable config, timeline filter
Backend: insights now carry engine + sessionId/sessionUrl; releaseLease promotes
the reported engine onto the run (was created with the abstract engineClass,
usually 'unknown').

tracker-web job detail:
- Runs: show the concrete engine (insights.engine, falls back off 'unknown') and
  the agent session (Devin session id with a `devin --resume <id>` hint, or a
  link when a sessionUrl is present).
- PromptCard: edit repo/baseBranch/verify/autoMerge (not just the prompt) while
  draft/queued/blocked.
- Timeline: filter by event type (default collapses heartbeat runs).
- Show a "no PR — needs verify / not PR mode" hint when parked in review.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 02:19:38 -07:00
saravanakumardb1
e4c84acf29 fix(fleet): bump the M0 queue gate when a draft/queued job is edited
updateDraft changes caps/deps/priority (and body) without a stage change, so it
did not bump fleet_queue_state — gated factories (AQ_FLEET_GATE=1) would not
re-evaluate claimability until the safety interval. Bump the gate on edit so an
edit that makes a job claimable wakes factories promptly.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:58:45 -07:00
saravanakumardb1
b7df779e1d feat(fleet): draft jobs + editable prompt (save-as-draft, submit, lock on pickup)
Backend (platform-service):
- New `draft` stage (not claimable; scheduler only takes queued/blocked).
- submitJob accepts `draft: true` → parks a new/superseded job as a draft.
- updateDraft(): edit prompt/config in place while draft/queued/blocked;
  recomputes contentHash; rejected (conflict) once picked up (assigned+).
- submitDraft(): promote draft → queued (or blocked on unmet deps); idempotent.
- Routes: PATCH /fleet/jobs/:id/draft, POST /fleet/jobs/:id/submit.
- tracker-bridge: map draft → item status `open`. Tests + FLEET_STAGES updated.

Frontend (tracker-web):
- New-Job form: add "Save as draft" alongside "Submit".
- Job detail: edit the prompt + Save while draft/queued/blocked, "Submit" a
  draft, and lock it read-only once a factory picks it up.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:44:10 -07:00
saravanakumardb1
78c4e47460 docs(gigafactory): fix stale/incorrect fleet docs
- fleet module README: add fleet_queue_state container + GET /fleet/queue-state
  and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale
  threshold (AQ_FLEET_LEASE_RENEW_SEC).
- FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and
  /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and
  /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories);
  add enroll, metrics, and the M0 queue-state endpoint.
- ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and
  Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings.
- README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 00:03:05 -07:00
saravanakumardb1
ba7db0008d feat(fleet): M0 RU gate — cheap per-product queue version + skip-claim
Adds fleet_queue_state (monotonic version per product), bumped on job create +
every stage change in the repository layer (best-effort, never fails a job
write), and a GET /fleet/queue-state read endpoint. Lets a polling factory
detect "work changed" with a ~1 RU point read instead of a full listJobs scan
on every claim. Registers the container; tests cover the bump + endpoint.

See agent-queue docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md §8/§12 (M0).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 23:18:27 -07:00
Saravanakumar Dhandapani
65c7d09584 feat(auth): include displayName claim in platform access token
Adds an optional displayName claim to the platform access token so
downstream product backends can source the user's display name from the
JWT (single source of truth = platform auth), not from per-product DB
copies. verifyToken already exposes displayName; this populates it at all
token-minting sites (password login, register, refresh, SSO, OAuth,
magic-link, passkeys, QR, enterprise SAML/OIDC). Additive and backward
compatible.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-31 20:23:15 -07:00
saravanakumardb1
2bd97791c9 feat(fleet): resilient PR merge on ship (inline attempt + background retry)
The corporate proxy intermittently 407s GitHub's API, so a single gh pr merge can
fail transiently. Try once inline (fast path), then retry in the background with
backoff (3s/8s/20s/45s) without blocking the ship; mark prState=merged when one
lands. Best-effort throughout.
2026-05-31 16:13:52 -07:00
saravanakumardb1
740335a149 feat(fleet): Ship can also merge the linked PR (gh pr merge)
On ship (Ship button / operator action / autoship PATCH), when the run has an open
PR and FLEET_SHIP_MERGES_PR=1, the coordinator squash-merges it via gh (best-effort,
where gh is authed) and marks the run prState=merged. UI button reads 'Ship & merge
PR' when an open PR exists; Ship refreshes runs.
2026-05-31 16:08:28 -07:00
saravanakumardb1
696ee4189e feat(fleet): track + show PR status (open/merged) on the run
Runs now carry prState (open when the PR is opened, merged when auto-merge
succeeds), reported on lease release. Job-detail Runs table shows a status badge
next to the PR link.
2026-05-31 13:56:33 -07:00
saravanakumardb1
c239abeec9 feat(fleet): per-repo verify + auto-merge options for PR jobs
Add job-level verify (command run in the PR checkout before opening the PR) and
autoMerge (squash-merge the PR once opened). Surfaced in the New Job form as a
Verify-command field + Auto-merge checkbox (PR mode only); confirmation now shows
PR-mode/repo. More repos added to the dropdown.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 06:17:19 -07:00
saravanakumardb1
883cf329e5 feat(fleet): PR deliverables — jobs target a repo, factory opens a PR, link recorded
Make "shipped" produce a real artifact. A job can now carry an optional repo
(owner/name or clone URL) + baseBranch; the factory's PR mode runs the agent in an
isolated checkout, opens a PR, and records the link.

Backend:
- SubmitJobSchema + FleetJobDoc: optional repo/baseBranch (recorded on submit).
- FleetRunDoc: optional prUrl/branch.
- ReleaseLease report carries prUrl/branch -> stored on the run.
- +2 coordinator tests.

UI (tracker-web):
- New Job form gains optional Repo + Base branch fields (and fixes the priority
  options to the valid critical/high/medium/low; "normal" was rejected by the API).
- Job detail Runs table shows a PR ↗ link from run.prUrl.
- fleet-client: submitJob repo/baseBranch; FleetRun prUrl/branch; OperatorAction +ship.

Docs: FLEET_CONTROL_PLANE.md "PR deliverable (PR mode)" section.

Verified: tsc clean; fleet suite 182; tracker-web 230.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 05:27:11 -07:00
saravanakumardb1
bcdb5dcdf3 fix(fleet): run result mirrors a shipped job + document testing->shipped
A job can reach `shipped` via autoship PATCH, the `ship` operator action, or a
terminal lease release, but the run-level `result` was left at whatever the
factory last reported (e.g. `review`), so the dashboard showed a shipped job with
a non-terminal run result.

- Add markLatestRunShipped(): on any transition to `shipped`, set the latest run
  result to `shipped` (+ endedAt if unset). Idempotent, best-effort.
- Wire it into patchJobFenced (ungated; budget accrual stays flag-gated) and the
  `ship` operator action.
- Document the testing->shipped paths (factory autoship vs `ship` operator action)
  and the run-mirroring in docs/GIGAFACTORY/FLEET_CONTROL_PLANE.md.

Tests: +2 (patchJobFenced->shipped and operator ship both set run.result=shipped).
Fleet suite 180 pass; tsc clean.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:31:46 -07:00
saravanakumardb1
8fe26027e7 chore(deps): bump @types/node 22 -> 25 (dev types)
Verified: full workspace build (tsc) green across all packages/services/dashboards;
fleet+items tests pass. Compile-time only.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:02:56 -07:00
saravanakumardb1
3022e634b8 feat(billing): migrate to Stripe SDK v20 (API shape changes)
Bump stripe 17 -> 20 and adapt to the breaking API changes:
- PromotionCode: coupon moved under `promotion` ({ type: 'coupon', coupon }).
  mapPromo now reads p.promotion.coupon; create now passes
  promotion: { type: 'coupon', coupon: id } instead of a top-level coupon.
- Subscription.current_period_end removed (now per subscription item). Add
  getSubscriptionPeriodEnd() = max(items[].current_period_end) and use it in the
  customer.subscription.updated webhook handler.
- Update the promos route test fixture to the new promotion.coupon shape.

Verified: platform-service build (tsc) clean; promos (14) + stripe/subscriptions/
billing tests pass; full suite 1692/1692.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 04:00:13 -07:00
saravanakumardb1
0079274bff chore(deps): bump bcryptjs 2 -> 3 (+ @types/bcryptjs 3)
Verified: auth + platform-service build (tsc, default import still resolves);
packages/auth (31) + platform-service auth/api-key (65) pass (hash/compare work).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 03:46:19 -07:00
saravanakumardb1
66a4a74aa6 chore(deps): bump @fastify/cors 10 -> 11
Verified: fastify-core + platform-service build (tsc); createServiceApp boot smoke
handles a CORS preflight (OPTIONS -> 204 with access-control-allow-origin).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 03:44:54 -07:00
saravanakumardb1
808f615124 chore(deps): bump @azure/cosmos, jose, @typescript-eslint/parser
Applied fresh on current main (the matching dependabot branches were 350-430
commits behind and would have conflicted on the lockfile):
- @azure/cosmos 4.9.1 -> 4.9.3
- jose 6.1.3 -> 6.2.3 (mcp-server stays on the 5.x line: 5.9.6 -> 5.10.0)
- @typescript-eslint/parser 8.0 -> 8.60.0

Verified: full workspace build green; platform-service suite 1684 pass (only the
pre-existing single-fork migration-isolation flake, passes isolated); tracker-web
228 pass (only the pre-existing happy-dom product-context failures). No new
regressions. Major bumps (fastify/cors, happy-dom, lint-staged, stripe,
types/node) deferred for separate review.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 03:28:40 -07:00
saravanakumardb1
1503ef2e19 fix(fleet): Phase 3 hardening — budget authz, idempotent accrual, cycle detection, artifact
Re-applies 4 defects merged with Phase 3 (still present on main), ported from the
stale reference branch and adapted to current main.

FIX 1 (BLOCKER): all 5 budget routes (GET/burndown/PUT/pause/resume) read
productId from the URL with no caller check, so a caller for product A could
read or modify product B budget. Add requireOwnProduct() -> 403 on mismatch.

FIX 2: stop tracking services/platform-service/.data/platform-events.json (the
EVENT_BUS_FILE runtime log); gitignore .data/.

FIX 3: accrueSpend was definition-only (budgets never accrued) and not
idempotent. Add accruedRunIds to FleetBudgetDoc; accrueSpend(productId, costUsd,
runId) no-ops on a seen runId; wire it into patchJobFenced on stage shipped,
flag-gated + best-effort + idempotent via jobId:leaseEpoch. Accrue the run ACTUAL
insights.costUsd so spentUsd and costBurndown agree.

FIX 4: submitChildren cycle detection is now batch-aware — rejects duplicate
child keys and walks each child declared deps across BOTH the unpersisted batch
and existing jobs, rejecting child-self, child-parent and sibling cycles.

Tests: +2 budget-authz (403 on all 5 verbs), +3 accrual (idempotent runId, ship
accrues insights.costUsd once, flag-off no accrual), +5 cycle detection. Gates
green: tsc/build, fleet+items (204), full suite (only the unrelated single-fork
migration-isolation file flakes, passes isolated). Flags stay default-OFF.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 02:45:52 -07:00
saravanakumardb1
32e426d423 feat(fleet): run-insights reporting + ship action; complete the lifecycle
Make the Agent Gigafactory fully drivable end-to-end via the API:

- lease/release now accepts `insights` (model/tokens/cost) + `result`, recorded
  on the current run with endedAt — factories report cost/token metrics on
  completion (previously no API existed; runs stayed insights:{}).
- add `ship` operator action so a job in `testing` (where the review gate left
  no lease holder) can reach the terminal `shipped` stage. Idempotent.
- operatorAction now retries on optimistic-concurrency conflict with backoff
  (mirrors submitReview) so a ship right after approve survives real-Cosmos
  read-after-write lag instead of a spurious 409.

Tests: +2 coordinator (ship idempotent, release records insights) and +2 route
integration (gated submit->...->ship->metrics; release-with-insights). 170 pass.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-31 01:49:59 -07:00
saravanakumardb1
e0a904c7ea fix(cosmos): support composite indexes; add fleet_jobs (priorityOrder, createdAt)
Azure Cosmos cannot serve a multi-field ORDER BY without a matching composite
index (the local emulator is lenient, real Cosmos returns HTTP 400). The fleet
listJobs() query orders by (priorityOrder, createdAt), which broke
GET /api/fleet/metrics and /api/fleet/jobs on real Cosmos.

- ContainerConfig gains an optional `compositeIndexes` field
- container init applies it on create AND reconciles it onto existing
  containers (createIfNotExists never updates an existing index policy)
- fleet_jobs declares the (priorityOrder ASC, createdAt ASC) composite index

Verified live against Azure Cosmos: both endpoints now return 200.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 23:07:06 -07:00
Saravanakumar D
20bcbd9f19 docs(gigafactory): uppercase GIGAFACTORY folder + add index README
Rename docs/gigafactory/ to docs/GIGAFACTORY/ and update the cross-repo
source-of-truth references in the fleet README and types.ts comment. Add an
index README listing the platform docs and pointing to the canonical spec in
learning_ai_devops_tools. Docs/comment only; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:22:21 -07:00
Saravanakumar D
fc000c4c20 docs(gigafactory): consolidate gigafactory docs into docs/gigafactory/
Move ROADMAP_COMPLETION_AUDIT.md, TASKS_TO_COMPLETE.md,
gigafactory-phase3-progress.md and FLEET_CONTROL_PLANE.md under
docs/gigafactory/ so the scattered Gigafactory docs are easy to discover.
Update intra-doc and cross-repo source-of-truth references (fleet README
and types.ts comment) to the new agent-queue/docs/gigafactory/ path.
Pure docs/comment move; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 21:07:20 -07:00
Saravanakumar D
4ac499f301 feat: add multi-reviewer human gate (review-policy routing)
Implements the §14 Phase 3 review gate. requestReview() routes a building
job into the review stage (fencing any worker), carrying a normalized policy
(requiredApprovals + reviewer allowlist) and clearing prior decisions.
submitReview() records one decision per reviewer (last-write-wins, identity-
normalized), advances the job to testing once distinct approvals reach the
quorum, and treats any reject as a veto that returns the job to queued for
rework. Adds POST /fleet/jobs/:id/review/request and POST /fleet/jobs/:id/review,
a typed client, and a review-gate card on the job-detail page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 19:04:24 -07:00
Saravanakumar D
c9c2c174db feat: add fleet metrics + alerting (GET /fleet/metrics)
Adds coordinator.fleetMetrics() computing queue depth, stage histogram,
oldest-queued age (starvation signal), factory health and seat utilization,
plus derived alerts (no_live_capacity, all_factories_down, queue_starvation,
saturated, stale_factories). Exposed via GET /fleet/metrics and surfaced as a
metrics+alerts panel on the fleet overview. Thresholds injectable for tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:51:59 -07:00
Saravanakumar D
ea42602407 feat: add resumable SSE live event stream for fleet jobs
Backend: GET /fleet/jobs/:id/events/stream emits a snapshot (seq > Last-Event-ID)
then long-polls the append-only event log, closing after a bounded window so
EventSource-style clients reconnect cleanly. Honors Last-Event-ID resume,
keepalive comments, and a terminal error frame.

Frontend: subscribeJobEvents uses fetch streaming (to send auth + product
headers) with parseSseFrames, Last-Event-ID resume, reconnect backoff, and a
fatal-on-error-frame fallback to polling. Job detail page subscribes live
(deduped by seq), falls back to 4s polling on failure, and shows a Live badge;
refresh() now merges events so a slow snapshot can't clobber streamed ones.

Tests: +3 route (snapshot, resume cursor, append-after-connect), +5 client
(parseSseFrames x2, subscribe deliver/error/resume/error-frame). fleet 150, web 222.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:38:50 -07:00
Saravanakumar D
89860e39f9 feat: add fleet cost burndown chart
- coordinator.costBurndown() aggregates completed run cost (insights.costUsd)
  by UTC day over a window, returning a gap-free cumulative series + ceiling
- repository.listRunsByProduct() cross-partition run query
- GET /fleet/budgets/:productId/burndown?days=N route
- fleet-client.getBudgetBurndown() + CostBurndown/BurndownPoint types
- BurndownChart on the budget page: cumulative daily bars with a dashed
  ceiling overlay; bars turn red past the ceiling; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 147, web 216)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:25:27 -07:00
Saravanakumar D
2d5f9be642 feat: surface scoring explainability in fleet control plane
Adds 'why does this job route here?' to the §7 scheduler:
- coordinator.explainJob() re-runs scoreCandidate against every live factory,
  returning per-factory weighted breakdown, eligibility + reasons, deps state,
  and the best eligible factory (read-only, side-effect free)
- GET /fleet/jobs/:id/explain route (404 when job missing)
- fleet-client.getJobExplain() + JobExplain/ScoreBreakdown types
- ExplainPanel on the job detail page: score table per factory with the six
  weighted terms, eligibility, and unmet-deps note; degrades gracefully
- Tests: +2 coordinator, +1 routes, +2 fleet-client (fleet 144 green,
  tracker-web 214 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:21:14 -07:00
Saravanakumar D
c8ab43d3ae fix: harden operator action lease release + idempotent terminal actions
- release lease with fenced epoch (leaseEpoch+1, clear holder) so a stale
  renewal cannot resurrect a held lease after operator displacement
- reject on dead_letter / cancel on failed are now idempotent no-ops
  (no epoch bump, no duplicate event)
- add coordinator test for terminal idempotency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:15:26 -07:00
Saravanakumar D
283383561c feat: complete operator job actions (requeue/reject/cancel)
Adds lease-free operator lifecycle control to the fleet control plane:
- coordinator.operatorAction() fences any current factory holder by bumping
  leaseEpoch (mirrors the reaper), preserves checkpoint, and routes the job:
  requeue -> queued (or blocked if deps unmet), reject -> dead_letter,
  cancel -> failed. Shipped jobs are terminal (invalid_state).
- POST /fleet/jobs/:id/actions/:action route (400 on unknown action)
- fleet-client.operatorAction() + Requeue/Cancel/Reject buttons on job detail
- Tests: +5 coordinator, +1 routes, +2 fleet-client (platform fleet 140 green,
  tracker-web 212 green)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 18:12:11 -07:00
Saravanakumar D
b061cc47f3 feat(tracker-web): fleet control plane UI — overview, jobs, budget, detail pages (Phase 3 Slice 4)
- Fleet overview page with factory cards + recent jobs polling
- Job table with stage filter tabs
- Job detail page with events timeline, runs, artifacts, DAG subtree, SHIP action
- Budget page with usage bar, pause/resume controls
- API proxy route forwarding /api/fleet/* to platform-service
- Typed fleet-client.ts with graceful 404 degradation
- 16 unit tests for fleet-client (198 total tracker-web tests green)
- Added Fleet nav item to dashboard layout
- Full monorepo build + test green

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:24 -07:00
Saravanakumar D
f4ea7b4a5b feat(fleet): per-product budgets with pause/resume (Phase 3 Slice 3)
- FleetBudgetDoc: ceilingUsd, window, spentUsd, status (active/paused)
- FLEET_BUDGETS flag (default OFF = no enforcement, unchanged behavior)
- Enforcement in claimNextJob: paused or ceiling-exceeded → null
- accrueSpend(): monotonic spend accumulation, auto-pause at ceiling
- Budget routes: GET/PUT /fleet/budgets/:productId, pause, resume
- UpsertBudgetSchema for route validation
- 7 new coordinator tests (ceiling, auto-pause, manual pause/resume,
  flag OFF bypass, monotonic accounting, cross-product isolation)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-30 09:49:23 -07:00