learning_ai_common_plat

Author	SHA1	Message	Date
Saravana Kumar	11f609eb3d	chore: ignore local graphify outputs Some checks failed CI — Common Platform / Build, Test & Typecheck (push) Successful in 1m32s Details Publish @bytelyst/* packages / publish (push) Failing after 13s Details Size limit / size-limit (push) Failing after 1m56s Details CI — Common Platform / Fleet E2E (Playwright) (push) Failing after 2m52s Details	2026-06-06 02:57:05 +00:00
saravanakumardb1	1b6e644ea6	docs(fleet): add TODO to validate exactly-once claim under true contention Capture the plan to close the known gap: existing CAS/fencing tests run on MemoryDatastoreProvider (exact only for sequential calls), so single-winner behavior is not yet demonstrated under a true interleaved read->write window or against Cosmos _etag/If-Match. No production change expected; Cosmos provider already conditions writes with IfMatch. Documents Option B (adversarial interleaving test, no infra) and Option A (emulator-gated integration test), acceptance criteria, and downstream doc updates.	2026-06-03 09:55:34 -07:00
saravanakumardb1	4ac5a747d1	feat(scripts): fleet-logs.sh to tail/inspect a Devin fleet job's logs Convenience CLI over the agent-queue factory logs: resolves the agent-queue checkout (AQ override or sibling default), takes a full/partial job id (defaults to newest), and exposes ls/status/tail/steps/watch/full/path over the runner .log and the live Devin transcript (.devin-export.json steps[]). Referenced from the §8 Observe section of the fleet run runbook. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-02 01:09:19 -07:00
saravanakumardb1	e6611cae1a	docs(fleet): runbook to run a Devin fleet job end-to-end (local) Add docs/runbooks/FLEET_DEVIN_LOCAL_RUN.md: how developers and coding agents spin up platform-service + tracker-web + an agent-queue factory so a submitted job is claimed and run autonomously by the Devin CLI against a target repo (worked example: learning_ai_notes), pushing a branch and opening a real PR. Covers: architecture + lifecycle, prerequisites incl. fresh-machine setup (clone both repos, .env/Cosmos, pnpm -r build so host-run resolves @bytelyst/* from dist/), all-localhost (no Docker) path as primary + Docker as the Grafana/Prometheus option, local JWT minting, job submit, factory launch, observe, PR-state reconcile, safety/cost, teardown, troubleshooting, and a copy-paste quickstart. Calls out two gotchas learned in practice: set AQ_FLEET_LEASE_RENEW_SEC < 90 so the factory heartbeat beats the coordinator's 90s stale-factory reclaim window (else a busy single-slot factory's in-flight lease is reclaimed mid-run and the final report is fenced), and a WSL-on-Windows differences section (run inside WSL, repos off /mnt/c, LF endings, gh/devin/node in WSL, localhost forwarding). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-02 00:50:29 -07:00
saravanakumardb1	6bddc88f0f	feat(fleet): reconcile PR state against GitHub (detect externally-merged PR) A PR merged in the GitHub UI was invisible to the platform — prState only flipped to `merged` when the platform merged it (ghMergePr on ship) or a runner reported it, so the job details page kept showing the PR as open. This adds a simple, pull-based reconcile (no inbound webhook / public ingress needed). - coordinator.reconcileJobPrState(jobId, productId, fetcher?): finds the latest run carrying a prUrl and, when `gh pr view --json state` reports MERGED, flips the run's prState to `merged` and appends a `pr_merged` event (data.via: 'reconcile'). The GitHub lookup is injectable for tests; pure `mapGhPrState` maps MERGED/OPEN/other. Best-effort: any gh failure is a no-op. - POST /fleet/jobs/:id/pr/reconcile route; echoes the outcome to the tracker Item when a merge is detected. - tracker-web: reconcilePrState() client + a "Refresh PR status" button on the job details PR section (shown until the PR is merged) that reconciles then refreshes the view. Tests: +5 (mapGhPrState, reconcile merged/open/no_pr/not_found, route wiring); full suite 1861 green; lint + tsc clean (service + tracker-web). Deployed: rebuilt the docker platform-service; POST .../pr/reconcile returns 401 (wired), not 404. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 23:58:13 -07:00
saravanakumardb1	c63736459b	feat(fleet): anti-flap hysteresis + autoscale Prometheus series & dashboard (ops #5 ) Make the capacity autoscaling signal safe to act on automatically and observable in Grafana. Anti-flap hysteresis: - New pure applyHysteresis: suppresses a direction reversal (scale_in after scale_out, or vice versa) within a cooldown window so a consumer cannot thrash capacity. A critical scale-out (queued work, zero usable capacity) always bypasses the cooldown. Cooldown anchor only advances on an emitted action, so a suppressed signal keeps counting down from the real last action. - Process-wide per-product cooldown state (mirrors reaper/breaker in-mem state) with a test seam; cooldown tunable via FLEET_AUTOSCALE_COOLDOWN_SEC (default 300). - GET /fleet/autoscale[/all] now serve the debounced (stateful) recommendation. Observability: - Prometheus exposition emits the RAW recommendation per product (fleet_autoscale_recommended_seats/delta/pressure + one-hot fleet_autoscale_action {action}). RAW (not stateful) so a scrape never mutates the cooldown anchors. - Grafana "Fleet Overview" gains two panels: products recommending scale-out (stat) + recommended seat delta vs backlog (timeseries). Docs: FLEET_AUTOSCALE_COOLDOWN_SEC in .env.example. Tests: +10 (hysteresis/stateful/cooldown + prom autoscale series); full suite 1856 green; lint + tsc clean. Verified live: a throwaway Prometheus scraped the running service and the dashboard PromQL returned real scale-out/scale-in recommendations across products. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 23:02:08 -07:00
saravanakumardb1	321cfe7546	feat(fleet): correlation-id tracing + capacity autoscaling signal (ops #4/#5) Thread a trace-context correlation id across the coordinator<->runner boundary so a logical work-unit (job -> claim -> run -> ship) is stitchable end to end, and add an advisory capacity autoscaling signal an external scaler can consume. Tracing (#4): - Mint/propagate a correlationId at submit from the inbound x-correlation-id/traceparent/x-request-id (else generate ftr_<uuid>); persist it on the job, inherit onto the run + lease at claim, and stamp every lifecycle event (submitted/assigned/transition/lease_renewed/lease_released/ retry_scheduled/dead_letter). Children of a composite job share the parent id. - Echo it back on the x-correlation-id response header (submit/claim/renew/ release/patch) so a factory can carry it forward, and bind it to req.log. - New pure trace.ts (header resolution incl. W3C traceparent trace-id). Autoscaling signal (#5): - New pure autoscaler.ts turns a product FleetMetrics + saturation alerts (no_live_capacity/saturated/queue_starvation) into an auditable scale recommendation (action/recommendedSeats/delta/urgency/signals). budget_exhausted suppresses scale-out; idle slack reclaims down to a floor. Thresholds tunable via FLEET_AUTOSCALE_* env. - GET /fleet/autoscale (per-product) + GET /fleet/autoscale/all (global, admin or scrape token). Documented the env vars in .env.example. Tests: +29 (trace 10, tracing 7, autoscaler 12); full suite 1846 green; lint + tsc clean. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 22:43:56 -07:00
saravanakumardb1	93d1caf4a2	feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4 ) Exports fleet observability to Prometheus/Grafana (previously JSON-only). - GET /api/fleet/metrics/prom: global, product-labelled Prometheus exposition (queue depth, blocked/active, per-stage histogram, factory health/seats/ utilization, active alerts, budget spent/ceiling/projected) plus process-wide reaper/GC counters and engine circuit-breaker state. Pure renderer (renderFleetMetricsProm) is unit-tested; route auth accepts a FLEET_METRICS_TOKEN bearer (scrape path) or an admin JWT — never world-readable by default. - Infra: add a prometheus container to docker-compose + a platform-service-fleet scrape job; pin the Prometheus Grafana datasource uid; add a provisioned "Fleet Overview" dashboard (breakers, dead-letter, stale factories, alerts, queue depth, utilization, budget burn, reaper rate) with a product template var. - Document FLEET_METRICS_TOKEN + the fleet feature flags in .env.example. No default behavior change: the endpoint is additive and the new container is opt-in via the compose stack. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 22:24:03 -07:00
saravanakumardb1	acf7c36cda	feat(tracker-web): dynamic, owner-scoped project switcher via /products/mine Wires the switcher to real per-user data instead of a static list: - New /api/products/[...path] proxy to platform-service (mirrors the fleet proxy), exposing GET /api/products/mine. - ProductProvider fetches the caller's owner-scoped projects on mount (when a tracker_token is present) and uses them as the switcher list. Best-effort: any failure / empty result / unauthenticated state keeps the configured fallback list, so dev and logged-out rendering still work. Combined with the earlier config-driven list + auto-hide, the switcher now reflects the authenticated user's projects on a generic platform. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 16:48:27 -07:00
saravanakumardb1	42c63dcc6e	feat(platform): product ownership + owner-scoped "my projects" + tenant guard Foundation for a generic, multi-tenant platform (any developer, not just the built-in products). - Products carry an optional ownerId (set on create + auto-register), so a product has a tenant. GET /products/mine returns the caller's owner-scoped list; admins/super_admins see all. productsForUser() is pure + unit-tested. - requireProductAccess(): a flag-gated tenant authorization guard (FLEET_TENANT_ENFORCEMENT, default OFF). OFF = byte-for-byte current behavior; ON = a non-admin may only act on products they own (others -> 403; owner-less legacy products keep a grace allowance until migrated). Fleet routes now resolve productId through it in place of getRequestProductId. ownerId is additive/optional; enforcement is off by default, so this is a no-op for existing deployments until explicitly enabled. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 16:47:05 -07:00
saravanakumardb1	2c4357b71b	feat(tracker-web): make product switcher generic — configurable list + auto-hide Steps toward a tenant-neutral platform (not hardcoded to the ByteLyst products): - The selectable product list is now configurable via NEXT_PUBLIC_PRODUCTS (JSON array of { id, name, icon? }), defaulting to the built-in set. A pure, defensive parser (parseProductsEnv) falls back to the default on any malformed value so a bad env can never blank the switcher. - The sidebar switcher auto-hides when there is <= 1 product, so a solo / freelance / single-tenant deployment shows no switcher clutter. - Dedupe: the server product-config now re-exports the single client-safe list instead of keeping a second hardcoded copy. NOTE: true per-user "only your projects" scoping + server-side tenant authorization still requires an ownership/membership model that does not exist yet (ProductDoc has no owner/members; products are a global registry). That is a deliberate, separate effort needing a product decision and is not included here. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 16:27:14 -07:00
saravanakumardb1	705d8e8eaa	feat(tracker-web): consolidated fleet overview — breaker panel, budget guardrail, dead-letter triage Surfaces the §2/§3 signal on the fleet overview page (read-only over existing endpoints): - Budget guardrail card: spend vs ceiling bar, projected end-of-window burn (highlighted when over ceiling), and per-engine sub-ceiling breakdown — from the new /fleet/metrics budget summary. - Engine circuit-breaker panel: lists only tripped/probing (factory, engine) pairs from metrics.engineBreakers. - Dead-letter triage table with a Re-drive button wired to the redrive operator action; filtered client-side so it is correct regardless of server filtering. All panels render only when their data is present, so the page is unchanged for fleets without budgets/breakers/dead-letters. Adds a happy-dom page test. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 13:29:52 -07:00
saravanakumardb1	bcd806c6ff	feat(fleet): complete budget enforcement — per-engine ceilings + overspend projection Builds on the existing product-level hard claim gate + idempotent accrual. - Per-engine sub-ceilings (engineCeilingsUsd) with per-engine accrual (spentByEngineUsd). An engine at its sub-ceiling is routed around at claim time via the same per-engine availability gate as the circuit breaker — it never pauses the whole product, so other engines keep flowing. Gated by FLEET_BUDGETS (defaults off). - /fleet/metrics now surfaces a budget summary (ceiling/spend/status/projection + per-engine breakdown) and derives guardrail alerts: budget_overspend_projected (burn-rate extrapolation, guarded against early-window false alarms), budget_exhausted, and engine_budget_exhausted. Surfaced whenever a budget exists, independent of the enforcement flag, so operators see the burn in dry-run. projectBudgetSpend is pure + unit-tested; per-engine spend follows the same idempotent accrual path as the total, so spentUsd and spentByEngineUsd agree. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 13:23:44 -07:00
saravanakumardb1	bdbb0a8ce4	feat(fleet): cost/latency-aware engine routing + per-engine circuit breaker Adds two additive, flag-gated routing refinements on top of the §7 scoring core; both default OFF so the deterministic claim path is unchanged. - FLEET_COST_ROUTING: soft engineQuality term (weight 0.4) biases routing toward the historically cheaper/faster engine, derived from per-engine insights.costUsd + run duration. No-history engines stay neutral, so the nudge can only demote demonstrably costly engines, never penalise new ones. - FLEET_ENGINE_BREAKER: per-(factory, engine) circuit breaker. releaseLease always records outcomes (observable via /fleet/metrics engineBreakers); when enabled, an OPEN pair is routed around. Only ever restricts the candidate set — never forces a route. The scheduler stays pure: history lookup + availability gate are injected predicates. New engineQuality term contributes 0 unless a lookup is supplied, preserving every existing score/breakdown. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 13:17:25 -07:00
saravanakumardb1	5413c0e789	feat(fleet): reaper/GC telemetry in /fleet/metrics + lease TTL backstop Observability + defense-in-depth for the recovery/GC machinery: - The reaper now accumulates process-wide telemetry (getReaperStats): cumulative expired/stale reclaims and per-container GC deletions, plus startedAt/last-run timestamps. GET /fleet/metrics returns it under a `reaper` field so operators can see recovery activity (dead_letter counts/alerts were already added). - Cosmos TTL backstop on fleet_leases (2 days): a held lease is renewed continuously so it never expires while active; only finished leases age out, matching the ~24h app GC. Purely defense-in-depth behind the reaper, which still OWNS recovery (requeue + epoch bump + checkpoint). TTL is deliberately NOT set on fleet_events (ids are <jobId>:evt:<seq> with seq=count, so partial TTL deletion could collide ids); events/runs/jobs are pruned by the cascade GC instead. Memory provider ignores defaultTtl, so tests/dev are unaffected. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 12:18:22 -07:00
saravanakumardb1	141435fe95	feat(fleet): bind advertised capabilities to the enrolled token scope The claim path already constrained a factory to its enrolled scope, but the heartbeat trusted self-reported capabilities — so (with enforcement on) a factory could advertise e.g. engine:codex it was never granted, polluting the engine picker (GET /fleet/factories) and routing/explain decisions even though a codex job still couldn't be claimed by it. Heartbeat now intersects the factory's self-reported capabilities with the token scope when enforcement is ON: it may report FEWER (an engine temporarily unavailable) but never MORE than enrolled. Enforcement OFF is unchanged (self-reported caps pass through verbatim). Covered by new route tests. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 12:14:09 -07:00
saravanakumardb1	a6adaee835	feat(fleet): operator re-drive for dead-letter jobs + dead-letter alert/UI Closes the loop on the retry automation — a job that exhausts its retries lands in dead_letter with no way to recover it: - New `redrive` operator action: requeues the job AND grants a fresh retry budget by anchoring a new `attemptsBase` to the current `attempts` (and clearing any retryNotBefore backoff). `attempts` stays monotonic so run ids never collide; a plain `requeue` leaves the budget exhausted and would instantly re-dead-letter. The retry policy now measures used budget as `attempts - attemptsBase`. - fleetMetrics raises a `dead_letter` warning alert when any job is dead-lettered. - tracker-web: a "Re-drive" button on dead_letter/failed jobs; the timeline already renders the retry_scheduled / dead_letter / pr_merged / pr_merge_failed / factory_stale events generically. Backward compatible: attemptsBase defaults to 0 and old docs without it read as 0. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 12:11:46 -07:00
saravanakumardb1	9e0afc23d2	test(fleet): end-to-end lifecycle/chaos guardrails across coordinator features Adds cross-feature integration tests that drive the coordinator the way a real factory does, asserting the newly-wired paths COMPOSE (the per-feature unit suites only cover them in isolation): - retry: fail then auto-requeue then fail again then dead_letter, one claim per attempt; - chaos: factory dies mid-build, stale reclaim fences the zombie, a late transition from the dead holder is rejected, survivor claims with a higher epoch, builds, and ships with the run result/engine recorded; - no double-assignment under a true claim race; - expiry reaper recovers a vanished factory's job (requeue + epoch bump + lease expired). Test-only; deterministic and offline (memory provider, injected time, no network). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 12:05:07 -07:00
saravanakumardb1	6770bbeef2	feat(fleet): honor job.autoMerge on ship and surface PR-merge failures job.autoMerge was persisted but ignored — PR merging fired only when the host set FLEET_SHIP_MERGES_PR=1, and a failed merge was silent (PR left open, no signal). Now: - mergeRunPrOnShip merges when EITHER the job opted in (job.autoMerge) OR the global flag is set (new pure, unit-tested shouldMergePrOnShip gate). Existing global-flag behavior is preserved. - Merge outcomes are surfaced as job events: pr_merged on success (inline or via background retry) and pr_merge_failed when the inline attempt + 4 background retries all fail, so a stuck PR shows on the timeline instead of vanishing. Still fully best-effort and gated (no merge attempted unless opted in), so the real-world side effect only happens when explicitly requested. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 11:47:33 -07:00
saravanakumardb1	493027fbad	feat(fleet): factory-token expiry, prod-default enforcement, token GC Hardens the factory credential lifecycle (§12): - Token expiry: tokens now carry an absolute expiresAt (FLEET_TOKEN_TTL_DAYS, default 90; 0 disables). verifyToken rejects an expired token regardless of status, bounding the blast radius of a leak. - Enforcement default: factoryTokenEnforcementEnabled now defaults ON in production and OFF in development/test (an explicit FLEET_REQUIRE_FACTORY_TOKEN still wins) — real deployments are secure by default while the local prototype and the test suite keep working without enrollment. - Token GC: pruneInvalidatedTokens deletes revoked, expired, and rotating-past- grace tokens; wired into the hourly fleet GC sweep (SweepResult.tokensDeleted) so the credential store stays bounded. Covered by new enrollment.test.ts cases (expiry, TTL=0, enforcement default matrix, prune) and the reaper/sweep accounting. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 11:45:17 -07:00
saravanakumardb1	42d27d8a4f	feat(fleet): enforce job.retry — auto-requeue, backoff, and dead-letter job.retry (max / on / backoff) was persisted but never enforced: a failed attempt just went to `failed` and required a manual operator requeue. Now, when a factory releases a lease reporting a failure, the coordinator applies the policy: - retryable result (matches a retry.on class) with attempts remaining ⇒ requeue (queued, or blocked if deps are now unmet) with a retry backoff; - retryable but attempts exhausted ⇒ dead_letter; - no policy or non-retryable result (capability_mismatch/no_engine) ⇒ failed, exactly as before (behavior-preserving). Backoff is honored via a new job.retryNotBefore timestamp; the scheduler skips a queued job until it elapses (new pure isAwaitingRetryBackoff gate in selectJob). parseBackoffMs supports "<n>", "<n>s\|m\|h", "<n>ms", and "exp" (30s·2^(n-1), capped 1h). retry_scheduled / dead_letter audit events are emitted. decideFailureOutcome and parseBackoffMs are pure and unit-tested (retry.test.ts), plus scheduler-gate and end-to-end releaseLease coverage. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 11:40:33 -07:00
saravanakumardb1	68bfa3dbd8	feat(fleet): stale-factory lease reclaim + bounded GC sweep Two recovery/cleanup gaps left the coordinator's containers growing without bound and jobs stuck longer than necessary: - reclaimStaleFactoryLeases: a crashed/partitioned factory stops heartbeating ~90s before its 900s lease TTL expires; the reaper now reclaims held leases of stale (or vanished) holders within one stale window, via the same fence + checkpoint-preserving path as the expiry reaper (refactored into reclaimLeaseJob). - sweepFleetGarbage: deletes ephemeral coordination state on by default (finished expired/released leases past a 24h TTL; factory docs with no heartbeat for 7d — a live host just re-registers). Terminal-job retention (jobs + their runs/events/ artifacts+blobs) is OPT-IN only via FLEET_GC_RETENTION_DAYS (default 0 = never delete history). Every delete is best-effort so one failure can't stall the sweep. Both are wired into the existing reaper loop: recovery scans run every 30s, the deletion sweep is throttled to hourly. New repo helpers (listHeldLeases, listFinishedLeasesOlderThan, deleteLease, listAllFactories, deleteFactory, listTerminalJobsOlderThan, deleteRun, deleteEvent) back the new coordinator functions. Covered by cleanup.test.ts + expanded reaper.test.ts. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 11:34:14 -07:00
saravanakumardb1	0bf8be9be5	fix(fleet): schedule the lease reaper so dead-factory jobs are recovered reapExpiredLeases implements the full section-25 recovery (fence the zombie holder via a leaseEpoch bump, return the job to queued/blocked, preserve the checkpoint) but nothing ever called it: no route, no cron, no timer. So when a factory crashed, lost network, or shut down, its in-flight job stayed stuck in an active stage forever and was never requeued — the recovery code was dormant. Add a process-wide background reaper (leases are queried across all products) that runs reapExpiredLeases every 30s, started at server boot and stopped on graceful shutdown, mirroring the diagnostics trigger-job pattern. A failing pass is logged and retried on the next tick rather than crashing the service. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 11:11:14 -07:00
saravanakumardb1	2ed19464c5	fix(tracker-web): exclude stale factories from the engine picker availableEnginesForProduct skipped only health:down factories, so an engine advertised solely by a host that had stopped heartbeating could still be offered in the picker. Also skip factories whose lastHeartbeatAt is older than 90s (mirrors the coordinator's DEFAULT_STALE_FACTORY_MS), and treat an unparseable timestamp as stale. Adds unit coverage for the engine-collection, down, stale, and graceful-degradation paths. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 11:02:56 -07:00
saravanakumardb1	f0a30b8356	fix(fleet): enforce the job's concrete engine in routing The engine picker only constrained the UI; the router never gated on the chosen engine, so a job pinned to e.g. engine:codex could be claimed by a factory that doesn't run codex (the runner's resolve_engine honors an explicit engine with no availability check, so it would then fail at execution time). Add a pure engineEligible(job, caps) hard gate to the section-7 scheduler filter (and preemption): a concrete-engine job runs only on a factory advertising the matching engine:<e> cap. Gated only against engine-aware factories (those that advertise any engine:* token); engine-agnostic/legacy factories stay eligible, mirroring the picker's "engine set unknown => offer all" fallback. explainJob now surfaces the mismatch reason. No DB migration; behavior-preserving for legacy. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 10:59:57 -07:00
saravanakumardb1	d318e0fa2a	feat(fleet): engine picker offers only engines the factory advertises Add GET /fleet/factories (lists a product's factory docs with capabilities) — also fixes the fleet map's empty factory cards (listFactories had no route and silently returned []). The New-Job form now loads the selected factory's engine:* capabilities and constrains the engine dropdown to those (e.g. hides codex when the host doesn't have it), keeping the current pick valid; falls back to all engines when capabilities are unknown. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 02:38:35 -07:00
saravanakumardb1	6c31577cf2	feat(fleet): per-job engine picker (devin/claude/codex/copilot, default devin) Add an optional concrete `engine` to a job (overrides engineClass; resolved by the runner's resolve_engine where an explicit engine wins). All additive + optional, so existing engineless jobs keep falling back to the factory default. - types: FLEET_ENGINES enum; engine on SubmitJob/FleetJobDoc/UpdateDraft. - coordinator: store engine on create/supersede/updateDraft; run.engine at claim prefers job.engine, then engineClass, then 'unknown'. - tracker-web: Engine dropdown on the New-Job form (default devin) + editable on draft/queued jobs; shown in the detail config grid. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 02:30:22 -07:00
saravanakumardb1	928edad0af	feat(fleet): surface engine + agent session, editable config, timeline filter Backend: insights now carry engine + sessionId/sessionUrl; releaseLease promotes the reported engine onto the run (was created with the abstract engineClass, usually 'unknown'). tracker-web job detail: - Runs: show the concrete engine (insights.engine, falls back off 'unknown') and the agent session (Devin session id with a `devin --resume <id>` hint, or a link when a sessionUrl is present). - PromptCard: edit repo/baseBranch/verify/autoMerge (not just the prompt) while draft/queued/blocked. - Timeline: filter by event type (default collapses heartbeat runs). - Show a "no PR — needs verify / not PR mode" hint when parked in review. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 02:19:38 -07:00
saravanakumardb1	5262583e8b	feat(tracker-web): collapse heartbeat noise + surface PR on job detail The job detail timeline was buried under hundreds of lease_renewed rows on long-running jobs. Collapse consecutive high-frequency events (lease_renewed) into one "type xN - over Nm" summary row; everything else renders verbatim. Add a prominent Pull Request banner (link + state) sourced from whichever run opened the PR, instead of only the per-attempt Runs column. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 02:06:33 -07:00
saravanakumardb1	5ad521ad4c	feat(tracker-web): macOS LaunchAgent keep-alive for the fleet web tracker Adds dashboards/tracker-web/launchd/ (boot script + install.sh + README) so tracker-web (:3003) auto-starts on login and restarts on crash/reboot, instead of dying silently between sessions. Mirrors agent-queue/launchd: boot script repairs PATH, loads JWT_SECRET from platform-service/.env (+ ~/.tracker-web.env overrides), points at the local platform-service, and execs `pnpm dev`. plist uses unconditional KeepAlive (restart on any exit, incl. a clean SIGTERM) + a 10s throttle; install.sh frees :3003 first to avoid a clash with deploy-gigafactory. Verified: killing the process respawns it and :3003 returns. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 01:46:53 -07:00
saravanakumardb1	e4c84acf29	fix(fleet): bump the M0 queue gate when a draft/queued job is edited updateDraft changes caps/deps/priority (and body) without a stage change, so it did not bump fleet_queue_state — gated factories (AQ_FLEET_GATE=1) would not re-evaluate claimability until the safety interval. Bump the gate on edit so an edit that makes a job claimable wakes factories promptly. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 00:58:45 -07:00
saravanakumardb1	14e982d04f	fix(tracker-web): make New-Job form submissions routable to live factories The form defaulted capabilities to "build" — a token no agent-queue factory advertises (caps are os:* / engine:* / node:* / has:*), so every default UI submission was unroutable and stranded in queued (queue_starvation). Default capabilities to empty (any capable factory claims it), and replace the stale hardcoded mac-1/mac-2 factory dropdown with the 4 live factories (lysnrai / chronomind / mindlyst / nomgap, ids matching AQ_FACTORY_ID). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 00:54:02 -07:00
saravanakumardb1	b7df779e1d	feat(fleet): draft jobs + editable prompt (save-as-draft, submit, lock on pickup) Backend (platform-service): - New `draft` stage (not claimable; scheduler only takes queued/blocked). - submitJob accepts `draft: true` → parks a new/superseded job as a draft. - updateDraft(): edit prompt/config in place while draft/queued/blocked; recomputes contentHash; rejected (conflict) once picked up (assigned+). - submitDraft(): promote draft → queued (or blocked on unmet deps); idempotent. - Routes: PATCH /fleet/jobs/:id/draft, POST /fleet/jobs/:id/submit. - tracker-bridge: map draft → item status `open`. Tests + FLEET_STAGES updated. Frontend (tracker-web): - New-Job form: add "Save as draft" alongside "Submit". - Job detail: edit the prompt + Save while draft/queued/blocked, "Submit" a draft, and lock it read-only once a factory picks it up. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 00:44:10 -07:00
saravanakumardb1	cb4f7a7606	feat(tracker-web): show job prompt + PR/target config on the detail page The fleet job detail page never rendered the prompt (bodyMd) or the repo/ verify/auto-merge/capabilities/deps config. Add a Prompt card (verbatim body, scrollable) + a target/config grid, with a read-only badge once the job leaves queued/draft (a factory may already be acting on it). Expose verify/autoMerge/ deps on the FleetJob client type. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 00:29:29 -07:00
saravanakumardb1	78c4e47460	docs(gigafactory): fix stale/incorrect fleet docs - fleet module README: add fleet_queue_state container + GET /fleet/queue-state and /fleet/metrics; note the heartbeat cadence must stay under the 90s stale threshold (AQ_FLEET_LEASE_RENEW_SEC). - FLEET_CONTROL_PLANE: correct wrong endpoint paths (/fleet/claim and /fleet/factories/heartbeat were documented as /fleet/jobs/:id/claim and /fleet/factories/:id/heartbeat; removed a non-existent GET /fleet/factories); add enroll, metrics, and the M0 queue-state endpoint. - ROADMAP_COMPLETION_AUDIT: dated status banner — roadmap §0 now reconciled and Phase-4 M0 shipped, superseding the older "stale §0 / not started" findings. - README: point to FLEET_DISPATCH_REDESIGN.md + the M0 gate. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 00:03:05 -07:00
saravanakumardb1	ba7db0008d	feat(fleet): M0 RU gate — cheap per-product queue version + skip-claim Adds fleet_queue_state (monotonic version per product), bumped on job create + every stage change in the repository layer (best-effort, never fails a job write), and a GET /fleet/queue-state read endpoint. Lets a polling factory detect "work changed" with a ~1 RU point read instead of a full listJobs scan on every claim. Registers the container; tests cover the bump + endpoint. See agent-queue docs/GIGAFACTORY/FLEET_DISPATCH_REDESIGN.md §8/§12 (M0). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 23:18:27 -07:00
Saravanakumar D	5bc72cf221	chore: enforce LF line endings via .gitattributes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-31 20:23:18 -07:00
Saravanakumar Dhandapani	65c7d09584	feat(auth): include displayName claim in platform access token Adds an optional displayName claim to the platform access token so downstream product backends can source the user's display name from the JWT (single source of truth = platform auth), not from per-product DB copies. verifyToken already exposes displayName; this populates it at all token-minting sites (password login, register, refresh, SSO, OAuth, magic-link, passkeys, QR, enterprise SAML/OIDC). Additive and backward compatible. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-31 20:23:15 -07:00
saravanakumardb1	6709862c1a	feat(tracker-web): redesign fleet jobs list — stage chips, filters, hide-shipped - Clickable stage chips with live counts (color-coded) replace the plain dropdown; click to filter, click again to clear. - Search box (key / repo / id) + 'Hide shipped' checkbox (persisted) + 'N of M shown'. - Redesigned table: color-coded stage badges, priority emphasis, a Repo column (PR target), newest-first sort, bordered/zebra layout. - All filtering is client-side over a 100-job window (instant + accurate counts); list/links/New-Job form unchanged. FleetJob gains repo/baseBranch. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 16:58:12 -07:00
saravanakumardb1	2bd97791c9	feat(fleet): resilient PR merge on ship (inline attempt + background retry) The corporate proxy intermittently 407s GitHub's API, so a single gh pr merge can fail transiently. Try once inline (fast path), then retry in the background with backoff (3s/8s/20s/45s) without blocking the ship; mark prState=merged when one lands. Best-effort throughout.	2026-05-31 16:13:52 -07:00
saravanakumardb1	740335a149	feat(fleet): Ship can also merge the linked PR (gh pr merge) On ship (Ship button / operator action / autoship PATCH), when the run has an open PR and FLEET_SHIP_MERGES_PR=1, the coordinator squash-merges it via gh (best-effort, where gh is authed) and marks the run prState=merged. UI button reads 'Ship & merge PR' when an open PR exists; Ship refreshes runs.	2026-05-31 16:08:28 -07:00
saravanakumardb1	37d049eb69	fix(tracker-web): telemetry ingest auth (X-Install-Token) + show cost as approx - telemetry proxy attaches an X-Install-Token (derived from the payload, with a fallback) so the backend ingest auth gate stops returning 401 on browser beacons. - job-detail Cost shows ~$x.xx approx when the figure is estimated (token-based).	2026-05-31 15:58:26 -07:00
saravanakumardb1	696ee4189e	feat(fleet): track + show PR status (open/merged) on the run Runs now carry prState (open when the PR is opened, merged when auto-merge succeeds), reported on lease release. Job-detail Runs table shows a status badge next to the PR link.	2026-05-31 13:56:33 -07:00
saravanakumardb1	e9c1714c13	feat(tracker-web): factory dropdown routes job to the selected factory's product New Job form: pick a Factory (2 hardcoded for this machine) -> the job is submitted to that factory's product (submitJob productId override) and the dashboard view switches to it so the job is visible. Confirmation shows factory + product. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 06:28:28 -07:00
saravanakumardb1	c239abeec9	feat(fleet): per-repo verify + auto-merge options for PR jobs Add job-level verify (command run in the PR checkout before opening the PR) and autoMerge (squash-merge the PR once opened). Surfaced in the New Job form as a Verify-command field + Auto-merge checkbox (PR mode only); confirmation now shows PR-mode/repo. More repos added to the dropdown. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 06:17:19 -07:00
saravanakumardb1	2adddce754	feat(tracker-web): hardcoded repo dropdown for PR-mode jobs (base=main) MVP: the New Job form picks a PR target from a fixed dropdown of local repos; base branch is fixed to main. Empty selection = no PR (plain job). Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 05:36:50 -07:00
saravanakumardb1	d350159025	Merge feat/fleet-pr-deliverable: PR deliverables for fleet jobs	2026-05-31 05:27:16 -07:00
saravanakumardb1	883cf329e5	feat(fleet): PR deliverables — jobs target a repo, factory opens a PR, link recorded Make "shipped" produce a real artifact. A job can now carry an optional repo (owner/name or clone URL) + baseBranch; the factory's PR mode runs the agent in an isolated checkout, opens a PR, and records the link. Backend: - SubmitJobSchema + FleetJobDoc: optional repo/baseBranch (recorded on submit). - FleetRunDoc: optional prUrl/branch. - ReleaseLease report carries prUrl/branch -> stored on the run. - +2 coordinator tests. UI (tracker-web): - New Job form gains optional Repo + Base branch fields (and fixes the priority options to the valid critical/high/medium/low; "normal" was rejected by the API). - Job detail Runs table shows a PR ↗ link from run.prUrl. - fleet-client: submitJob repo/baseBranch; FleetRun prUrl/branch; OperatorAction +ship. Docs: FLEET_CONTROL_PLANE.md "PR deliverable (PR mode)" section. Verified: tsc clean; fleet suite 182; tracker-web 230. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 05:27:11 -07:00
saravanakumardb1	8b9ca4fee2	Merge feat/fleet-ui-submit-job: submit fleet jobs from the dashboard	2026-05-31 04:53:28 -07:00
saravanakumardb1	176b778a1f	feat(tracker-web): submit fleet jobs from the dashboard Add a collapsible 'New Job' form on the fleet jobs page (task body, priority, capabilities) wired to a new fleet-client submitJob() -> POST /fleet/jobs, with inline success/error and auto-refresh. Also add 'ship' to the OperatorAction type for parity with the coordinator. The existing job-detail 'Ship' button already drives the human-gate testing -> shipped transition. Verified: tsc clean; tracker-web suite 230/230. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 04:53:23 -07:00

1 2 3 4 5 ...

1672 Commits