learning_ai_common_plat/services/monitoring
saravanakumardb1 c63736459b feat(fleet): anti-flap hysteresis + autoscale Prometheus series & dashboard (ops #5)
Make the capacity autoscaling signal safe to act on automatically and observable
in Grafana.

Anti-flap hysteresis:
- New pure applyHysteresis: suppresses a direction reversal (scale_in after
  scale_out, or vice versa) within a cooldown window so a consumer cannot thrash
  capacity. A critical scale-out (queued work, zero usable capacity) always
  bypasses the cooldown. Cooldown anchor only advances on an emitted action, so a
  suppressed signal keeps counting down from the real last action.
- Process-wide per-product cooldown state (mirrors reaper/breaker in-mem state)
  with a test seam; cooldown tunable via FLEET_AUTOSCALE_COOLDOWN_SEC (default 300).
- GET /fleet/autoscale[/all] now serve the debounced (stateful) recommendation.

Observability:
- Prometheus exposition emits the RAW recommendation per product
  (fleet_autoscale_recommended_seats/delta/pressure + one-hot fleet_autoscale_action
  {action}). RAW (not stateful) so a scrape never mutates the cooldown anchors.
- Grafana "Fleet Overview" gains two panels: products recommending scale-out
  (stat) + recommended seat delta vs backlog (timeseries).

Docs: FLEET_AUTOSCALE_COOLDOWN_SEC in .env.example.

Tests: +10 (hysteresis/stateful/cooldown + prom autoscale series); full suite 1856
green; lint + tsc clean. Verified live: a throwaway Prometheus scraped the running
service and the dashboard PromQL returned real scale-out/scale-in recommendations
across products.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 23:02:08 -07:00
..
grafana feat(fleet): anti-flap hysteresis + autoscale Prometheus series & dashboard (ops #5) 2026-06-01 23:02:08 -07:00
loki fix(common): configure ESLint 9 and fix lint issues 2026-02-12 16:37:30 -08:00
prometheus feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4) 2026-06-01 22:24:03 -07:00
health-check.local.sh fix(monitoring): update health-check endpoints for consolidated services 2026-02-17 20:53:37 -08:00
health-check.ts chore(monitoring): document health-check output 2026-05-04 16:34:27 -07:00
package.json chore(deps): bump @types/node 22 -> 25 (dev types) 2026-05-31 04:02:56 -07:00
tsconfig.json feat(services): add monitoring (Loki + Grafana config, health-check) 2026-02-12 11:39:24 -08:00