learning_ai_common_plat/services/monitoring
saravanakumardb1 93d1caf4a2 feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4)
Exports fleet observability to Prometheus/Grafana (previously JSON-only).

- GET /api/fleet/metrics/prom: global, product-labelled Prometheus exposition
  (queue depth, blocked/active, per-stage histogram, factory health/seats/
  utilization, active alerts, budget spent/ceiling/projected) plus process-wide
  reaper/GC counters and engine circuit-breaker state. Pure renderer
  (renderFleetMetricsProm) is unit-tested; route auth accepts a FLEET_METRICS_TOKEN
  bearer (scrape path) or an admin JWT — never world-readable by default.
- Infra: add a prometheus container to docker-compose + a platform-service-fleet
  scrape job; pin the Prometheus Grafana datasource uid; add a provisioned
  "Fleet Overview" dashboard (breakers, dead-letter, stale factories, alerts,
  queue depth, utilization, budget burn, reaper rate) with a product template var.
- Document FLEET_METRICS_TOKEN + the fleet feature flags in .env.example.

No default behavior change: the endpoint is additive and the new container is
opt-in via the compose stack.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-06-01 22:24:03 -07:00
..
grafana feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4) 2026-06-01 22:24:03 -07:00
loki fix(common): configure ESLint 9 and fix lint issues 2026-02-12 16:37:30 -08:00
prometheus feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4) 2026-06-01 22:24:03 -07:00
health-check.local.sh fix(monitoring): update health-check endpoints for consolidated services 2026-02-17 20:53:37 -08:00
health-check.ts chore(monitoring): document health-check output 2026-05-04 16:34:27 -07:00
package.json chore(deps): bump @types/node 22 -> 25 (dev types) 2026-05-31 04:02:56 -07:00
tsconfig.json feat(services): add monitoring (Loki + Grafana config, health-check) 2026-02-12 11:39:24 -08:00