learning_ai_common_plat

Author	SHA1	Message	Date
saravanakumardb1	c63736459b	feat(fleet): anti-flap hysteresis + autoscale Prometheus series & dashboard (ops #5 ) Make the capacity autoscaling signal safe to act on automatically and observable in Grafana. Anti-flap hysteresis: - New pure applyHysteresis: suppresses a direction reversal (scale_in after scale_out, or vice versa) within a cooldown window so a consumer cannot thrash capacity. A critical scale-out (queued work, zero usable capacity) always bypasses the cooldown. Cooldown anchor only advances on an emitted action, so a suppressed signal keeps counting down from the real last action. - Process-wide per-product cooldown state (mirrors reaper/breaker in-mem state) with a test seam; cooldown tunable via FLEET_AUTOSCALE_COOLDOWN_SEC (default 300). - GET /fleet/autoscale[/all] now serve the debounced (stateful) recommendation. Observability: - Prometheus exposition emits the RAW recommendation per product (fleet_autoscale_recommended_seats/delta/pressure + one-hot fleet_autoscale_action {action}). RAW (not stateful) so a scrape never mutates the cooldown anchors. - Grafana "Fleet Overview" gains two panels: products recommending scale-out (stat) + recommended seat delta vs backlog (timeseries). Docs: FLEET_AUTOSCALE_COOLDOWN_SEC in .env.example. Tests: +10 (hysteresis/stateful/cooldown + prom autoscale series); full suite 1856 green; lint + tsc clean. Verified live: a throwaway Prometheus scraped the running service and the dashboard PromQL returned real scale-out/scale-in recommendations across products. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 23:02:08 -07:00
saravanakumardb1	93d1caf4a2	feat(fleet): Prometheus metrics export + Grafana dashboard (ops #4 ) Exports fleet observability to Prometheus/Grafana (previously JSON-only). - GET /api/fleet/metrics/prom: global, product-labelled Prometheus exposition (queue depth, blocked/active, per-stage histogram, factory health/seats/ utilization, active alerts, budget spent/ceiling/projected) plus process-wide reaper/GC counters and engine circuit-breaker state. Pure renderer (renderFleetMetricsProm) is unit-tested; route auth accepts a FLEET_METRICS_TOKEN bearer (scrape path) or an admin JWT — never world-readable by default. - Infra: add a prometheus container to docker-compose + a platform-service-fleet scrape job; pin the Prometheus Grafana datasource uid; add a provisioned "Fleet Overview" dashboard (breakers, dead-letter, stale factories, alerts, queue depth, utilization, budget burn, reaper rate) with a product template var. - Document FLEET_METRICS_TOKEN + the fleet feature flags in .env.example. No default behavior change: the endpoint is additive and the new container is opt-in via the compose stack. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-06-01 22:24:03 -07:00
saravanakumardb1	8fe26027e7	chore(deps): bump @types/node 22 -> 25 (dev types) Verified: full workspace build (tsc) green across all packages/services/dashboards; fleet+items tests pass. Compile-time only. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-31 04:02:56 -07:00
Saravana Kumar	fe8338c2c5	feat(monitoring): add VM Overview Grafana dashboard 12-panel dashboard auto-provisioned via /var/lib/grafana/dashboards: - 4 stat tiles (disk %, RAM avail, swap used, CPU steal) with threshold colouring matching vm-health-check.sh - 4 time-series (disk %, RAM trend, steal, sda write GB/hr) — 7d default - 2 bargauge top-10 by RAM and CPU (cAdvisor container_memory_working_set, container_cpu_usage) - Load average (1/5/15) + network throughput (RX/TX, host interfaces) uid: vm-overview. Picked up on next Grafana boot. Closes Phase 5: "Add Grafana" item from VM observability roadmap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 21:26:35 +00:00
Saravana Achu Mac	e9a70edb8b	chore(monitoring): document health-check output What changed: - Documented the monitoring health-check script as a CLI/standalone output surface. - Kept console output unchanged because it is the command's user interface. Warning impact: - @lysnrai/monitoring scoped warnings: 5 -> 0. - Workspace warning total: 155 -> 150. Verification: - pnpm --filter @lysnrai/monitoring exec tsc --noEmit - pnpm --filter @lysnrai/monitoring exec eslint . --ext .ts,.tsx - pnpm lint	2026-05-04 16:34:27 -07:00
root	b8661392c6	feat(observability): add phase 2 monitoring and valkey services	2026-03-31 06:57:12 +00:00
saravanakumardb1	6f7299aa7a	fix(monitoring): update health-check endpoints for consolidated services - Remove defunct growth-service (4001), billing-service (4002), tracker-service (4004) - Add backend API (8000), extraction sidecar (4006), all 3 dashboards (3001-3003) - Reorder: backend → services → dashboards → infra	2026-02-17 20:53:37 -08:00
saravanakumardb1	fb3bc750eb	fix: update .env.example comments, Grafana dashboard, and debug-service.md for consolidated services	2026-02-14 22:01:55 -08:00
saravanakumardb1	81609e9358	fix: remove stale port references from monitoring, docs, and AI.dev skills	2026-02-14 21:48:21 -08:00
Saravana Achu Mac	16bc06d84a	Add local health-check script; mark health verification	2026-02-14 18:59:01 -08:00
Saravana Achu Mac	e9b33fb518	feat(monitoring): add @bytelyst/monitoring package	2026-02-14 15:57:41 -08:00
saravanakumardb1	b8c0a73e89	feat(extraction): Phase 5 observability + error handling (5.7-5.12) - 5.7: Enhanced structured logging with userId, productId, cacheHit, tokenCount - 5.8: Metrics module (counters + histograms) + /extract/metrics endpoint - 5.9: Grafana dashboard config for extraction-service (Loki queries) - 5.10: Error mapping — sidecar errors → proper HTTP status codes (408, 429, 502, 503) - 5.11: Circuit breaker for Python sidecar (5 failures → 30s OPEN) - 5.12: Graceful degradation — circuit open returns 503, cached results still served - 46 TS tests passing	2026-02-14 14:04:59 -08:00
saravanakumardb1	90b9cf93d8	fix(common): configure ESLint 9 and fix lint issues - Added @eslint/js dependency - Updated eslint.config.js for ESLint 9 compatibility - Added required globals (crypto, localStorage, React, etc.) - Fixed unused imports and variables - Disabled sort-imports temporarily - Formatted all files with Prettier	2026-02-12 16:37:30 -08:00
saravanakumardb1	c97e697097	feat(services): add monitoring (Loki + Grafana config, health-check) - Copied as-is from learning_voice_ai_agent/services/monitoring - Grafana dashboards + provisioning for Loki datasource - health-check.ts for service health polling - Updated pnpm-workspace.yaml to include services/*	2026-02-12 11:39:24 -08:00

14 Commits