learning_ai_invt_trdg/backend/admin-observability.md

2.4 KiB

Admin Observability & Health Panel

This document describes the runtime observability system implemented for the trading bot administrators.

Overview

The Admin Error & Health panel provides real-time visibility into the bot's internal state and actionable issues. It is designed for operators to quickly identify why trading might be paused, failing, or behaving unexpectedly without having to dig through raw logs.

Architecture

Backend: ObservabilityService

  • In-Memory Buffer: Stores the last 50 operational events in a ring buffer.
  • Structured Events: Every event follows the OperationalEvent interface.
  • Filtering: Events are filtered by user role. Only administrators receive operational events via the API and Socket.IO.

Frontend: AdminTab (System Health)

  • Status Badge: A global indicator of system health (Healthy, Degraded, Critical).
  • Event List: A chronologically ordered list of recent operational events with severity levels (INFO, WARN, ERROR).
  • Telemetry: Real-time display of execution loop durations, exchange latency, and lock contention counts.

Operational Event Types

Type Severity Description
INSUFFICIENT_BUYING_POWER WARN Attempted to open a position but broker reported insufficient capital.
ORDER_FAILURE ERROR Exchange rejected an order (e.g., price out of bounds, invalid qty).
EXCHANGE_STATE_MISMATCH WARN Discrepancy detected between internal database and exchange state.
RECONCILIATION_DEGRADED ERROR Reconciliation loop is failing repeatedly.
SYSTEM_ERROR WARN/ERROR General system issues, including exchange API timeouts or manual pauses.

Security & Performance

  • Sensitive Data: Events contain structured messages instead of raw stack traces or internal environment variables.
  • Cap: Both backend buffer and frontend display are capped at 50 events to ensure performance and prevent memory bloating.
  • RBAC: Operational events are only pushed to authenticated sockets belonging to users with the admin role.

Usage

  1. Navigate to the Admin tab.
  2. Select System Health.
  3. Review the Operational Events list for recent issues.
  4. If a global red banner appears at the top of the dashboard, it indicates a critical operational event occured in the last 10 minutes.