Admin Observability & Health Panel

This document describes the runtime observability system implemented for the trading bot administrators.

Overview

The Admin Error & Health panel provides real-time visibility into the bot's internal state and actionable issues. It is designed for operators to quickly identify why trading might be paused, failing, or behaving unexpectedly without having to dig through raw logs.

Architecture

Backend: `ObservabilityService`

In-Memory Buffer: Stores the last 50 operational events in a ring buffer.
Structured Events: Every event follows the OperationalEvent interface.
Filtering: Events are filtered by user role. Only administrators receive operational events via the API and Socket.IO.

Frontend: `AdminTab` (System Health)

Status Badge: A global indicator of system health (Healthy, Degraded, Critical).
Event List: A chronologically ordered list of recent operational events with severity levels (INFO, WARN, ERROR).
Telemetry: Real-time display of execution loop durations, exchange latency, and lock contention counts.

Operational Event Types

Type	Severity	Description
`INSUFFICIENT_BUYING_POWER`	WARN	Attempted to open a position but broker reported insufficient capital.
`ORDER_FAILURE`	ERROR	Exchange rejected an order (e.g., price out of bounds, invalid qty).
`EXCHANGE_STATE_MISMATCH`	WARN	Discrepancy detected between internal database and exchange state.
`RECONCILIATION_DEGRADED`	ERROR	Reconciliation loop is failing repeatedly.
`SYSTEM_ERROR`	WARN/ERROR	General system issues, including exchange API timeouts or manual pauses.

Security & Performance

Sensitive Data: Events contain structured messages instead of raw stack traces or internal environment variables.
Cap: Both backend buffer and frontend display are capped at 50 events to ensure performance and prevent memory bloating.
RBAC: Operational events are only pushed to authenticated sockets belonging to users with the admin role.

Usage

Navigate to the Admin tab.
Select System Health.
Review the Operational Events list for recent issues.
If a global red banner appears at the top of the dashboard, it indicates a critical operational event occured in the last 10 minutes.

2.4 KiB Raw Blame History