learning_ai_invt_trdg/backend/runbooks/lock-contention.md

35 lines
1.8 KiB
Markdown

# ENTRY Lock Contention Spike Runbook
## Incident description
A profile experiences repeated failures to acquire the row-based entry lock, blocking ENTRY signals and indicating pressure on horizontal scaling.
## Symptoms
- `lockContentionCount` increments in `/internal/health` and `/metrics`.
- Logs show fn_try_acquire_entry_lock_row returning false with owner tokens different from the caller.
- Trading loop reports `lock acquisition failed` warnings and may skip signals.
## Metrics to check
- `/internal/health` ? `lockContentionCount`, `tradingLoopHealthy`, `reconciliationLoopHealthy`.
- `/metrics` ? `entry_lock_contention_total`, `lock_acquisition_latency_seconds`, `entry_lock_holder_info` (if available).
## Immediate mitigation
1. Identify the profile_id and symbol from logs; confirm if another worker legitimately holds the lock.
2. Ensure the existing lock owner is still alive or has not crashed; use Supabase to inspect `entry_locks` TTL.
3. Wait for TTL expiry (default 30s) before retrying if owner appears stuck.
4. Avoid forcing lock release unless owner is confirmed dead; manual deletion risks concurrent exchange submission.
## Expected self-recovery
- The TTL expires, the lock row updates or deletes itself, and the next signal acquires the lock.
- Metrics return to baseline if contention was transient.
## When to escalate
- Contention persists beyond three TTL cycles (90s).
- Multiple profiles report contention simultaneously.
- Lock rows show expired timestamps but fail to refresh.
Escalate to Platform Ops and refer to docs/runbooks/lock-timeout.md (if it exists) for lock escalation.
## What NOT to do
- Do not delete lock rows manually while other workers are active.
- Do not restart all workers; indiscriminate restarts magnify contention.
- Do not trigger new ENTRY signals for the affected profile until lock clears.