35 lines
1.8 KiB
Markdown
35 lines
1.8 KiB
Markdown
# ENTRY Lock Contention Spike Runbook
|
|
|
|
## Incident description
|
|
A profile experiences repeated failures to acquire the row-based entry lock, blocking ENTRY signals and indicating pressure on horizontal scaling.
|
|
|
|
## Symptoms
|
|
- `lockContentionCount` increments in `/internal/health` and `/metrics`.
|
|
- Logs show fn_try_acquire_entry_lock_row returning false with owner tokens different from the caller.
|
|
- Trading loop reports `lock acquisition failed` warnings and may skip signals.
|
|
|
|
## Metrics to check
|
|
- `/internal/health` ? `lockContentionCount`, `tradingLoopHealthy`, `reconciliationLoopHealthy`.
|
|
- `/metrics` ? `entry_lock_contention_total`, `lock_acquisition_latency_seconds`, `entry_lock_holder_info` (if available).
|
|
|
|
## Immediate mitigation
|
|
1. Identify the profile_id and symbol from logs; confirm if another worker legitimately holds the lock.
|
|
2. Ensure the existing lock owner is still alive or has not crashed; use Supabase to inspect `entry_locks` TTL.
|
|
3. Wait for TTL expiry (default 30s) before retrying if owner appears stuck.
|
|
4. Avoid forcing lock release unless owner is confirmed dead; manual deletion risks concurrent exchange submission.
|
|
|
|
## Expected self-recovery
|
|
- The TTL expires, the lock row updates or deletes itself, and the next signal acquires the lock.
|
|
- Metrics return to baseline if contention was transient.
|
|
|
|
## When to escalate
|
|
- Contention persists beyond three TTL cycles (90s).
|
|
- Multiple profiles report contention simultaneously.
|
|
- Lock rows show expired timestamps but fail to refresh.
|
|
Escalate to Platform Ops and refer to docs/runbooks/lock-timeout.md (if it exists) for lock escalation.
|
|
|
|
## What NOT to do
|
|
- Do not delete lock rows manually while other workers are active.
|
|
- Do not restart all workers; indiscriminate restarts magnify contention.
|
|
- Do not trigger new ENTRY signals for the affected profile until lock clears.
|