learning_ai_invt_trdg/backend/runbooks/lock-contention.md

1.8 KiB

ENTRY Lock Contention Spike Runbook

Incident description

A profile experiences repeated failures to acquire the row-based entry lock, blocking ENTRY signals and indicating pressure on horizontal scaling.

Symptoms

  • lockContentionCount increments in /internal/health and /metrics.
  • Logs show fn_try_acquire_entry_lock_row returning false with owner tokens different from the caller.
  • Trading loop reports lock acquisition failed warnings and may skip signals.

Metrics to check

  • /internal/health ? lockContentionCount, tradingLoopHealthy, reconciliationLoopHealthy.
  • /metrics ? entry_lock_contention_total, lock_acquisition_latency_seconds, entry_lock_holder_info (if available).

Immediate mitigation

  1. Identify the profile_id and symbol from logs; confirm if another worker legitimately holds the lock.
  2. Ensure the existing lock owner is still alive or has not crashed; use Supabase to inspect entry_locks TTL.
  3. Wait for TTL expiry (default 30s) before retrying if owner appears stuck.
  4. Avoid forcing lock release unless owner is confirmed dead; manual deletion risks concurrent exchange submission.

Expected self-recovery

  • The TTL expires, the lock row updates or deletes itself, and the next signal acquires the lock.
  • Metrics return to baseline if contention was transient.

When to escalate

  • Contention persists beyond three TTL cycles (90s).
  • Multiple profiles report contention simultaneously.
  • Lock rows show expired timestamps but fail to refresh. Escalate to Platform Ops and refer to docs/runbooks/lock-timeout.md (if it exists) for lock escalation.

What NOT to do

  • Do not delete lock rows manually while other workers are active.
  • Do not restart all workers; indiscriminate restarts magnify contention.
  • Do not trigger new ENTRY signals for the affected profile until lock clears.