docs(ops): add operator runbook

2026-05-05 13:53:59 -07:00 · 2026-05-05 13:53:59 -07:00 · 57a7e10bc9
commit 57a7e10bc9
parent 95f252a520
3 changed files with 288 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -139,3 +139,4 @@ Current baseline note: after common-platform workspace alignment, `pnpm install
 - [`docs/SEED_BOOTSTRAP_STRATEGY.md`](docs/SEED_BOOTSTRAP_STRATEGY.md) — Built-in prompt, intake rule, onboarding workspace, and feature-flag bootstrap strategy
 - [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md) — Encrypted-field, schema-change, and backfill migration plan
 - [`docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`](docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md) — Event taxonomy and diagnostic breadcrumb contract
 - [`docs/OPERATOR_RUNBOOK.md`](docs/OPERATOR_RUNBOOK.md) — Incident triage and recovery steps for dependencies, scheduler/webhooks, blob, LLM/extraction, reviews, and MCP
--- a/docs/OPERATOR_RUNBOOK.md
+++ b/docs/OPERATOR_RUNBOOK.md
@ -0,0 +1,286 @@
 # NoteLett Operator Runbook
 Date: May 5, 2026
 Product ID: `notelett`
 Primary service: NoteLett backend `4016`
 Shared services: platform-service `4003`, extraction-service `4005`, mcp-server `4007`
 ## Scope
 This runbook covers production incident triage, dependency outage behavior, stuck scheduler/webhook recovery, failed blob upload recovery, and failed LLM/extraction recovery. It assumes the operator has access to service logs, platform-service, Cosmos metrics, and a valid admin/owner token for diagnostics.
 Use this alongside:
 - `docs/PLATFORM_SMOKE_CHECKS.md`
 - `docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`
 - `docs/COSMOS_DATA_OPERATIONS.md`
 - `docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`
 - `docs/RELEASE_CHECKLIST.md`
 ## First Five Minutes
 1. Freeze unrelated deploys and record incident start time, environment, release commit, and primary symptom.
 2. Check health and dependency readiness:
 ```bash
 curl -sf "$NOTELETT_URL/health"
 curl -sf "$NOTELETT_API_URL/bootstrap"
 curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
 ```
 3. Check shared services:
 ```bash
 curl -sf "$PLATFORM_URL/health"
 curl -sf "$EXTRACTION_URL/health"
 curl -sf "$MCP_ORIGIN/health"
 ```
 4. Check current release flags and kill switch in platform-service. Prefer disabling risky workflows before rolling back images.
 5. Capture a short evidence bundle: readiness response, recent backend errors, platform-service errors, relevant telemetry event names, affected user/workspace ids, and any recent deploy or migration id.
 ## Dependency Outage Behavior
 | Dependency | User-visible behavior | Immediate operator action | Recovery check |
 | --- | --- | --- | --- |
 | Cosmos/datastore | Most authenticated reads/writes fail or hang | Verify Cosmos account/container health, RU throttling, keys, and backend `DB_PROVIDER`; rollback only if code deploy caused query surge | Authenticated workspace/note create/read flow passes |
 | platform-service | Auth, flags, telemetry, diagnostics, kill switch, and blob SAS may degrade | Keep backend alive if auth tokens still validate; disable risky features via cached/default flags where possible; restore platform-service first | `/api/diagnostics/readiness` reports platform-service healthy |
 | extraction-service | Task extraction and note summarization fail with user-facing extraction-down state | Disable extraction-heavy UI if needed; verify sidecar health and extraction queue/cache endpoints | Minimal extract request in `docs/PLATFORM_SMOKE_CHECKS.md` passes |
 | mcp-server | Agent tool registry/calls fail; product UI mostly continues | Keep web/mobile available; validate local tool registration count and MCP `/health` | `GET $MCP_URL/tools` shows NoteLett tools |
 | Blob service | Upload/download attachments fail; existing note text remains available | Verify platform blob routes, storage account/container, SAS permissions, and token scope | SAS request plus small upload/delete smoke passes |
 | LLM provider | Smart Actions, copilot, scheduled actions, URL extraction summaries, and Palace extraction degrade | Disable LLM-heavy flags; inspect provider credentials, model availability, rate limits, and retry logs | Mock or production provider prompt smoke passes |
 ## Feature Flag Triage
 Use platform-service flags and NoteLett defaults from `docs/SEED_BOOTSTRAP_STRATEGY.md`:
 - Disable `notelett_smart_actions_enabled` for broad prompt failures.
 - Disable `notelett_scheduled_actions_enabled` for runaway or stuck scheduler jobs.
 - Disable `notelett_webhooks_enabled` for repeated webhook failures or external callback incidents.
 - Disable `notelett_intake_enabled` if URL intake is causing queue/backpressure or LLM cost spikes.
 - Disable `notelett_collaborative_sharing_enabled` for cross-user access concerns.
 - Disable `notelett_auto_summarize_enabled`, `notelett_auto_embed_enabled`, and `notelett_auto_link_enabled` for background AI cost or mutation concerns.
 Record every flag change with timestamp, operator, reason, and expected rollback condition.
 ## Stuck Scheduler Recovery
 Symptoms:
 - Scheduled prompts are due but no result notes appear.
 - `scheduled_action_fired` drops to zero.
 - Backend logs show `Failed to run scheduled prompt`.
 - Users report weekly digest or scheduled actions missing.
 Diagnosis:
 ```bash
 curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" \
  -H "Authorization: Bearer $TOKEN"
 ```
 Check:
 - `notelett_scheduled_actions_enabled` is true.
 - Scheduler loop is running in the backend process.
 - `nextRunAt` is not in the past for many enabled schedules.
 - Built-in prompt templates were seeded with `pnpm run seed:bootstrap`.
 - LLM provider and datastore readiness are healthy.
 Recovery:
 1. If a schedule is malformed, patch it disabled through `PATCH /api/prompt-schedules/:id`.
 2. If many schedules are due and failing, disable `notelett_scheduled_actions_enabled`.
 3. Restart the backend only after confirming no deploy/migration is currently running.
 4. Re-enable schedules gradually after one manual smoke schedule succeeds.
 5. Record affected schedule ids and whether missed runs need manual replay. Do not bulk-create catch-up notes without product-owner approval.
 ## Webhook Recovery
 There are two webhook concepts:
 - Prompt webhooks in `note_prompt_webhooks`, triggered through `/api/prompt-webhooks/:id/trigger`.
 - Domain event dispatch targets held by `backend/src/lib/webhook-subscriber.ts` for product event delivery.
 Symptoms:
 - `webhook_triggered` events are absent or failing.
 - External integrations report duplicate or missing callbacks.
 - Backend logs show dispatch failures or repeated timeouts.
 Diagnosis:
 ```bash
 curl -sf "$NOTELETT_API_URL/prompt-webhooks" -H "Authorization: Bearer $TOKEN"
 curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" -H "Authorization: Bearer $TOKEN"
 ```
 Check:
 - `notelett_webhooks_enabled` is true only when webhook smoke is expected.
 - Target webhook is enabled and has correct `workspaceId`, `templateId`, `triggerEvent`, and `tagFilter`.
 - Built-in/custom template resolves.
 - External callback endpoint is reachable and not rejecting auth/signature.
 Recovery:
 1. Disable the failing webhook with `PATCH /api/prompt-webhooks/:id`.
 2. If failures are broad or external abuse is suspected, disable `notelett_webhooks_enabled`.
 3. Retry one manual trigger after the dependency is healthy.
 4. For duplicate deliveries, compare external request ids/correlation ids before replay.
 5. Document delivery window and any events intentionally not replayed.
 ## Failed Blob Upload Recovery
 Symptoms:
 - Web artifact upload fails.
 - Mobile image/attachment upload fails.
 - Download links return 403/404.
 Diagnosis:
 ```bash
 curl -sf "$PLATFORM_API_URL/blob/containers" -H "Authorization: Bearer $TOKEN"
 curl -sf "$PLATFORM_API_URL/blob/sas" \
  -H "Authorization: Bearer $TOKEN" \
  -H "content-type: application/json" \
  -d '{"container":"attachments","blobName":"notelett/smoke/operator.txt","permissions":"rw","expiresInMinutes":10}'
 ```
 Check:
 - platform-service is healthy and can reach the storage account.
 - The `attachments` container exists.
 - SAS permissions include the needed operation.
 - The client is using the shared `@bytelyst/blob-client` paths.
 - Blob paths do not include unsafe file names or cross-product prefixes.
 Recovery:
 1. If SAS issuance fails, restore platform-service/blob configuration first.
 2. If upload fails after SAS succeeds, check storage account/container permissions and CORS.
 3. Ask affected user to retry upload only after smoke succeeds.
 4. If a metadata row exists without blob content, either retry upload to the same path or delete the orphaned artifact metadata through the product API after owner approval.
 5. Never paste SAS URLs into tickets or logs; record only container, sanitized prefix, and error class.
 ## Failed LLM Or Extraction Recovery
 Symptoms:
 - Prompt runs return errors.
 - Intake jobs stay failed.
 - Copilot transforms fail.
 - Auto-summary or Palace extraction logs errors.
 Diagnosis:
 ```bash
 curl -sf "$EXTRACTION_URL/health"
 curl -sf "$EXTRACTION_URL/api/extract/models"
 curl -sf "$EXTRACTION_URL/api/extract/sidecar-health"
 curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
 ```
 Check backend env:
 - `LLM_PROVIDER`
 - `LLM_DEFAULT_MODEL`
 - `LLM_VISION_MODEL`
 - `LLM_EMBEDDING_MODEL`
 - provider API credentials
 - `EXTRACTION_SERVICE_URL`
 Recovery:
 1. Disable LLM-heavy feature flags for broad failures.
 2. If only extraction-service is down, keep basic note CRUD/search available and surface extraction-down UI state.
 3. If provider rate limits are hit, lower traffic with flags and wait out/reset provider quotas.
 4. For failed intake jobs, users can resubmit the URL after dependency recovery; do not rewrite failed job status manually unless there is a documented replay tool.
 5. For prompt result corruption, preserve the generated note/artifact as evidence, then archive/delete only with user or product-owner approval.
 ## Review Queue Recovery
 Symptoms:
 - Pending review queue is empty unexpectedly.
 - Approve/reject fails or gets partial batch results.
 - Agent action state appears stuck in `draft` or `proposed`.
 Diagnosis:
 ```bash
 curl -sf "$NOTELETT_API_URL/note-agent-actions/pending?limit=50" \
  -H "Authorization: Bearer $TOKEN"
 ```
 Check:
 - user id and product id scope
 - `workspaceId` query when patching a specific action
 - encrypted field readiness for action summary/review fields
 - batch-review response `updated`, `not_found`, and `error` counts
 Recovery:
 1. Retry single review before retrying a large batch.
 2. If partial batch failure occurred, use returned ids to retry only failed/not-found records after confirming workspace scope.
 3. Preserve audit trail; do not directly mutate Cosmos unless product-owner approves an incident backfill.
 4. If cross-user access is suspected, disable sharing/collaboration flags and escalate.
 ## MCP Action Recovery
 Symptoms:
 - Agent tools cannot list/create/update notes.
 - MCP write tool created duplicate or stuck audit actions.
 - Tool calls fail with auth/product scope errors.
 Diagnosis:
 ```bash
 curl -sf "$MCP_ORIGIN/health"
 curl -sf "$MCP_URL/tools" -H "Authorization: Bearer $TOKEN"
 curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
 ```
 Check:
 - NoteLett MCP tools are registered.
 - Product backend health/readiness is green.
 - Token role is sufficient for requested tool.
 - Write tools used idempotency key, dry-run, and correlation id when available.
 Recovery:
 1. Pause agent automation if duplicate writes or scope concerns appear.
 2. Use `note-agent-actions` records to identify applied/proposed actions.
 3. Reject or archive unwanted proposed actions through the product API.
 4. For already-applied note changes, use note version history or restore flow where available.
 5. Keep MCP disabled until a dry-run tool smoke succeeds.
 ## Communication And Closeout
 During incident:
 - Post status every 15-30 minutes with impact, current hypothesis, mitigation, and next check.
 - Name exact degraded workflows: web note CRUD, mobile capture, Smart Actions, intake, reviews, MCP, or sharing.
 - Avoid exposing secrets, share tokens, note text, prompt text, full URLs, or raw LLM output in status updates.
 Closeout requires:
 - all affected health/readiness/smoke checks pass
 - feature flags restored or explicitly left disabled with owner/date
 - release or rollback commit recorded
 - telemetry and diagnostics event names captured
 - data migration/backfill, if any, documented with counts and rollback decision
 - follow-up issue or roadmap item for every residual risk
 ## Verification
 For runbook-only changes:
 ```bash
 git diff --check
 rg -n "OPERATOR_RUNBOOK|Stuck Scheduler Recovery|Failed Blob Upload Recovery|Failed LLM Or Extraction Recovery|MCP Action Recovery" docs README.md
 ```
--- a/docs/RELEASE_CHECKLIST.md
+++ b/docs/RELEASE_CHECKLIST.md
@ -96,6 +96,7 @@ Do not place secrets in `NEXT_PUBLIC_*` or `EXPO_PUBLIC_*` variables.
 ## Pre-Deploy Checklist
 - Confirm release commit is pushed and CI is green.
 - Confirm `docs/OPERATOR_RUNBOOK.md` has the current incident owner, service URLs, and rollback target for this environment.
 - Confirm `pnpm run audit:release-guards` passes.
 - Confirm `pnpm run dependency:health` has no typecheck failures; review the non-blocking outdated report.
 - Confirm backend, web, and mobile tests from `docs/PRODUCTION_READINESS_HANDOFF_ROADMAP.md` P10 have passed or have explicit release-owner signoff.