From 57a7e10bc908d1d24ea3577732c85a62a83b0905 Mon Sep 17 00:00:00 2001 From: Saravana Achu Mac Date: Tue, 5 May 2026 13:53:59 -0700 Subject: [PATCH] docs(ops): add operator runbook --- README.md | 1 + docs/OPERATOR_RUNBOOK.md | 286 ++++++++++++++++++++++++++++++++++++++ docs/RELEASE_CHECKLIST.md | 1 + 3 files changed, 288 insertions(+) create mode 100644 docs/OPERATOR_RUNBOOK.md diff --git a/README.md b/README.md index a4513e3..aed43f5 100644 --- a/README.md +++ b/README.md @@ -139,3 +139,4 @@ Current baseline note: after common-platform workspace alignment, `pnpm install - [`docs/SEED_BOOTSTRAP_STRATEGY.md`](docs/SEED_BOOTSTRAP_STRATEGY.md) — Built-in prompt, intake rule, onboarding workspace, and feature-flag bootstrap strategy - [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md) — Encrypted-field, schema-change, and backfill migration plan - [`docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`](docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md) — Event taxonomy and diagnostic breadcrumb contract +- [`docs/OPERATOR_RUNBOOK.md`](docs/OPERATOR_RUNBOOK.md) — Incident triage and recovery steps for dependencies, scheduler/webhooks, blob, LLM/extraction, reviews, and MCP diff --git a/docs/OPERATOR_RUNBOOK.md b/docs/OPERATOR_RUNBOOK.md new file mode 100644 index 0000000..d69d66b --- /dev/null +++ b/docs/OPERATOR_RUNBOOK.md @@ -0,0 +1,286 @@ +# NoteLett Operator Runbook + +Date: May 5, 2026 +Product ID: `notelett` +Primary service: NoteLett backend `4016` +Shared services: platform-service `4003`, extraction-service `4005`, mcp-server `4007` + +## Scope + +This runbook covers production incident triage, dependency outage behavior, stuck scheduler/webhook recovery, failed blob upload recovery, and failed LLM/extraction recovery. It assumes the operator has access to service logs, platform-service, Cosmos metrics, and a valid admin/owner token for diagnostics. + +Use this alongside: + +- `docs/PLATFORM_SMOKE_CHECKS.md` +- `docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md` +- `docs/COSMOS_DATA_OPERATIONS.md` +- `docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md` +- `docs/RELEASE_CHECKLIST.md` + +## First Five Minutes + +1. Freeze unrelated deploys and record incident start time, environment, release commit, and primary symptom. +2. Check health and dependency readiness: + +```bash +curl -sf "$NOTELETT_URL/health" +curl -sf "$NOTELETT_API_URL/bootstrap" +curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN" +``` + +3. Check shared services: + +```bash +curl -sf "$PLATFORM_URL/health" +curl -sf "$EXTRACTION_URL/health" +curl -sf "$MCP_ORIGIN/health" +``` + +4. Check current release flags and kill switch in platform-service. Prefer disabling risky workflows before rolling back images. +5. Capture a short evidence bundle: readiness response, recent backend errors, platform-service errors, relevant telemetry event names, affected user/workspace ids, and any recent deploy or migration id. + +## Dependency Outage Behavior + +| Dependency | User-visible behavior | Immediate operator action | Recovery check | +| --- | --- | --- | --- | +| Cosmos/datastore | Most authenticated reads/writes fail or hang | Verify Cosmos account/container health, RU throttling, keys, and backend `DB_PROVIDER`; rollback only if code deploy caused query surge | Authenticated workspace/note create/read flow passes | +| platform-service | Auth, flags, telemetry, diagnostics, kill switch, and blob SAS may degrade | Keep backend alive if auth tokens still validate; disable risky features via cached/default flags where possible; restore platform-service first | `/api/diagnostics/readiness` reports platform-service healthy | +| extraction-service | Task extraction and note summarization fail with user-facing extraction-down state | Disable extraction-heavy UI if needed; verify sidecar health and extraction queue/cache endpoints | Minimal extract request in `docs/PLATFORM_SMOKE_CHECKS.md` passes | +| mcp-server | Agent tool registry/calls fail; product UI mostly continues | Keep web/mobile available; validate local tool registration count and MCP `/health` | `GET $MCP_URL/tools` shows NoteLett tools | +| Blob service | Upload/download attachments fail; existing note text remains available | Verify platform blob routes, storage account/container, SAS permissions, and token scope | SAS request plus small upload/delete smoke passes | +| LLM provider | Smart Actions, copilot, scheduled actions, URL extraction summaries, and Palace extraction degrade | Disable LLM-heavy flags; inspect provider credentials, model availability, rate limits, and retry logs | Mock or production provider prompt smoke passes | + +## Feature Flag Triage + +Use platform-service flags and NoteLett defaults from `docs/SEED_BOOTSTRAP_STRATEGY.md`: + +- Disable `notelett_smart_actions_enabled` for broad prompt failures. +- Disable `notelett_scheduled_actions_enabled` for runaway or stuck scheduler jobs. +- Disable `notelett_webhooks_enabled` for repeated webhook failures or external callback incidents. +- Disable `notelett_intake_enabled` if URL intake is causing queue/backpressure or LLM cost spikes. +- Disable `notelett_collaborative_sharing_enabled` for cross-user access concerns. +- Disable `notelett_auto_summarize_enabled`, `notelett_auto_embed_enabled`, and `notelett_auto_link_enabled` for background AI cost or mutation concerns. + +Record every flag change with timestamp, operator, reason, and expected rollback condition. + +## Stuck Scheduler Recovery + +Symptoms: + +- Scheduled prompts are due but no result notes appear. +- `scheduled_action_fired` drops to zero. +- Backend logs show `Failed to run scheduled prompt`. +- Users report weekly digest or scheduled actions missing. + +Diagnosis: + +```bash +curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" \ + -H "Authorization: Bearer $TOKEN" +``` + +Check: + +- `notelett_scheduled_actions_enabled` is true. +- Scheduler loop is running in the backend process. +- `nextRunAt` is not in the past for many enabled schedules. +- Built-in prompt templates were seeded with `pnpm run seed:bootstrap`. +- LLM provider and datastore readiness are healthy. + +Recovery: + +1. If a schedule is malformed, patch it disabled through `PATCH /api/prompt-schedules/:id`. +2. If many schedules are due and failing, disable `notelett_scheduled_actions_enabled`. +3. Restart the backend only after confirming no deploy/migration is currently running. +4. Re-enable schedules gradually after one manual smoke schedule succeeds. +5. Record affected schedule ids and whether missed runs need manual replay. Do not bulk-create catch-up notes without product-owner approval. + +## Webhook Recovery + +There are two webhook concepts: + +- Prompt webhooks in `note_prompt_webhooks`, triggered through `/api/prompt-webhooks/:id/trigger`. +- Domain event dispatch targets held by `backend/src/lib/webhook-subscriber.ts` for product event delivery. + +Symptoms: + +- `webhook_triggered` events are absent or failing. +- External integrations report duplicate or missing callbacks. +- Backend logs show dispatch failures or repeated timeouts. + +Diagnosis: + +```bash +curl -sf "$NOTELETT_API_URL/prompt-webhooks" -H "Authorization: Bearer $TOKEN" +curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" -H "Authorization: Bearer $TOKEN" +``` + +Check: + +- `notelett_webhooks_enabled` is true only when webhook smoke is expected. +- Target webhook is enabled and has correct `workspaceId`, `templateId`, `triggerEvent`, and `tagFilter`. +- Built-in/custom template resolves. +- External callback endpoint is reachable and not rejecting auth/signature. + +Recovery: + +1. Disable the failing webhook with `PATCH /api/prompt-webhooks/:id`. +2. If failures are broad or external abuse is suspected, disable `notelett_webhooks_enabled`. +3. Retry one manual trigger after the dependency is healthy. +4. For duplicate deliveries, compare external request ids/correlation ids before replay. +5. Document delivery window and any events intentionally not replayed. + +## Failed Blob Upload Recovery + +Symptoms: + +- Web artifact upload fails. +- Mobile image/attachment upload fails. +- Download links return 403/404. + +Diagnosis: + +```bash +curl -sf "$PLATFORM_API_URL/blob/containers" -H "Authorization: Bearer $TOKEN" +curl -sf "$PLATFORM_API_URL/blob/sas" \ + -H "Authorization: Bearer $TOKEN" \ + -H "content-type: application/json" \ + -d '{"container":"attachments","blobName":"notelett/smoke/operator.txt","permissions":"rw","expiresInMinutes":10}' +``` + +Check: + +- platform-service is healthy and can reach the storage account. +- The `attachments` container exists. +- SAS permissions include the needed operation. +- The client is using the shared `@bytelyst/blob-client` paths. +- Blob paths do not include unsafe file names or cross-product prefixes. + +Recovery: + +1. If SAS issuance fails, restore platform-service/blob configuration first. +2. If upload fails after SAS succeeds, check storage account/container permissions and CORS. +3. Ask affected user to retry upload only after smoke succeeds. +4. If a metadata row exists without blob content, either retry upload to the same path or delete the orphaned artifact metadata through the product API after owner approval. +5. Never paste SAS URLs into tickets or logs; record only container, sanitized prefix, and error class. + +## Failed LLM Or Extraction Recovery + +Symptoms: + +- Prompt runs return errors. +- Intake jobs stay failed. +- Copilot transforms fail. +- Auto-summary or Palace extraction logs errors. + +Diagnosis: + +```bash +curl -sf "$EXTRACTION_URL/health" +curl -sf "$EXTRACTION_URL/api/extract/models" +curl -sf "$EXTRACTION_URL/api/extract/sidecar-health" +curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN" +``` + +Check backend env: + +- `LLM_PROVIDER` +- `LLM_DEFAULT_MODEL` +- `LLM_VISION_MODEL` +- `LLM_EMBEDDING_MODEL` +- provider API credentials +- `EXTRACTION_SERVICE_URL` + +Recovery: + +1. Disable LLM-heavy feature flags for broad failures. +2. If only extraction-service is down, keep basic note CRUD/search available and surface extraction-down UI state. +3. If provider rate limits are hit, lower traffic with flags and wait out/reset provider quotas. +4. For failed intake jobs, users can resubmit the URL after dependency recovery; do not rewrite failed job status manually unless there is a documented replay tool. +5. For prompt result corruption, preserve the generated note/artifact as evidence, then archive/delete only with user or product-owner approval. + +## Review Queue Recovery + +Symptoms: + +- Pending review queue is empty unexpectedly. +- Approve/reject fails or gets partial batch results. +- Agent action state appears stuck in `draft` or `proposed`. + +Diagnosis: + +```bash +curl -sf "$NOTELETT_API_URL/note-agent-actions/pending?limit=50" \ + -H "Authorization: Bearer $TOKEN" +``` + +Check: + +- user id and product id scope +- `workspaceId` query when patching a specific action +- encrypted field readiness for action summary/review fields +- batch-review response `updated`, `not_found`, and `error` counts + +Recovery: + +1. Retry single review before retrying a large batch. +2. If partial batch failure occurred, use returned ids to retry only failed/not-found records after confirming workspace scope. +3. Preserve audit trail; do not directly mutate Cosmos unless product-owner approves an incident backfill. +4. If cross-user access is suspected, disable sharing/collaboration flags and escalate. + +## MCP Action Recovery + +Symptoms: + +- Agent tools cannot list/create/update notes. +- MCP write tool created duplicate or stuck audit actions. +- Tool calls fail with auth/product scope errors. + +Diagnosis: + +```bash +curl -sf "$MCP_ORIGIN/health" +curl -sf "$MCP_URL/tools" -H "Authorization: Bearer $TOKEN" +curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN" +``` + +Check: + +- NoteLett MCP tools are registered. +- Product backend health/readiness is green. +- Token role is sufficient for requested tool. +- Write tools used idempotency key, dry-run, and correlation id when available. + +Recovery: + +1. Pause agent automation if duplicate writes or scope concerns appear. +2. Use `note-agent-actions` records to identify applied/proposed actions. +3. Reject or archive unwanted proposed actions through the product API. +4. For already-applied note changes, use note version history or restore flow where available. +5. Keep MCP disabled until a dry-run tool smoke succeeds. + +## Communication And Closeout + +During incident: + +- Post status every 15-30 minutes with impact, current hypothesis, mitigation, and next check. +- Name exact degraded workflows: web note CRUD, mobile capture, Smart Actions, intake, reviews, MCP, or sharing. +- Avoid exposing secrets, share tokens, note text, prompt text, full URLs, or raw LLM output in status updates. + +Closeout requires: + +- all affected health/readiness/smoke checks pass +- feature flags restored or explicitly left disabled with owner/date +- release or rollback commit recorded +- telemetry and diagnostics event names captured +- data migration/backfill, if any, documented with counts and rollback decision +- follow-up issue or roadmap item for every residual risk + +## Verification + +For runbook-only changes: + +```bash +git diff --check +rg -n "OPERATOR_RUNBOOK|Stuck Scheduler Recovery|Failed Blob Upload Recovery|Failed LLM Or Extraction Recovery|MCP Action Recovery" docs README.md +``` diff --git a/docs/RELEASE_CHECKLIST.md b/docs/RELEASE_CHECKLIST.md index d5608cf..11dfdc7 100644 --- a/docs/RELEASE_CHECKLIST.md +++ b/docs/RELEASE_CHECKLIST.md @@ -96,6 +96,7 @@ Do not place secrets in `NEXT_PUBLIC_*` or `EXPO_PUBLIC_*` variables. ## Pre-Deploy Checklist - Confirm release commit is pushed and CI is green. +- Confirm `docs/OPERATOR_RUNBOOK.md` has the current incident owner, service URLs, and rollback target for this environment. - Confirm `pnpm run audit:release-guards` passes. - Confirm `pnpm run dependency:health` has no typecheck failures; review the non-blocking outdated report. - Confirm backend, web, and mobile tests from `docs/PRODUCTION_READINESS_HANDOFF_ROADMAP.md` P10 have passed or have explicit release-owner signoff.