From 57a7e10bc908d1d24ea3577732c85a62a83b0905 Mon Sep 17 00:00:00 2001
From: Saravana Achu Mac <saravanakumardb1@gmail.com>
Date: Tue, 5 May 2026 13:53:59 -0700
Subject: [PATCH] docs(ops): add operator runbook

---
 README.md                 |   1 +
 docs/OPERATOR_RUNBOOK.md  | 286 ++++++++++++++++++++++++++++++++++++++
 docs/RELEASE_CHECKLIST.md |   1 +
 3 files changed, 288 insertions(+)
 create mode 100644 docs/OPERATOR_RUNBOOK.md

diff --git a/README.md b/README.md
index a4513e3..aed43f5 100644
--- a/README.md
+++ b/README.md
@@ -139,3 +139,4 @@ Current baseline note: after common-platform workspace alignment, `pnpm install
 - [`docs/SEED_BOOTSTRAP_STRATEGY.md`](docs/SEED_BOOTSTRAP_STRATEGY.md) — Built-in prompt, intake rule, onboarding workspace, and feature-flag bootstrap strategy
 - [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md) — Encrypted-field, schema-change, and backfill migration plan
 - [`docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`](docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md) — Event taxonomy and diagnostic breadcrumb contract
+- [`docs/OPERATOR_RUNBOOK.md`](docs/OPERATOR_RUNBOOK.md) — Incident triage and recovery steps for dependencies, scheduler/webhooks, blob, LLM/extraction, reviews, and MCP
diff --git a/docs/OPERATOR_RUNBOOK.md b/docs/OPERATOR_RUNBOOK.md
new file mode 100644
index 0000000..d69d66b
--- /dev/null
+++ b/docs/OPERATOR_RUNBOOK.md
@@ -0,0 +1,286 @@
+# NoteLett Operator Runbook
+
+Date: May 5, 2026
+Product ID: `notelett`
+Primary service: NoteLett backend `4016`
+Shared services: platform-service `4003`, extraction-service `4005`, mcp-server `4007`
+
+## Scope
+
+This runbook covers production incident triage, dependency outage behavior, stuck scheduler/webhook recovery, failed blob upload recovery, and failed LLM/extraction recovery. It assumes the operator has access to service logs, platform-service, Cosmos metrics, and a valid admin/owner token for diagnostics.
+
+Use this alongside:
+
+- `docs/PLATFORM_SMOKE_CHECKS.md`
+- `docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`
+- `docs/COSMOS_DATA_OPERATIONS.md`
+- `docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`
+- `docs/RELEASE_CHECKLIST.md`
+
+## First Five Minutes
+
+1. Freeze unrelated deploys and record incident start time, environment, release commit, and primary symptom.
+2. Check health and dependency readiness:
+
+```bash
+curl -sf "$NOTELETT_URL/health"
+curl -sf "$NOTELETT_API_URL/bootstrap"
+curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
+```
+
+3. Check shared services:
+
+```bash
+curl -sf "$PLATFORM_URL/health"
+curl -sf "$EXTRACTION_URL/health"
+curl -sf "$MCP_ORIGIN/health"
+```
+
+4. Check current release flags and kill switch in platform-service. Prefer disabling risky workflows before rolling back images.
+5. Capture a short evidence bundle: readiness response, recent backend errors, platform-service errors, relevant telemetry event names, affected user/workspace ids, and any recent deploy or migration id.
+
+## Dependency Outage Behavior
+
+| Dependency | User-visible behavior | Immediate operator action | Recovery check |
+| --- | --- | --- | --- |
+| Cosmos/datastore | Most authenticated reads/writes fail or hang | Verify Cosmos account/container health, RU throttling, keys, and backend `DB_PROVIDER`; rollback only if code deploy caused query surge | Authenticated workspace/note create/read flow passes |
+| platform-service | Auth, flags, telemetry, diagnostics, kill switch, and blob SAS may degrade | Keep backend alive if auth tokens still validate; disable risky features via cached/default flags where possible; restore platform-service first | `/api/diagnostics/readiness` reports platform-service healthy |
+| extraction-service | Task extraction and note summarization fail with user-facing extraction-down state | Disable extraction-heavy UI if needed; verify sidecar health and extraction queue/cache endpoints | Minimal extract request in `docs/PLATFORM_SMOKE_CHECKS.md` passes |
+| mcp-server | Agent tool registry/calls fail; product UI mostly continues | Keep web/mobile available; validate local tool registration count and MCP `/health` | `GET $MCP_URL/tools` shows NoteLett tools |
+| Blob service | Upload/download attachments fail; existing note text remains available | Verify platform blob routes, storage account/container, SAS permissions, and token scope | SAS request plus small upload/delete smoke passes |
+| LLM provider | Smart Actions, copilot, scheduled actions, URL extraction summaries, and Palace extraction degrade | Disable LLM-heavy flags; inspect provider credentials, model availability, rate limits, and retry logs | Mock or production provider prompt smoke passes |
+
+## Feature Flag Triage
+
+Use platform-service flags and NoteLett defaults from `docs/SEED_BOOTSTRAP_STRATEGY.md`:
+
+- Disable `notelett_smart_actions_enabled` for broad prompt failures.
+- Disable `notelett_scheduled_actions_enabled` for runaway or stuck scheduler jobs.
+- Disable `notelett_webhooks_enabled` for repeated webhook failures or external callback incidents.
+- Disable `notelett_intake_enabled` if URL intake is causing queue/backpressure or LLM cost spikes.
+- Disable `notelett_collaborative_sharing_enabled` for cross-user access concerns.
+- Disable `notelett_auto_summarize_enabled`, `notelett_auto_embed_enabled`, and `notelett_auto_link_enabled` for background AI cost or mutation concerns.
+
+Record every flag change with timestamp, operator, reason, and expected rollback condition.
+
+## Stuck Scheduler Recovery
+
+Symptoms:
+
+- Scheduled prompts are due but no result notes appear.
+- `scheduled_action_fired` drops to zero.
+- Backend logs show `Failed to run scheduled prompt`.
+- Users report weekly digest or scheduled actions missing.
+
+Diagnosis:
+
+```bash
+curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" \
+  -H "Authorization: Bearer $TOKEN"
+```
+
+Check:
+
+- `notelett_scheduled_actions_enabled` is true.
+- Scheduler loop is running in the backend process.
+- `nextRunAt` is not in the past for many enabled schedules.
+- Built-in prompt templates were seeded with `pnpm run seed:bootstrap`.
+- LLM provider and datastore readiness are healthy.
+
+Recovery:
+
+1. If a schedule is malformed, patch it disabled through `PATCH /api/prompt-schedules/:id`.
+2. If many schedules are due and failing, disable `notelett_scheduled_actions_enabled`.
+3. Restart the backend only after confirming no deploy/migration is currently running.
+4. Re-enable schedules gradually after one manual smoke schedule succeeds.
+5. Record affected schedule ids and whether missed runs need manual replay. Do not bulk-create catch-up notes without product-owner approval.
+
+## Webhook Recovery
+
+There are two webhook concepts:
+
+- Prompt webhooks in `note_prompt_webhooks`, triggered through `/api/prompt-webhooks/:id/trigger`.
+- Domain event dispatch targets held by `backend/src/lib/webhook-subscriber.ts` for product event delivery.
+
+Symptoms:
+
+- `webhook_triggered` events are absent or failing.
+- External integrations report duplicate or missing callbacks.
+- Backend logs show dispatch failures or repeated timeouts.
+
+Diagnosis:
+
+```bash
+curl -sf "$NOTELETT_API_URL/prompt-webhooks" -H "Authorization: Bearer $TOKEN"
+curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" -H "Authorization: Bearer $TOKEN"
+```
+
+Check:
+
+- `notelett_webhooks_enabled` is true only when webhook smoke is expected.
+- Target webhook is enabled and has correct `workspaceId`, `templateId`, `triggerEvent`, and `tagFilter`.
+- Built-in/custom template resolves.
+- External callback endpoint is reachable and not rejecting auth/signature.
+
+Recovery:
+
+1. Disable the failing webhook with `PATCH /api/prompt-webhooks/:id`.
+2. If failures are broad or external abuse is suspected, disable `notelett_webhooks_enabled`.
+3. Retry one manual trigger after the dependency is healthy.
+4. For duplicate deliveries, compare external request ids/correlation ids before replay.
+5. Document delivery window and any events intentionally not replayed.
+
+## Failed Blob Upload Recovery
+
+Symptoms:
+
+- Web artifact upload fails.
+- Mobile image/attachment upload fails.
+- Download links return 403/404.
+
+Diagnosis:
+
+```bash
+curl -sf "$PLATFORM_API_URL/blob/containers" -H "Authorization: Bearer $TOKEN"
+curl -sf "$PLATFORM_API_URL/blob/sas" \
+  -H "Authorization: Bearer $TOKEN" \
+  -H "content-type: application/json" \
+  -d '{"container":"attachments","blobName":"notelett/smoke/operator.txt","permissions":"rw","expiresInMinutes":10}'
+```
+
+Check:
+
+- platform-service is healthy and can reach the storage account.
+- The `attachments` container exists.
+- SAS permissions include the needed operation.
+- The client is using the shared `@bytelyst/blob-client` paths.
+- Blob paths do not include unsafe file names or cross-product prefixes.
+
+Recovery:
+
+1. If SAS issuance fails, restore platform-service/blob configuration first.
+2. If upload fails after SAS succeeds, check storage account/container permissions and CORS.
+3. Ask affected user to retry upload only after smoke succeeds.
+4. If a metadata row exists without blob content, either retry upload to the same path or delete the orphaned artifact metadata through the product API after owner approval.
+5. Never paste SAS URLs into tickets or logs; record only container, sanitized prefix, and error class.
+
+## Failed LLM Or Extraction Recovery
+
+Symptoms:
+
+- Prompt runs return errors.
+- Intake jobs stay failed.
+- Copilot transforms fail.
+- Auto-summary or Palace extraction logs errors.
+
+Diagnosis:
+
+```bash
+curl -sf "$EXTRACTION_URL/health"
+curl -sf "$EXTRACTION_URL/api/extract/models"
+curl -sf "$EXTRACTION_URL/api/extract/sidecar-health"
+curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
+```
+
+Check backend env:
+
+- `LLM_PROVIDER`
+- `LLM_DEFAULT_MODEL`
+- `LLM_VISION_MODEL`
+- `LLM_EMBEDDING_MODEL`
+- provider API credentials
+- `EXTRACTION_SERVICE_URL`
+
+Recovery:
+
+1. Disable LLM-heavy feature flags for broad failures.
+2. If only extraction-service is down, keep basic note CRUD/search available and surface extraction-down UI state.
+3. If provider rate limits are hit, lower traffic with flags and wait out/reset provider quotas.
+4. For failed intake jobs, users can resubmit the URL after dependency recovery; do not rewrite failed job status manually unless there is a documented replay tool.
+5. For prompt result corruption, preserve the generated note/artifact as evidence, then archive/delete only with user or product-owner approval.
+
+## Review Queue Recovery
+
+Symptoms:
+
+- Pending review queue is empty unexpectedly.
+- Approve/reject fails or gets partial batch results.
+- Agent action state appears stuck in `draft` or `proposed`.
+
+Diagnosis:
+
+```bash
+curl -sf "$NOTELETT_API_URL/note-agent-actions/pending?limit=50" \
+  -H "Authorization: Bearer $TOKEN"
+```
+
+Check:
+
+- user id and product id scope
+- `workspaceId` query when patching a specific action
+- encrypted field readiness for action summary/review fields
+- batch-review response `updated`, `not_found`, and `error` counts
+
+Recovery:
+
+1. Retry single review before retrying a large batch.
+2. If partial batch failure occurred, use returned ids to retry only failed/not-found records after confirming workspace scope.
+3. Preserve audit trail; do not directly mutate Cosmos unless product-owner approves an incident backfill.
+4. If cross-user access is suspected, disable sharing/collaboration flags and escalate.
+
+## MCP Action Recovery
+
+Symptoms:
+
+- Agent tools cannot list/create/update notes.
+- MCP write tool created duplicate or stuck audit actions.
+- Tool calls fail with auth/product scope errors.
+
+Diagnosis:
+
+```bash
+curl -sf "$MCP_ORIGIN/health"
+curl -sf "$MCP_URL/tools" -H "Authorization: Bearer $TOKEN"
+curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
+```
+
+Check:
+
+- NoteLett MCP tools are registered.
+- Product backend health/readiness is green.
+- Token role is sufficient for requested tool.
+- Write tools used idempotency key, dry-run, and correlation id when available.
+
+Recovery:
+
+1. Pause agent automation if duplicate writes or scope concerns appear.
+2. Use `note-agent-actions` records to identify applied/proposed actions.
+3. Reject or archive unwanted proposed actions through the product API.
+4. For already-applied note changes, use note version history or restore flow where available.
+5. Keep MCP disabled until a dry-run tool smoke succeeds.
+
+## Communication And Closeout
+
+During incident:
+
+- Post status every 15-30 minutes with impact, current hypothesis, mitigation, and next check.
+- Name exact degraded workflows: web note CRUD, mobile capture, Smart Actions, intake, reviews, MCP, or sharing.
+- Avoid exposing secrets, share tokens, note text, prompt text, full URLs, or raw LLM output in status updates.
+
+Closeout requires:
+
+- all affected health/readiness/smoke checks pass
+- feature flags restored or explicitly left disabled with owner/date
+- release or rollback commit recorded
+- telemetry and diagnostics event names captured
+- data migration/backfill, if any, documented with counts and rollback decision
+- follow-up issue or roadmap item for every residual risk
+
+## Verification
+
+For runbook-only changes:
+
+```bash
+git diff --check
+rg -n "OPERATOR_RUNBOOK|Stuck Scheduler Recovery|Failed Blob Upload Recovery|Failed LLM Or Extraction Recovery|MCP Action Recovery" docs README.md
+```
diff --git a/docs/RELEASE_CHECKLIST.md b/docs/RELEASE_CHECKLIST.md
index d5608cf..11dfdc7 100644
--- a/docs/RELEASE_CHECKLIST.md
+++ b/docs/RELEASE_CHECKLIST.md
@@ -96,6 +96,7 @@ Do not place secrets in `NEXT_PUBLIC_*` or `EXPO_PUBLIC_*` variables.
 ## Pre-Deploy Checklist
 
 - Confirm release commit is pushed and CI is green.
+- Confirm `docs/OPERATOR_RUNBOOK.md` has the current incident owner, service URLs, and rollback target for this environment.
 - Confirm `pnpm run audit:release-guards` passes.
 - Confirm `pnpm run dependency:health` has no typecheck failures; review the non-blocking outdated report.
 - Confirm backend, web, and mobile tests from `docs/PRODUCTION_READINESS_HANDOFF_ROADMAP.md` P10 have passed or have explicit release-owner signoff.