12 KiB
NoteLett Operator Runbook
Date: May 5, 2026
Product ID: notelett
Primary service: NoteLett backend 4016
Shared services: platform-service 4003, extraction-service 4005, mcp-server 4007
Scope
This runbook covers production incident triage, dependency outage behavior, stuck scheduler/webhook recovery, failed blob upload recovery, and failed LLM/extraction recovery. It assumes the operator has access to service logs, platform-service, Cosmos metrics, and a valid admin/owner token for diagnostics.
Use this alongside:
docs/PLATFORM_SMOKE_CHECKS.mddocs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.mddocs/COSMOS_DATA_OPERATIONS.mddocs/DATA_MIGRATION_AND_BACKFILL_PLAN.mddocs/RELEASE_CHECKLIST.md
First Five Minutes
- Freeze unrelated deploys and record incident start time, environment, release commit, and primary symptom.
- Check health and dependency readiness:
curl -sf "$NOTELETT_URL/health"
curl -sf "$NOTELETT_API_URL/bootstrap"
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
- Check shared services:
curl -sf "$PLATFORM_URL/health"
curl -sf "$EXTRACTION_URL/health"
curl -sf "$MCP_ORIGIN/health"
- Check current release flags and kill switch in platform-service. Prefer disabling risky workflows before rolling back images.
- Capture a short evidence bundle: readiness response, recent backend errors, platform-service errors, relevant telemetry event names, affected user/workspace ids, and any recent deploy or migration id.
Dependency Outage Behavior
| Dependency | User-visible behavior | Immediate operator action | Recovery check |
|---|---|---|---|
| Cosmos/datastore | Most authenticated reads/writes fail or hang | Verify Cosmos account/container health, RU throttling, keys, and backend DB_PROVIDER; rollback only if code deploy caused query surge |
Authenticated workspace/note create/read flow passes |
| platform-service | Auth, flags, telemetry, diagnostics, kill switch, and blob SAS may degrade | Keep backend alive if auth tokens still validate; disable risky features via cached/default flags where possible; restore platform-service first | /api/diagnostics/readiness reports platform-service healthy |
| extraction-service | Task extraction and note summarization fail with user-facing extraction-down state | Disable extraction-heavy UI if needed; verify sidecar health and extraction queue/cache endpoints | Minimal extract request in docs/PLATFORM_SMOKE_CHECKS.md passes |
| mcp-server | Agent tool registry/calls fail; product UI mostly continues | Keep web/mobile available; validate local tool registration count and MCP /health |
GET $MCP_URL/tools shows NoteLett tools |
| Blob service | Upload/download attachments fail; existing note text remains available | Verify platform blob routes, storage account/container, SAS permissions, and token scope | SAS request plus small upload/delete smoke passes |
| LLM provider | Smart Actions, copilot, scheduled actions, URL extraction summaries, and Palace extraction degrade | Disable LLM-heavy flags; inspect provider credentials, model availability, rate limits, and retry logs | Mock or production provider prompt smoke passes |
Feature Flag Triage
Use platform-service flags and NoteLett defaults from docs/SEED_BOOTSTRAP_STRATEGY.md:
- Disable
notelett_smart_actions_enabledfor broad prompt failures. - Disable
notelett_scheduled_actions_enabledfor runaway or stuck scheduler jobs. - Disable
notelett_webhooks_enabledfor repeated webhook failures or external callback incidents. - Disable
notelett_intake_enabledif URL intake is causing queue/backpressure or LLM cost spikes. - Disable
notelett_collaborative_sharing_enabledfor cross-user access concerns. - Disable
notelett_auto_summarize_enabled,notelett_auto_embed_enabled, andnotelett_auto_link_enabledfor background AI cost or mutation concerns.
Record every flag change with timestamp, operator, reason, and expected rollback condition.
Stuck Scheduler Recovery
Symptoms:
- Scheduled prompts are due but no result notes appear.
scheduled_action_fireddrops to zero.- Backend logs show
Failed to run scheduled prompt. - Users report weekly digest or scheduled actions missing.
Diagnosis:
curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" \
-H "Authorization: Bearer $TOKEN"
Check:
notelett_scheduled_actions_enabledis true.- Scheduler loop is running in the backend process.
nextRunAtis not in the past for many enabled schedules.- Built-in prompt templates were seeded with
pnpm run seed:bootstrap. - LLM provider and datastore readiness are healthy.
Recovery:
- If a schedule is malformed, patch it disabled through
PATCH /api/prompt-schedules/:id. - If many schedules are due and failing, disable
notelett_scheduled_actions_enabled. - Restart the backend only after confirming no deploy/migration is currently running.
- Re-enable schedules gradually after one manual smoke schedule succeeds.
- Record affected schedule ids and whether missed runs need manual replay. Do not bulk-create catch-up notes without product-owner approval.
Webhook Recovery
There are two webhook concepts:
- Prompt webhooks in
note_prompt_webhooks, triggered through/api/prompt-webhooks/:id/trigger. - Domain event dispatch targets held by
backend/src/lib/webhook-subscriber.tsfor product event delivery.
Symptoms:
webhook_triggeredevents are absent or failing.- External integrations report duplicate or missing callbacks.
- Backend logs show dispatch failures or repeated timeouts.
Diagnosis:
curl -sf "$NOTELETT_API_URL/prompt-webhooks" -H "Authorization: Bearer $TOKEN"
curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" -H "Authorization: Bearer $TOKEN"
Check:
notelett_webhooks_enabledis true only when webhook smoke is expected.- Target webhook is enabled and has correct
workspaceId,templateId,triggerEvent, andtagFilter. - Built-in/custom template resolves.
- External callback endpoint is reachable and not rejecting auth/signature.
Recovery:
- Disable the failing webhook with
PATCH /api/prompt-webhooks/:id. - If failures are broad or external abuse is suspected, disable
notelett_webhooks_enabled. - Retry one manual trigger after the dependency is healthy.
- For duplicate deliveries, compare external request ids/correlation ids before replay.
- Document delivery window and any events intentionally not replayed.
Failed Blob Upload Recovery
Symptoms:
- Web artifact upload fails.
- Mobile image/attachment upload fails.
- Download links return 403/404.
Diagnosis:
curl -sf "$PLATFORM_API_URL/blob/containers" -H "Authorization: Bearer $TOKEN"
curl -sf "$PLATFORM_API_URL/blob/sas" \
-H "Authorization: Bearer $TOKEN" \
-H "content-type: application/json" \
-d '{"container":"attachments","blobName":"notelett/smoke/operator.txt","permissions":"rw","expiresInMinutes":10}'
Check:
- platform-service is healthy and can reach the storage account.
- The
attachmentscontainer exists. - SAS permissions include the needed operation.
- The client is using the shared
@bytelyst/blob-clientpaths. - Blob paths do not include unsafe file names or cross-product prefixes.
Recovery:
- If SAS issuance fails, restore platform-service/blob configuration first.
- If upload fails after SAS succeeds, check storage account/container permissions and CORS.
- Ask affected user to retry upload only after smoke succeeds.
- If a metadata row exists without blob content, either retry upload to the same path or delete the orphaned artifact metadata through the product API after owner approval.
- Never paste SAS URLs into tickets or logs; record only container, sanitized prefix, and error class.
Failed LLM Or Extraction Recovery
Symptoms:
- Prompt runs return errors.
- Intake jobs stay failed.
- Copilot transforms fail.
- Auto-summary or Palace extraction logs errors.
Diagnosis:
curl -sf "$EXTRACTION_URL/health"
curl -sf "$EXTRACTION_URL/api/extract/models"
curl -sf "$EXTRACTION_URL/api/extract/sidecar-health"
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
Check backend env:
LLM_PROVIDERLLM_DEFAULT_MODELLLM_VISION_MODELLLM_EMBEDDING_MODEL- provider API credentials
EXTRACTION_SERVICE_URL
Recovery:
- Disable LLM-heavy feature flags for broad failures.
- If only extraction-service is down, keep basic note CRUD/search available and surface extraction-down UI state.
- If provider rate limits are hit, lower traffic with flags and wait out/reset provider quotas.
- For failed intake jobs, users can resubmit the URL after dependency recovery; do not rewrite failed job status manually unless there is a documented replay tool.
- For prompt result corruption, preserve the generated note/artifact as evidence, then archive/delete only with user or product-owner approval.
Review Queue Recovery
Symptoms:
- Pending review queue is empty unexpectedly.
- Approve/reject fails or gets partial batch results.
- Agent action state appears stuck in
draftorproposed.
Diagnosis:
curl -sf "$NOTELETT_API_URL/note-agent-actions/pending?limit=50" \
-H "Authorization: Bearer $TOKEN"
Check:
- user id and product id scope
workspaceIdquery when patching a specific action- encrypted field readiness for action summary/review fields
- batch-review response
updated,not_found, anderrorcounts
Recovery:
- Retry single review before retrying a large batch.
- If partial batch failure occurred, use returned ids to retry only failed/not-found records after confirming workspace scope.
- Preserve audit trail; do not directly mutate Cosmos unless product-owner approves an incident backfill.
- If cross-user access is suspected, disable sharing/collaboration flags and escalate.
MCP Action Recovery
Symptoms:
- Agent tools cannot list/create/update notes.
- MCP write tool created duplicate or stuck audit actions.
- Tool calls fail with auth/product scope errors.
Diagnosis:
curl -sf "$MCP_ORIGIN/health"
curl -sf "$MCP_URL/tools" -H "Authorization: Bearer $TOKEN"
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
Check:
- NoteLett MCP tools are registered.
- Product backend health/readiness is green.
- Token role is sufficient for requested tool.
- Write tools used idempotency key, dry-run, and correlation id when available.
Recovery:
- Pause agent automation if duplicate writes or scope concerns appear.
- Use
note-agent-actionsrecords to identify applied/proposed actions. - Reject or archive unwanted proposed actions through the product API.
- For already-applied note changes, use note version history or restore flow where available.
- Keep MCP disabled until a dry-run tool smoke succeeds.
Communication And Closeout
During incident:
- Post status every 15-30 minutes with impact, current hypothesis, mitigation, and next check.
- Name exact degraded workflows: web note CRUD, mobile capture, Smart Actions, intake, reviews, MCP, or sharing.
- Avoid exposing secrets, share tokens, note text, prompt text, full URLs, or raw LLM output in status updates.
Closeout requires:
- all affected health/readiness/smoke checks pass
- feature flags restored or explicitly left disabled with owner/date
- release or rollback commit recorded
- telemetry and diagnostics event names captured
- data migration/backfill, if any, documented with counts and rollback decision
- follow-up issue or roadmap item for every residual risk
Verification
For runbook-only changes:
git diff --check
rg -n "OPERATOR_RUNBOOK|Stuck Scheduler Recovery|Failed Blob Upload Recovery|Failed LLM Or Extraction Recovery|MCP Action Recovery" docs README.md