Saravana Achu Mac 57a7e10bc9 docs(ops): add operator runbook

2026-05-05 13:53:59 -07:00

12 KiB

Raw Blame History

NoteLett Operator Runbook

Date: May 5, 2026 Product ID: notelett Primary service: NoteLett backend 4016 Shared services: platform-service 4003, extraction-service 4005, mcp-server 4007

Scope

This runbook covers production incident triage, dependency outage behavior, stuck scheduler/webhook recovery, failed blob upload recovery, and failed LLM/extraction recovery. It assumes the operator has access to service logs, platform-service, Cosmos metrics, and a valid admin/owner token for diagnostics.

Use this alongside:

docs/PLATFORM_SMOKE_CHECKS.md
docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md
docs/COSMOS_DATA_OPERATIONS.md
docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md
docs/RELEASE_CHECKLIST.md

First Five Minutes

Freeze unrelated deploys and record incident start time, environment, release commit, and primary symptom.
Check health and dependency readiness:

curl -sf "$NOTELETT_URL/health"
curl -sf "$NOTELETT_API_URL/bootstrap"
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"

Check shared services:

curl -sf "$PLATFORM_URL/health"
curl -sf "$EXTRACTION_URL/health"
curl -sf "$MCP_ORIGIN/health"

Check current release flags and kill switch in platform-service. Prefer disabling risky workflows before rolling back images.
Capture a short evidence bundle: readiness response, recent backend errors, platform-service errors, relevant telemetry event names, affected user/workspace ids, and any recent deploy or migration id.

Dependency Outage Behavior

Dependency	User-visible behavior	Immediate operator action	Recovery check
Cosmos/datastore	Most authenticated reads/writes fail or hang	Verify Cosmos account/container health, RU throttling, keys, and backend `DB_PROVIDER`; rollback only if code deploy caused query surge	Authenticated workspace/note create/read flow passes
platform-service	Auth, flags, telemetry, diagnostics, kill switch, and blob SAS may degrade	Keep backend alive if auth tokens still validate; disable risky features via cached/default flags where possible; restore platform-service first	`/api/diagnostics/readiness` reports platform-service healthy
extraction-service	Task extraction and note summarization fail with user-facing extraction-down state	Disable extraction-heavy UI if needed; verify sidecar health and extraction queue/cache endpoints	Minimal extract request in `docs/PLATFORM_SMOKE_CHECKS.md` passes
mcp-server	Agent tool registry/calls fail; product UI mostly continues	Keep web/mobile available; validate local tool registration count and MCP `/health`	`GET $MCP_URL/tools` shows NoteLett tools
Blob service	Upload/download attachments fail; existing note text remains available	Verify platform blob routes, storage account/container, SAS permissions, and token scope	SAS request plus small upload/delete smoke passes
LLM provider	Smart Actions, copilot, scheduled actions, URL extraction summaries, and Palace extraction degrade	Disable LLM-heavy flags; inspect provider credentials, model availability, rate limits, and retry logs	Mock or production provider prompt smoke passes

Feature Flag Triage

Use platform-service flags and NoteLett defaults from docs/SEED_BOOTSTRAP_STRATEGY.md:

Disable notelett_smart_actions_enabled for broad prompt failures.
Disable notelett_scheduled_actions_enabled for runaway or stuck scheduler jobs.
Disable notelett_webhooks_enabled for repeated webhook failures or external callback incidents.
Disable notelett_intake_enabled if URL intake is causing queue/backpressure or LLM cost spikes.
Disable notelett_collaborative_sharing_enabled for cross-user access concerns.
Disable notelett_auto_summarize_enabled, notelett_auto_embed_enabled, and notelett_auto_link_enabled for background AI cost or mutation concerns.

Record every flag change with timestamp, operator, reason, and expected rollback condition.

Stuck Scheduler Recovery

Symptoms:

Scheduled prompts are due but no result notes appear.
scheduled_action_fired drops to zero.
Backend logs show Failed to run scheduled prompt.
Users report weekly digest or scheduled actions missing.

Diagnosis:

curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" \
  -H "Authorization: Bearer $TOKEN"

Check:

notelett_scheduled_actions_enabled is true.
Scheduler loop is running in the backend process.
nextRunAt is not in the past for many enabled schedules.
Built-in prompt templates were seeded with pnpm run seed:bootstrap.
LLM provider and datastore readiness are healthy.

Recovery:

If a schedule is malformed, patch it disabled through PATCH /api/prompt-schedules/:id.
If many schedules are due and failing, disable notelett_scheduled_actions_enabled.
Restart the backend only after confirming no deploy/migration is currently running.
Re-enable schedules gradually after one manual smoke schedule succeeds.
Record affected schedule ids and whether missed runs need manual replay. Do not bulk-create catch-up notes without product-owner approval.

Webhook Recovery

There are two webhook concepts:

Prompt webhooks in note_prompt_webhooks, triggered through /api/prompt-webhooks/:id/trigger.
Domain event dispatch targets held by backend/src/lib/webhook-subscriber.ts for product event delivery.

Symptoms:

webhook_triggered events are absent or failing.
External integrations report duplicate or missing callbacks.
Backend logs show dispatch failures or repeated timeouts.

Diagnosis:

curl -sf "$NOTELETT_API_URL/prompt-webhooks" -H "Authorization: Bearer $TOKEN"
curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" -H "Authorization: Bearer $TOKEN"

Check:

notelett_webhooks_enabled is true only when webhook smoke is expected.
Target webhook is enabled and has correct workspaceId, templateId, triggerEvent, and tagFilter.
Built-in/custom template resolves.
External callback endpoint is reachable and not rejecting auth/signature.

Recovery:

Disable the failing webhook with PATCH /api/prompt-webhooks/:id.
If failures are broad or external abuse is suspected, disable notelett_webhooks_enabled.
Retry one manual trigger after the dependency is healthy.
For duplicate deliveries, compare external request ids/correlation ids before replay.
Document delivery window and any events intentionally not replayed.

Failed Blob Upload Recovery

Symptoms:

Web artifact upload fails.
Mobile image/attachment upload fails.
Download links return 403/404.

Diagnosis:

curl -sf "$PLATFORM_API_URL/blob/containers" -H "Authorization: Bearer $TOKEN"
curl -sf "$PLATFORM_API_URL/blob/sas" \
  -H "Authorization: Bearer $TOKEN" \
  -H "content-type: application/json" \
  -d '{"container":"attachments","blobName":"notelett/smoke/operator.txt","permissions":"rw","expiresInMinutes":10}'

Check:

platform-service is healthy and can reach the storage account.
The attachments container exists.
SAS permissions include the needed operation.
The client is using the shared @bytelyst/blob-client paths.
Blob paths do not include unsafe file names or cross-product prefixes.

Recovery:

If SAS issuance fails, restore platform-service/blob configuration first.
If upload fails after SAS succeeds, check storage account/container permissions and CORS.
Ask affected user to retry upload only after smoke succeeds.
If a metadata row exists without blob content, either retry upload to the same path or delete the orphaned artifact metadata through the product API after owner approval.
Never paste SAS URLs into tickets or logs; record only container, sanitized prefix, and error class.

Failed LLM Or Extraction Recovery

Symptoms:

Prompt runs return errors.
Intake jobs stay failed.
Copilot transforms fail.
Auto-summary or Palace extraction logs errors.

Diagnosis:

curl -sf "$EXTRACTION_URL/health"
curl -sf "$EXTRACTION_URL/api/extract/models"
curl -sf "$EXTRACTION_URL/api/extract/sidecar-health"
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"

Check backend env:

LLM_PROVIDER
LLM_DEFAULT_MODEL
LLM_VISION_MODEL
LLM_EMBEDDING_MODEL
provider API credentials
EXTRACTION_SERVICE_URL

Recovery:

Disable LLM-heavy feature flags for broad failures.
If only extraction-service is down, keep basic note CRUD/search available and surface extraction-down UI state.
If provider rate limits are hit, lower traffic with flags and wait out/reset provider quotas.
For failed intake jobs, users can resubmit the URL after dependency recovery; do not rewrite failed job status manually unless there is a documented replay tool.
For prompt result corruption, preserve the generated note/artifact as evidence, then archive/delete only with user or product-owner approval.

Review Queue Recovery

Symptoms:

Pending review queue is empty unexpectedly.
Approve/reject fails or gets partial batch results.
Agent action state appears stuck in draft or proposed.

Diagnosis:

curl -sf "$NOTELETT_API_URL/note-agent-actions/pending?limit=50" \
  -H "Authorization: Bearer $TOKEN"

Check:

user id and product id scope
workspaceId query when patching a specific action
encrypted field readiness for action summary/review fields
batch-review response updated, not_found, and error counts

Recovery:

Retry single review before retrying a large batch.
If partial batch failure occurred, use returned ids to retry only failed/not-found records after confirming workspace scope.
Preserve audit trail; do not directly mutate Cosmos unless product-owner approves an incident backfill.
If cross-user access is suspected, disable sharing/collaboration flags and escalate.

MCP Action Recovery

Symptoms:

Agent tools cannot list/create/update notes.
MCP write tool created duplicate or stuck audit actions.
Tool calls fail with auth/product scope errors.

Diagnosis:

curl -sf "$MCP_ORIGIN/health"
curl -sf "$MCP_URL/tools" -H "Authorization: Bearer $TOKEN"
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"

Check:

NoteLett MCP tools are registered.
Product backend health/readiness is green.
Token role is sufficient for requested tool.
Write tools used idempotency key, dry-run, and correlation id when available.

Recovery:

Pause agent automation if duplicate writes or scope concerns appear.
Use note-agent-actions records to identify applied/proposed actions.
Reject or archive unwanted proposed actions through the product API.
For already-applied note changes, use note version history or restore flow where available.
Keep MCP disabled until a dry-run tool smoke succeeds.

Communication And Closeout

During incident:

Post status every 15-30 minutes with impact, current hypothesis, mitigation, and next check.
Name exact degraded workflows: web note CRUD, mobile capture, Smart Actions, intake, reviews, MCP, or sharing.
Avoid exposing secrets, share tokens, note text, prompt text, full URLs, or raw LLM output in status updates.

Closeout requires:

all affected health/readiness/smoke checks pass
feature flags restored or explicitly left disabled with owner/date
release or rollback commit recorded
telemetry and diagnostics event names captured
data migration/backfill, if any, documented with counts and rollback decision
follow-up issue or roadmap item for every residual risk

Verification

For runbook-only changes:

git diff --check
rg -n "OPERATOR_RUNBOOK|Stuck Scheduler Recovery|Failed Blob Upload Recovery|Failed LLM Or Extraction Recovery|MCP Action Recovery" docs README.md

12 KiB Raw Blame History

NoteLett Operator Runbook

Scope

First Five Minutes

Dependency Outage Behavior

Feature Flag Triage

Stuck Scheduler Recovery

Webhook Recovery

Failed Blob Upload Recovery

Failed LLM Or Extraction Recovery

Review Queue Recovery

MCP Action Recovery

Communication And Closeout

Verification

12 KiB

Raw Blame History