docs(ops): add operator runbook
This commit is contained in:
parent
95f252a520
commit
57a7e10bc9
@ -139,3 +139,4 @@ Current baseline note: after common-platform workspace alignment, `pnpm install
|
|||||||
- [`docs/SEED_BOOTSTRAP_STRATEGY.md`](docs/SEED_BOOTSTRAP_STRATEGY.md) — Built-in prompt, intake rule, onboarding workspace, and feature-flag bootstrap strategy
|
- [`docs/SEED_BOOTSTRAP_STRATEGY.md`](docs/SEED_BOOTSTRAP_STRATEGY.md) — Built-in prompt, intake rule, onboarding workspace, and feature-flag bootstrap strategy
|
||||||
- [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md) — Encrypted-field, schema-change, and backfill migration plan
|
- [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md) — Encrypted-field, schema-change, and backfill migration plan
|
||||||
- [`docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`](docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md) — Event taxonomy and diagnostic breadcrumb contract
|
- [`docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`](docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md) — Event taxonomy and diagnostic breadcrumb contract
|
||||||
|
- [`docs/OPERATOR_RUNBOOK.md`](docs/OPERATOR_RUNBOOK.md) — Incident triage and recovery steps for dependencies, scheduler/webhooks, blob, LLM/extraction, reviews, and MCP
|
||||||
|
|||||||
286
docs/OPERATOR_RUNBOOK.md
Normal file
286
docs/OPERATOR_RUNBOOK.md
Normal file
@ -0,0 +1,286 @@
|
|||||||
|
# NoteLett Operator Runbook
|
||||||
|
|
||||||
|
Date: May 5, 2026
|
||||||
|
Product ID: `notelett`
|
||||||
|
Primary service: NoteLett backend `4016`
|
||||||
|
Shared services: platform-service `4003`, extraction-service `4005`, mcp-server `4007`
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This runbook covers production incident triage, dependency outage behavior, stuck scheduler/webhook recovery, failed blob upload recovery, and failed LLM/extraction recovery. It assumes the operator has access to service logs, platform-service, Cosmos metrics, and a valid admin/owner token for diagnostics.
|
||||||
|
|
||||||
|
Use this alongside:
|
||||||
|
|
||||||
|
- `docs/PLATFORM_SMOKE_CHECKS.md`
|
||||||
|
- `docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`
|
||||||
|
- `docs/COSMOS_DATA_OPERATIONS.md`
|
||||||
|
- `docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`
|
||||||
|
- `docs/RELEASE_CHECKLIST.md`
|
||||||
|
|
||||||
|
## First Five Minutes
|
||||||
|
|
||||||
|
1. Freeze unrelated deploys and record incident start time, environment, release commit, and primary symptom.
|
||||||
|
2. Check health and dependency readiness:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$NOTELETT_URL/health"
|
||||||
|
curl -sf "$NOTELETT_API_URL/bootstrap"
|
||||||
|
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Check shared services:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$PLATFORM_URL/health"
|
||||||
|
curl -sf "$EXTRACTION_URL/health"
|
||||||
|
curl -sf "$MCP_ORIGIN/health"
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Check current release flags and kill switch in platform-service. Prefer disabling risky workflows before rolling back images.
|
||||||
|
5. Capture a short evidence bundle: readiness response, recent backend errors, platform-service errors, relevant telemetry event names, affected user/workspace ids, and any recent deploy or migration id.
|
||||||
|
|
||||||
|
## Dependency Outage Behavior
|
||||||
|
|
||||||
|
| Dependency | User-visible behavior | Immediate operator action | Recovery check |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| Cosmos/datastore | Most authenticated reads/writes fail or hang | Verify Cosmos account/container health, RU throttling, keys, and backend `DB_PROVIDER`; rollback only if code deploy caused query surge | Authenticated workspace/note create/read flow passes |
|
||||||
|
| platform-service | Auth, flags, telemetry, diagnostics, kill switch, and blob SAS may degrade | Keep backend alive if auth tokens still validate; disable risky features via cached/default flags where possible; restore platform-service first | `/api/diagnostics/readiness` reports platform-service healthy |
|
||||||
|
| extraction-service | Task extraction and note summarization fail with user-facing extraction-down state | Disable extraction-heavy UI if needed; verify sidecar health and extraction queue/cache endpoints | Minimal extract request in `docs/PLATFORM_SMOKE_CHECKS.md` passes |
|
||||||
|
| mcp-server | Agent tool registry/calls fail; product UI mostly continues | Keep web/mobile available; validate local tool registration count and MCP `/health` | `GET $MCP_URL/tools` shows NoteLett tools |
|
||||||
|
| Blob service | Upload/download attachments fail; existing note text remains available | Verify platform blob routes, storage account/container, SAS permissions, and token scope | SAS request plus small upload/delete smoke passes |
|
||||||
|
| LLM provider | Smart Actions, copilot, scheduled actions, URL extraction summaries, and Palace extraction degrade | Disable LLM-heavy flags; inspect provider credentials, model availability, rate limits, and retry logs | Mock or production provider prompt smoke passes |
|
||||||
|
|
||||||
|
## Feature Flag Triage
|
||||||
|
|
||||||
|
Use platform-service flags and NoteLett defaults from `docs/SEED_BOOTSTRAP_STRATEGY.md`:
|
||||||
|
|
||||||
|
- Disable `notelett_smart_actions_enabled` for broad prompt failures.
|
||||||
|
- Disable `notelett_scheduled_actions_enabled` for runaway or stuck scheduler jobs.
|
||||||
|
- Disable `notelett_webhooks_enabled` for repeated webhook failures or external callback incidents.
|
||||||
|
- Disable `notelett_intake_enabled` if URL intake is causing queue/backpressure or LLM cost spikes.
|
||||||
|
- Disable `notelett_collaborative_sharing_enabled` for cross-user access concerns.
|
||||||
|
- Disable `notelett_auto_summarize_enabled`, `notelett_auto_embed_enabled`, and `notelett_auto_link_enabled` for background AI cost or mutation concerns.
|
||||||
|
|
||||||
|
Record every flag change with timestamp, operator, reason, and expected rollback condition.
|
||||||
|
|
||||||
|
## Stuck Scheduler Recovery
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Scheduled prompts are due but no result notes appear.
|
||||||
|
- `scheduled_action_fired` drops to zero.
|
||||||
|
- Backend logs show `Failed to run scheduled prompt`.
|
||||||
|
- Users report weekly digest or scheduled actions missing.
|
||||||
|
|
||||||
|
Diagnosis:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" \
|
||||||
|
-H "Authorization: Bearer $TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- `notelett_scheduled_actions_enabled` is true.
|
||||||
|
- Scheduler loop is running in the backend process.
|
||||||
|
- `nextRunAt` is not in the past for many enabled schedules.
|
||||||
|
- Built-in prompt templates were seeded with `pnpm run seed:bootstrap`.
|
||||||
|
- LLM provider and datastore readiness are healthy.
|
||||||
|
|
||||||
|
Recovery:
|
||||||
|
|
||||||
|
1. If a schedule is malformed, patch it disabled through `PATCH /api/prompt-schedules/:id`.
|
||||||
|
2. If many schedules are due and failing, disable `notelett_scheduled_actions_enabled`.
|
||||||
|
3. Restart the backend only after confirming no deploy/migration is currently running.
|
||||||
|
4. Re-enable schedules gradually after one manual smoke schedule succeeds.
|
||||||
|
5. Record affected schedule ids and whether missed runs need manual replay. Do not bulk-create catch-up notes without product-owner approval.
|
||||||
|
|
||||||
|
## Webhook Recovery
|
||||||
|
|
||||||
|
There are two webhook concepts:
|
||||||
|
|
||||||
|
- Prompt webhooks in `note_prompt_webhooks`, triggered through `/api/prompt-webhooks/:id/trigger`.
|
||||||
|
- Domain event dispatch targets held by `backend/src/lib/webhook-subscriber.ts` for product event delivery.
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- `webhook_triggered` events are absent or failing.
|
||||||
|
- External integrations report duplicate or missing callbacks.
|
||||||
|
- Backend logs show dispatch failures or repeated timeouts.
|
||||||
|
|
||||||
|
Diagnosis:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$NOTELETT_API_URL/prompt-webhooks" -H "Authorization: Bearer $TOKEN"
|
||||||
|
curl -sf "$NOTELETT_API_URL/prompt-schedules/diagnostics" -H "Authorization: Bearer $TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- `notelett_webhooks_enabled` is true only when webhook smoke is expected.
|
||||||
|
- Target webhook is enabled and has correct `workspaceId`, `templateId`, `triggerEvent`, and `tagFilter`.
|
||||||
|
- Built-in/custom template resolves.
|
||||||
|
- External callback endpoint is reachable and not rejecting auth/signature.
|
||||||
|
|
||||||
|
Recovery:
|
||||||
|
|
||||||
|
1. Disable the failing webhook with `PATCH /api/prompt-webhooks/:id`.
|
||||||
|
2. If failures are broad or external abuse is suspected, disable `notelett_webhooks_enabled`.
|
||||||
|
3. Retry one manual trigger after the dependency is healthy.
|
||||||
|
4. For duplicate deliveries, compare external request ids/correlation ids before replay.
|
||||||
|
5. Document delivery window and any events intentionally not replayed.
|
||||||
|
|
||||||
|
## Failed Blob Upload Recovery
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Web artifact upload fails.
|
||||||
|
- Mobile image/attachment upload fails.
|
||||||
|
- Download links return 403/404.
|
||||||
|
|
||||||
|
Diagnosis:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$PLATFORM_API_URL/blob/containers" -H "Authorization: Bearer $TOKEN"
|
||||||
|
curl -sf "$PLATFORM_API_URL/blob/sas" \
|
||||||
|
-H "Authorization: Bearer $TOKEN" \
|
||||||
|
-H "content-type: application/json" \
|
||||||
|
-d '{"container":"attachments","blobName":"notelett/smoke/operator.txt","permissions":"rw","expiresInMinutes":10}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- platform-service is healthy and can reach the storage account.
|
||||||
|
- The `attachments` container exists.
|
||||||
|
- SAS permissions include the needed operation.
|
||||||
|
- The client is using the shared `@bytelyst/blob-client` paths.
|
||||||
|
- Blob paths do not include unsafe file names or cross-product prefixes.
|
||||||
|
|
||||||
|
Recovery:
|
||||||
|
|
||||||
|
1. If SAS issuance fails, restore platform-service/blob configuration first.
|
||||||
|
2. If upload fails after SAS succeeds, check storage account/container permissions and CORS.
|
||||||
|
3. Ask affected user to retry upload only after smoke succeeds.
|
||||||
|
4. If a metadata row exists without blob content, either retry upload to the same path or delete the orphaned artifact metadata through the product API after owner approval.
|
||||||
|
5. Never paste SAS URLs into tickets or logs; record only container, sanitized prefix, and error class.
|
||||||
|
|
||||||
|
## Failed LLM Or Extraction Recovery
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Prompt runs return errors.
|
||||||
|
- Intake jobs stay failed.
|
||||||
|
- Copilot transforms fail.
|
||||||
|
- Auto-summary or Palace extraction logs errors.
|
||||||
|
|
||||||
|
Diagnosis:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$EXTRACTION_URL/health"
|
||||||
|
curl -sf "$EXTRACTION_URL/api/extract/models"
|
||||||
|
curl -sf "$EXTRACTION_URL/api/extract/sidecar-health"
|
||||||
|
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
Check backend env:
|
||||||
|
|
||||||
|
- `LLM_PROVIDER`
|
||||||
|
- `LLM_DEFAULT_MODEL`
|
||||||
|
- `LLM_VISION_MODEL`
|
||||||
|
- `LLM_EMBEDDING_MODEL`
|
||||||
|
- provider API credentials
|
||||||
|
- `EXTRACTION_SERVICE_URL`
|
||||||
|
|
||||||
|
Recovery:
|
||||||
|
|
||||||
|
1. Disable LLM-heavy feature flags for broad failures.
|
||||||
|
2. If only extraction-service is down, keep basic note CRUD/search available and surface extraction-down UI state.
|
||||||
|
3. If provider rate limits are hit, lower traffic with flags and wait out/reset provider quotas.
|
||||||
|
4. For failed intake jobs, users can resubmit the URL after dependency recovery; do not rewrite failed job status manually unless there is a documented replay tool.
|
||||||
|
5. For prompt result corruption, preserve the generated note/artifact as evidence, then archive/delete only with user or product-owner approval.
|
||||||
|
|
||||||
|
## Review Queue Recovery
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Pending review queue is empty unexpectedly.
|
||||||
|
- Approve/reject fails or gets partial batch results.
|
||||||
|
- Agent action state appears stuck in `draft` or `proposed`.
|
||||||
|
|
||||||
|
Diagnosis:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$NOTELETT_API_URL/note-agent-actions/pending?limit=50" \
|
||||||
|
-H "Authorization: Bearer $TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- user id and product id scope
|
||||||
|
- `workspaceId` query when patching a specific action
|
||||||
|
- encrypted field readiness for action summary/review fields
|
||||||
|
- batch-review response `updated`, `not_found`, and `error` counts
|
||||||
|
|
||||||
|
Recovery:
|
||||||
|
|
||||||
|
1. Retry single review before retrying a large batch.
|
||||||
|
2. If partial batch failure occurred, use returned ids to retry only failed/not-found records after confirming workspace scope.
|
||||||
|
3. Preserve audit trail; do not directly mutate Cosmos unless product-owner approves an incident backfill.
|
||||||
|
4. If cross-user access is suspected, disable sharing/collaboration flags and escalate.
|
||||||
|
|
||||||
|
## MCP Action Recovery
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Agent tools cannot list/create/update notes.
|
||||||
|
- MCP write tool created duplicate or stuck audit actions.
|
||||||
|
- Tool calls fail with auth/product scope errors.
|
||||||
|
|
||||||
|
Diagnosis:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sf "$MCP_ORIGIN/health"
|
||||||
|
curl -sf "$MCP_URL/tools" -H "Authorization: Bearer $TOKEN"
|
||||||
|
curl -sf "$NOTELETT_URL/api/diagnostics/readiness" -H "Authorization: Bearer $ADMIN_TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- NoteLett MCP tools are registered.
|
||||||
|
- Product backend health/readiness is green.
|
||||||
|
- Token role is sufficient for requested tool.
|
||||||
|
- Write tools used idempotency key, dry-run, and correlation id when available.
|
||||||
|
|
||||||
|
Recovery:
|
||||||
|
|
||||||
|
1. Pause agent automation if duplicate writes or scope concerns appear.
|
||||||
|
2. Use `note-agent-actions` records to identify applied/proposed actions.
|
||||||
|
3. Reject or archive unwanted proposed actions through the product API.
|
||||||
|
4. For already-applied note changes, use note version history or restore flow where available.
|
||||||
|
5. Keep MCP disabled until a dry-run tool smoke succeeds.
|
||||||
|
|
||||||
|
## Communication And Closeout
|
||||||
|
|
||||||
|
During incident:
|
||||||
|
|
||||||
|
- Post status every 15-30 minutes with impact, current hypothesis, mitigation, and next check.
|
||||||
|
- Name exact degraded workflows: web note CRUD, mobile capture, Smart Actions, intake, reviews, MCP, or sharing.
|
||||||
|
- Avoid exposing secrets, share tokens, note text, prompt text, full URLs, or raw LLM output in status updates.
|
||||||
|
|
||||||
|
Closeout requires:
|
||||||
|
|
||||||
|
- all affected health/readiness/smoke checks pass
|
||||||
|
- feature flags restored or explicitly left disabled with owner/date
|
||||||
|
- release or rollback commit recorded
|
||||||
|
- telemetry and diagnostics event names captured
|
||||||
|
- data migration/backfill, if any, documented with counts and rollback decision
|
||||||
|
- follow-up issue or roadmap item for every residual risk
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
For runbook-only changes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git diff --check
|
||||||
|
rg -n "OPERATOR_RUNBOOK|Stuck Scheduler Recovery|Failed Blob Upload Recovery|Failed LLM Or Extraction Recovery|MCP Action Recovery" docs README.md
|
||||||
|
```
|
||||||
@ -96,6 +96,7 @@ Do not place secrets in `NEXT_PUBLIC_*` or `EXPO_PUBLIC_*` variables.
|
|||||||
## Pre-Deploy Checklist
|
## Pre-Deploy Checklist
|
||||||
|
|
||||||
- Confirm release commit is pushed and CI is green.
|
- Confirm release commit is pushed and CI is green.
|
||||||
|
- Confirm `docs/OPERATOR_RUNBOOK.md` has the current incident owner, service URLs, and rollback target for this environment.
|
||||||
- Confirm `pnpm run audit:release-guards` passes.
|
- Confirm `pnpm run audit:release-guards` passes.
|
||||||
- Confirm `pnpm run dependency:health` has no typecheck failures; review the non-blocking outdated report.
|
- Confirm `pnpm run dependency:health` has no typecheck failures; review the non-blocking outdated report.
|
||||||
- Confirm backend, web, and mobile tests from `docs/PRODUCTION_READINESS_HANDOFF_ROADMAP.md` P10 have passed or have explicit release-owner signoff.
|
- Confirm backend, web, and mobile tests from `docs/PRODUCTION_READINESS_HANDOFF_ROADMAP.md` P10 have passed or have explicit release-owner signoff.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user