From 40b62bf3a6ab7202ce692899b3c102c6950bdac7 Mon Sep 17 00:00:00 2001 From: Saravana Achu Mac Date: Tue, 5 May 2026 13:51:28 -0700 Subject: [PATCH] docs(ops): define telemetry taxonomy --- README.md | 1 + docs/RELEASE_CHECKLIST.md | 1 + docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md | 145 +++++++++++++++++++++ 3 files changed, 147 insertions(+) create mode 100644 docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md diff --git a/README.md b/README.md index b4a404a..a4513e3 100644 --- a/README.md +++ b/README.md @@ -138,3 +138,4 @@ Current baseline note: after common-platform workspace alignment, `pnpm install - [`docs/COSMOS_DATA_OPERATIONS.md`](docs/COSMOS_DATA_OPERATIONS.md) — Cosmos containers, indexes, retention, and backup/restore approach - [`docs/SEED_BOOTSTRAP_STRATEGY.md`](docs/SEED_BOOTSTRAP_STRATEGY.md) — Built-in prompt, intake rule, onboarding workspace, and feature-flag bootstrap strategy - [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md) — Encrypted-field, schema-change, and backfill migration plan +- [`docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`](docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md) — Event taxonomy and diagnostic breadcrumb contract diff --git a/docs/RELEASE_CHECKLIST.md b/docs/RELEASE_CHECKLIST.md index 31ed08e..d5608cf 100644 --- a/docs/RELEASE_CHECKLIST.md +++ b/docs/RELEASE_CHECKLIST.md @@ -101,6 +101,7 @@ Do not place secrets in `NEXT_PUBLIC_*` or `EXPO_PUBLIC_*` variables. - Confirm backend, web, and mobile tests from `docs/PRODUCTION_READINESS_HANDOFF_ROADMAP.md` P10 have passed or have explicit release-owner signoff. - Confirm Docker images build in CI. - Confirm common-platform services are deployed and reachable: platform-service, extraction-service, mcp-server, telemetry, diagnostics, flags, kill switch, blob. +- Confirm telemetry events and diagnostic breadcrumbs follow `docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md`. - Confirm Cosmos database, containers, partition keys, backups, and retention policy are ready. - Confirm field encryption provider and key material are ready. - Confirm feature flags and kill switch defaults are safe for release. diff --git a/docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md b/docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md new file mode 100644 index 0000000..30bcdbc --- /dev/null +++ b/docs/TELEMETRY_AND_DIAGNOSTICS_TAXONOMY.md @@ -0,0 +1,145 @@ +# NoteLett Telemetry And Diagnostics Taxonomy + +Date: May 5, 2026 +Product ID: `notelett` +Common platform sources: + +- `../learning_ai/learning_ai_common_plat/docs/ecosystem/ECOSYSTEM_EVENT_TAXONOMY.md` +- `../learning_ai/learning_ai_common_plat/docs/design/CLIENT_TELEMETRY_DESIGN.md` +- `@bytelyst/backend-telemetry` +- `@bytelyst/telemetry-client` +- `@bytelyst/diagnostics-client` + +## Purpose + +This document defines the production event vocabulary and diagnostic breadcrumb contract for NoteLett. It is intentionally product-specific while staying aligned with common-platform telemetry and ecosystem event naming. + +Telemetry events answer "what happened?" Breadcrumbs answer "what led up to the failure report?" + +## Event Naming Rules + +- Use dot-separated lowercase names: `..`. +- Prefer stable facts over UI labels. +- Keep `productId`, user/install identity, surface, channel, request id, correlation id, and release metadata in the event envelope or metadata. +- Never include note body text, prompt text, extracted URL text, blob paths, share tokens, JWTs, API keys, email addresses, or raw LLM responses. +- Use counts, booleans, durations, ids, content type, status, feature flag key, model/provider name, and error class instead of sensitive payloads. + +The backend currently buffers events through `backend/src/lib/telemetry.ts`; web/mobile use common-platform telemetry clients configured in `web/src/lib/telemetry.ts` and `mobile/src/lib/platform.ts`. + +## Required Metadata + +Every new event should include the fields below when available: + +| Metadata | Source | +| --- | --- | +| `productId` | common client/backend config | +| `surface` or `channel` | `backend`, `notes_web`, `notelett_mobile`, `mcp` | +| `userId` or `anonymousInstallId` | auth/session layer | +| `requestId` | `x-request-id` / generated outbound request id | +| `correlationId` | MCP/tool/action/workflow id when available | +| `workspaceId` | workspace-scoped product flows | +| `noteId` | note-scoped product flows | +| `durationMs` | LLM, extraction, prompt, intake, upload, and scheduler work | +| `status` | lifecycle state such as `submitted`, `completed`, `failed`, `approved` | +| `errorType` | sanitized class/code, not raw secret-bearing messages | + +## Backend Event Taxonomy + +| Domain | Event | Required metadata | Notes | +| --- | --- | --- | --- | +| Notes | `note.created` | `noteId`, `workspaceId`, `sourceType?` | Existing event; use for manual, import, intake, and voice-created notes. | +| Notes | `note.updated` | `noteId`, `workspaceId`, `changedFields?` | Avoid logging field values. | +| Notes | `note.archived` | `noteId`, `workspaceId` | Existing event. | +| Notes | `note.restored` | `noteId`, `workspaceId` | Add when restore telemetry is implemented. | +| Notes | `note.searched` | `workspaceId?`, `mode`, `resultCount?` | Existing event for lexical/hybrid search. | +| Notes | `note.exported_text` | `noteId`, `format?` | Existing text export/share-safe event. | +| Sharing | `note.share_created` | `noteId`, `workspaceId`, `expiresInHours?` | Never log share token. | +| Sharing | `note.share_revoked` | `noteId`, `workspaceId` | Existing event. | +| Collaboration | `note.shared_with_user` | `noteId`, `permission` | Existing event; do not log email. | +| Collaboration | `note.collaborator_removed` | `noteId`, `removedUserId` | Existing event. | +| Workspaces | `workspace.created` | `workspaceId` | Existing event. | +| Workspaces | `workspace.onboarding_seeded` | `workspaceId`, `noteCount`, `agentActionCount` | Existing event; add counts when touched. | + +## Prompt And AI Event Taxonomy + +| Domain | Event | Required metadata | Notes | +| --- | --- | --- | --- | +| Prompt templates | `smart_action_template_created` | `category`, `inputType` | Existing event. | +| Prompt runs | `smart_action_run` | `templateSlug`, `noteId?`, `workspaceId?`, `model?`, `durationMs?` | Existing event in runner. | +| Prompt runs | `smart_action_result_saved` | `templateSlug`, `resultType`, `durationMs?` | Existing event. | +| Prompt runs | `smart_action_error` | `templateSlug`, `errorType`, `durationMs?` | Existing event; sanitize error. | +| Prompt schedules | `scheduled_action_fired` | `scheduleId`, `templateSlug` | Existing event. | +| Prompt webhooks | `webhook_triggered` | `webhookId`, `triggerEvent` | Existing event; do not log webhook secret. | +| Copilot | `note.copilot` | `noteId`, `action` | Existing route event. | +| Copilot | `copilot_transform` | `action`, `durationMs` | Existing lower-level event. Prefer joining to `note.copilot` through request id. | +| Suggestions | `duplicate_detected` | `noteId`, `similarityScore` | Existing event; avoid logging titles/body. | +| Suggestions | `auto_summarize_triggered` | `noteId`, `wordCount` | Existing event. | +| URL extraction | `url_extract_completed` | `domain`, `wordCount` | Existing event; log domain only, not full URL. | +| Palace | `palace.memories_extracted` | `workspaceId?`, `noteId?`, `memoryCount` | Existing event; no raw memory content. | + +## Intake Event Taxonomy + +| Event | Required metadata | Notes | +| --- | --- | --- | +| `intake_submitted` | `contentType`, `templateSlug`, `domain`, `workspaceId?` | Existing event; domain only. | +| `intake_job_completed` | `contentType`, `templateSlug`, `domain`, `durationMs?` | Existing event; add duration when touched. | +| `intake_job_failed` | `contentType`, `errorType`, `stage?` | Existing event currently logs a raw error string; next code touch should normalize to `errorType`. | +| `intake_rule_created` | `ruleId`, `contentType`, `priority` | Add when rule telemetry is implemented. | +| `intake_rule_updated` | `ruleId`, `changedFields` | Add when rule telemetry is implemented. | + +## Reviews And MCP Event Taxonomy + +| Domain | Event | Required metadata | Notes | +| --- | --- | --- | --- | +| Reviews | `agent_action.created` | `actionId`, `workspaceId`, `noteId?`, `toolName`, `actionType`, `actorType` | Add for direct API-created actions. | +| Reviews | `agent_action.approved` | `actionId`, `workspaceId`, `noteId?`, `reviewerId?` | Add for single or batch review. | +| Reviews | `agent_action.rejected` | `actionId`, `workspaceId`, `noteId?`, `reviewerId?` | Add for single or batch review. | +| Reviews | `agent_action.batch_reviewed` | `approvedCount`, `rejectedCount`, `total` | Add for batch endpoint. | +| MCP | `mcp.tool.called` | `toolName`, `correlationId`, `dryRun`, `idempotencyKey?` | Add for read/write tools if telemetry volume is acceptable. | +| MCP | `mcp.tool.applied` | `toolName`, `actionId?`, `workspaceId`, `noteId?` | Write tools should connect to agent action audit rows. | +| MCP | `mcp.tool.failed` | `toolName`, `errorType`, `correlationId?` | Sanitize error details. | + +MCP events should align with common-platform agent runtime names where possible: `agent.run.started`, `agent.run.completed`, and `audit.action.logged` remain ecosystem-level names for cross-product replay. NoteLett can emit product-local `mcp.*` events for operational dashboards. + +## Mobile Capture Event Taxonomy + +Mobile telemetry uses `@bytelyst/telemetry-client` with channel `notelett_mobile`. + +| Event | Required metadata | Notes | +| --- | --- | --- | +| `mobile_app_initialized` | `appVersion`, `buildNumber`, `osFamily` | Existing event category `app_shell`. | +| `mobile.capture.started` | `captureMode`, `workspaceId?` | Add when capture flow is instrumented. | +| `mobile.capture.saved` | `captureMode`, `noteId`, `workspaceId`, `hasBlob` | Do not log raw captured text. | +| `mobile.capture.failed` | `captureMode`, `errorType`, `offline` | Sanitize error details. | +| `mobile.intake.shared_url_received` | `contentType?`, `domain?` | Domain only; do not log full URL. | +| `mobile.offline_queue.flushed` | `queuedCount`, `successCount`, `failureCount` | Useful for offline reliability. | +| `mobile.telemetry.flushed` | `reason`, `queuedCount?` | Pair with app background lifecycle. | + +## Diagnostic Breadcrumbs + +Use common-platform `@bytelyst/diagnostics-client` breadcrumbs for client-side failure reports. Breadcrumbs should be terse, bounded, and free of sensitive text. + +Recommended categories and messages: + +| Category | Message | Data | +| --- | --- | --- | +| `navigation` | `opened_dashboard`, `opened_note_detail`, `opened_capture`, `opened_reviews`, `opened_settings` | route id, note/workspace id where already visible | +| `note` | `note_create_started`, `note_save_completed`, `note_archive_failed` | noteId, workspaceId, status, errorType | +| `prompt` | `prompt_run_started`, `prompt_run_completed`, `prompt_run_failed` | templateSlug, noteId, workspaceId, durationMs, errorType | +| `intake` | `intake_submitted`, `intake_polling_started`, `intake_failed` | contentType, domain, jobId, errorType | +| `review` | `review_decision_started`, `review_decision_completed`, `review_decision_failed` | actionId, decision, errorType | +| `capture` | `capture_mode_selected`, `capture_saved`, `capture_failed` | captureMode, workspaceId, noteId, errorType | +| `offline` | `offline_queue_enqueued`, `offline_queue_flushed`, `offline_queue_failed` | operation, queuedCount, errorType | +| `platform` | `feature_flags_unavailable`, `kill_switch_checked`, `telemetry_flush_failed` | dependency, status, errorType | +| `mcp` | `mcp_settings_updated`, `mcp_connection_failed` | serverHost, errorType | + +Breadcrumb data must not include note bodies, prompt bodies, token values, blob paths, share tokens, full URLs, headers, or response bodies. + +## Adoption Checklist + +- New backend route handlers call `trackEvent()` for release-critical facts. +- New web/mobile workflows add breadcrumbs around failure-prone actions. +- Error events use `errorType` or a short code, not raw exception messages, unless the message is guaranteed sanitized. +- Event names are documented here before they are emitted from production code. +- High-volume events have sampling or an explicit volume review. +- P10 final verification samples `/api/diagnostics/telemetry` in non-production and confirms platform telemetry ingest in production smoke.