diff --git a/docs/runbooks/MEK_ROTATION.md b/docs/runbooks/MEK_ROTATION.md new file mode 100644 index 0000000..63b3e05 --- /dev/null +++ b/docs/runbooks/MEK_ROTATION.md @@ -0,0 +1,135 @@ +# Runbook — MEK (Master Encryption Key) Rotation + +> **Owner:** Platform / Security +> **Touches:** `@bytelyst/field-encrypt`, Azure Key Vault, NoteLett backend (port 4016) +> **Risk:** Medium — operates on encrypted Cosmos documents. A failure mid-rewrap leaves DEKs wrapped under a mix of old and new MEK versions, which is recoverable but requires the old MEK to remain readable until rotation completes. + +## Background + +NoteLett uses envelope encryption from `@bytelyst/field-encrypt`: + +- **MEK** (Master Encryption Key) lives in Azure Key Vault (or `FIELD_ENCRYPT_KEY` env in non-prod). +- **DEKs** (per-document Data Encryption Keys) are wrapped by the MEK and stored in the Cosmos `dek_store` container. +- Encrypted fields on note documents carry a `dekId` plus a `mekVersion` so the backend can find the right DEK and the right MEK version. + +Rotation means: keep all encrypted document fields unchanged, generate a **new MEK version**, and **rewrap every DEK** so future reads use the new MEK. Old documents are never re-encrypted — only the DEK wrapping layer changes. + +## Pre-flight Checklist + +- [ ] Confirm `FIELD_ENCRYPT_ENABLED=true` in the target environment. +- [ ] Confirm `FIELD_ENCRYPT_KEY_PROVIDER=akv` and `AZURE_KEYVAULT_URL` is set in production. +- [ ] Confirm the operator running rotation has both `keys/wrapKey` and `keys/unwrapKey` permissions on the target AKV key. +- [ ] Take a Cosmos DB snapshot or enable point-in-time restore on the `dek_store` container (and any container holding encrypted documents). +- [ ] Confirm there are no in-flight migrations from `scripts/encrypt-migrate.ts` running against this environment. +- [ ] Notify on-call channel; rotation should be scheduled in a low-traffic window because the rewrap loop holds the DEK cache invalidated. + +## Rotation Procedure + +### Step 1 — Create a new MEK version in AKV + +```bash +# Azure CLI example: create a new version of the existing key. +az keyvault key create \ + --vault-name "" \ + --name "notelett-mek" \ + --kty RSA \ + --size 4096 +``` + +The Key Vault returns a new version identifier. Record it — you will need it to verify rotation in Step 4. + +### Step 2 — Roll the backend with both MEK versions readable + +Update the running deployment so that: + +- `AZURE_KEYVAULT_URL` continues to point at the same vault. +- `FIELD_ENCRYPT_MEK_NAME` continues to point at the same key name (`notelett-mek`). +- The new version is now the **latest** version on the AKV key; the previous version is still **enabled** so old wrapped DEKs can be unwrapped. + +The AKV-backed `KeyProvider` resolves `mekVersion` from the wrapped-DEK record, so older DEKs continue to unwrap correctly during the transition. + +### Step 3 — Run `rewrapAllDeks` + +Use the cross-product CLI from common-plat: + +```bash +cd ../learning_ai_common_plat +pnpm --filter @bytelyst/scripts run encrypt-migrate -- \ + --product notelett \ + --mode rewrap-deks \ + --old-mek-version \ + --new-mek-version +``` + +This iterates every wrapped DEK in the `dek_store` container, unwraps with the old MEK version, rewraps with the new MEK version, and writes the updated wrapped DEK back. The `DekCache` is invalidated entry-by-entry so live traffic immediately picks up the new wrap. + +Progress is logged. Re-run is **idempotent**: DEKs whose `mekVersion` already matches the new target are skipped. + +### Step 4 — Verify + +```bash +# 1. Confirm health and dependency readiness against the rotated environment. +curl https:///health +curl https:///api/diagnostics/readiness + +# 2. Confirm at least one read of an encrypted note succeeds after rotation. +curl -H "Authorization: Bearer " https:///api/notes/?workspaceId= + +# 3. Spot-check `dek_store`: every wrapped DEK should report the new mekVersion. +# Sample query in Cosmos Data Explorer: +# SELECT VALUE COUNT(1) FROM c WHERE c.mekVersion != "" +# Expected: 0 +``` + +### Step 5 — Disable the old MEK version + +After verification and a soak period (recommended: 24 hours of normal traffic), disable the old key version in AKV so it can no longer be used to unwrap. + +```bash +az keyvault key set-attributes \ + --vault-name "" \ + --name "notelett-mek" \ + --version "" \ + --enabled false +``` + +Do **not** delete the old version — keep it disabled for at least 30 days so a backup restore that contains old wrapped DEKs can still be recovered. + +## Rollback + +If rewrap fails partway: + +1. Stop the rewrap job. +2. Leave the old MEK version **enabled** in AKV. +3. The backend continues to serve traffic — DEKs that were already rewrapped resolve via the new version, DEKs that were not yet rewrapped resolve via the old version. +4. Investigate the failure (typically AKV throttling or transient Cosmos errors), then re-run the same command. It will skip DEKs already rewrapped. + +If a document fails to decrypt after rotation: + +1. Inspect the encrypted field — it carries the `dekId` it expects. +2. Inspect the `dek_store` row for that `dekId` — confirm `mekVersion` is the new version. +3. If the row is missing, restore the most recent Cosmos snapshot of `dek_store`. +4. **Never** rotate the field's `dekId` manually; the field-encrypt library owns that lifecycle. + +## Recovery From Lost MEK + +If the AKV key (all versions) is unrecoverable: + +- All envelope-encrypted documents are unrecoverable. +- Restore from the most recent Cosmos snapshot taken **before** the loss. +- Re-key from that snapshot by: + 1. Provisioning a new AKV key. + 2. Running `scripts/encrypt-migrate.ts` in `re-encrypt` mode (not `rewrap-deks`) so a fresh DEK is generated for every document. + +## Open Items + +- **Automation:** rotation is manual today. A scheduled monthly rotation via a controlled GitHub Action with AKV federated credentials is tracked in the production-hardening backlog. +- **Audit trail:** the rewrap CLI writes structured logs but does not yet emit a domain event to `actiontrail`. Tracked separately. + +## References + +- Package helper: `../learning_ai_common_plat/packages/field-encrypt/src/envelope.ts` — `rewrapAllDeks` +- Cross-product CLI: `../learning_ai_common_plat/scripts/encrypt-migrate.ts` +- Backend config: [`backend/src/lib/config.ts`](../backend/src/lib/config.ts) — `FIELD_ENCRYPT_*` env vars +- Backend bootstrap: [`backend/src/lib/field-encrypt.ts`](../backend/src/lib/field-encrypt.ts) — `initEncryption()` +- Related: [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](../DATA_MIGRATION_AND_BACKFILL_PLAN.md) — initial backfill procedure diff --git a/docs/runbooks/SECRET_MANAGEMENT.md b/docs/runbooks/SECRET_MANAGEMENT.md new file mode 100644 index 0000000..31fddae --- /dev/null +++ b/docs/runbooks/SECRET_MANAGEMENT.md @@ -0,0 +1,136 @@ +# Runbook — Secret Management for NoteLett + +> **Owner:** Platform / Security +> **Touches:** backend container (port 4016), web container (port 3000), Azure Key Vault, deployment platform +> **Audience:** anyone deploying NoteLett to a non-development environment + +## Principles + +- Never commit a secret to git, never bake one into a Docker image. +- Every secret has exactly one source of truth — Azure Key Vault (AKV) in production. +- The container reads secrets at process start, never from disk on the runtime host. +- Rotation is non-disruptive: rolling a deployment after rotating the secret is enough. + +## Secret Inventory + +| Variable | Required when | Source of truth (prod) | Source (dev) | +|---|---|---|---| +| `JWT_SECRET` | always (validated ≥ 32 chars in prod) | AKV secret `notelett-jwt-secret` | `.env` (dev default rejected in prod) | +| `COSMOS_ENDPOINT` | `DB_PROVIDER=cosmos` | AKV secret `bytelyst-cosmos-endpoint` | `.env` | +| `COSMOS_KEY` | `DB_PROVIDER=cosmos` | AKV secret `bytelyst-cosmos-key` | `.env` | +| `AZURE_KEYVAULT_URL` | `FIELD_ENCRYPT_KEY_PROVIDER=akv` | Static config (URL, not a secret) | `.env` | +| `FIELD_ENCRYPT_KEY` | `FIELD_ENCRYPT_KEY_PROVIDER=env` (non-prod only) | n/a — prod uses AKV | `.env` | +| `OPENAI_API_KEY` | `LLM_PROVIDER=openai` | AKV secret `notelett-openai-api-key` | `.env` | +| `OPENAI_BASE_URL` | optional override | Static config (URL, not a secret) | `.env` | +| `AZURE_OPENAI_API_KEY` | `LLM_PROVIDER=azure` | AKV secret `notelett-azure-openai-key` | `.env` | +| `AZURE_OPENAI_ENDPOINT` | `LLM_PROVIDER=azure` | Static config (URL, not a secret) | `.env` | +| `GITEA_NPM_TOKEN` | Docker build only (when not using `docker-prep.sh` tarballs) | CI secret | `~/.npmrc` | + +`backend/src/lib/config.ts` enforces production assertions for the four hardest constraints: `JWT_SECRET` must not be the dev default and must be ≥ 32 chars, `DB_PROVIDER` must be `cosmos`, Cosmos endpoint/key/database must be set, and field encryption must be enabled with `akv` or `env` provider (never `memory`). + +## Production Pattern — Azure Key Vault + +Two supported flows depending on the deployment target: + +### Flow A — Workload Identity (preferred) + +1. The backend container runs under a Managed Identity (Azure Container Apps, AKS, or App Service). +2. The Managed Identity has `secrets/get` and `keys/{wrapKey, unwrapKey}` permissions on the NoteLett key vault. +3. At process start, an init step resolves secrets from AKV and exports them as env vars in the process scope only: + + ```bash + # entrypoint snippet (illustrative) + eval "$(node -e " + import('@azure/identity').then(({ DefaultAzureCredential }) => + import('@azure/keyvault-secrets').then(async ({ SecretClient }) => { + const c = new SecretClient(process.env.AZURE_KEYVAULT_URL, new DefaultAzureCredential()); + for (const name of ['notelett-jwt-secret','bytelyst-cosmos-key','notelett-openai-api-key']) { + const v = (await c.getSecret(name)).value; + process.stdout.write(`export ${name.replace(/-/g,'_').toUpperCase()}='${v}'\n`); + } + }) + )" + )" + exec node dist/server.js + ``` + + In `@bytelyst/config` this is encapsulated by `resolveKeyVaultSecrets(...)` (see common-plat). Use that helper instead of writing inline glue. + +4. Secrets never touch the container filesystem and never appear in logs (they live in process env only). + +### Flow B — Kubernetes Secret synced from AKV + +1. Use the AKV CSI driver or `secrets-store.csi.k8s.io` to project AKV secrets into a Kubernetes Secret. +2. Reference the K8s Secret in the Deployment via `envFrom` so values land in the container env. +3. Rotate by recreating the Pod after the secret syncs. + +## Deployment Pattern — `docker-compose.yml` + +The committed `docker-compose.yml` reads from the host shell env (`${OPENAI_API_KEY:-}` etc.) and from a local `.env`. For production-like single-host deploys: + +1. Place secrets in a file owned by the deployer with `chmod 600`, never in git. +2. Source it before `docker compose up`: + + ```bash + set -a + source /etc/notelett/secrets.env + set +a + docker compose up -d + ``` + +3. Avoid `--env-file` on the `docker compose` command line — it persists the path in process listings and is harder to rotate. +4. After deploy, scrub `/etc/notelett/secrets.env` from any shell history (`history -d`) and confirm `docker compose config` does not leak the secret values to logs. + +## Rotation + +Rotation pattern for any AKV-backed secret: + +1. Update the AKV secret with a new version (`az keyvault secret set ...`). +2. Roll the backend deployment (rolling restart picks up the new value at process start). +3. For `JWT_SECRET`: rotation invalidates all outstanding access tokens. Plan for forced re-auth or implement dual-secret support before rotating in production. +4. For `OPENAI_API_KEY` / `AZURE_OPENAI_API_KEY`: rotation is hot — in-flight LLM calls complete with the old key; new calls use the new key after restart. +5. For `COSMOS_KEY`: prefer rotating the **secondary** key first, swap the deployment to use it, then rotate the primary. + +MEK rotation has its own runbook: [`MEK_ROTATION.md`](./MEK_ROTATION.md). + +## Verification + +After any rotation or initial deploy: + +```bash +# 1. Service health. +curl https:///health + +# 2. Dependency readiness (datastore + encryption + platform/extraction/MCP if configured). +curl https:///api/diagnostics/readiness + +# 3. Authenticated note read (proves JWT_SECRET and Cosmos creds are wired). +curl -H "Authorization: Bearer " https:///api/notes?workspaceId= + +# 4. LLM smoke (proves OPENAI_API_KEY or AZURE_OPENAI_API_KEY are wired, if LLM_PROVIDER != mock). +curl -X POST -H "Authorization: Bearer " -H "Content-Type: application/json" \ + -d '{"workspaceId":"","noteId":"","transform":"shorten"}' \ + https:///api/notes/copilot/transform +``` + +If any returns 5xx, check the structured log line for a missing-secret error before re-rotating. + +## Red Flags + +- A secret value appearing in `req.log` or `app.log` output. **Stop, rotate, and audit.** +- A secret committed to git. Use `git filter-repo` to scrub, force-push (coordinate with the team), and rotate the secret immediately. +- Two pods seeing different secret values. Indicates a partial K8s rollout — finish the rollout before traffic is sent to the new version. +- `FIELD_ENCRYPT_KEY_PROVIDER=memory` in production. The backend will refuse to start, but if it slips through (e.g. with `NODE_ENV` set to something other than `production`), all encrypted documents are unrecoverable on restart. + +## Open Items + +- **Centralized rotation calendar.** Tracked in production-hardening backlog: schedule per-secret cadence (90 days for `OPENAI_API_KEY`, 365 days for `JWT_SECRET`, etc.). +- **Audit log integration.** Emit a `secret.rotated` event to `actiontrail` after each rotation. Currently rotation is logged only in AKV's own audit feed. +- **Dual-JWT support.** Today `JWT_SECRET` rotation invalidates outstanding tokens; planned: support `JWT_SECRET_NEXT` for graceful transitions. + +## References + +- Config validation: [`backend/src/lib/config.ts`](../../backend/src/lib/config.ts) +- AKV-backed encryption provider: `../learning_ai_common_plat/packages/field-encrypt/src/key-provider-akv.ts` +- Shared secret resolver: `../learning_ai_common_plat/packages/config/src/akv.ts` +- Related: [`MEK_ROTATION.md`](./MEK_ROTATION.md)