docs(runbooks): add MEK rotation and secret-management runbooks

Sprint B — closes audit items B4 and B5.

- docs/runbooks/MEK_ROTATION.md: step-by-step procedure for rotating
  the field-encrypt master key in Azure Key Vault, including pre-flight
  checks, rewrapAllDeks usage, verification queries, rollback, and lost-MEK
  recovery. Replaces the previous gap where MEK rotation had no
  documented operator path.
- docs/runbooks/SECRET_MANAGEMENT.md: inventory of every secret consumed
  by NoteLett with its production source (AKV), two production-grade
  patterns (workload identity vs K8s CSI), the compose-host pattern,
  rotation flow per secret type, verification commands, and red-flag
  triage.

Both docs cross-link each other and call out concrete open items
(automation, dual-JWT support, audit-log emission) for later sprints
rather than overstating current capabilities.
This commit is contained in:
saravanakumardb1 2026-05-22 23:23:38 -07:00
parent 1258d49488
commit bcad7d330a
2 changed files with 271 additions and 0 deletions

View File

@ -0,0 +1,135 @@
# Runbook — MEK (Master Encryption Key) Rotation
> **Owner:** Platform / Security
> **Touches:** `@bytelyst/field-encrypt`, Azure Key Vault, NoteLett backend (port 4016)
> **Risk:** Medium — operates on encrypted Cosmos documents. A failure mid-rewrap leaves DEKs wrapped under a mix of old and new MEK versions, which is recoverable but requires the old MEK to remain readable until rotation completes.
## Background
NoteLett uses envelope encryption from `@bytelyst/field-encrypt`:
- **MEK** (Master Encryption Key) lives in Azure Key Vault (or `FIELD_ENCRYPT_KEY` env in non-prod).
- **DEKs** (per-document Data Encryption Keys) are wrapped by the MEK and stored in the Cosmos `dek_store` container.
- Encrypted fields on note documents carry a `dekId` plus a `mekVersion` so the backend can find the right DEK and the right MEK version.
Rotation means: keep all encrypted document fields unchanged, generate a **new MEK version**, and **rewrap every DEK** so future reads use the new MEK. Old documents are never re-encrypted — only the DEK wrapping layer changes.
## Pre-flight Checklist
- [ ] Confirm `FIELD_ENCRYPT_ENABLED=true` in the target environment.
- [ ] Confirm `FIELD_ENCRYPT_KEY_PROVIDER=akv` and `AZURE_KEYVAULT_URL` is set in production.
- [ ] Confirm the operator running rotation has both `keys/wrapKey` and `keys/unwrapKey` permissions on the target AKV key.
- [ ] Take a Cosmos DB snapshot or enable point-in-time restore on the `dek_store` container (and any container holding encrypted documents).
- [ ] Confirm there are no in-flight migrations from `scripts/encrypt-migrate.ts` running against this environment.
- [ ] Notify on-call channel; rotation should be scheduled in a low-traffic window because the rewrap loop holds the DEK cache invalidated.
## Rotation Procedure
### Step 1 — Create a new MEK version in AKV
```bash
# Azure CLI example: create a new version of the existing key.
az keyvault key create \
--vault-name "<vault-name>" \
--name "notelett-mek" \
--kty RSA \
--size 4096
```
The Key Vault returns a new version identifier. Record it — you will need it to verify rotation in Step 4.
### Step 2 — Roll the backend with both MEK versions readable
Update the running deployment so that:
- `AZURE_KEYVAULT_URL` continues to point at the same vault.
- `FIELD_ENCRYPT_MEK_NAME` continues to point at the same key name (`notelett-mek`).
- The new version is now the **latest** version on the AKV key; the previous version is still **enabled** so old wrapped DEKs can be unwrapped.
The AKV-backed `KeyProvider` resolves `mekVersion` from the wrapped-DEK record, so older DEKs continue to unwrap correctly during the transition.
### Step 3 — Run `rewrapAllDeks`
Use the cross-product CLI from common-plat:
```bash
cd ../learning_ai_common_plat
pnpm --filter @bytelyst/scripts run encrypt-migrate -- \
--product notelett \
--mode rewrap-deks \
--old-mek-version <OLD_VERSION_ID> \
--new-mek-version <NEW_VERSION_ID>
```
This iterates every wrapped DEK in the `dek_store` container, unwraps with the old MEK version, rewraps with the new MEK version, and writes the updated wrapped DEK back. The `DekCache` is invalidated entry-by-entry so live traffic immediately picks up the new wrap.
Progress is logged. Re-run is **idempotent**: DEKs whose `mekVersion` already matches the new target are skipped.
### Step 4 — Verify
```bash
# 1. Confirm health and dependency readiness against the rotated environment.
curl https://<backend-host>/health
curl https://<backend-host>/api/diagnostics/readiness
# 2. Confirm at least one read of an encrypted note succeeds after rotation.
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes/<noteId>?workspaceId=<workspaceId>
# 3. Spot-check `dek_store`: every wrapped DEK should report the new mekVersion.
# Sample query in Cosmos Data Explorer:
# SELECT VALUE COUNT(1) FROM c WHERE c.mekVersion != "<NEW_VERSION_ID>"
# Expected: 0
```
### Step 5 — Disable the old MEK version
After verification and a soak period (recommended: 24 hours of normal traffic), disable the old key version in AKV so it can no longer be used to unwrap.
```bash
az keyvault key set-attributes \
--vault-name "<vault-name>" \
--name "notelett-mek" \
--version "<OLD_VERSION_ID>" \
--enabled false
```
Do **not** delete the old version — keep it disabled for at least 30 days so a backup restore that contains old wrapped DEKs can still be recovered.
## Rollback
If rewrap fails partway:
1. Stop the rewrap job.
2. Leave the old MEK version **enabled** in AKV.
3. The backend continues to serve traffic — DEKs that were already rewrapped resolve via the new version, DEKs that were not yet rewrapped resolve via the old version.
4. Investigate the failure (typically AKV throttling or transient Cosmos errors), then re-run the same command. It will skip DEKs already rewrapped.
If a document fails to decrypt after rotation:
1. Inspect the encrypted field — it carries the `dekId` it expects.
2. Inspect the `dek_store` row for that `dekId` — confirm `mekVersion` is the new version.
3. If the row is missing, restore the most recent Cosmos snapshot of `dek_store`.
4. **Never** rotate the field's `dekId` manually; the field-encrypt library owns that lifecycle.
## Recovery From Lost MEK
If the AKV key (all versions) is unrecoverable:
- All envelope-encrypted documents are unrecoverable.
- Restore from the most recent Cosmos snapshot taken **before** the loss.
- Re-key from that snapshot by:
1. Provisioning a new AKV key.
2. Running `scripts/encrypt-migrate.ts` in `re-encrypt` mode (not `rewrap-deks`) so a fresh DEK is generated for every document.
## Open Items
- **Automation:** rotation is manual today. A scheduled monthly rotation via a controlled GitHub Action with AKV federated credentials is tracked in the production-hardening backlog.
- **Audit trail:** the rewrap CLI writes structured logs but does not yet emit a domain event to `actiontrail`. Tracked separately.
## References
- Package helper: `../learning_ai_common_plat/packages/field-encrypt/src/envelope.ts``rewrapAllDeks`
- Cross-product CLI: `../learning_ai_common_plat/scripts/encrypt-migrate.ts`
- Backend config: [`backend/src/lib/config.ts`](../backend/src/lib/config.ts) — `FIELD_ENCRYPT_*` env vars
- Backend bootstrap: [`backend/src/lib/field-encrypt.ts`](../backend/src/lib/field-encrypt.ts) — `initEncryption()`
- Related: [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](../DATA_MIGRATION_AND_BACKFILL_PLAN.md) — initial backfill procedure

View File

@ -0,0 +1,136 @@
# Runbook — Secret Management for NoteLett
> **Owner:** Platform / Security
> **Touches:** backend container (port 4016), web container (port 3000), Azure Key Vault, deployment platform
> **Audience:** anyone deploying NoteLett to a non-development environment
## Principles
- Never commit a secret to git, never bake one into a Docker image.
- Every secret has exactly one source of truth — Azure Key Vault (AKV) in production.
- The container reads secrets at process start, never from disk on the runtime host.
- Rotation is non-disruptive: rolling a deployment after rotating the secret is enough.
## Secret Inventory
| Variable | Required when | Source of truth (prod) | Source (dev) |
|---|---|---|---|
| `JWT_SECRET` | always (validated ≥ 32 chars in prod) | AKV secret `notelett-jwt-secret` | `.env` (dev default rejected in prod) |
| `COSMOS_ENDPOINT` | `DB_PROVIDER=cosmos` | AKV secret `bytelyst-cosmos-endpoint` | `.env` |
| `COSMOS_KEY` | `DB_PROVIDER=cosmos` | AKV secret `bytelyst-cosmos-key` | `.env` |
| `AZURE_KEYVAULT_URL` | `FIELD_ENCRYPT_KEY_PROVIDER=akv` | Static config (URL, not a secret) | `.env` |
| `FIELD_ENCRYPT_KEY` | `FIELD_ENCRYPT_KEY_PROVIDER=env` (non-prod only) | n/a — prod uses AKV | `.env` |
| `OPENAI_API_KEY` | `LLM_PROVIDER=openai` | AKV secret `notelett-openai-api-key` | `.env` |
| `OPENAI_BASE_URL` | optional override | Static config (URL, not a secret) | `.env` |
| `AZURE_OPENAI_API_KEY` | `LLM_PROVIDER=azure` | AKV secret `notelett-azure-openai-key` | `.env` |
| `AZURE_OPENAI_ENDPOINT` | `LLM_PROVIDER=azure` | Static config (URL, not a secret) | `.env` |
| `GITEA_NPM_TOKEN` | Docker build only (when not using `docker-prep.sh` tarballs) | CI secret | `~/.npmrc` |
`backend/src/lib/config.ts` enforces production assertions for the four hardest constraints: `JWT_SECRET` must not be the dev default and must be ≥ 32 chars, `DB_PROVIDER` must be `cosmos`, Cosmos endpoint/key/database must be set, and field encryption must be enabled with `akv` or `env` provider (never `memory`).
## Production Pattern — Azure Key Vault
Two supported flows depending on the deployment target:
### Flow A — Workload Identity (preferred)
1. The backend container runs under a Managed Identity (Azure Container Apps, AKS, or App Service).
2. The Managed Identity has `secrets/get` and `keys/{wrapKey, unwrapKey}` permissions on the NoteLett key vault.
3. At process start, an init step resolves secrets from AKV and exports them as env vars in the process scope only:
```bash
# entrypoint snippet (illustrative)
eval "$(node -e "
import('@azure/identity').then(({ DefaultAzureCredential }) =>
import('@azure/keyvault-secrets').then(async ({ SecretClient }) => {
const c = new SecretClient(process.env.AZURE_KEYVAULT_URL, new DefaultAzureCredential());
for (const name of ['notelett-jwt-secret','bytelyst-cosmos-key','notelett-openai-api-key']) {
const v = (await c.getSecret(name)).value;
process.stdout.write(`export ${name.replace(/-/g,'_').toUpperCase()}='${v}'\n`);
}
})
)"
)"
exec node dist/server.js
```
In `@bytelyst/config` this is encapsulated by `resolveKeyVaultSecrets(...)` (see common-plat). Use that helper instead of writing inline glue.
4. Secrets never touch the container filesystem and never appear in logs (they live in process env only).
### Flow B — Kubernetes Secret synced from AKV
1. Use the AKV CSI driver or `secrets-store.csi.k8s.io` to project AKV secrets into a Kubernetes Secret.
2. Reference the K8s Secret in the Deployment via `envFrom` so values land in the container env.
3. Rotate by recreating the Pod after the secret syncs.
## Deployment Pattern — `docker-compose.yml`
The committed `docker-compose.yml` reads from the host shell env (`${OPENAI_API_KEY:-}` etc.) and from a local `.env`. For production-like single-host deploys:
1. Place secrets in a file owned by the deployer with `chmod 600`, never in git.
2. Source it before `docker compose up`:
```bash
set -a
source /etc/notelett/secrets.env
set +a
docker compose up -d
```
3. Avoid `--env-file` on the `docker compose` command line — it persists the path in process listings and is harder to rotate.
4. After deploy, scrub `/etc/notelett/secrets.env` from any shell history (`history -d`) and confirm `docker compose config` does not leak the secret values to logs.
## Rotation
Rotation pattern for any AKV-backed secret:
1. Update the AKV secret with a new version (`az keyvault secret set ...`).
2. Roll the backend deployment (rolling restart picks up the new value at process start).
3. For `JWT_SECRET`: rotation invalidates all outstanding access tokens. Plan for forced re-auth or implement dual-secret support before rotating in production.
4. For `OPENAI_API_KEY` / `AZURE_OPENAI_API_KEY`: rotation is hot — in-flight LLM calls complete with the old key; new calls use the new key after restart.
5. For `COSMOS_KEY`: prefer rotating the **secondary** key first, swap the deployment to use it, then rotate the primary.
MEK rotation has its own runbook: [`MEK_ROTATION.md`](./MEK_ROTATION.md).
## Verification
After any rotation or initial deploy:
```bash
# 1. Service health.
curl https://<backend-host>/health
# 2. Dependency readiness (datastore + encryption + platform/extraction/MCP if configured).
curl https://<backend-host>/api/diagnostics/readiness
# 3. Authenticated note read (proves JWT_SECRET and Cosmos creds are wired).
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes?workspaceId=<ws>
# 4. LLM smoke (proves OPENAI_API_KEY or AZURE_OPENAI_API_KEY are wired, if LLM_PROVIDER != mock).
curl -X POST -H "Authorization: Bearer <token>" -H "Content-Type: application/json" \
-d '{"workspaceId":"<ws>","noteId":"<noteId>","transform":"shorten"}' \
https://<backend-host>/api/notes/copilot/transform
```
If any returns 5xx, check the structured log line for a missing-secret error before re-rotating.
## Red Flags
- A secret value appearing in `req.log` or `app.log` output. **Stop, rotate, and audit.**
- A secret committed to git. Use `git filter-repo` to scrub, force-push (coordinate with the team), and rotate the secret immediately.
- Two pods seeing different secret values. Indicates a partial K8s rollout — finish the rollout before traffic is sent to the new version.
- `FIELD_ENCRYPT_KEY_PROVIDER=memory` in production. The backend will refuse to start, but if it slips through (e.g. with `NODE_ENV` set to something other than `production`), all encrypted documents are unrecoverable on restart.
## Open Items
- **Centralized rotation calendar.** Tracked in production-hardening backlog: schedule per-secret cadence (90 days for `OPENAI_API_KEY`, 365 days for `JWT_SECRET`, etc.).
- **Audit log integration.** Emit a `secret.rotated` event to `actiontrail` after each rotation. Currently rotation is logged only in AKV's own audit feed.
- **Dual-JWT support.** Today `JWT_SECRET` rotation invalidates outstanding tokens; planned: support `JWT_SECRET_NEXT` for graceful transitions.
## References
- Config validation: [`backend/src/lib/config.ts`](../../backend/src/lib/config.ts)
- AKV-backed encryption provider: `../learning_ai_common_plat/packages/field-encrypt/src/key-provider-akv.ts`
- Shared secret resolver: `../learning_ai_common_plat/packages/config/src/akv.ts`
- Related: [`MEK_ROTATION.md`](./MEK_ROTATION.md)