Sprint B — closes audit items B4 and B5. - docs/runbooks/MEK_ROTATION.md: step-by-step procedure for rotating the field-encrypt master key in Azure Key Vault, including pre-flight checks, rewrapAllDeks usage, verification queries, rollback, and lost-MEK recovery. Replaces the previous gap where MEK rotation had no documented operator path. - docs/runbooks/SECRET_MANAGEMENT.md: inventory of every secret consumed by NoteLett with its production source (AKV), two production-grade patterns (workload identity vs K8s CSI), the compose-host pattern, rotation flow per secret type, verification commands, and red-flag triage. Both docs cross-link each other and call out concrete open items (automation, dual-JWT support, audit-log emission) for later sprints rather than overstating current capabilities.
136 lines
6.4 KiB
Markdown
136 lines
6.4 KiB
Markdown
# Runbook — MEK (Master Encryption Key) Rotation
|
|
|
|
> **Owner:** Platform / Security
|
|
> **Touches:** `@bytelyst/field-encrypt`, Azure Key Vault, NoteLett backend (port 4016)
|
|
> **Risk:** Medium — operates on encrypted Cosmos documents. A failure mid-rewrap leaves DEKs wrapped under a mix of old and new MEK versions, which is recoverable but requires the old MEK to remain readable until rotation completes.
|
|
|
|
## Background
|
|
|
|
NoteLett uses envelope encryption from `@bytelyst/field-encrypt`:
|
|
|
|
- **MEK** (Master Encryption Key) lives in Azure Key Vault (or `FIELD_ENCRYPT_KEY` env in non-prod).
|
|
- **DEKs** (per-document Data Encryption Keys) are wrapped by the MEK and stored in the Cosmos `dek_store` container.
|
|
- Encrypted fields on note documents carry a `dekId` plus a `mekVersion` so the backend can find the right DEK and the right MEK version.
|
|
|
|
Rotation means: keep all encrypted document fields unchanged, generate a **new MEK version**, and **rewrap every DEK** so future reads use the new MEK. Old documents are never re-encrypted — only the DEK wrapping layer changes.
|
|
|
|
## Pre-flight Checklist
|
|
|
|
- [ ] Confirm `FIELD_ENCRYPT_ENABLED=true` in the target environment.
|
|
- [ ] Confirm `FIELD_ENCRYPT_KEY_PROVIDER=akv` and `AZURE_KEYVAULT_URL` is set in production.
|
|
- [ ] Confirm the operator running rotation has both `keys/wrapKey` and `keys/unwrapKey` permissions on the target AKV key.
|
|
- [ ] Take a Cosmos DB snapshot or enable point-in-time restore on the `dek_store` container (and any container holding encrypted documents).
|
|
- [ ] Confirm there are no in-flight migrations from `scripts/encrypt-migrate.ts` running against this environment.
|
|
- [ ] Notify on-call channel; rotation should be scheduled in a low-traffic window because the rewrap loop holds the DEK cache invalidated.
|
|
|
|
## Rotation Procedure
|
|
|
|
### Step 1 — Create a new MEK version in AKV
|
|
|
|
```bash
|
|
# Azure CLI example: create a new version of the existing key.
|
|
az keyvault key create \
|
|
--vault-name "<vault-name>" \
|
|
--name "notelett-mek" \
|
|
--kty RSA \
|
|
--size 4096
|
|
```
|
|
|
|
The Key Vault returns a new version identifier. Record it — you will need it to verify rotation in Step 4.
|
|
|
|
### Step 2 — Roll the backend with both MEK versions readable
|
|
|
|
Update the running deployment so that:
|
|
|
|
- `AZURE_KEYVAULT_URL` continues to point at the same vault.
|
|
- `FIELD_ENCRYPT_MEK_NAME` continues to point at the same key name (`notelett-mek`).
|
|
- The new version is now the **latest** version on the AKV key; the previous version is still **enabled** so old wrapped DEKs can be unwrapped.
|
|
|
|
The AKV-backed `KeyProvider` resolves `mekVersion` from the wrapped-DEK record, so older DEKs continue to unwrap correctly during the transition.
|
|
|
|
### Step 3 — Run `rewrapAllDeks`
|
|
|
|
Use the cross-product CLI from common-plat:
|
|
|
|
```bash
|
|
cd ../learning_ai_common_plat
|
|
pnpm --filter @bytelyst/scripts run encrypt-migrate -- \
|
|
--product notelett \
|
|
--mode rewrap-deks \
|
|
--old-mek-version <OLD_VERSION_ID> \
|
|
--new-mek-version <NEW_VERSION_ID>
|
|
```
|
|
|
|
This iterates every wrapped DEK in the `dek_store` container, unwraps with the old MEK version, rewraps with the new MEK version, and writes the updated wrapped DEK back. The `DekCache` is invalidated entry-by-entry so live traffic immediately picks up the new wrap.
|
|
|
|
Progress is logged. Re-run is **idempotent**: DEKs whose `mekVersion` already matches the new target are skipped.
|
|
|
|
### Step 4 — Verify
|
|
|
|
```bash
|
|
# 1. Confirm health and dependency readiness against the rotated environment.
|
|
curl https://<backend-host>/health
|
|
curl https://<backend-host>/api/diagnostics/readiness
|
|
|
|
# 2. Confirm at least one read of an encrypted note succeeds after rotation.
|
|
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes/<noteId>?workspaceId=<workspaceId>
|
|
|
|
# 3. Spot-check `dek_store`: every wrapped DEK should report the new mekVersion.
|
|
# Sample query in Cosmos Data Explorer:
|
|
# SELECT VALUE COUNT(1) FROM c WHERE c.mekVersion != "<NEW_VERSION_ID>"
|
|
# Expected: 0
|
|
```
|
|
|
|
### Step 5 — Disable the old MEK version
|
|
|
|
After verification and a soak period (recommended: 24 hours of normal traffic), disable the old key version in AKV so it can no longer be used to unwrap.
|
|
|
|
```bash
|
|
az keyvault key set-attributes \
|
|
--vault-name "<vault-name>" \
|
|
--name "notelett-mek" \
|
|
--version "<OLD_VERSION_ID>" \
|
|
--enabled false
|
|
```
|
|
|
|
Do **not** delete the old version — keep it disabled for at least 30 days so a backup restore that contains old wrapped DEKs can still be recovered.
|
|
|
|
## Rollback
|
|
|
|
If rewrap fails partway:
|
|
|
|
1. Stop the rewrap job.
|
|
2. Leave the old MEK version **enabled** in AKV.
|
|
3. The backend continues to serve traffic — DEKs that were already rewrapped resolve via the new version, DEKs that were not yet rewrapped resolve via the old version.
|
|
4. Investigate the failure (typically AKV throttling or transient Cosmos errors), then re-run the same command. It will skip DEKs already rewrapped.
|
|
|
|
If a document fails to decrypt after rotation:
|
|
|
|
1. Inspect the encrypted field — it carries the `dekId` it expects.
|
|
2. Inspect the `dek_store` row for that `dekId` — confirm `mekVersion` is the new version.
|
|
3. If the row is missing, restore the most recent Cosmos snapshot of `dek_store`.
|
|
4. **Never** rotate the field's `dekId` manually; the field-encrypt library owns that lifecycle.
|
|
|
|
## Recovery From Lost MEK
|
|
|
|
If the AKV key (all versions) is unrecoverable:
|
|
|
|
- All envelope-encrypted documents are unrecoverable.
|
|
- Restore from the most recent Cosmos snapshot taken **before** the loss.
|
|
- Re-key from that snapshot by:
|
|
1. Provisioning a new AKV key.
|
|
2. Running `scripts/encrypt-migrate.ts` in `re-encrypt` mode (not `rewrap-deks`) so a fresh DEK is generated for every document.
|
|
|
|
## Open Items
|
|
|
|
- **Automation:** rotation is manual today. A scheduled monthly rotation via a controlled GitHub Action with AKV federated credentials is tracked in the production-hardening backlog.
|
|
- **Audit trail:** the rewrap CLI writes structured logs but does not yet emit a domain event to `actiontrail`. Tracked separately.
|
|
|
|
## References
|
|
|
|
- Package helper: `../learning_ai_common_plat/packages/field-encrypt/src/envelope.ts` — `rewrapAllDeks`
|
|
- Cross-product CLI: `../learning_ai_common_plat/scripts/encrypt-migrate.ts`
|
|
- Backend config: [`backend/src/lib/config.ts`](../backend/src/lib/config.ts) — `FIELD_ENCRYPT_*` env vars
|
|
- Backend bootstrap: [`backend/src/lib/field-encrypt.ts`](../backend/src/lib/field-encrypt.ts) — `initEncryption()`
|
|
- Related: [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](../DATA_MIGRATION_AND_BACKFILL_PLAN.md) — initial backfill procedure
|