learning_ai_notes/docs/runbooks/MEK_ROTATION.md
saravanakumardb1 bcad7d330a docs(runbooks): add MEK rotation and secret-management runbooks
Sprint B — closes audit items B4 and B5.

- docs/runbooks/MEK_ROTATION.md: step-by-step procedure for rotating
  the field-encrypt master key in Azure Key Vault, including pre-flight
  checks, rewrapAllDeks usage, verification queries, rollback, and lost-MEK
  recovery. Replaces the previous gap where MEK rotation had no
  documented operator path.
- docs/runbooks/SECRET_MANAGEMENT.md: inventory of every secret consumed
  by NoteLett with its production source (AKV), two production-grade
  patterns (workload identity vs K8s CSI), the compose-host pattern,
  rotation flow per secret type, verification commands, and red-flag
  triage.

Both docs cross-link each other and call out concrete open items
(automation, dual-JWT support, audit-log emission) for later sprints
rather than overstating current capabilities.
2026-05-22 23:23:38 -07:00

6.4 KiB

Runbook — MEK (Master Encryption Key) Rotation

Owner: Platform / Security Touches: @bytelyst/field-encrypt, Azure Key Vault, NoteLett backend (port 4016) Risk: Medium — operates on encrypted Cosmos documents. A failure mid-rewrap leaves DEKs wrapped under a mix of old and new MEK versions, which is recoverable but requires the old MEK to remain readable until rotation completes.

Background

NoteLett uses envelope encryption from @bytelyst/field-encrypt:

  • MEK (Master Encryption Key) lives in Azure Key Vault (or FIELD_ENCRYPT_KEY env in non-prod).
  • DEKs (per-document Data Encryption Keys) are wrapped by the MEK and stored in the Cosmos dek_store container.
  • Encrypted fields on note documents carry a dekId plus a mekVersion so the backend can find the right DEK and the right MEK version.

Rotation means: keep all encrypted document fields unchanged, generate a new MEK version, and rewrap every DEK so future reads use the new MEK. Old documents are never re-encrypted — only the DEK wrapping layer changes.

Pre-flight Checklist

  • Confirm FIELD_ENCRYPT_ENABLED=true in the target environment.
  • Confirm FIELD_ENCRYPT_KEY_PROVIDER=akv and AZURE_KEYVAULT_URL is set in production.
  • Confirm the operator running rotation has both keys/wrapKey and keys/unwrapKey permissions on the target AKV key.
  • Take a Cosmos DB snapshot or enable point-in-time restore on the dek_store container (and any container holding encrypted documents).
  • Confirm there are no in-flight migrations from scripts/encrypt-migrate.ts running against this environment.
  • Notify on-call channel; rotation should be scheduled in a low-traffic window because the rewrap loop holds the DEK cache invalidated.

Rotation Procedure

Step 1 — Create a new MEK version in AKV

# Azure CLI example: create a new version of the existing key.
az keyvault key create \
  --vault-name "<vault-name>" \
  --name "notelett-mek" \
  --kty RSA \
  --size 4096

The Key Vault returns a new version identifier. Record it — you will need it to verify rotation in Step 4.

Step 2 — Roll the backend with both MEK versions readable

Update the running deployment so that:

  • AZURE_KEYVAULT_URL continues to point at the same vault.
  • FIELD_ENCRYPT_MEK_NAME continues to point at the same key name (notelett-mek).
  • The new version is now the latest version on the AKV key; the previous version is still enabled so old wrapped DEKs can be unwrapped.

The AKV-backed KeyProvider resolves mekVersion from the wrapped-DEK record, so older DEKs continue to unwrap correctly during the transition.

Step 3 — Run rewrapAllDeks

Use the cross-product CLI from common-plat:

cd ../learning_ai_common_plat
pnpm --filter @bytelyst/scripts run encrypt-migrate -- \
  --product notelett \
  --mode rewrap-deks \
  --old-mek-version <OLD_VERSION_ID> \
  --new-mek-version <NEW_VERSION_ID>

This iterates every wrapped DEK in the dek_store container, unwraps with the old MEK version, rewraps with the new MEK version, and writes the updated wrapped DEK back. The DekCache is invalidated entry-by-entry so live traffic immediately picks up the new wrap.

Progress is logged. Re-run is idempotent: DEKs whose mekVersion already matches the new target are skipped.

Step 4 — Verify

# 1. Confirm health and dependency readiness against the rotated environment.
curl https://<backend-host>/health
curl https://<backend-host>/api/diagnostics/readiness

# 2. Confirm at least one read of an encrypted note succeeds after rotation.
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes/<noteId>?workspaceId=<workspaceId>

# 3. Spot-check `dek_store`: every wrapped DEK should report the new mekVersion.
# Sample query in Cosmos Data Explorer:
#   SELECT VALUE COUNT(1) FROM c WHERE c.mekVersion != "<NEW_VERSION_ID>"
# Expected: 0

Step 5 — Disable the old MEK version

After verification and a soak period (recommended: 24 hours of normal traffic), disable the old key version in AKV so it can no longer be used to unwrap.

az keyvault key set-attributes \
  --vault-name "<vault-name>" \
  --name "notelett-mek" \
  --version "<OLD_VERSION_ID>" \
  --enabled false

Do not delete the old version — keep it disabled for at least 30 days so a backup restore that contains old wrapped DEKs can still be recovered.

Rollback

If rewrap fails partway:

  1. Stop the rewrap job.
  2. Leave the old MEK version enabled in AKV.
  3. The backend continues to serve traffic — DEKs that were already rewrapped resolve via the new version, DEKs that were not yet rewrapped resolve via the old version.
  4. Investigate the failure (typically AKV throttling or transient Cosmos errors), then re-run the same command. It will skip DEKs already rewrapped.

If a document fails to decrypt after rotation:

  1. Inspect the encrypted field — it carries the dekId it expects.
  2. Inspect the dek_store row for that dekId — confirm mekVersion is the new version.
  3. If the row is missing, restore the most recent Cosmos snapshot of dek_store.
  4. Never rotate the field's dekId manually; the field-encrypt library owns that lifecycle.

Recovery From Lost MEK

If the AKV key (all versions) is unrecoverable:

  • All envelope-encrypted documents are unrecoverable.
  • Restore from the most recent Cosmos snapshot taken before the loss.
  • Re-key from that snapshot by:
    1. Provisioning a new AKV key.
    2. Running scripts/encrypt-migrate.ts in re-encrypt mode (not rewrap-deks) so a fresh DEK is generated for every document.

Open Items

  • Automation: rotation is manual today. A scheduled monthly rotation via a controlled GitHub Action with AKV federated credentials is tracked in the production-hardening backlog.
  • Audit trail: the rewrap CLI writes structured logs but does not yet emit a domain event to actiontrail. Tracked separately.

References