Sprint B — closes audit items B4 and B5. - docs/runbooks/MEK_ROTATION.md: step-by-step procedure for rotating the field-encrypt master key in Azure Key Vault, including pre-flight checks, rewrapAllDeks usage, verification queries, rollback, and lost-MEK recovery. Replaces the previous gap where MEK rotation had no documented operator path. - docs/runbooks/SECRET_MANAGEMENT.md: inventory of every secret consumed by NoteLett with its production source (AKV), two production-grade patterns (workload identity vs K8s CSI), the compose-host pattern, rotation flow per secret type, verification commands, and red-flag triage. Both docs cross-link each other and call out concrete open items (automation, dual-JWT support, audit-log emission) for later sprints rather than overstating current capabilities.
6.4 KiB
Runbook — MEK (Master Encryption Key) Rotation
Owner: Platform / Security Touches:
@bytelyst/field-encrypt, Azure Key Vault, NoteLett backend (port 4016) Risk: Medium — operates on encrypted Cosmos documents. A failure mid-rewrap leaves DEKs wrapped under a mix of old and new MEK versions, which is recoverable but requires the old MEK to remain readable until rotation completes.
Background
NoteLett uses envelope encryption from @bytelyst/field-encrypt:
- MEK (Master Encryption Key) lives in Azure Key Vault (or
FIELD_ENCRYPT_KEYenv in non-prod). - DEKs (per-document Data Encryption Keys) are wrapped by the MEK and stored in the Cosmos
dek_storecontainer. - Encrypted fields on note documents carry a
dekIdplus amekVersionso the backend can find the right DEK and the right MEK version.
Rotation means: keep all encrypted document fields unchanged, generate a new MEK version, and rewrap every DEK so future reads use the new MEK. Old documents are never re-encrypted — only the DEK wrapping layer changes.
Pre-flight Checklist
- Confirm
FIELD_ENCRYPT_ENABLED=truein the target environment. - Confirm
FIELD_ENCRYPT_KEY_PROVIDER=akvandAZURE_KEYVAULT_URLis set in production. - Confirm the operator running rotation has both
keys/wrapKeyandkeys/unwrapKeypermissions on the target AKV key. - Take a Cosmos DB snapshot or enable point-in-time restore on the
dek_storecontainer (and any container holding encrypted documents). - Confirm there are no in-flight migrations from
scripts/encrypt-migrate.tsrunning against this environment. - Notify on-call channel; rotation should be scheduled in a low-traffic window because the rewrap loop holds the DEK cache invalidated.
Rotation Procedure
Step 1 — Create a new MEK version in AKV
# Azure CLI example: create a new version of the existing key.
az keyvault key create \
--vault-name "<vault-name>" \
--name "notelett-mek" \
--kty RSA \
--size 4096
The Key Vault returns a new version identifier. Record it — you will need it to verify rotation in Step 4.
Step 2 — Roll the backend with both MEK versions readable
Update the running deployment so that:
AZURE_KEYVAULT_URLcontinues to point at the same vault.FIELD_ENCRYPT_MEK_NAMEcontinues to point at the same key name (notelett-mek).- The new version is now the latest version on the AKV key; the previous version is still enabled so old wrapped DEKs can be unwrapped.
The AKV-backed KeyProvider resolves mekVersion from the wrapped-DEK record, so older DEKs continue to unwrap correctly during the transition.
Step 3 — Run rewrapAllDeks
Use the cross-product CLI from common-plat:
cd ../learning_ai_common_plat
pnpm --filter @bytelyst/scripts run encrypt-migrate -- \
--product notelett \
--mode rewrap-deks \
--old-mek-version <OLD_VERSION_ID> \
--new-mek-version <NEW_VERSION_ID>
This iterates every wrapped DEK in the dek_store container, unwraps with the old MEK version, rewraps with the new MEK version, and writes the updated wrapped DEK back. The DekCache is invalidated entry-by-entry so live traffic immediately picks up the new wrap.
Progress is logged. Re-run is idempotent: DEKs whose mekVersion already matches the new target are skipped.
Step 4 — Verify
# 1. Confirm health and dependency readiness against the rotated environment.
curl https://<backend-host>/health
curl https://<backend-host>/api/diagnostics/readiness
# 2. Confirm at least one read of an encrypted note succeeds after rotation.
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes/<noteId>?workspaceId=<workspaceId>
# 3. Spot-check `dek_store`: every wrapped DEK should report the new mekVersion.
# Sample query in Cosmos Data Explorer:
# SELECT VALUE COUNT(1) FROM c WHERE c.mekVersion != "<NEW_VERSION_ID>"
# Expected: 0
Step 5 — Disable the old MEK version
After verification and a soak period (recommended: 24 hours of normal traffic), disable the old key version in AKV so it can no longer be used to unwrap.
az keyvault key set-attributes \
--vault-name "<vault-name>" \
--name "notelett-mek" \
--version "<OLD_VERSION_ID>" \
--enabled false
Do not delete the old version — keep it disabled for at least 30 days so a backup restore that contains old wrapped DEKs can still be recovered.
Rollback
If rewrap fails partway:
- Stop the rewrap job.
- Leave the old MEK version enabled in AKV.
- The backend continues to serve traffic — DEKs that were already rewrapped resolve via the new version, DEKs that were not yet rewrapped resolve via the old version.
- Investigate the failure (typically AKV throttling or transient Cosmos errors), then re-run the same command. It will skip DEKs already rewrapped.
If a document fails to decrypt after rotation:
- Inspect the encrypted field — it carries the
dekIdit expects. - Inspect the
dek_storerow for thatdekId— confirmmekVersionis the new version. - If the row is missing, restore the most recent Cosmos snapshot of
dek_store. - Never rotate the field's
dekIdmanually; the field-encrypt library owns that lifecycle.
Recovery From Lost MEK
If the AKV key (all versions) is unrecoverable:
- All envelope-encrypted documents are unrecoverable.
- Restore from the most recent Cosmos snapshot taken before the loss.
- Re-key from that snapshot by:
- Provisioning a new AKV key.
- Running
scripts/encrypt-migrate.tsinre-encryptmode (notrewrap-deks) so a fresh DEK is generated for every document.
Open Items
- Automation: rotation is manual today. A scheduled monthly rotation via a controlled GitHub Action with AKV federated credentials is tracked in the production-hardening backlog.
- Audit trail: the rewrap CLI writes structured logs but does not yet emit a domain event to
actiontrail. Tracked separately.
References
- Package helper:
../learning_ai_common_plat/packages/field-encrypt/src/envelope.ts—rewrapAllDeks - Cross-product CLI:
../learning_ai_common_plat/scripts/encrypt-migrate.ts - Backend config:
backend/src/lib/config.ts—FIELD_ENCRYPT_*env vars - Backend bootstrap:
backend/src/lib/field-encrypt.ts—initEncryption() - Related:
docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md— initial backfill procedure