# Runbook — MEK (Master Encryption Key) Rotation > **Owner:** Platform / Security > **Touches:** `@bytelyst/field-encrypt`, Azure Key Vault, NoteLett backend (port 4016) > **Risk:** Medium — operates on encrypted Cosmos documents. A failure mid-rewrap leaves DEKs wrapped under a mix of old and new MEK versions, which is recoverable but requires the old MEK to remain readable until rotation completes. ## Background NoteLett uses envelope encryption from `@bytelyst/field-encrypt`: - **MEK** (Master Encryption Key) lives in Azure Key Vault (or `FIELD_ENCRYPT_KEY` env in non-prod). - **DEKs** (per-document Data Encryption Keys) are wrapped by the MEK and stored in the Cosmos `dek_store` container. - Encrypted fields on note documents carry a `dekId` plus a `mekVersion` so the backend can find the right DEK and the right MEK version. Rotation means: keep all encrypted document fields unchanged, generate a **new MEK version**, and **rewrap every DEK** so future reads use the new MEK. Old documents are never re-encrypted — only the DEK wrapping layer changes. ## Pre-flight Checklist - [ ] Confirm `FIELD_ENCRYPT_ENABLED=true` in the target environment. - [ ] Confirm `FIELD_ENCRYPT_KEY_PROVIDER=akv` and `AZURE_KEYVAULT_URL` is set in production. - [ ] Confirm the operator running rotation has both `keys/wrapKey` and `keys/unwrapKey` permissions on the target AKV key. - [ ] Take a Cosmos DB snapshot or enable point-in-time restore on the `dek_store` container (and any container holding encrypted documents). - [ ] Confirm there are no in-flight migrations from `scripts/encrypt-migrate.ts` running against this environment. - [ ] Notify on-call channel; rotation should be scheduled in a low-traffic window because the rewrap loop holds the DEK cache invalidated. ## Rotation Procedure ### Step 1 — Create a new MEK version in AKV ```bash # Azure CLI example: create a new version of the existing key. az keyvault key create \ --vault-name "" \ --name "notelett-mek" \ --kty RSA \ --size 4096 ``` The Key Vault returns a new version identifier. Record it — you will need it to verify rotation in Step 4. ### Step 2 — Roll the backend with both MEK versions readable Update the running deployment so that: - `AZURE_KEYVAULT_URL` continues to point at the same vault. - `FIELD_ENCRYPT_MEK_NAME` continues to point at the same key name (`notelett-mek`). - The new version is now the **latest** version on the AKV key; the previous version is still **enabled** so old wrapped DEKs can be unwrapped. The AKV-backed `KeyProvider` resolves `mekVersion` from the wrapped-DEK record, so older DEKs continue to unwrap correctly during the transition. ### Step 3 — Run `rewrapAllDeks` Use the cross-product CLI from common-plat: ```bash cd ../learning_ai_common_plat pnpm --filter @bytelyst/scripts run encrypt-migrate -- \ --product notelett \ --mode rewrap-deks \ --old-mek-version \ --new-mek-version ``` This iterates every wrapped DEK in the `dek_store` container, unwraps with the old MEK version, rewraps with the new MEK version, and writes the updated wrapped DEK back. The `DekCache` is invalidated entry-by-entry so live traffic immediately picks up the new wrap. Progress is logged. Re-run is **idempotent**: DEKs whose `mekVersion` already matches the new target are skipped. ### Step 4 — Verify ```bash # 1. Confirm health and dependency readiness against the rotated environment. curl https:///health curl https:///api/diagnostics/readiness # 2. Confirm at least one read of an encrypted note succeeds after rotation. curl -H "Authorization: Bearer " https:///api/notes/?workspaceId= # 3. Spot-check `dek_store`: every wrapped DEK should report the new mekVersion. # Sample query in Cosmos Data Explorer: # SELECT VALUE COUNT(1) FROM c WHERE c.mekVersion != "" # Expected: 0 ``` ### Step 5 — Disable the old MEK version After verification and a soak period (recommended: 24 hours of normal traffic), disable the old key version in AKV so it can no longer be used to unwrap. ```bash az keyvault key set-attributes \ --vault-name "" \ --name "notelett-mek" \ --version "" \ --enabled false ``` Do **not** delete the old version — keep it disabled for at least 30 days so a backup restore that contains old wrapped DEKs can still be recovered. ## Rollback If rewrap fails partway: 1. Stop the rewrap job. 2. Leave the old MEK version **enabled** in AKV. 3. The backend continues to serve traffic — DEKs that were already rewrapped resolve via the new version, DEKs that were not yet rewrapped resolve via the old version. 4. Investigate the failure (typically AKV throttling or transient Cosmos errors), then re-run the same command. It will skip DEKs already rewrapped. If a document fails to decrypt after rotation: 1. Inspect the encrypted field — it carries the `dekId` it expects. 2. Inspect the `dek_store` row for that `dekId` — confirm `mekVersion` is the new version. 3. If the row is missing, restore the most recent Cosmos snapshot of `dek_store`. 4. **Never** rotate the field's `dekId` manually; the field-encrypt library owns that lifecycle. ## Recovery From Lost MEK If the AKV key (all versions) is unrecoverable: - All envelope-encrypted documents are unrecoverable. - Restore from the most recent Cosmos snapshot taken **before** the loss. - Re-key from that snapshot by: 1. Provisioning a new AKV key. 2. Running `scripts/encrypt-migrate.ts` in `re-encrypt` mode (not `rewrap-deks`) so a fresh DEK is generated for every document. ## Open Items - **Automation:** rotation is manual today. A scheduled monthly rotation via a controlled GitHub Action with AKV federated credentials is tracked in the production-hardening backlog. - **Audit trail:** the rewrap CLI writes structured logs but does not yet emit a domain event to `actiontrail`. Tracked separately. ## References - Package helper: `../learning_ai_common_plat/packages/field-encrypt/src/envelope.ts` — `rewrapAllDeks` - Cross-product CLI: `../learning_ai_common_plat/scripts/encrypt-migrate.ts` - Backend config: [`backend/src/lib/config.ts`](../backend/src/lib/config.ts) — `FIELD_ENCRYPT_*` env vars - Backend bootstrap: [`backend/src/lib/field-encrypt.ts`](../backend/src/lib/field-encrypt.ts) — `initEncryption()` - Related: [`docs/DATA_MIGRATION_AND_BACKFILL_PLAN.md`](../DATA_MIGRATION_AND_BACKFILL_PLAN.md) — initial backfill procedure