learning_ai_notes/docs/runbooks/SECRET_MANAGEMENT.md
saravanakumardb1 bcad7d330a docs(runbooks): add MEK rotation and secret-management runbooks
Sprint B — closes audit items B4 and B5.

- docs/runbooks/MEK_ROTATION.md: step-by-step procedure for rotating
  the field-encrypt master key in Azure Key Vault, including pre-flight
  checks, rewrapAllDeks usage, verification queries, rollback, and lost-MEK
  recovery. Replaces the previous gap where MEK rotation had no
  documented operator path.
- docs/runbooks/SECRET_MANAGEMENT.md: inventory of every secret consumed
  by NoteLett with its production source (AKV), two production-grade
  patterns (workload identity vs K8s CSI), the compose-host pattern,
  rotation flow per secret type, verification commands, and red-flag
  triage.

Both docs cross-link each other and call out concrete open items
(automation, dual-JWT support, audit-log emission) for later sprints
rather than overstating current capabilities.
2026-05-22 23:23:38 -07:00

137 lines
7.3 KiB
Markdown

# Runbook — Secret Management for NoteLett
> **Owner:** Platform / Security
> **Touches:** backend container (port 4016), web container (port 3000), Azure Key Vault, deployment platform
> **Audience:** anyone deploying NoteLett to a non-development environment
## Principles
- Never commit a secret to git, never bake one into a Docker image.
- Every secret has exactly one source of truth — Azure Key Vault (AKV) in production.
- The container reads secrets at process start, never from disk on the runtime host.
- Rotation is non-disruptive: rolling a deployment after rotating the secret is enough.
## Secret Inventory
| Variable | Required when | Source of truth (prod) | Source (dev) |
|---|---|---|---|
| `JWT_SECRET` | always (validated ≥ 32 chars in prod) | AKV secret `notelett-jwt-secret` | `.env` (dev default rejected in prod) |
| `COSMOS_ENDPOINT` | `DB_PROVIDER=cosmos` | AKV secret `bytelyst-cosmos-endpoint` | `.env` |
| `COSMOS_KEY` | `DB_PROVIDER=cosmos` | AKV secret `bytelyst-cosmos-key` | `.env` |
| `AZURE_KEYVAULT_URL` | `FIELD_ENCRYPT_KEY_PROVIDER=akv` | Static config (URL, not a secret) | `.env` |
| `FIELD_ENCRYPT_KEY` | `FIELD_ENCRYPT_KEY_PROVIDER=env` (non-prod only) | n/a — prod uses AKV | `.env` |
| `OPENAI_API_KEY` | `LLM_PROVIDER=openai` | AKV secret `notelett-openai-api-key` | `.env` |
| `OPENAI_BASE_URL` | optional override | Static config (URL, not a secret) | `.env` |
| `AZURE_OPENAI_API_KEY` | `LLM_PROVIDER=azure` | AKV secret `notelett-azure-openai-key` | `.env` |
| `AZURE_OPENAI_ENDPOINT` | `LLM_PROVIDER=azure` | Static config (URL, not a secret) | `.env` |
| `GITEA_NPM_TOKEN` | Docker build only (when not using `docker-prep.sh` tarballs) | CI secret | `~/.npmrc` |
`backend/src/lib/config.ts` enforces production assertions for the four hardest constraints: `JWT_SECRET` must not be the dev default and must be ≥ 32 chars, `DB_PROVIDER` must be `cosmos`, Cosmos endpoint/key/database must be set, and field encryption must be enabled with `akv` or `env` provider (never `memory`).
## Production Pattern — Azure Key Vault
Two supported flows depending on the deployment target:
### Flow A — Workload Identity (preferred)
1. The backend container runs under a Managed Identity (Azure Container Apps, AKS, or App Service).
2. The Managed Identity has `secrets/get` and `keys/{wrapKey, unwrapKey}` permissions on the NoteLett key vault.
3. At process start, an init step resolves secrets from AKV and exports them as env vars in the process scope only:
```bash
# entrypoint snippet (illustrative)
eval "$(node -e "
import('@azure/identity').then(({ DefaultAzureCredential }) =>
import('@azure/keyvault-secrets').then(async ({ SecretClient }) => {
const c = new SecretClient(process.env.AZURE_KEYVAULT_URL, new DefaultAzureCredential());
for (const name of ['notelett-jwt-secret','bytelyst-cosmos-key','notelett-openai-api-key']) {
const v = (await c.getSecret(name)).value;
process.stdout.write(`export ${name.replace(/-/g,'_').toUpperCase()}='${v}'\n`);
}
})
)"
)"
exec node dist/server.js
```
In `@bytelyst/config` this is encapsulated by `resolveKeyVaultSecrets(...)` (see common-plat). Use that helper instead of writing inline glue.
4. Secrets never touch the container filesystem and never appear in logs (they live in process env only).
### Flow B — Kubernetes Secret synced from AKV
1. Use the AKV CSI driver or `secrets-store.csi.k8s.io` to project AKV secrets into a Kubernetes Secret.
2. Reference the K8s Secret in the Deployment via `envFrom` so values land in the container env.
3. Rotate by recreating the Pod after the secret syncs.
## Deployment Pattern — `docker-compose.yml`
The committed `docker-compose.yml` reads from the host shell env (`${OPENAI_API_KEY:-}` etc.) and from a local `.env`. For production-like single-host deploys:
1. Place secrets in a file owned by the deployer with `chmod 600`, never in git.
2. Source it before `docker compose up`:
```bash
set -a
source /etc/notelett/secrets.env
set +a
docker compose up -d
```
3. Avoid `--env-file` on the `docker compose` command line — it persists the path in process listings and is harder to rotate.
4. After deploy, scrub `/etc/notelett/secrets.env` from any shell history (`history -d`) and confirm `docker compose config` does not leak the secret values to logs.
## Rotation
Rotation pattern for any AKV-backed secret:
1. Update the AKV secret with a new version (`az keyvault secret set ...`).
2. Roll the backend deployment (rolling restart picks up the new value at process start).
3. For `JWT_SECRET`: rotation invalidates all outstanding access tokens. Plan for forced re-auth or implement dual-secret support before rotating in production.
4. For `OPENAI_API_KEY` / `AZURE_OPENAI_API_KEY`: rotation is hot — in-flight LLM calls complete with the old key; new calls use the new key after restart.
5. For `COSMOS_KEY`: prefer rotating the **secondary** key first, swap the deployment to use it, then rotate the primary.
MEK rotation has its own runbook: [`MEK_ROTATION.md`](./MEK_ROTATION.md).
## Verification
After any rotation or initial deploy:
```bash
# 1. Service health.
curl https://<backend-host>/health
# 2. Dependency readiness (datastore + encryption + platform/extraction/MCP if configured).
curl https://<backend-host>/api/diagnostics/readiness
# 3. Authenticated note read (proves JWT_SECRET and Cosmos creds are wired).
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes?workspaceId=<ws>
# 4. LLM smoke (proves OPENAI_API_KEY or AZURE_OPENAI_API_KEY are wired, if LLM_PROVIDER != mock).
curl -X POST -H "Authorization: Bearer <token>" -H "Content-Type: application/json" \
-d '{"workspaceId":"<ws>","noteId":"<noteId>","transform":"shorten"}' \
https://<backend-host>/api/notes/copilot/transform
```
If any returns 5xx, check the structured log line for a missing-secret error before re-rotating.
## Red Flags
- A secret value appearing in `req.log` or `app.log` output. **Stop, rotate, and audit.**
- A secret committed to git. Use `git filter-repo` to scrub, force-push (coordinate with the team), and rotate the secret immediately.
- Two pods seeing different secret values. Indicates a partial K8s rollout — finish the rollout before traffic is sent to the new version.
- `FIELD_ENCRYPT_KEY_PROVIDER=memory` in production. The backend will refuse to start, but if it slips through (e.g. with `NODE_ENV` set to something other than `production`), all encrypted documents are unrecoverable on restart.
## Open Items
- **Centralized rotation calendar.** Tracked in production-hardening backlog: schedule per-secret cadence (90 days for `OPENAI_API_KEY`, 365 days for `JWT_SECRET`, etc.).
- **Audit log integration.** Emit a `secret.rotated` event to `actiontrail` after each rotation. Currently rotation is logged only in AKV's own audit feed.
- **Dual-JWT support.** Today `JWT_SECRET` rotation invalidates outstanding tokens; planned: support `JWT_SECRET_NEXT` for graceful transitions.
## References
- Config validation: [`backend/src/lib/config.ts`](../../backend/src/lib/config.ts)
- AKV-backed encryption provider: `../learning_ai_common_plat/packages/field-encrypt/src/key-provider-akv.ts`
- Shared secret resolver: `../learning_ai_common_plat/packages/config/src/akv.ts`
- Related: [`MEK_ROTATION.md`](./MEK_ROTATION.md)