learning_ai_notes/docs/runbooks/SECRET_MANAGEMENT.md
saravanakumardb1 bcad7d330a docs(runbooks): add MEK rotation and secret-management runbooks
Sprint B — closes audit items B4 and B5.

- docs/runbooks/MEK_ROTATION.md: step-by-step procedure for rotating
  the field-encrypt master key in Azure Key Vault, including pre-flight
  checks, rewrapAllDeks usage, verification queries, rollback, and lost-MEK
  recovery. Replaces the previous gap where MEK rotation had no
  documented operator path.
- docs/runbooks/SECRET_MANAGEMENT.md: inventory of every secret consumed
  by NoteLett with its production source (AKV), two production-grade
  patterns (workload identity vs K8s CSI), the compose-host pattern,
  rotation flow per secret type, verification commands, and red-flag
  triage.

Both docs cross-link each other and call out concrete open items
(automation, dual-JWT support, audit-log emission) for later sprints
rather than overstating current capabilities.
2026-05-22 23:23:38 -07:00

7.3 KiB

Runbook — Secret Management for NoteLett

Owner: Platform / Security Touches: backend container (port 4016), web container (port 3000), Azure Key Vault, deployment platform Audience: anyone deploying NoteLett to a non-development environment

Principles

  • Never commit a secret to git, never bake one into a Docker image.
  • Every secret has exactly one source of truth — Azure Key Vault (AKV) in production.
  • The container reads secrets at process start, never from disk on the runtime host.
  • Rotation is non-disruptive: rolling a deployment after rotating the secret is enough.

Secret Inventory

Variable Required when Source of truth (prod) Source (dev)
JWT_SECRET always (validated ≥ 32 chars in prod) AKV secret notelett-jwt-secret .env (dev default rejected in prod)
COSMOS_ENDPOINT DB_PROVIDER=cosmos AKV secret bytelyst-cosmos-endpoint .env
COSMOS_KEY DB_PROVIDER=cosmos AKV secret bytelyst-cosmos-key .env
AZURE_KEYVAULT_URL FIELD_ENCRYPT_KEY_PROVIDER=akv Static config (URL, not a secret) .env
FIELD_ENCRYPT_KEY FIELD_ENCRYPT_KEY_PROVIDER=env (non-prod only) n/a — prod uses AKV .env
OPENAI_API_KEY LLM_PROVIDER=openai AKV secret notelett-openai-api-key .env
OPENAI_BASE_URL optional override Static config (URL, not a secret) .env
AZURE_OPENAI_API_KEY LLM_PROVIDER=azure AKV secret notelett-azure-openai-key .env
AZURE_OPENAI_ENDPOINT LLM_PROVIDER=azure Static config (URL, not a secret) .env
GITEA_NPM_TOKEN Docker build only (when not using docker-prep.sh tarballs) CI secret ~/.npmrc

backend/src/lib/config.ts enforces production assertions for the four hardest constraints: JWT_SECRET must not be the dev default and must be ≥ 32 chars, DB_PROVIDER must be cosmos, Cosmos endpoint/key/database must be set, and field encryption must be enabled with akv or env provider (never memory).

Production Pattern — Azure Key Vault

Two supported flows depending on the deployment target:

Flow A — Workload Identity (preferred)

  1. The backend container runs under a Managed Identity (Azure Container Apps, AKS, or App Service).

  2. The Managed Identity has secrets/get and keys/{wrapKey, unwrapKey} permissions on the NoteLett key vault.

  3. At process start, an init step resolves secrets from AKV and exports them as env vars in the process scope only:

    # entrypoint snippet (illustrative)
    eval "$(node -e "
      import('@azure/identity').then(({ DefaultAzureCredential }) =>
        import('@azure/keyvault-secrets').then(async ({ SecretClient }) => {
          const c = new SecretClient(process.env.AZURE_KEYVAULT_URL, new DefaultAzureCredential());
          for (const name of ['notelett-jwt-secret','bytelyst-cosmos-key','notelett-openai-api-key']) {
            const v = (await c.getSecret(name)).value;
            process.stdout.write(`export ${name.replace(/-/g,'_').toUpperCase()}='${v}'\n`);
          }
        })
      )"
    )"
    exec node dist/server.js
    

    In @bytelyst/config this is encapsulated by resolveKeyVaultSecrets(...) (see common-plat). Use that helper instead of writing inline glue.

  4. Secrets never touch the container filesystem and never appear in logs (they live in process env only).

Flow B — Kubernetes Secret synced from AKV

  1. Use the AKV CSI driver or secrets-store.csi.k8s.io to project AKV secrets into a Kubernetes Secret.
  2. Reference the K8s Secret in the Deployment via envFrom so values land in the container env.
  3. Rotate by recreating the Pod after the secret syncs.

Deployment Pattern — docker-compose.yml

The committed docker-compose.yml reads from the host shell env (${OPENAI_API_KEY:-} etc.) and from a local .env. For production-like single-host deploys:

  1. Place secrets in a file owned by the deployer with chmod 600, never in git.

  2. Source it before docker compose up:

    set -a
    source /etc/notelett/secrets.env
    set +a
    docker compose up -d
    
  3. Avoid --env-file on the docker compose command line — it persists the path in process listings and is harder to rotate.

  4. After deploy, scrub /etc/notelett/secrets.env from any shell history (history -d) and confirm docker compose config does not leak the secret values to logs.

Rotation

Rotation pattern for any AKV-backed secret:

  1. Update the AKV secret with a new version (az keyvault secret set ...).
  2. Roll the backend deployment (rolling restart picks up the new value at process start).
  3. For JWT_SECRET: rotation invalidates all outstanding access tokens. Plan for forced re-auth or implement dual-secret support before rotating in production.
  4. For OPENAI_API_KEY / AZURE_OPENAI_API_KEY: rotation is hot — in-flight LLM calls complete with the old key; new calls use the new key after restart.
  5. For COSMOS_KEY: prefer rotating the secondary key first, swap the deployment to use it, then rotate the primary.

MEK rotation has its own runbook: MEK_ROTATION.md.

Verification

After any rotation or initial deploy:

# 1. Service health.
curl https://<backend-host>/health

# 2. Dependency readiness (datastore + encryption + platform/extraction/MCP if configured).
curl https://<backend-host>/api/diagnostics/readiness

# 3. Authenticated note read (proves JWT_SECRET and Cosmos creds are wired).
curl -H "Authorization: Bearer <token>" https://<backend-host>/api/notes?workspaceId=<ws>

# 4. LLM smoke (proves OPENAI_API_KEY or AZURE_OPENAI_API_KEY are wired, if LLM_PROVIDER != mock).
curl -X POST -H "Authorization: Bearer <token>" -H "Content-Type: application/json" \
  -d '{"workspaceId":"<ws>","noteId":"<noteId>","transform":"shorten"}' \
  https://<backend-host>/api/notes/copilot/transform

If any returns 5xx, check the structured log line for a missing-secret error before re-rotating.

Red Flags

  • A secret value appearing in req.log or app.log output. Stop, rotate, and audit.
  • A secret committed to git. Use git filter-repo to scrub, force-push (coordinate with the team), and rotate the secret immediately.
  • Two pods seeing different secret values. Indicates a partial K8s rollout — finish the rollout before traffic is sent to the new version.
  • FIELD_ENCRYPT_KEY_PROVIDER=memory in production. The backend will refuse to start, but if it slips through (e.g. with NODE_ENV set to something other than production), all encrypted documents are unrecoverable on restart.

Open Items

  • Centralized rotation calendar. Tracked in production-hardening backlog: schedule per-secret cadence (90 days for OPENAI_API_KEY, 365 days for JWT_SECRET, etc.).
  • Audit log integration. Emit a secret.rotated event to actiontrail after each rotation. Currently rotation is logged only in AKV's own audit feed.
  • Dual-JWT support. Today JWT_SECRET rotation invalidates outstanding tokens; planned: support JWT_SECRET_NEXT for graceful transitions.

References

  • Config validation: backend/src/lib/config.ts
  • AKV-backed encryption provider: ../learning_ai_common_plat/packages/field-encrypt/src/key-provider-akv.ts
  • Shared secret resolver: ../learning_ai_common_plat/packages/config/src/akv.ts
  • Related: MEK_ROTATION.md