bytelyst-devops-tools/docs/SESSION_CHECKPOINT_2026-05-30.md
Hermes VM 2056883198 checkpoint(dashboard): session 2026-05-30 — CORS env knob + state handoff
Captures the in-progress state of the long-running v2 dashboard session
so the next session (post `--permission-mode dangerous` relaunch) can
pick up without losing context. The full handoff narrative lives in
`docs/SESSION_CHECKPOINT_2026-05-30.md` — read it first.

Code change:
  - `backend/src/server.ts` CORS allow-list is now env-driven via
    `EXTRA_CORS_ORIGINS` (comma-separated). Originally added because
    the user's browser is hitting the deployed dashboard via a
    Tailscale-served hostname (`srv1491630.tailf85608.ts.net`), and
    the static built-in list only knew `localhost` + `devops.bytelyst.com`.
    Honours `*` as a wildcard for trusted-network deployments. Adds
    `Vary: Origin` so caches behave.
  - `backend/package-lock.json` regenerated to match `package.json`
    (was missing the Phase 5 ESLint deps added earlier this session).
    Note: the Dockerfile build is STILL broken with `tsc: not found`
    despite typescript being in devDeps — this is a separate
    dual-lockfile issue documented in the checkpoint. Untangle on
    resume.

Live infra carry-forward summarised in the checkpoint doc:
  - Real Azure Cosmos DB (`cosmos-mywisprai` / new `bytelyst` db)
    replaces the crash-looping local emulator.
  - `learning_ai_common_plat/docker-compose.yml` has uncommitted
    changes mirroring this; that repo is 15 commits behind origin/main
    and needs a rebase+commit pass separately.
  - Hot-patched the running `devops-backend` container's `dist/server.js`
    to allow the Tailscale origin (ephemeral; lost on next image build,
    superseded by the code change above once rebuild works).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-30 09:55:50 +00:00

9.8 KiB

Session Checkpoint — 2026-05-30

Handoff snapshot for the next session. Read this top-to-bottom before touching anything — there's live infra state outside this repo that's material to the work in progress.

TL;DR

Roadmap items shipped this session: all of Phase 1, 2, 3, 5, 6, 7 of the v2 dashboard roadmap, plus 4 of 5 of the Phase 5 P2 mitigation roadmap. Phase 4 + Phase 8 are documented as delegation briefs (VM ops, not code).

But: the live deployed dashboard is still the pre-this-session image. Building it ran into a pre-existing dual-lockfile issue (pnpm-lock vs backend/package-lock.json drift). That's the first thing to fix on resume so the rest of this session's work actually ships.

There's also a CORS hot-patch applied directly to the running devops-backend container to unblock the user's browser tour. That patch evaporates on the next image build/recreate.

What's live right now (running infra)

Resource State Notes
Tailscale serve UP https://srv1491630.tailf85608.ts.net/localhost:3049
devops-backend container Up + healthy Pre-session image (built ~2026-05-29) + a hot-patch in dist/server.js adding https://srv1491630.tailf85608.ts.net to CORS allow-list
devops-web container Up Pre-session image
learning_ai_common_plat-platform-service-1 Up + healthy Restarted with new env pointing at real Cosmos DB
learning_ai_common_plat-cosmos-emulator-1 Stopped Was crash-looping; replaced with real Cosmos
Real Cosmos DB account cosmos-mywisprai Live New bytelyst database created in rg-mywisprai (West US 2)

To check on resume:

docker ps --filter name=devops --filter name=platform-service --filter name=cosmos
tailscale serve status

Credentials (this session's mint, change on first login)

  • Dashboard URL: https://srv1491630.tailf85608.ts.net/login
  • Email: admin@bytelyst.local
  • Password: cat /tmp/devin-mint-pw.txt (random base64, 20 chars; rotate immediately)
  • Product ID: bytelyst-devops
  • User ID (in Cosmos): usr_7fb3552c-3d8f-4fed-83e5-8461b018c345

Backup minted JWT (24h, dashboard-backend JWT_SECRET, never used in the end because the real auth flow took precedence): /tmp/devin-mint-jwt.txt.

Both files are in /tmp — survive shell exit, lost on reboot.

Cross-repo state

learning_ai_devops_tools (this repo)

Branch main. Pushed commits this session — 18 in total, most recently:

SHA Phase Title
eaaa545 6 + P2 close trend cards, theme toggle, drop-root scaffold, Agents inventory, Phase 0 reconfirm
74a8ee0 5 P2 allow-list shell wrapper + projectPath validation + audit-log shell-outs
a8cf61a 8 Telegram convention + delegation brief
14c7a8f 6 severity alerts + per-instance actions + URL-param deep links
efdf41f 4 + 7 Phase 4 brief + /hermes/ops requireAdmin
62c0cd6 3.2 Products pane on real service registry
ad16b13 3.1 hermes-telemetry contract + endpoint + 6 tests
13e5e1c 5 P2 Playwright E2E wired into Gitea CI
1e64d75 5 P2 structured pino logging + redaction
c6ec1a0 5 P1 privilege surface doc + /code-quality/check auth fix
824f315 5 P1 doc drift + dedupe deployment docs
3fc471e 5 P1 SSE TODO removed (dead fastify-sse-v2)
8ba2dbd 5 P1 35 auth/csrf/health/orchestrator tests + coverage gate
ecd1f20 2 instance dimension across Mission Control
1e64d75, c6ec1a0, 824f315, 3fc471e, 8ba2dbd, cf5428a earlier in session (see roadmap notes for full list)

Uncommitted (will be in the same commit as this checkpoint):

  • dashboard/backend/src/server.ts — CORS now env-driven via EXTRA_CORS_ORIGINS. Source-correct, typechecks. Not in the running image because the image rebuild is currently broken (see below).
  • dashboard/backend/package-lock.json — regenerated to match package.json. Was the source of the rebuild error.

learning_ai_common_plat (sibling repo)

Branch main, 15 commits behind origin/main. Uncommitted, not pushed.

Working tree changes:

  • docker-compose.yml — Cosmos emulator service replaced/disabled, all consuming services point at real Cosmos via .env. Long inline comment explains why.
  • .envgitignored, contains live Cosmos credentials. Do not commit.
  • .env.bak-pre-real-cosmos — backup of the env file before I changed it, same gitignore. Delete when you're sure the real-Cosmos setup is keeping.

Suggested next action there: rebase + commit the docker-compose.yml diff once you've verified other dashboards (mindlyst, lysnrai, etc.) still work without the emulator. They reference cosmos-emulator:8081 in compose env vars and will need similar repointing.

Real Cosmos DB layout

  • Account: cosmos-mywisprai (West US 2, resource group rg-mywisprai)
  • Existing databases: mindlyst, lysnrai, mywisprai, invttrdg
  • New database added today: bytelyst (for platform-service)
  • Collections in bytelyst: created automatically by platform-service's COSMOS_AUTO_INIT=true on startup
  • Auto-seeded so far: bytelyst-devops product + the admin user above. All other 12 products (lysnrai, mindlyst, etc.) need re-seeding if their respective dashboards/services are expected to work.

What broke that needs fixing on resume

1. Backend Dockerfile build (BLOCKING the redeploy)

RUN npm ci --ignore-scripts    # OK
RUN npm run build              # fails: sh: tsc: not found

typescript is in devDependencies of package.json and present in package-lock.json, but npm ci isn't actually installing it in the Alpine builder stage. Cause unknown — could be:

  • An NODE_ENV=production leaking into the build context
  • .npmrc somehow excluding devDeps
  • The Alpine Node 20 image's npm having a different behaviour

Investigation paths:

  1. docker run --rm -it node:20-alpine sh and reproduce npm ci from the lockfile manually
  2. Check whether BYTELYST_PACKAGE_SOURCE=vendor (compose default) is triggering an .pnpmfile.cjs hook that drops devDeps
  3. Just switch the Dockerfile to pnpm to align with the workspace

2. Web Dockerfile likely has the same dual-lockfile drift

Haven't verified — but dashboard/web/package-lock.json exists alongside dashboard/pnpm-lock.yaml. Expect the same npm ci failure when web is rebuilt. Worth checking in the same pass.

3. CORS allow-list is hot-patched, not built in

The running devops-backend container has a sed-applied edit to dist/server.js to allow the Tailscale origin. Lost on next image build. The source fix is committed (this commit) but won't take effect until the rebuild works. Workaround: keep hot-patching until rebuild is unblocked, OR set EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net via env at runtime (the new code reads it).

4. The deployed dashboard is the OLD code

The user's "tour" of the dashboard right now shows none of this session's work. After the rebuild is unblocked:

cd /opt/bytelyst/learning_ai_devops_tools/dashboard
EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net docker compose up -d --build --force-recreate backend web

…or via build args / env file. New env var: EXTRA_CORS_ORIGINS.

Open delegation work (not blockers for code)

  • docs/prompts/phase4-bheem-uma-parity.md — VM ops: Uma backup repo + watchdog + restore drill. Requires sudo + Uma GitHub PAT + Uma Telegram bot. Closes 4 of 5 Bheem-only warnings in the ops panel.
  • docs/prompts/phase8-telegram-loop.md — VM ops + bot tokens. Gated on Phase 4. Closes the dashboard-warning → Telegram delivery loop.

Carry-forward from Phase 5 P2 mitigation roadmap

In dashboard/DEPLOYMENT.md "Mitigation roadmap":

  • Allow-list wrapper around shell-outs
  • Validate /code-quality/check's projectPath
  • Audit-log every privileged shell-out
  • Non-root backend container (scaffolded, default-off pending host file permissions)
  • P3 still open: replace raw docker.sock with verb-restricted daemon. Worth a design doc before code.

How to verify on resume

# 1. Confirm the dashboard URL still works
curl -fsS -o /dev/null -w "tailscale dashboard: %{http_code}\n" \
  https://srv1491630.tailf85608.ts.net/login

# 2. Confirm platform-service is healthy on real Cosmos
docker exec learning_ai_common_plat-platform-service-1 \
  node -e 'fetch("http://localhost:4003/health").then(r=>r.text()).then(console.log)'

# 3. Confirm the admin user still exists and login works
PW=$(cat /tmp/devin-mint-pw.txt)
docker exec -e PW="$PW" learning_ai_common_plat-platform-service-1 sh -c '
  node -e "
    fetch(\"http://localhost:4003/api/auth/login\",{
      method:\"POST\",
      headers:{\"content-type\":\"application/json\"},
      body:JSON.stringify({email:\"admin@bytelyst.local\",password:process.env.PW,productId:\"bytelyst-devops\"})
    }).then(async r => { console.log(\"login:\", r.status); })
  "'

# 4. Confirm CORS hot-patch is still in place
docker exec devops-backend grep tailf85608 dist/server.js
# Expect: 'https://devops.bytelyst.com', 'https://srv1491630.tailf85608.ts.net',

What I'd do first on the next session

  1. Fix the backend Dockerfile rebuild. Probably switch to pnpm or debug the npm ci devDep issue. Once that works:
  2. Rebuild + redeploy backend + web with EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net. This brings all the Phase 1-7 work live and replaces the hot-patch.
  3. Verify the user can use the dashboard end-to-end with the new UI.
  4. Delete /tmp/devin-mint-jwt.txt (no longer needed once auth works).
  5. Help the user rotate admin@bytelyst.local's password via the new UI.
  6. Then return to whatever was next — re-seed other products, work on Phase 5 P3 (docker daemon proxy), or let the user drive.

— end checkpoint —