# Session Checkpoint — 2026-05-30 > Handoff snapshot for the next session. Read this top-to-bottom before > touching anything — there's live infra state outside this repo that's > material to the work in progress. ## TL;DR Roadmap items shipped this session: all of Phase 1, 2, 3, 5, 6, 7 of the v2 dashboard roadmap, plus 4 of 5 of the Phase 5 P2 mitigation roadmap. Phase 4 + Phase 8 are documented as delegation briefs (VM ops, not code). But: **the live deployed dashboard is still the pre-this-session image**. Building it ran into a pre-existing dual-lockfile issue (pnpm-lock vs backend/package-lock.json drift). That's the first thing to fix on resume so the rest of this session's work actually ships. There's also a **CORS hot-patch applied directly to the running `devops-backend` container** to unblock the user's browser tour. That patch evaporates on the next image build/recreate. ## What's live right now (running infra) | Resource | State | Notes | |---|---|---| | Tailscale serve | UP | `https://srv1491630.tailf85608.ts.net/` → `localhost:3049` | | `devops-backend` container | Up + healthy | Pre-session image (built ~2026-05-29) + a hot-patch in `dist/server.js` adding `https://srv1491630.tailf85608.ts.net` to CORS allow-list | | `devops-web` container | Up | Pre-session image | | `learning_ai_common_plat-platform-service-1` | Up + healthy | Restarted with new env pointing at real Cosmos DB | | `learning_ai_common_plat-cosmos-emulator-1` | **Stopped** | Was crash-looping; replaced with real Cosmos | | Real Cosmos DB account `cosmos-mywisprai` | Live | New `bytelyst` database created in `rg-mywisprai` (West US 2) | To check on resume: ```bash docker ps --filter name=devops --filter name=platform-service --filter name=cosmos tailscale serve status ``` ## Credentials (this session's mint, change on first login) - **Dashboard URL**: - **Email**: `admin@bytelyst.local` - **Password**: `cat /tmp/devin-mint-pw.txt` (random base64, 20 chars; rotate immediately) - **Product ID**: `bytelyst-devops` - **User ID** (in Cosmos): `usr_7fb3552c-3d8f-4fed-83e5-8461b018c345` Backup minted JWT (24h, dashboard-backend JWT_SECRET, never used in the end because the real auth flow took precedence): `/tmp/devin-mint-jwt.txt`. Both files are in `/tmp` — survive shell exit, lost on reboot. ## Cross-repo state ### `learning_ai_devops_tools` (this repo) Branch `main`. Pushed commits this session — 18 in total, most recently: | SHA | Phase | Title | |---|---|---| | `eaaa545` | 6 + P2 close | trend cards, theme toggle, drop-root scaffold, Agents inventory, Phase 0 reconfirm | | `74a8ee0` | 5 P2 | allow-list shell wrapper + projectPath validation + audit-log shell-outs | | `a8cf61a` | 8 | Telegram convention + delegation brief | | `14c7a8f` | 6 | severity alerts + per-instance actions + URL-param deep links | | `efdf41f` | 4 + 7 | Phase 4 brief + `/hermes/ops` requireAdmin | | `62c0cd6` | 3.2 | Products pane on real service registry | | `ad16b13` | 3.1 | hermes-telemetry contract + endpoint + 6 tests | | `13e5e1c` | 5 P2 | Playwright E2E wired into Gitea CI | | `1e64d75` | 5 P2 | structured pino logging + redaction | | `c6ec1a0` | 5 P1 | privilege surface doc + `/code-quality/check` auth fix | | `824f315` | 5 P1 | doc drift + dedupe deployment docs | | `3fc471e` | 5 P1 | SSE TODO removed (dead `fastify-sse-v2`) | | `8ba2dbd` | 5 P1 | 35 auth/csrf/health/orchestrator tests + coverage gate | | `ecd1f20` | 2 | instance dimension across Mission Control | | `1e64d75`, `c6ec1a0`, `824f315`, `3fc471e`, `8ba2dbd`, `cf5428a` | earlier in session | (see roadmap notes for full list) | Uncommitted (will be in the same commit as this checkpoint): - `dashboard/backend/src/server.ts` — CORS now env-driven via `EXTRA_CORS_ORIGINS`. Source-correct, typechecks. **Not in the running image** because the image rebuild is currently broken (see below). - `dashboard/backend/package-lock.json` — regenerated to match `package.json`. Was the source of the rebuild error. ### `learning_ai_common_plat` (sibling repo) Branch `main`, **15 commits behind origin/main**. **Uncommitted, not pushed.** Working tree changes: - `docker-compose.yml` — Cosmos emulator service replaced/disabled, all consuming services point at real Cosmos via `.env`. Long inline comment explains why. - `.env` — **gitignored, contains live Cosmos credentials**. Do not commit. - `.env.bak-pre-real-cosmos` — backup of the env file before I changed it, same gitignore. Delete when you're sure the real-Cosmos setup is keeping. Suggested next action there: rebase + commit the docker-compose.yml diff once you've verified other dashboards (`mindlyst`, `lysnrai`, etc.) still work without the emulator. They reference `cosmos-emulator:8081` in compose env vars and will need similar repointing. ## Real Cosmos DB layout - Account: `cosmos-mywisprai` (West US 2, resource group `rg-mywisprai`) - Existing databases: `mindlyst`, `lysnrai`, `mywisprai`, `invttrdg` - **New database added today**: `bytelyst` (for platform-service) - Collections in `bytelyst`: created automatically by platform-service's `COSMOS_AUTO_INIT=true` on startup - Auto-seeded so far: `bytelyst-devops` product + the admin user above. **All other 12 products (`lysnrai`, `mindlyst`, etc.) need re-seeding** if their respective dashboards/services are expected to work. ## What broke that needs fixing on resume ### 1. Backend Dockerfile build (BLOCKING the redeploy) ``` RUN npm ci --ignore-scripts # OK RUN npm run build # fails: sh: tsc: not found ``` `typescript` is in devDependencies of `package.json` and present in `package-lock.json`, but `npm ci` isn't actually installing it in the Alpine builder stage. Cause unknown — could be: - An `NODE_ENV=production` leaking into the build context - `.npmrc` somehow excluding devDeps - The Alpine Node 20 image's npm having a different behaviour Investigation paths: 1. `docker run --rm -it node:20-alpine sh` and reproduce `npm ci` from the lockfile manually 2. Check whether `BYTELYST_PACKAGE_SOURCE=vendor` (compose default) is triggering an `.pnpmfile.cjs` hook that drops devDeps 3. Just switch the Dockerfile to pnpm to align with the workspace ### 2. Web Dockerfile likely has the same dual-lockfile drift Haven't verified — but `dashboard/web/package-lock.json` exists alongside `dashboard/pnpm-lock.yaml`. Expect the same `npm ci` failure when web is rebuilt. Worth checking in the same pass. ### 3. CORS allow-list is hot-patched, not built in The running `devops-backend` container has a `sed`-applied edit to `dist/server.js` to allow the Tailscale origin. **Lost on next image build.** The source fix is committed (this commit) but won't take effect until the rebuild works. Workaround: keep hot-patching until rebuild is unblocked, OR set `EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net` via env at runtime (the new code reads it). ### 4. The deployed dashboard is the OLD code The user's "tour" of the dashboard right now shows none of this session's work. After the rebuild is unblocked: ```bash cd /opt/bytelyst/learning_ai_devops_tools/dashboard EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net docker compose up -d --build --force-recreate backend web ``` …or via build args / env file. New env var: `EXTRA_CORS_ORIGINS`. ## Open delegation work (not blockers for code) - `docs/prompts/phase4-bheem-uma-parity.md` — VM ops: Uma backup repo + watchdog + restore drill. Requires sudo + Uma GitHub PAT + Uma Telegram bot. Closes 4 of 5 Bheem-only warnings in the ops panel. - `docs/prompts/phase8-telegram-loop.md` — VM ops + bot tokens. Gated on Phase 4. Closes the dashboard-warning → Telegram delivery loop. ## Carry-forward from Phase 5 P2 mitigation roadmap In `dashboard/DEPLOYMENT.md` "Mitigation roadmap": - ✅ Allow-list wrapper around shell-outs - ✅ Validate `/code-quality/check`'s `projectPath` - ✅ Audit-log every privileged shell-out - ✅ Non-root backend container (scaffolded, default-off pending host file permissions) - ❌ **P3 still open**: replace raw `docker.sock` with verb-restricted daemon. Worth a design doc before code. ## How to verify on resume ```bash # 1. Confirm the dashboard URL still works curl -fsS -o /dev/null -w "tailscale dashboard: %{http_code}\n" \ https://srv1491630.tailf85608.ts.net/login # 2. Confirm platform-service is healthy on real Cosmos docker exec learning_ai_common_plat-platform-service-1 \ node -e 'fetch("http://localhost:4003/health").then(r=>r.text()).then(console.log)' # 3. Confirm the admin user still exists and login works PW=$(cat /tmp/devin-mint-pw.txt) docker exec -e PW="$PW" learning_ai_common_plat-platform-service-1 sh -c ' node -e " fetch(\"http://localhost:4003/api/auth/login\",{ method:\"POST\", headers:{\"content-type\":\"application/json\"}, body:JSON.stringify({email:\"admin@bytelyst.local\",password:process.env.PW,productId:\"bytelyst-devops\"}) }).then(async r => { console.log(\"login:\", r.status); }) "' # 4. Confirm CORS hot-patch is still in place docker exec devops-backend grep tailf85608 dist/server.js # Expect: 'https://devops.bytelyst.com', 'https://srv1491630.tailf85608.ts.net', ``` ## What I'd do first on the next session 1. **Fix the backend Dockerfile rebuild.** Probably switch to pnpm or debug the npm ci devDep issue. Once that works: 2. Rebuild + redeploy backend + web with `EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net`. This brings all the Phase 1-7 work live and replaces the hot-patch. 3. Verify the user can use the dashboard end-to-end with the new UI. 4. Delete `/tmp/devin-mint-jwt.txt` (no longer needed once auth works). 5. Help the user rotate `admin@bytelyst.local`'s password via the new UI. 6. Then return to whatever was next — re-seed other products, work on Phase 5 P3 (docker daemon proxy), or let the user drive. — end checkpoint —