Captures the in-progress state of the long-running v2 dashboard session
so the next session (post `--permission-mode dangerous` relaunch) can
pick up without losing context. The full handoff narrative lives in
`docs/SESSION_CHECKPOINT_2026-05-30.md` — read it first.
Code change:
- `backend/src/server.ts` CORS allow-list is now env-driven via
`EXTRA_CORS_ORIGINS` (comma-separated). Originally added because
the user's browser is hitting the deployed dashboard via a
Tailscale-served hostname (`srv1491630.tailf85608.ts.net`), and
the static built-in list only knew `localhost` + `devops.bytelyst.com`.
Honours `*` as a wildcard for trusted-network deployments. Adds
`Vary: Origin` so caches behave.
- `backend/package-lock.json` regenerated to match `package.json`
(was missing the Phase 5 ESLint deps added earlier this session).
Note: the Dockerfile build is STILL broken with `tsc: not found`
despite typescript being in devDeps — this is a separate
dual-lockfile issue documented in the checkpoint. Untangle on
resume.
Live infra carry-forward summarised in the checkpoint doc:
- Real Azure Cosmos DB (`cosmos-mywisprai` / new `bytelyst` db)
replaces the crash-looping local emulator.
- `learning_ai_common_plat/docker-compose.yml` has uncommitted
changes mirroring this; that repo is 15 commits behind origin/main
and needs a rebase+commit pass separately.
- Hot-patched the running `devops-backend` container's `dist/server.js`
to allow the Tailscale origin (ephemeral; lost on next image build,
superseded by the code change above once rebuild works).
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
219 lines
9.8 KiB
Markdown
219 lines
9.8 KiB
Markdown
# Session Checkpoint — 2026-05-30
|
|
|
|
> Handoff snapshot for the next session. Read this top-to-bottom before
|
|
> touching anything — there's live infra state outside this repo that's
|
|
> material to the work in progress.
|
|
|
|
## TL;DR
|
|
|
|
Roadmap items shipped this session: all of Phase 1, 2, 3, 5, 6, 7 of the
|
|
v2 dashboard roadmap, plus 4 of 5 of the Phase 5 P2 mitigation roadmap.
|
|
Phase 4 + Phase 8 are documented as delegation briefs (VM ops, not code).
|
|
|
|
But: **the live deployed dashboard is still the pre-this-session image**.
|
|
Building it ran into a pre-existing dual-lockfile issue (pnpm-lock vs
|
|
backend/package-lock.json drift). That's the first thing to fix on
|
|
resume so the rest of this session's work actually ships.
|
|
|
|
There's also a **CORS hot-patch applied directly to the running
|
|
`devops-backend` container** to unblock the user's browser tour. That
|
|
patch evaporates on the next image build/recreate.
|
|
|
|
## What's live right now (running infra)
|
|
|
|
| Resource | State | Notes |
|
|
|---|---|---|
|
|
| Tailscale serve | UP | `https://srv1491630.tailf85608.ts.net/` → `localhost:3049` |
|
|
| `devops-backend` container | Up + healthy | Pre-session image (built ~2026-05-29) + a hot-patch in `dist/server.js` adding `https://srv1491630.tailf85608.ts.net` to CORS allow-list |
|
|
| `devops-web` container | Up | Pre-session image |
|
|
| `learning_ai_common_plat-platform-service-1` | Up + healthy | Restarted with new env pointing at real Cosmos DB |
|
|
| `learning_ai_common_plat-cosmos-emulator-1` | **Stopped** | Was crash-looping; replaced with real Cosmos |
|
|
| Real Cosmos DB account `cosmos-mywisprai` | Live | New `bytelyst` database created in `rg-mywisprai` (West US 2) |
|
|
|
|
To check on resume:
|
|
```bash
|
|
docker ps --filter name=devops --filter name=platform-service --filter name=cosmos
|
|
tailscale serve status
|
|
```
|
|
|
|
## Credentials (this session's mint, change on first login)
|
|
|
|
- **Dashboard URL**: <https://srv1491630.tailf85608.ts.net/login>
|
|
- **Email**: `admin@bytelyst.local`
|
|
- **Password**: `cat /tmp/devin-mint-pw.txt` (random base64, 20 chars; rotate immediately)
|
|
- **Product ID**: `bytelyst-devops`
|
|
- **User ID** (in Cosmos): `usr_7fb3552c-3d8f-4fed-83e5-8461b018c345`
|
|
|
|
Backup minted JWT (24h, dashboard-backend JWT_SECRET, never used in the
|
|
end because the real auth flow took precedence): `/tmp/devin-mint-jwt.txt`.
|
|
|
|
Both files are in `/tmp` — survive shell exit, lost on reboot.
|
|
|
|
## Cross-repo state
|
|
|
|
### `learning_ai_devops_tools` (this repo)
|
|
|
|
Branch `main`. Pushed commits this session — 18 in total, most recently:
|
|
|
|
| SHA | Phase | Title |
|
|
|---|---|---|
|
|
| `eaaa545` | 6 + P2 close | trend cards, theme toggle, drop-root scaffold, Agents inventory, Phase 0 reconfirm |
|
|
| `74a8ee0` | 5 P2 | allow-list shell wrapper + projectPath validation + audit-log shell-outs |
|
|
| `a8cf61a` | 8 | Telegram convention + delegation brief |
|
|
| `14c7a8f` | 6 | severity alerts + per-instance actions + URL-param deep links |
|
|
| `efdf41f` | 4 + 7 | Phase 4 brief + `/hermes/ops` requireAdmin |
|
|
| `62c0cd6` | 3.2 | Products pane on real service registry |
|
|
| `ad16b13` | 3.1 | hermes-telemetry contract + endpoint + 6 tests |
|
|
| `13e5e1c` | 5 P2 | Playwright E2E wired into Gitea CI |
|
|
| `1e64d75` | 5 P2 | structured pino logging + redaction |
|
|
| `c6ec1a0` | 5 P1 | privilege surface doc + `/code-quality/check` auth fix |
|
|
| `824f315` | 5 P1 | doc drift + dedupe deployment docs |
|
|
| `3fc471e` | 5 P1 | SSE TODO removed (dead `fastify-sse-v2`) |
|
|
| `8ba2dbd` | 5 P1 | 35 auth/csrf/health/orchestrator tests + coverage gate |
|
|
| `ecd1f20` | 2 | instance dimension across Mission Control |
|
|
| `1e64d75`, `c6ec1a0`, `824f315`, `3fc471e`, `8ba2dbd`, `cf5428a` | earlier in session | (see roadmap notes for full list) |
|
|
|
|
Uncommitted (will be in the same commit as this checkpoint):
|
|
- `dashboard/backend/src/server.ts` — CORS now env-driven via
|
|
`EXTRA_CORS_ORIGINS`. Source-correct, typechecks. **Not in the running
|
|
image** because the image rebuild is currently broken (see below).
|
|
- `dashboard/backend/package-lock.json` — regenerated to match
|
|
`package.json`. Was the source of the rebuild error.
|
|
|
|
### `learning_ai_common_plat` (sibling repo)
|
|
|
|
Branch `main`, **15 commits behind origin/main**. **Uncommitted, not pushed.**
|
|
|
|
Working tree changes:
|
|
- `docker-compose.yml` — Cosmos emulator service replaced/disabled, all
|
|
consuming services point at real Cosmos via `.env`. Long inline
|
|
comment explains why.
|
|
- `.env` — **gitignored, contains live Cosmos credentials**. Do not commit.
|
|
- `.env.bak-pre-real-cosmos` — backup of the env file before I changed it,
|
|
same gitignore. Delete when you're sure the real-Cosmos setup is keeping.
|
|
|
|
Suggested next action there: rebase + commit the docker-compose.yml diff
|
|
once you've verified other dashboards (`mindlyst`, `lysnrai`, etc.) still
|
|
work without the emulator. They reference `cosmos-emulator:8081` in
|
|
compose env vars and will need similar repointing.
|
|
|
|
## Real Cosmos DB layout
|
|
|
|
- Account: `cosmos-mywisprai` (West US 2, resource group `rg-mywisprai`)
|
|
- Existing databases: `mindlyst`, `lysnrai`, `mywisprai`, `invttrdg`
|
|
- **New database added today**: `bytelyst` (for platform-service)
|
|
- Collections in `bytelyst`: created automatically by platform-service's
|
|
`COSMOS_AUTO_INIT=true` on startup
|
|
- Auto-seeded so far: `bytelyst-devops` product + the admin user above.
|
|
**All other 12 products (`lysnrai`, `mindlyst`, etc.) need re-seeding**
|
|
if their respective dashboards/services are expected to work.
|
|
|
|
## What broke that needs fixing on resume
|
|
|
|
### 1. Backend Dockerfile build (BLOCKING the redeploy)
|
|
|
|
```
|
|
RUN npm ci --ignore-scripts # OK
|
|
RUN npm run build # fails: sh: tsc: not found
|
|
```
|
|
|
|
`typescript` is in devDependencies of `package.json` and present in
|
|
`package-lock.json`, but `npm ci` isn't actually installing it in the
|
|
Alpine builder stage. Cause unknown — could be:
|
|
- An `NODE_ENV=production` leaking into the build context
|
|
- `.npmrc` somehow excluding devDeps
|
|
- The Alpine Node 20 image's npm having a different behaviour
|
|
|
|
Investigation paths:
|
|
1. `docker run --rm -it node:20-alpine sh` and reproduce `npm ci` from
|
|
the lockfile manually
|
|
2. Check whether `BYTELYST_PACKAGE_SOURCE=vendor` (compose default) is
|
|
triggering an `.pnpmfile.cjs` hook that drops devDeps
|
|
3. Just switch the Dockerfile to pnpm to align with the workspace
|
|
|
|
### 2. Web Dockerfile likely has the same dual-lockfile drift
|
|
|
|
Haven't verified — but `dashboard/web/package-lock.json` exists alongside
|
|
`dashboard/pnpm-lock.yaml`. Expect the same `npm ci` failure when web is
|
|
rebuilt. Worth checking in the same pass.
|
|
|
|
### 3. CORS allow-list is hot-patched, not built in
|
|
|
|
The running `devops-backend` container has a `sed`-applied edit to
|
|
`dist/server.js` to allow the Tailscale origin. **Lost on next image
|
|
build.** The source fix is committed (this commit) but won't take effect
|
|
until the rebuild works. Workaround: keep hot-patching until rebuild is
|
|
unblocked, OR set `EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net`
|
|
via env at runtime (the new code reads it).
|
|
|
|
### 4. The deployed dashboard is the OLD code
|
|
|
|
The user's "tour" of the dashboard right now shows none of this session's
|
|
work. After the rebuild is unblocked:
|
|
```bash
|
|
cd /opt/bytelyst/learning_ai_devops_tools/dashboard
|
|
EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net docker compose up -d --build --force-recreate backend web
|
|
```
|
|
…or via build args / env file. New env var: `EXTRA_CORS_ORIGINS`.
|
|
|
|
## Open delegation work (not blockers for code)
|
|
|
|
- `docs/prompts/phase4-bheem-uma-parity.md` — VM ops: Uma backup repo +
|
|
watchdog + restore drill. Requires sudo + Uma GitHub PAT + Uma Telegram
|
|
bot. Closes 4 of 5 Bheem-only warnings in the ops panel.
|
|
- `docs/prompts/phase8-telegram-loop.md` — VM ops + bot tokens. Gated on
|
|
Phase 4. Closes the dashboard-warning → Telegram delivery loop.
|
|
|
|
## Carry-forward from Phase 5 P2 mitigation roadmap
|
|
|
|
In `dashboard/DEPLOYMENT.md` "Mitigation roadmap":
|
|
- ✅ Allow-list wrapper around shell-outs
|
|
- ✅ Validate `/code-quality/check`'s `projectPath`
|
|
- ✅ Audit-log every privileged shell-out
|
|
- ✅ Non-root backend container (scaffolded, default-off pending host file
|
|
permissions)
|
|
- ❌ **P3 still open**: replace raw `docker.sock` with verb-restricted
|
|
daemon. Worth a design doc before code.
|
|
|
|
## How to verify on resume
|
|
|
|
```bash
|
|
# 1. Confirm the dashboard URL still works
|
|
curl -fsS -o /dev/null -w "tailscale dashboard: %{http_code}\n" \
|
|
https://srv1491630.tailf85608.ts.net/login
|
|
|
|
# 2. Confirm platform-service is healthy on real Cosmos
|
|
docker exec learning_ai_common_plat-platform-service-1 \
|
|
node -e 'fetch("http://localhost:4003/health").then(r=>r.text()).then(console.log)'
|
|
|
|
# 3. Confirm the admin user still exists and login works
|
|
PW=$(cat /tmp/devin-mint-pw.txt)
|
|
docker exec -e PW="$PW" learning_ai_common_plat-platform-service-1 sh -c '
|
|
node -e "
|
|
fetch(\"http://localhost:4003/api/auth/login\",{
|
|
method:\"POST\",
|
|
headers:{\"content-type\":\"application/json\"},
|
|
body:JSON.stringify({email:\"admin@bytelyst.local\",password:process.env.PW,productId:\"bytelyst-devops\"})
|
|
}).then(async r => { console.log(\"login:\", r.status); })
|
|
"'
|
|
|
|
# 4. Confirm CORS hot-patch is still in place
|
|
docker exec devops-backend grep tailf85608 dist/server.js
|
|
# Expect: 'https://devops.bytelyst.com', 'https://srv1491630.tailf85608.ts.net',
|
|
```
|
|
|
|
## What I'd do first on the next session
|
|
|
|
1. **Fix the backend Dockerfile rebuild.** Probably switch to pnpm or
|
|
debug the npm ci devDep issue. Once that works:
|
|
2. Rebuild + redeploy backend + web with
|
|
`EXTRA_CORS_ORIGINS=https://srv1491630.tailf85608.ts.net`. This brings
|
|
all the Phase 1-7 work live and replaces the hot-patch.
|
|
3. Verify the user can use the dashboard end-to-end with the new UI.
|
|
4. Delete `/tmp/devin-mint-jwt.txt` (no longer needed once auth works).
|
|
5. Help the user rotate `admin@bytelyst.local`'s password via the new UI.
|
|
6. Then return to whatever was next — re-seed other products, work on
|
|
Phase 5 P3 (docker daemon proxy), or let the user drive.
|
|
|
|
— end checkpoint —
|