learning_ai_common_plat/docs/WORKSPACE_ANTI_PATTERNS.md

298 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Workspace Anti-Pattern Audit
> **Date:** 2026-02-28
> **Scope:** All 5 workspace repos (`learning_ai_common_plat`, `learning_voice_ai_agent`, `learning_multimodal_memory_agents`, `learning_ai_clock`, `learning_ai_fastgap`)
> **Method:** Automated grep/scan across all repos + manual review
---
## Summary
| Severity | Count | Category |
| ------------------------------- | ------ | -------------------------------------------------------- |
| **P0 — Security / Data Loss** | 4 | Auth gaps, secrets exposure, CORS wildcard |
| **P1 — Reliability / Crashes** | 6 | Missing error handling, no retries, no graceful shutdown |
| **P2 — Maintainability / Debt** | 8 | Code duplication, version mismatches, package divergence |
| **P3 — Operational / DX** | 6 | CI gaps, env sprawl, missing observability |
| **Total** | **24** | |
---
## P0 — Security / Data Loss
### 1. Admin API routes missing auth guards — CRITICAL
**28 of 53** admin-web API routes have **no auth check** at all. This includes sensitive endpoints:
| Route | Risk |
| -------------------------------------- | -------------------------------------------- |
| `/api/ops/secrets` (GET/POST) | **Lists and writes Azure Key Vault secrets** |
| `/api/ops/secrets/[name]` (GET/DELETE) | **Reads and deletes individual secrets** |
| `/api/telemetry/*` (7 routes) | Queries/mutates telemetry data |
| `/api/themes/*` (4 routes) | Modifies platform themes |
| `/api/tokens/*` (2 routes) | API token management |
| `/api/stripe/config` | Stripe configuration |
The secrets routes are the most critical — they interact directly with Azure Key Vault with **zero authentication**. Anyone who can reach the admin dashboard can read/write/delete all production secrets.
**Fix:** Add Next.js edge middleware (`middleware.ts`) that validates JWT on all `/api/*` routes except `/api/auth/login` and `/api/auth/forgot-password`. This is a single file, ~30 lines, that protects all routes uniformly.
### 2. User-dashboard API routes missing auth — HIGH
**31 of 36** user-dashboard API routes lack explicit auth checks. Includes:
- `/api/payments`, `/api/subscription` — billing operations
- `/api/sessions/*` — user session data
- `/api/stripe/portal`, `/api/stripe/config`
- `/api/transcripts` — user transcript data
- `/api/dashboard` — dashboard aggregations
**Fix:** Same middleware.ts pattern. Protect all `/api/*` except `/api/auth/*`.
### 3. CORS defaults to wildcard (`origin: true`) — MEDIUM
In `@bytelyst/fastify-core`, when `CORS_ORIGIN` env var is not set, CORS defaults to `origin: true` (allow all origins). In production, if someone forgets to set this variable, any website can make authenticated requests to platform-service.
```typescript
// packages/fastify-core/src/create-app.ts:34
const origin = corsOrigin ? corsOrigin.split(',').map(o => o.trim()) : true;
```
**Fix:** Default to `false` (deny all) when `CORS_ORIGIN` is unset. Require explicit opt-in.
### 4. No CSP / security headers on MindLyst-web and ChronoMind-web — MEDIUM
Admin-web, tracker-web, and user-dashboard all have security headers in `next.config.ts`. MindLyst-web and ChronoMind-web have **zero** security headers configured — no CSP, no X-Frame-Options, no HSTS.
**Fix:** Copy the security headers block from admin-web's `next.config.ts` to both apps.
---
## P1 — Reliability / Crashes
### 5. 21 API routes with no try/catch — crash on any DB/network error — HIGH
| Dashboard | Total Routes | Without try/catch | % Unprotected |
| -------------- | ------------ | ----------------- | ------------- |
| user-dashboard | 36 | 12 | 33% |
| mindlyst-web | 33 | 6 | 18% |
| admin-web | 53 | 3 | 6% |
Unprotected routes include payments, subscriptions, sessions, transcripts, SSO callbacks. Any Cosmos timeout or network blip returns an unhandled 500 with a stack trace (information leak + poor UX).
**Fix:** Wrap each handler body in try/catch, or create a shared `withErrorHandler()` HOF that all API routes use.
### 6. No retry / timeout / circuit breaker in `@bytelyst/api-client` — HIGH
The shared `createApiClient()` has **no timeout**, **no retry logic**, and **no circuit breaker**. Every consumer (6 dashboards + mobile apps) inherits this:
- A single Cosmos slowdown cascades to all dashboards
- Network blips cause immediate failures with no recovery
- No AbortController timeout — requests can hang indefinitely
**Fix:** Add `timeout` option (default 10s via AbortController), `retries` option (default 2 for GET, 0 for mutations), and exponential backoff.
### 7. No graceful shutdown in Fastify services — MEDIUM
`startService()` in `@bytelyst/fastify-core` calls `process.exit(1)` on startup failure but has **no SIGTERM/SIGINT handler**. In Docker/K8s, this means:
- In-flight requests are dropped on deploy
- Database connections not cleaned up
- Potential data corruption on writes
**Fix:** Add to `startService()`:
```typescript
for (const signal of ['SIGTERM', 'SIGINT']) {
process.on(signal, async () => {
app.log.info(`Received ${signal}, shutting down gracefully`);
await app.close();
process.exit(0);
});
}
```
### 8. No `error.tsx` in any Next.js app — MEDIUM
**Zero** of the 5 Next.js apps have an `error.tsx` file. When a React component throws during render, users see a blank white page (or the browser's default error). This is the #1 source of "the app is broken" reports.
**Fix:** Add `error.tsx` to each app's root `app/` directory — ~20 lines showing a "Something went wrong" UI with a retry button.
### 9. No `not-found.tsx` in 4 of 5 Next.js apps — LOW
Only MindLyst has a custom 404. The other 4 apps show Next.js's default 404 page.
**Fix:** Add `not-found.tsx` to each app.
### 10. Missing `loading.tsx` in 4 of 5 dashboards — LOW
Only admin-web has a `loading.tsx`. Other dashboards show no loading indicator during route transitions.
**Fix:** Add a skeleton loader `loading.tsx` to each app's layout group.
---
## P2 — Maintainability / Code Duplication
### 11. MindLyst-web uses raw `@azure/cosmos` v3 instead of `@bytelyst/cosmos` — HIGH
MindLyst-web has its own 86-line `cosmos.ts` with a hand-rolled Cosmos client using **v3.17.3** of the SDK. Every other dashboard uses `@bytelyst/cosmos` (which uses v4.x). This means:
- **Different API surface** (v3 vs v4 have breaking changes)
- **No container registry** (MindLyst manages containers ad-hoc)
- **Hardcoded `PRODUCT_ID = "mindlyst"`** instead of using `@bytelyst/config`
- Bug fixes to the shared package don't reach MindLyst
**Fix:** Migrate MindLyst-web to `@bytelyst/cosmos` + `@bytelyst/config`. Replace the 86-line file with ~40 lines matching user-dashboard pattern.
### 12. MindLyst billing-client uses raw fetch instead of `@bytelyst/api-client` — MEDIUM
MindLyst's `billing-client.ts` has its own `billingFetch()` wrapper with hardcoded headers, token management, and error handling. User-dashboard's `billing-client.ts` correctly uses `createApiClient()`.
**Fix:** Rewrite MindLyst's billing-client to use `@bytelyst/api-client` like every other dashboard.
### 13. Duplicate `feature-flags.ts` across repos — MEDIUM
`feature-flags.ts` is nearly identical in user-dashboard and MindLyst-web (only differs by `PRODUCT_ID` fallback). Both have their own raw `fetch()` calls.
**Fix:** Either add a `createFeatureFlagClient()` to `@bytelyst/api-client` or create a thin `@bytelyst/feature-flags` package.
### 14. 5 copies of `product-config.ts` with identical boilerplate — LOW
Every service and dashboard has its own `product-config.ts` that wraps `@bytelyst/config`. The files are 5-10 lines of identical code.
**Fix:** Consider making `@bytelyst/config` export a ready-to-use `PRODUCT_ID` constant (lazy-loaded) to eliminate the wrapper files.
### 15. 4 copies of `docker-prep.sh` across repos — LOW
Each consumer repo has its own `docker-prep.sh` script (22-45 lines each) for packing `@bytelyst/*` tarballs. They diverge in package lists and paths.
**Fix:** Move the canonical script to `learning_ai_common_plat/scripts/docker-prep.sh` and have consumer repos call it, or use a shared Makefile target.
### 16. Duplicate `error-boundary.tsx` in admin + user dashboards — LOW
Nearly identical class component (differs by 3 whitespace lines). Should be in a shared UI package.
**Fix:** Move to `@bytelyst/react-auth` (or create `@bytelyst/react-ui`) and re-export.
### 17. Zod v4 vs v3 conflict — ChronoMind uses Zod 4 — MEDIUM
ChronoMind-web uses `zod: "^4.3.6"` while **every other package and service** uses Zod 3.x. The `@bytelyst/*` packages (config, events) all depend on Zod 3. This means:
- ChronoMind cannot use `@bytelyst/config` or any Zod-dependent shared package without runtime conflicts
- Schema types are incompatible between Zod 3 and Zod 4
**Fix:** Either downgrade ChronoMind to Zod 3 to match ecosystem, or upgrade the entire ecosystem to Zod 4 (breaking change for all services).
### 18. TypeScript version skew — MINOR
| Range | Repos |
| --------------------------- | ------------------------------------------------------ |
| `^5` (loose) | admin-web, tracker-web, user-dashboard, chronomind-web |
| `^5.7.0` `^5.7.3` | common-plat root, services |
| `~5.9.2` `5.9.3` (pinned) | NomGap, MindLyst-web |
MindLyst pins `5.9.3` (exact) while NomGap uses `~5.9.2`. The common-plat root uses `^5.7.0`. This can cause type-checking discrepancies.
**Fix:** Standardize all repos to `^5.9.0` in a coordinated PR.
---
## P3 — Operational / DX
### 19. 10 disabled CI workflows — no automated quality gate — HIGH
| Repo | Disabled Workflows |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| learning_voice_ai_agent | 7 (ci.yml, ci-python-backend, ci-admin-dashboard, ci-user-dashboard, ci-tracker-dashboard, churn-alert, release) |
| learning_ai_common_plat | 2 (ci.yml, trigger-consumers) |
| learning_multimodal_memory_agents | 1 (ci.yml) |
Only ChronoMind and NomGap have active CI. The **three largest repos** have zero automated CI on push/PR. Regressions go undetected until manual testing.
**Fix:** Re-enable CI workflows. Even a minimal `typecheck + test` workflow on PR catches most regressions.
### 20. Zero `x-request-id` propagation in dashboard API routes — MEDIUM
**All 122 dashboard API routes** (53 admin + 36 user + 33 mindlyst) lack `x-request-id` propagation. When a dashboard API route calls platform-service, there's no way to correlate the request across services in logs.
**Fix:** Add a shared middleware or utility that auto-generates and forwards `x-request-id` from incoming request to all outgoing `fetch()` / `createApiClient()` calls.
### 21. 80+ unique env vars with no central registry — MEDIUM
Across all services and dashboards, there are **80+ unique `process.env.*` references**. There's no single document listing which vars each app needs, their valid values, and which are required vs optional.
**Fix:** Create an `ENV_REGISTRY.md` in common-plat docs, auto-generated by scanning all repos. Each entry: var name, required/optional, which apps use it, description.
### 22. No `middleware.ts` in any Next.js app — MEDIUM
None of the 5 Next.js apps have a `middleware.ts` file. This means:
- No edge-level auth protection (each API route must check auth individually — and most don't)
- No redirect logic for unauthenticated users
- No request logging at the edge
**Fix:** Add `middleware.ts` to admin-web, user-dashboard, and mindlyst-web. Tracker-web may not need it if it's mostly public.
### 23. No `instrumentation.ts` in ChronoMind-web — LOW
All other Next.js apps have `instrumentation.ts` for AKV secret resolution at startup. ChronoMind-web is missing it — secrets won't resolve from Key Vault.
**Fix:** Add `instrumentation.ts` following the pattern from user-dashboard-web.
### 24. Package manager split: pnpm (common-plat) vs npm (all consumers) — INFO
Common-plat uses pnpm workspace. All 4 consumer repos use npm with `package-lock.json`. This isn't a bug but creates friction:
- Contributors must know which tool to use where
- `file:` refs from npm to pnpm workspace packages require `pnpm build` first
- Lock file formats differ
**Recommendation:** Document this clearly. Long-term, consider migrating consumers to pnpm or publishing `@bytelyst/*` to a private registry.
---
## Priority Action Plan
### Sprint 1 — Security (1-2 days)
1. Add `middleware.ts` to admin-web (blocks unauthenticated access to secrets, telemetry, themes, tokens)
2. Add `middleware.ts` to user-dashboard (blocks unauthenticated access to payments, sessions, transcripts)
3. Fix CORS default to deny-all when `CORS_ORIGIN` is unset
4. Add security headers to MindLyst-web and ChronoMind-web `next.config.ts`
### Sprint 2 — Reliability (2-3 days)
5. Add `error.tsx` + `not-found.tsx` to all 5 Next.js apps
6. Add try/catch to all 21 unprotected API routes (or create shared error handler HOF)
7. Add timeout + retry to `@bytelyst/api-client`
8. Add graceful shutdown to `@bytelyst/fastify-core`
### Sprint 3 — Deduplication (2-3 days)
9. Migrate MindLyst-web from raw `@azure/cosmos` v3 → `@bytelyst/cosmos` v4
10. Migrate MindLyst billing-client to `@bytelyst/api-client`
11. Consolidate `feature-flags.ts` into shared package
12. Resolve Zod v3/v4 conflict (ChronoMind)
### Sprint 4 — Ops & DX (1-2 days)
13. Re-enable CI on the 3 largest repos (even minimal typecheck + test)
14. Add `x-request-id` propagation to dashboard API layer
15. Standardize TypeScript version across all repos
16. Create `ENV_REGISTRY.md` with all env vars documented
---
## Items Confirmed Correct (Not Anti-Patterns)
| # | Item | Status |
| --- | ---------------------------------------------------------------------------------------- | ----------------------------------------------- |
| A | Tracker-web has no direct Cosmos access — uses `@bytelyst/api-client` → platform-service | Correct by design |
| B | MindLyst native (KMP) has no Azure wiring — all Azure goes through web API routes | Correct by design |
| C | ChronoMind/NomGap have no direct Azure SDK usage — REST API only | Correct by design |
| D | `console.log` in `@bytelyst/logger` | Intentional (it IS the logger) |
| E | `console.log` in `design-tokens/generate.ts` | Build script, not production code |
| F | `print()` in Python `cli_output.py` | Intentional CLI output (has `noqa` comment) |
| G | `as any` in `api-client/client.ts:44` | Single occurrence, casting Headers — acceptable |