docs(local-llm): comprehensive roadmap review — 5 bugs, 6 phases, 31 items

Systematic code review of DASHBOARD_ROADMAP.md against actual codebase:

Bugs found (BN1-BN5):
- BN1: Compare buttons show unloaded models (can't generate)
- BN2: No AbortController on compare stream (leaks on close)
- BN3: Chat messages lost on modal close (no persistence)
- BN4: Logs panel has no refresh button
- BN5: Delete dialog missing reclaim size (partial impl exists)

Expanded from 4 phases to 6 + backlog (15 → 31 items):
- Phase 1: Pre-load intelligence + bug fixes (N1-N3, BN1-BN2, BN5)
- Phase 2: Rich metadata + persistence (N4-N5, BN3-BN4)
- Phase 3: Model intelligence badges + sort (N6-N10)
- Phase 4: Runtime metrics + UX polish (N11-N14)
- Phase 5 (NEW): Response quality — markdown, syntax highlight,
  vision upload, think-block collapse, model library link
- Phase 6 (NEW): Data persistence — export/import, inference log,
  factory reset
- Phase 7: Expanded backlog (F17-F38, +6 new ideas)

Improvements:
- Added checkboxes for all tasks and acceptance criteria
- Quant-aware RAM estimate multipliers (Q4/Q5/Q8/F16)
- Broader vision model regex (bakllava, moondream, llama-vision)
- DeepSeek R1 distill variant detection for think badge
- Conservative memory availability formula (free + cached*0.5)
- localStorage key registry with llm- prefix standardization
- Dependency graph between phases
- ~6 hrs total estimated effort
This commit is contained in:
saravanakumardb1 2026-02-19 23:02:25 -08:00
parent cd6e561f1b
commit ae231d5aac

View File

@ -15,30 +15,76 @@ Transform the dashboard from a model management tool into a **model intelligence
---
## Bugs Found During Review
Issues discovered by cross-referencing the roadmap against the actual codebase (`page.tsx` ~1,885 lines):
- [ ] **BN1. Compare buttons show unloaded models**`page.tsx:1809`
`ollama.models.filter(m => m.name !== promptModel)` shows ALL installed models for comparison, but unloaded models can't generate responses. Should filter to `ollama.running` only, or show a "load first" indicator.
- [ ] **BN2. No AbortController on compare stream**`page.tsx:236-274`
`handleCompare` fetches `/api/ollama/stream` but doesn't use an abort controller. Closing the prompt modal during comparison doesn't cancel the stream — it continues in the background wasting resources.
- [ ] **BN3. Chat messages lost on modal close**`page.tsx:1524-1530`
Closing the prompt modal clears `promptResponse` and `promptText` but does not persist `chatMessages`. Re-opening the modal starts a fresh conversation. Multi-turn history is discarded.
- [ ] **BN4. Logs panel has no refresh**`page.tsx:1476-1517`
The Ollama logs panel fetches once on open (`fetchLogs` on toggle). There's no refresh button — the only way to see new logs is to close and re-open the panel.
- [ ] **BN5. Delete confirmation doesn't show reclaim size**`page.tsx:1121-1153`
Delete confirmation exists (two-step flow via `deleteConfirm` state) but only shows "Delete this model?" without the disk reclaim amount. N14 in the roadmap was marked as new, but the dialog already exists — it just needs `formatBytes(model.size)` added.
---
## Phase 1 — Pre-Load Intelligence _(Sprint 8)_
**Goal:** Give users the information they need **before** clicking "Load" on a model.
**Estimated effort:** ~45 minutes
**Estimated effort:** ~60 minutes
| # | ID | Task | Status | Priority | Notes |
| --- | --- | -------------------------- | ------ | -------- | ----------------------------------------------------------- |
| 1 | N1 | Estimated RAM per model | [ ] | High | Q4_K_M ≈ 1.2× disk size. Show `~22 GB RAM` on every card. |
| 2 | N2 | "Will it fit?" indicator | [ ] | High | 🟢 Fits / 🟡 Tight / 🔴 Won't fit based on free+cached RAM. |
| 3 | N3 | Aggregate loaded model RAM | [ ] | High | Sum VRAM at top of panel: "2 loaded · 28.5 GB VRAM". |
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ---------------------------------- | ------ | -------- | ------------------------------------------------------------ |
| 1 | N1 | Estimated RAM per model | - [ ] | High | Show `~22 GB RAM` on every model card, not just running ones |
| 2 | N2 | "Will it fit?" indicator | - [ ] | High | 🟢 Fits / 🟡 Tight / 🔴 Won't fit, on Load button |
| 3 | N3 | Aggregate loaded model RAM | - [ ] | High | Sum VRAM at top of panel: "2 loaded · 28.5 GB VRAM" |
| 4 | BN1 | Fix compare to show loaded only | - [ ] | High | Filter compare buttons to `ollama.running` models |
| 5 | BN2 | Add AbortController to compare | - [ ] | High | Cancel compare stream on modal close |
| 6 | BN5 | Show reclaim size in delete dialog | - [ ] | Medium | Add `formatBytes(model.size)` to existing confirmation |
**Implementation details:**
- **N1:** Add `estimateRam(diskSize: number)` to `lib/format.ts`. Returns `diskSize * 1.2`. Display below existing size/params/quant line on each model card.
- **N2:** Compare `estimateRam(model.size)` against `system.memory.free + system.memory.cached`. Pass `system` into the model list rendering. Add colored dot or badge next to Load button.
- **N3:** Compute `ollama.running.reduce((sum, r) => sum + r.size_vram, 0)` and display in the models panel header next to "X active".
- **N1:** Add `estimateRam(diskSize: number, quant?: string)` to `lib/format.ts`. Use quantization-aware multipliers:
- `Q4_K_M` / `Q4_K_S` / `Q4_0`: 1.2×
- `Q5_K_M` / `Q5_K_S`: 1.25×
- `Q8_0`: 1.1×
- `F16` / `F32`: 1.05×
- Default (unknown): 1.2×
Display below existing size/params/quant line on each model card.
**Note:** Apple Silicon uses unified memory — GPU and CPU share the same pool. Add a tooltip explaining this.
- **N2:** Compare `estimateRam(model.size, model.details?.quantization_level)` against `system.memory.free + system.memory.cached`. Note: cached memory is reclaimable by the OS but not guaranteed available. Use conservative estimate: `free + (cached * 0.5)` as effective available.
- 🟢 Green: estimated < 70% of available "Fits comfortably"
- 🟡 Yellow: estimated is 70100% of available → "Tight — may swap"
- 🔴 Red: estimated > available → "Won't fit — will swap heavily"
Add as colored dot next to Load button with tooltip.
- **N3:** Compute `ollama.running.reduce((sum, r) => sum + r.size_vram, 0)` and display in the models panel header: "X active · Y GB VRAM".
- **BN1:** Change `ollama.models.filter(m => m.name !== promptModel)` to `ollama.running.filter(r => r.name !== promptModel)` for the compare buttons. If no other models are loaded, show "Load another model to compare".
- **BN2:** Create a `compareAbortRef` similar to `abortRef`. Call `compareAbortRef.current?.abort()` when modal closes.
- **BN5:** Change "Delete this model?" to `"Delete ${model.name}? Reclaim ${formatBytes(model.size)}"`.
**Acceptance criteria:**
- Every model card shows estimated RAM requirement
- Load button has a color-coded fit indicator
- Panel header shows total VRAM of loaded models
- TypeScript compiles cleanly
- [ ] Every model card shows estimated RAM requirement with quant-aware multiplier
- [ ] Load button has a color-coded fit indicator with tooltip
- [ ] Panel header shows total VRAM of loaded models
- [ ] Compare buttons only show loaded models
- [ ] Compare stream is abortable
- [ ] Delete dialog shows disk reclaim amount
- [ ] TypeScript compiles cleanly
---
@ -46,23 +92,39 @@ Transform the dashboard from a model management tool into a **model intelligence
**Goal:** Surface critical model metadata that's currently hidden behind the Ollama API.
**Estimated effort:** ~60 minutes
**Estimated effort:** ~75 minutes
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ------------------- | ------ | -------- | ---------------------------------------------------------------------------- |
| 4 | N4 | RAM budget bar | [ ] | Medium | Stacked horizontal bar: OS+Apps / Loaded models (by name) / Free. |
| 5 | N5 | Context window size | [ ] | High | Fetch `context_length` from `/api/show` model_info. Show `128k ctx` on card. |
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ------------------------- | ------ | -------- | -------------------------------------------------- |
| 7 | N4 | RAM budget bar | - [ ] | Medium | Stacked bar: OS+Apps / Models (by name) / Free |
| 8 | N5 | Context window size | - [ ] | High | Fetch `context_length` from `/api/show` model_info |
| 9 | BN3 | Persist chat messages | - [ ] | Medium | Save to localStorage, restore on modal re-open |
| 10 | BN4 | Logs panel refresh button | - [ ] | Low | Add refresh icon next to "Show/Hide Ollama Logs" |
**Implementation details:**
- **N4:** New `RamBudgetBar` component. Inputs: `totalRam`, `appMemory`, `runningModels[]` (each with name + size_vram), `freeRam`. Renders as a CSS flex bar with labeled segments. Place above the models list.
- **N5:** Extend the `OllamaModel` interface to include optional `context_length?: number`. On expand (or eagerly for all models), call `/api/ollama` POST with `action: 'show'` and extract `model_info.*.context_length`. Cache in component state. Show as badge on card.
- **N4:** New `RamBudgetBar` component in `components/`. Inputs: `totalRam`, `appMemory`, `runningModels[]` (each with name + size_vram), `freeRam`. Renders as a CSS flex bar with labeled segments using inline widths.
- Segment colors: OS/Apps = `--text-tertiary`, each model = unique from palette, Free = `--surface-muted`
- Label each model segment with name + size (if wide enough, else tooltip)
- Place above models list in the left column
- **Note:** On Apple Silicon, unified memory means all models compete with OS + apps. The bar should show this clearly.
- **N5:** Extend the `show` action handler in `/api/ollama/route.ts` — it already returns the full response. The client needs to extract `model_info.*.context_length` from the response. Currently `modelfileData[model.name]` only stores the modelfile text. Add a separate `modelMetadata` state to cache the full show response. Display context window as badge on card: `128k ctx`, `32k ctx`, `4k ctx`.
- Fetch on expand (lazy) to avoid N+1 calls on page load
- Cache in `modelMetadata: Record<string, { contextLength?: number }>`
- **BN3:** Save `chatMessages` to `localStorage` keyed by `llm-chat-${modelName}`. Restore on prompt modal open if same model. Add a "Clear chat" button next to the chat mode toggle. Cap stored messages at 50.
- **BN4:** Add a `RefreshCw` icon button next to the "Show/Hide Ollama Logs" toggle. Clicking calls `fetchLogs()`.
**Acceptance criteria:**
- RAM budget bar visually represents memory allocation
- Each model shows context window length (once fetched)
- Bar updates when models are loaded/unloaded
- [ ] RAM budget bar renders with labeled model segments
- [ ] Bar updates live when models are loaded/unloaded
- [ ] Context window shown for expanded models (fetched lazily)
- [ ] Chat history persists across modal close/re-open
- [ ] Logs panel has a refresh button
- [ ] TypeScript compiles cleanly
---
@ -70,88 +132,215 @@ Transform the dashboard from a model management tool into a **model intelligence
**Goal:** Auto-detect model capabilities and surface warnings so users don't hit surprises.
**Estimated effort:** ~45 minutes
**Estimated effort:** ~60 minutes
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ----------------------- | ------ | -------- | -------------------------------------------------------------------- |
| 6 | N6 | `<think>` warning badge | [ ] | High | DeepSeek R1 models emit reasoning traces — warn about JSON stripping |
| 7 | N7 | Vision model indicator | [ ] | Medium | Multimodal models (llava, qwen2.5vl) need image input |
| 8 | N8 | Architecture badge | [ ] | Low | Show model arch as pill on card (currently in expanded only) |
| 9 | N9 | Sort/order models | [ ] | Medium | Dropdown: name, size, parameters, running, modified |
| 10 | N10 | Ollama version display | [ ] | Low | Show Ollama server version in status card |
| 11 | N6 | `<think>` warning badge | - [ ] | High | DeepSeek R1 models emit reasoning traces — warn about JSON stripping |
| 12 | N7 | Vision model indicator | - [ ] | Medium | Multimodal models need image input |
| 13 | N8 | Architecture badge | - [ ] | Low | Show model arch as pill on card (currently in expanded only) |
| 14 | N9 | Sort/order models | - [ ] | Medium | Dropdown: name, size, parameters, running, modified |
| 15 | N10 | Ollama version display | - [ ] | Low | Show Ollama server version in status card |
**Implementation details:**
- **N6:** Pattern match model name: `/deepseek-r1/i` or family containing `deepseek`. Show amber ⚠️ badge with tooltip: "Emits `<think>` traces before JSON output".
- **N7:** Pattern match: `/llava|qwen.*vl|minicpm-v/i`. Show 👁 badge with tooltip: "Vision model — supports image input".
- **N8:** Move `model.details.family` from expanded-only to always-visible as a subtle pill badge.
- **N9:** Add `modelSort` state (`'name' | 'size' | 'params' | 'running' | 'modified'`). Sort the filtered model list before `.map()`. Add dropdown above the model list (next to search bar).
- **N10:** New API call to Ollama `/api/version` in the GET handler. Return in OllamaData. Display in the Ollama stats card.
- **N6:** Create `getModelBadges(name: string, family?: string)` in `lib/format.ts`. Pattern match for `<think>` emitters:
- `/deepseek-r1/i` — covers `deepseek-r1:7b`, `deepseek-r1:32b`, etc.
- `/deepseek-r1-distill/i` — distilled variants also emit `<think>`
Show amber ⚠️ badge: "Emits `<think>` traces — strip before JSON.parse".
In the prompt modal, if the active model is a `<think>` model, show a dismissable tip: "This model emits reasoning traces. Use the response after the closing `</think>` tag."
- **N7:** Broader pattern match for vision/multimodal models:
```
/llava|bakllava|moondream|qwen.*vl|minicpm-v|llama.*vision/i
```
Show 👁 badge: "Vision model — supports image input".
**Future:** Add image upload to prompt modal when a vision model is selected (tracked as F24).
- **N8:** Move `model.details.family` from expanded-only to the subtitle line (next to size/params/quant). Style as a subtle pill: `font-mono text-[10px] px-1.5 rounded bg-surface-muted`.
- **N9:** Add `modelSort` state (`'name' | 'size' | 'params' | 'running' | 'modified'`). Apply sort to the filtered model list before `.map()`:
- `name`: alphabetical
- `size`: disk size descending
- `params`: parse `parameter_size` string ("32B" → 32) descending
- `running`: loaded models first, then alphabetical
- `modified`: most recently modified first
Add compact dropdown next to the search bar. Persist sort choice in localStorage (`llm-model-sort`).
- **N10:** Call Ollama `/api/version` in the GET handler (alongside `/api/tags` and `/api/ps`). Add `version?: string` to `OllamaData` interface. Display in the Ollama stats card: "v0.16.2" next to the Online badge.
**Acceptance criteria:**
- DeepSeek R1 models show `<think>` warning badge
- Vision models show eye indicator
- Models can be sorted by any of the 5 criteria
- Ollama version shown in status card
- [ ] DeepSeek R1 (and distilled) models show `<think>` warning badge
- [ ] Vision models show eye indicator
- [ ] Family/architecture shown as pill on all model cards
- [ ] Models sortable by 5 criteria, sort persisted in localStorage
- [ ] Ollama version displayed in stats card
- [ ] TypeScript compiles cleanly
---
## Phase 4 — Runtime Metrics & Polish _(Sprint 11)_
## Phase 4 — Runtime Metrics & UX Polish _(Sprint 11)_
**Goal:** Improve the experience for users who are actively using models.
**Estimated effort:** ~30 minutes
**Estimated effort:** ~45 minutes
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ----------------------------- | ------ | -------- | -------------------------------------------------------- |
| 11 | N11 | Last known tok/s per model | [ ] | Medium | Persist after prompt, show on card |
| 12 | N12 | Auto-unload countdown | [ ] | Medium | Live countdown instead of static expiry time |
| 13 | N13 | Session stats per model | [ ] | Low | Prompts sent + tokens generated in this session |
| 14 | N14 | Delete confirmation + reclaim | [ ] | High | "Delete X? Reclaim Y GB" dialog before deleting |
| 15 | N15 | Simultaneous load suggestions | [ ] | Low | Suggest which models fit together based on available RAM |
| 16 | N11 | Last known tok/s per model | - [ ] | Medium | Persist after prompt, show on card |
| 17 | N12 | Auto-unload countdown | - [ ] | Medium | Live countdown instead of static expiry time |
| 18 | N13 | Session stats per model | - [ ] | Low | Prompts sent + tokens generated in this session |
| 19 | N14 | Simultaneous load suggestions | - [ ] | Low | Suggest which models fit together based on available RAM |
**Implementation details:**
- **N11:** After prompt/chat completes, save `{ model: string, tokPerSec: number, timestamp: number }` to localStorage key `modelBenchmarks`. Show on card as faded text: `~45 tok/s`.
- **N12:** Replace `Expires: {time}` with a `useEffect` interval that computes `expires_at - Date.now()` and formats as `Xm Ys`. Update every second.
- **N13:** Track in component state: `Map<string, { prompts: number, tokens: number }>`. Increment on each prompt/chat completion. Display in expanded details.
- **N14:** Add `confirmDelete` state. When delete is clicked, show inline confirmation with model name and `formatBytes(model.size)` reclaim amount. Second click executes.
- **N15:** After computing estimated RAM for all unloaded models, filter those that fit in remaining free memory. Show as suggestions in the models panel footer.
- **N11:** After prompt/chat completes with `streamMetrics`, save to localStorage key `llm-model-benchmarks`:
```ts
Record<string, { tokPerSec: number; totalTokens: number; timestamp: number }>;
```
Show on card as faded text: `~45 tok/s` (most recent value). Use `llm-` prefix for consistency with other localStorage keys.
Restore on page load. Show "N/A" for models never benchmarked.
- **N12:** Replace static `Expires: {time}` (line 1034-1038) with a live countdown. Add a `useEffect` with 1-second interval that computes `expires_at - Date.now()` and formats:
- `> 5m` → "Unloads in Xm"
- `15m` → "Unloads in Xm Ys" (yellow)
- `< 1m` → "Unloading soon" (red pulse)
Clear interval on unmount or when model is unloaded.
- **N13:** Track in component state: `Map<string, { prompts: number, tokens: number }>`. Increment on each prompt/chat completion. Display in expanded details: "Session: 5 prompts · 2,340 tokens". Resets on page refresh (intentional — session-scoped).
- **N14 (was N15):** After computing estimated RAM (from N1) for all unloaded models, filter those that fit in remaining free memory (accounting for already-loaded models). Show as a suggestion strip below the models list: "Can also load: llama3.1:8b (~6 GB), qwen2.5:7b (~6 GB)".
**Acceptance criteria:**
- Previously benchmarked models show tok/s on card
- Running models show live countdown to unload
- Delete requires explicit confirmation showing disk reclaim
- TypeScript compiles cleanly after all phases
- [ ] Previously benchmarked models show tok/s on card
- [ ] Running models show live countdown to auto-unload
- [ ] Session stats shown in expanded model details
- [ ] Suggestions shown for co-loadable models
- [ ] TypeScript compiles cleanly
---
## Phase 5 — Future Considerations _(Backlog)_
## Phase 5 — Response Quality & Interaction _(Sprint 12)_
Not planned for immediate implementation. Revisit after Phases 14.
**Goal:** Improve the prompt/response experience with better rendering and model-specific interactions.
| ID | Feature | Complexity | Notes |
| --- | ------------------------------ | ---------- | ----------------------------------------------------------------- |
| F17 | WebSocket real-time updates | High | Replace 15s polling with push-based updates from Ollama |
| F18 | GPU/Metal utilization chart | Medium | macOS `powermetrics` or IOKit for GPU load percentage |
| F19 | Model download queue | Medium | Queue multiple pulls, show progress for each |
| F20 | Inference history log | Medium | Persist all prompts/responses to localStorage or file |
| F21 | Custom Modelfile editor | Medium | Edit and push custom Modelfiles to Ollama |
| F22 | Benchmark suite | High | Run standard prompts across all models, generate comparison table |
| F23 | Component decomposition (CQ1b) | High | Break 1,885-line page.tsx into feature-based modules |
**Estimated effort:** ~90 minutes
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ------------------------------- | ------ | -------- | ----------------------------------------------------------- |
| 20 | F24 | Image upload for vision models | - [ ] | High | Add image input when vision model is active |
| 21 | F25 | Markdown rendering in responses | - [ ] | High | Render markdown (headers, lists, bold, code) instead of raw |
| 22 | F26 | Code syntax highlighting | - [ ] | Medium | Highlight code blocks in responses with language detection |
| 23 | F27 | `<think>` block auto-collapse | - [ ] | Medium | Detect and collapse reasoning traces, show "Show reasoning" |
| 24 | F28 | Ollama model library link | - [ ] | Low | Link to ollama.com/library in the pull input placeholder |
**Implementation details:**
- **F24:** When a vision model is selected (detected by N7 badge logic), show an image upload area in the prompt modal. Convert to base64 and send in the Ollama API `images` field. Support drag-and-drop and file picker. Show thumbnail preview.
- **F25:** Use `react-markdown` (lightweight) to render model responses. Install as dependency. Wrap response `<pre>` blocks with `<ReactMarkdown>`. Preserves streaming (markdown renders incrementally).
- **F26:** Use `react-syntax-highlighter` with a dark theme (e.g., `oneDark`) for fenced code blocks inside markdown. Lazy-load to avoid bundle bloat.
- **F27:** For `<think>` models (detected by N6 logic), parse the response to find `<think>...</think>` blocks. Wrap in a collapsible `<details>` element: `<summary>Show reasoning (X tokens)</summary>`. Display the actual answer below the collapsed reasoning.
- **F28:** Change pull input placeholder from `"Pull a model... e.g. deepseek-r1:32b"` to include a link: add a small "Browse models" link below the input that opens `https://ollama.com/library` in a new tab.
**Acceptance criteria:**
- [ ] Vision models show image upload area in prompt modal
- [ ] Responses render markdown (headers, lists, code blocks, bold/italic)
- [ ] Code blocks have syntax highlighting with language labels
- [ ] `<think>` traces are auto-collapsed with "Show reasoning" toggle
- [ ] Model library link available near pull input
---
## Phase 6 — Data Persistence & Export _(Sprint 13)_
**Goal:** Protect user data and enable portability.
**Estimated effort:** ~45 minutes
| # | ID | Task | Status | Priority | Notes |
| --- | --- | ------------------------------ | ------ | -------- | --------------------------------------------------------- |
| 25 | F29 | Export/import settings | - [ ] | Medium | Backup all localStorage data as JSON file |
| 26 | F30 | Inference history log | - [ ] | Medium | Persist prompt/response pairs to localStorage with search |
| 27 | F31 | Clear all data / factory reset | - [ ] | Low | Single button to clear all localStorage keys + state |
**Implementation details:**
- **F29:** Add a gear icon in the header that opens a settings popover. Include "Export settings" (downloads `llm-dashboard-settings.json` containing all `llm-*` localStorage keys) and "Import settings" (file upload, validates shape, merges). Keys to export: `llm-theme`, `llm-model-tags`, `llm-auto-load-model`, `llm-prompt-history`, `llm-model-benchmarks`, `llm-model-sort`, `llm-chat-*`.
- **F30:** After each prompt/chat completion, save `{ model, prompt, response, metrics, timestamp }` to `llm-inference-log` (capped at 100 entries, FIFO). Add a small "History" panel accessible from the header with search and replay.
- **F31:** Add "Reset dashboard" button in settings. Clears all `llm-*` localStorage keys. Confirm with "This will clear all tags, history, and preferences".
**Acceptance criteria:**
- [ ] Settings can be exported as JSON and re-imported
- [ ] Inference history persists across page refreshes
- [ ] Factory reset clears all dashboard state
- [ ] TypeScript compiles cleanly
---
## Phase 7 — Future Considerations _(Backlog)_
Not planned for immediate implementation. Revisit after Phases 16.
| ID | Feature | Complexity | Notes |
| --- | ------------------------------ | ---------- | -------------------------------------------------------------------------- |
| F17 | WebSocket real-time updates | High | Replace 15s polling with push-based updates from Ollama |
| F18 | GPU/Metal utilization chart | Medium | macOS `powermetrics` or IOKit for GPU load percentage |
| F19 | Model download queue | Medium | Queue multiple pulls, show progress for each |
| F21 | Custom Modelfile editor | Medium | Edit and push custom Modelfiles to Ollama |
| F22 | Benchmark suite | High | Run standard prompts across all models, generate comparison table |
| F23 | Component decomposition (CQ1b) | High | Break 1,885-line page.tsx into feature-based modules |
| F32 | Accessibility / ARIA labels | Medium | Keyboard nav beyond shortcuts, screen reader support, focus management |
| F33 | API latency overlay | Low | Show request duration for each panel refresh (network timing) |
| F34 | Model disk cleanup wizard | Medium | Identify unused models (never loaded, old modified_at), suggest deletion |
| F35 | Drag-and-drop model ordering | Low | Manual model order with drag handles, persist in localStorage |
| F36 | Prompt templates library | Medium | Pre-built prompts for common tasks (code review, explain, translate, etc.) |
| F37 | Response diff viewer | Medium | Side-by-side diff of two model responses highlighting differences |
| F38 | Multi-model chat | High | Chat with multiple models simultaneously, responses interleaved |
---
## Summary
| Phase | Sprint | Items | Focus | Effort | Depends On |
| --------- | ------ | ------------ | ------------------------ | ---------- | ---------- |
| 1 | 8 | N1, N2, N3 | Pre-load RAM estimates | ~45 min | — |
| 2 | 9 | N4, N5 | RAM bar + context window | ~60 min | Phase 1 |
| 3 | 10 | N6N10 | Badges + sort + version | ~45 min | — |
| 4 | 11 | N11N15 | Runtime metrics + UX | ~30 min | — |
| **Total** | | **15 items** | | **~3 hrs** | |
| Phase | Sprint | Items | Focus | Effort | Depends On |
| --------- | ------ | -------------- | ------------------------------ | ---------- | ---------- |
| 1 | 8 | N1N3, BN1BN5 | Pre-load intelligence + fixes | ~60 min | — |
| 2 | 9 | N4N5, BN3BN4 | Rich metadata + persistence | ~75 min | Phase 1 |
| 3 | 10 | N6N10 | Intelligence badges + sort | ~60 min | — |
| 4 | 11 | N11N14 | Runtime metrics + UX | ~45 min | Phase 1 |
| 5 | 12 | F24F28 | Response quality + interaction | ~90 min | Phase 3 |
| 6 | 13 | F29F31 | Data persistence + export | ~45 min | — |
| **Total** | | **31 items** | | **~6 hrs** | |
Phases 1 and 3 can run in parallel. Phase 2 depends on Phase 1 (needs RAM estimates for the budget bar). Phase 4 is independent.
**Dependency graph:**
- Phase 1 → Phase 2 (N4 budget bar needs N1 RAM estimates)
- Phase 1 → Phase 4 (N14 co-load suggestions need N1 RAM estimates)
- Phase 3 → Phase 5 (F24 vision upload needs N7 detection, F27 `<think>` collapse needs N6)
- Phases 1, 3, 6 can run in parallel
**localStorage key registry** (standardized `llm-` prefix):
| Key | Type | Phase | Purpose |
| ---------------------- | ------ | ----- | ------------------------- |
| `llm-theme` | string | Done | Dark/light theme |
| `llm-model-tags` | JSON | Done | User tags per model |
| `llm-auto-load-model` | string | Done | Preferred auto-load model |
| `llm-prompt-history` | JSON | Done | Last 20 prompts |
| `llm-model-benchmarks` | JSON | P4 | Tok/s per model |
| `llm-model-sort` | string | P3 | Sort preference |
| `llm-chat-{model}` | JSON | P2 | Chat messages per model |
| `llm-inference-log` | JSON | P6 | Full inference history |