docs(local-llm): comprehensive roadmap review — 5 bugs, 6 phases, 31 items

Systematic code review of DASHBOARD_ROADMAP.md against actual codebase: Bugs found (BN1-BN5): - BN1: Compare buttons show unloaded models (can't generate) - BN2: No AbortController on compare stream (leaks on close) - BN3: Chat messages lost on modal close (no persistence) - BN4: Logs panel has no refresh button - BN5: Delete dialog missing reclaim size (partial impl exists) Expanded from 4 phases to 6 + backlog (15 → 31 items): - Phase 1: Pre-load intelligence + bug fixes (N1-N3, BN1-BN2, BN5) - Phase 2: Rich metadata + persistence (N4-N5, BN3-BN4) - Phase 3: Model intelligence badges + sort (N6-N10) - Phase 4: Runtime metrics + UX polish (N11-N14) - Phase 5 (NEW): Response quality — markdown, syntax highlight, vision upload, think-block collapse, model library link - Phase 6 (NEW): Data persistence — export/import, inference log, factory reset - Phase 7: Expanded backlog (F17-F38, +6 new ideas) Improvements: - Added checkboxes for all tasks and acceptance criteria - Quant-aware RAM estimate multipliers (Q4/Q5/Q8/F16) - Broader vision model regex (bakllava, moondream, llama-vision) - DeepSeek R1 distill variant detection for think badge - Conservative memory availability formula (free + cached*0.5) - localStorage key registry with llm- prefix standardization - Dependency graph between phases - ~6 hrs total estimated effort
2026-02-19 23:02:25 -08:00 · 2026-02-19 23:02:25 -08:00 · ae231d5aac
commit ae231d5aac
parent cd6e561f1b
1 changed files with 262 additions and 73 deletions
--- a/__LOCAL_LLMs/dashboard/docs/DASHBOARD_ROADMAP.md
+++ b/__LOCAL_LLMs/dashboard/docs/DASHBOARD_ROADMAP.md
@ -15,30 +15,76 @@ Transform the dashboard from a model management tool into a **model intelligence

 ---

+## Bugs Found During Review
+
+Issues discovered by cross-referencing the roadmap against the actual codebase (`page.tsx` ~1,885 lines):
+
+- [ ] **BN1. Compare buttons show unloaded models** — `page.tsx:1809`
+      `ollama.models.filter(m => m.name !== promptModel)` shows ALL installed models for comparison, but unloaded models can't generate responses. Should filter to `ollama.running` only, or show a "load first" indicator.
+
+- [ ] **BN2. No AbortController on compare stream** — `page.tsx:236-274`
+      `handleCompare` fetches `/api/ollama/stream` but doesn't use an abort controller. Closing the prompt modal during comparison doesn't cancel the stream — it continues in the background wasting resources.
+
+- [ ] **BN3. Chat messages lost on modal close** — `page.tsx:1524-1530`
+      Closing the prompt modal clears `promptResponse` and `promptText` but does not persist `chatMessages`. Re-opening the modal starts a fresh conversation. Multi-turn history is discarded.
+
+- [ ] **BN4. Logs panel has no refresh** — `page.tsx:1476-1517`
+      The Ollama logs panel fetches once on open (`fetchLogs` on toggle). There's no refresh button — the only way to see new logs is to close and re-open the panel.
+
+- [ ] **BN5. Delete confirmation doesn't show reclaim size** — `page.tsx:1121-1153`
+      Delete confirmation exists (two-step flow via `deleteConfirm` state) but only shows "Delete this model?" without the disk reclaim amount. N14 in the roadmap was marked as new, but the dialog already exists — it just needs `formatBytes(model.size)` added.
+
+---
+
 ## Phase 1 — Pre-Load Intelligence _(Sprint 8)_

 **Goal:** Give users the information they need **before** clicking "Load" on a model.

-**Estimated effort:** ~45 minutes
+**Estimated effort:** ~60 minutes

-| #   | ID  | Task                       | Status | Priority | Notes                                                       |
-| --- | --- | -------------------------- | ------ | -------- | ----------------------------------------------------------- |
-| 1   | N1  | Estimated RAM per model    | [ ]    | High     | Q4_K_M ≈ 1.2× disk size. Show `~22 GB RAM` on every card.   |
-| 2   | N2  | "Will it fit?" indicator   | [ ]    | High     | 🟢 Fits / 🟡 Tight / 🔴 Won't fit based on free+cached RAM. |
-| 3   | N3  | Aggregate loaded model RAM | [ ]    | High     | Sum VRAM at top of panel: "2 loaded · 28.5 GB VRAM".        |
+| #   | ID  | Task                               | Status | Priority | Notes                                                        |
+| --- | --- | ---------------------------------- | ------ | -------- | ------------------------------------------------------------ |
+| 1   | N1  | Estimated RAM per model            | - [ ]  | High     | Show `~22 GB RAM` on every model card, not just running ones |
+| 2   | N2  | "Will it fit?" indicator           | - [ ]  | High     | 🟢 Fits / 🟡 Tight / 🔴 Won't fit, on Load button            |
+| 3   | N3  | Aggregate loaded model RAM         | - [ ]  | High     | Sum VRAM at top of panel: "2 loaded · 28.5 GB VRAM"          |
+| 4   | BN1 | Fix compare to show loaded only    | - [ ]  | High     | Filter compare buttons to `ollama.running` models            |
+| 5   | BN2 | Add AbortController to compare     | - [ ]  | High     | Cancel compare stream on modal close                         |
+| 6   | BN5 | Show reclaim size in delete dialog | - [ ]  | Medium   | Add `formatBytes(model.size)` to existing confirmation       |

 **Implementation details:**

- **N1:** Add `estimateRam(diskSize: number)` to `lib/format.ts`. Returns `diskSize * 1.2`. Display below existing size/params/quant line on each model card.
- **N2:** Compare `estimateRam(model.size)` against `system.memory.free + system.memory.cached`. Pass `system` into the model list rendering. Add colored dot or badge next to Load button.
- **N3:** Compute `ollama.running.reduce((sum, r) => sum + r.size_vram, 0)` and display in the models panel header next to "X active".
+- **N1:** Add `estimateRam(diskSize: number, quant?: string)` to `lib/format.ts`. Use quantization-aware multipliers:
+  - `Q4_K_M` / `Q4_K_S` / `Q4_0`: 1.2×
+  - `Q5_K_M` / `Q5_K_S`: 1.25×
+  - `Q8_0`: 1.1×
+  - `F16` / `F32`: 1.05×
+  - Default (unknown): 1.2×
+    Display below existing size/params/quant line on each model card.
+    **Note:** Apple Silicon uses unified memory — GPU and CPU share the same pool. Add a tooltip explaining this.
+
+- **N2:** Compare `estimateRam(model.size, model.details?.quantization_level)` against `system.memory.free + system.memory.cached`. Note: cached memory is reclaimable by the OS but not guaranteed available. Use conservative estimate: `free + (cached * 0.5)` as effective available.
+  - 🟢 Green: estimated < 70% of available → "Fits comfortably"
+  - 🟡 Yellow: estimated is 70–100% of available → "Tight — may swap"
+  - 🔴 Red: estimated > available → "Won't fit — will swap heavily"
+    Add as colored dot next to Load button with tooltip.
+
+- **N3:** Compute `ollama.running.reduce((sum, r) => sum + r.size_vram, 0)` and display in the models panel header: "X active · Y GB VRAM".
+
+- **BN1:** Change `ollama.models.filter(m => m.name !== promptModel)` to `ollama.running.filter(r => r.name !== promptModel)` for the compare buttons. If no other models are loaded, show "Load another model to compare".
+
+- **BN2:** Create a `compareAbortRef` similar to `abortRef`. Call `compareAbortRef.current?.abort()` when modal closes.
+
+- **BN5:** Change "Delete this model?" to `"Delete ${model.name}? Reclaim ${formatBytes(model.size)}"`.

 **Acceptance criteria:**

- Every model card shows estimated RAM requirement
- Load button has a color-coded fit indicator
- Panel header shows total VRAM of loaded models
- TypeScript compiles cleanly
+- [ ] Every model card shows estimated RAM requirement with quant-aware multiplier
+- [ ] Load button has a color-coded fit indicator with tooltip
+- [ ] Panel header shows total VRAM of loaded models
+- [ ] Compare buttons only show loaded models
+- [ ] Compare stream is abortable
+- [ ] Delete dialog shows disk reclaim amount
+- [ ] TypeScript compiles cleanly

 ---

@ -46,23 +92,39 @@ Transform the dashboard from a model management tool into a **model intelligence

 **Goal:** Surface critical model metadata that's currently hidden behind the Ollama API.

-**Estimated effort:** ~60 minutes
+**Estimated effort:** ~75 minutes

-| #   | ID  | Task                | Status | Priority | Notes                                                                        |
-| --- | --- | ------------------- | ------ | -------- | ---------------------------------------------------------------------------- |
-| 4   | N4  | RAM budget bar      | [ ]    | Medium   | Stacked horizontal bar: OS+Apps / Loaded models (by name) / Free.            |
-| 5   | N5  | Context window size | [ ]    | High     | Fetch `context_length` from `/api/show` model_info. Show `128k ctx` on card. |
+| #   | ID  | Task                      | Status | Priority | Notes                                              |
+| --- | --- | ------------------------- | ------ | -------- | -------------------------------------------------- |
+| 7   | N4  | RAM budget bar            | - [ ]  | Medium   | Stacked bar: OS+Apps / Models (by name) / Free     |
+| 8   | N5  | Context window size       | - [ ]  | High     | Fetch `context_length` from `/api/show` model_info |
+| 9   | BN3 | Persist chat messages     | - [ ]  | Medium   | Save to localStorage, restore on modal re-open     |
+| 10  | BN4 | Logs panel refresh button | - [ ]  | Low      | Add refresh icon next to "Show/Hide Ollama Logs"   |

 **Implementation details:**

- **N4:** New `RamBudgetBar` component. Inputs: `totalRam`, `appMemory`, `runningModels[]` (each with name + size_vram), `freeRam`. Renders as a CSS flex bar with labeled segments. Place above the models list.
- **N5:** Extend the `OllamaModel` interface to include optional `context_length?: number`. On expand (or eagerly for all models), call `/api/ollama` POST with `action: 'show'` and extract `model_info.*.context_length`. Cache in component state. Show as badge on card.
+- **N4:** New `RamBudgetBar` component in `components/`. Inputs: `totalRam`, `appMemory`, `runningModels[]` (each with name + size_vram), `freeRam`. Renders as a CSS flex bar with labeled segments using inline widths.
+  - Segment colors: OS/Apps = `--text-tertiary`, each model = unique from palette, Free = `--surface-muted`
+  - Label each model segment with name + size (if wide enough, else tooltip)
+  - Place above models list in the left column
+  - **Note:** On Apple Silicon, unified memory means all models compete with OS + apps. The bar should show this clearly.
+
+- **N5:** Extend the `show` action handler in `/api/ollama/route.ts` — it already returns the full response. The client needs to extract `model_info.*.context_length` from the response. Currently `modelfileData[model.name]` only stores the modelfile text. Add a separate `modelMetadata` state to cache the full show response. Display context window as badge on card: `128k ctx`, `32k ctx`, `4k ctx`.
+  - Fetch on expand (lazy) to avoid N+1 calls on page load
+  - Cache in `modelMetadata: Record<string, { contextLength?: number }>`
+
+- **BN3:** Save `chatMessages` to `localStorage` keyed by `llm-chat-${modelName}`. Restore on prompt modal open if same model. Add a "Clear chat" button next to the chat mode toggle. Cap stored messages at 50.
+
+- **BN4:** Add a `RefreshCw` icon button next to the "Show/Hide Ollama Logs" toggle. Clicking calls `fetchLogs()`.

 **Acceptance criteria:**

- RAM budget bar visually represents memory allocation
- Each model shows context window length (once fetched)
- Bar updates when models are loaded/unloaded
+- [ ] RAM budget bar renders with labeled model segments
+- [ ] Bar updates live when models are loaded/unloaded
+- [ ] Context window shown for expanded models (fetched lazily)
+- [ ] Chat history persists across modal close/re-open
+- [ ] Logs panel has a refresh button
+- [ ] TypeScript compiles cleanly

 ---

@ -70,88 +132,215 @@ Transform the dashboard from a model management tool into a **model intelligence

 **Goal:** Auto-detect model capabilities and surface warnings so users don't hit surprises.

-**Estimated effort:** ~45 minutes
+**Estimated effort:** ~60 minutes

 | #   | ID  | Task                    | Status | Priority | Notes                                                                |
 | --- | --- | ----------------------- | ------ | -------- | -------------------------------------------------------------------- |
-| 6   | N6  | `<think>` warning badge | [ ]    | High     | DeepSeek R1 models emit reasoning traces — warn about JSON stripping |
-| 7   | N7  | Vision model indicator  | [ ]    | Medium   | Multimodal models (llava, qwen2.5vl) need image input                |
-| 8   | N8  | Architecture badge      | [ ]    | Low      | Show model arch as pill on card (currently in expanded only)         |
-| 9   | N9  | Sort/order models       | [ ]    | Medium   | Dropdown: name, size, parameters, running, modified                  |
-| 10  | N10 | Ollama version display  | [ ]    | Low      | Show Ollama server version in status card                            |
+| 11  | N6  | `<think>` warning badge | - [ ]  | High     | DeepSeek R1 models emit reasoning traces — warn about JSON stripping |
+| 12  | N7  | Vision model indicator  | - [ ]  | Medium   | Multimodal models need image input                                   |
+| 13  | N8  | Architecture badge      | - [ ]  | Low      | Show model arch as pill on card (currently in expanded only)         |
+| 14  | N9  | Sort/order models       | - [ ]  | Medium   | Dropdown: name, size, parameters, running, modified                  |
+| 15  | N10 | Ollama version display  | - [ ]  | Low      | Show Ollama server version in status card                            |

 **Implementation details:**

- **N6:** Pattern match model name: `/deepseek-r1/i` or family containing `deepseek`. Show amber ⚠️ badge with tooltip: "Emits `<think>` traces before JSON output".
- **N7:** Pattern match: `/llava|qwen.*vl|minicpm-v/i`. Show 👁 badge with tooltip: "Vision model — supports image input".
- **N8:** Move `model.details.family` from expanded-only to always-visible as a subtle pill badge.
- **N9:** Add `modelSort` state (`'name' | 'size' | 'params' | 'running' | 'modified'`). Sort the filtered model list before `.map()`. Add dropdown above the model list (next to search bar).
- **N10:** New API call to Ollama `/api/version` in the GET handler. Return in OllamaData. Display in the Ollama stats card.
+- **N6:** Create `getModelBadges(name: string, family?: string)` in `lib/format.ts`. Pattern match for `<think>` emitters:
+  - `/deepseek-r1/i` — covers `deepseek-r1:7b`, `deepseek-r1:32b`, etc.
+  - `/deepseek-r1-distill/i` — distilled variants also emit `<think>`
+    Show amber ⚠️ badge: "Emits `<think>` traces — strip before JSON.parse".
+    In the prompt modal, if the active model is a `<think>` model, show a dismissable tip: "This model emits reasoning traces. Use the response after the closing `</think>` tag."
+
+- **N7:** Broader pattern match for vision/multimodal models:
+
+  ```
+  /llava|bakllava|moondream|qwen.*vl|minicpm-v|llama.*vision/i
+  ```
+
+  Show 👁 badge: "Vision model — supports image input".
+  **Future:** Add image upload to prompt modal when a vision model is selected (tracked as F24).
+
+- **N8:** Move `model.details.family` from expanded-only to the subtitle line (next to size/params/quant). Style as a subtle pill: `font-mono text-[10px] px-1.5 rounded bg-surface-muted`.
+
+- **N9:** Add `modelSort` state (`'name' | 'size' | 'params' | 'running' | 'modified'`). Apply sort to the filtered model list before `.map()`:
+  - `name`: alphabetical
+  - `size`: disk size descending
+  - `params`: parse `parameter_size` string ("32B" → 32) descending
+  - `running`: loaded models first, then alphabetical
+  - `modified`: most recently modified first
+    Add compact dropdown next to the search bar. Persist sort choice in localStorage (`llm-model-sort`).
+
+- **N10:** Call Ollama `/api/version` in the GET handler (alongside `/api/tags` and `/api/ps`). Add `version?: string` to `OllamaData` interface. Display in the Ollama stats card: "v0.16.2" next to the Online badge.

 **Acceptance criteria:**

- DeepSeek R1 models show `<think>` warning badge
- Vision models show eye indicator
- Models can be sorted by any of the 5 criteria
- Ollama version shown in status card
+- [ ] DeepSeek R1 (and distilled) models show `<think>` warning badge
+- [ ] Vision models show eye indicator
+- [ ] Family/architecture shown as pill on all model cards
+- [ ] Models sortable by 5 criteria, sort persisted in localStorage
+- [ ] Ollama version displayed in stats card
+- [ ] TypeScript compiles cleanly

 ---

-## Phase 4 — Runtime Metrics & Polish _(Sprint 11)_
+## Phase 4 — Runtime Metrics & UX Polish _(Sprint 11)_

 **Goal:** Improve the experience for users who are actively using models.

-**Estimated effort:** ~30 minutes
+**Estimated effort:** ~45 minutes

 | #   | ID  | Task                          | Status | Priority | Notes                                                    |
 | --- | --- | ----------------------------- | ------ | -------- | -------------------------------------------------------- |
-| 11  | N11 | Last known tok/s per model    | [ ]    | Medium   | Persist after prompt, show on card                       |
-| 12  | N12 | Auto-unload countdown         | [ ]    | Medium   | Live countdown instead of static expiry time             |
-| 13  | N13 | Session stats per model       | [ ]    | Low      | Prompts sent + tokens generated in this session          |
-| 14  | N14 | Delete confirmation + reclaim | [ ]    | High     | "Delete X? Reclaim Y GB" dialog before deleting          |
-| 15  | N15 | Simultaneous load suggestions | [ ]    | Low      | Suggest which models fit together based on available RAM |
+| 16  | N11 | Last known tok/s per model    | - [ ]  | Medium   | Persist after prompt, show on card                       |
+| 17  | N12 | Auto-unload countdown         | - [ ]  | Medium   | Live countdown instead of static expiry time             |
+| 18  | N13 | Session stats per model       | - [ ]  | Low      | Prompts sent + tokens generated in this session          |
+| 19  | N14 | Simultaneous load suggestions | - [ ]  | Low      | Suggest which models fit together based on available RAM |

 **Implementation details:**

- **N11:** After prompt/chat completes, save `{ model: string, tokPerSec: number, timestamp: number }` to localStorage key `modelBenchmarks`. Show on card as faded text: `~45 tok/s`.
- **N12:** Replace `Expires: {time}` with a `useEffect` interval that computes `expires_at - Date.now()` and formats as `Xm Ys`. Update every second.
- **N13:** Track in component state: `Map<string, { prompts: number, tokens: number }>`. Increment on each prompt/chat completion. Display in expanded details.
- **N14:** Add `confirmDelete` state. When delete is clicked, show inline confirmation with model name and `formatBytes(model.size)` reclaim amount. Second click executes.
- **N15:** After computing estimated RAM for all unloaded models, filter those that fit in remaining free memory. Show as suggestions in the models panel footer.
+- **N11:** After prompt/chat completes with `streamMetrics`, save to localStorage key `llm-model-benchmarks`:
+
+  ```ts
+  Record<string, { tokPerSec: number; totalTokens: number; timestamp: number }>;
+  ```
+
+  Show on card as faded text: `~45 tok/s` (most recent value). Use `llm-` prefix for consistency with other localStorage keys.
+  Restore on page load. Show "N/A" for models never benchmarked.
+
+- **N12:** Replace static `Expires: {time}` (line 1034-1038) with a live countdown. Add a `useEffect` with 1-second interval that computes `expires_at - Date.now()` and formats:
+  - `> 5m` → "Unloads in Xm"
+  - `1–5m` → "Unloads in Xm Ys" (yellow)
+  - `< 1m` → "Unloading soon" (red pulse)
+    Clear interval on unmount or when model is unloaded.
+
+- **N13:** Track in component state: `Map<string, { prompts: number, tokens: number }>`. Increment on each prompt/chat completion. Display in expanded details: "Session: 5 prompts · 2,340 tokens". Resets on page refresh (intentional — session-scoped).
+
+- **N14 (was N15):** After computing estimated RAM (from N1) for all unloaded models, filter those that fit in remaining free memory (accounting for already-loaded models). Show as a suggestion strip below the models list: "Can also load: llama3.1:8b (~6 GB), qwen2.5:7b (~6 GB)".

 **Acceptance criteria:**

- Previously benchmarked models show tok/s on card
- Running models show live countdown to unload
- Delete requires explicit confirmation showing disk reclaim
- TypeScript compiles cleanly after all phases
+- [ ] Previously benchmarked models show tok/s on card
+- [ ] Running models show live countdown to auto-unload
+- [ ] Session stats shown in expanded model details
+- [ ] Suggestions shown for co-loadable models
+- [ ] TypeScript compiles cleanly

 ---

-## Phase 5 — Future Considerations _(Backlog)_
+## Phase 5 — Response Quality & Interaction _(Sprint 12)_

-Not planned for immediate implementation. Revisit after Phases 1–4.
+**Goal:** Improve the prompt/response experience with better rendering and model-specific interactions.

-| ID  | Feature                        | Complexity | Notes                                                             |
-| --- | ------------------------------ | ---------- | ----------------------------------------------------------------- |
-| F17 | WebSocket real-time updates    | High       | Replace 15s polling with push-based updates from Ollama           |
-| F18 | GPU/Metal utilization chart    | Medium     | macOS `powermetrics` or IOKit for GPU load percentage             |
-| F19 | Model download queue           | Medium     | Queue multiple pulls, show progress for each                      |
-| F20 | Inference history log          | Medium     | Persist all prompts/responses to localStorage or file             |
-| F21 | Custom Modelfile editor        | Medium     | Edit and push custom Modelfiles to Ollama                         |
-| F22 | Benchmark suite                | High       | Run standard prompts across all models, generate comparison table |
-| F23 | Component decomposition (CQ1b) | High       | Break 1,885-line page.tsx into feature-based modules              |
+**Estimated effort:** ~90 minutes
+
+| #   | ID  | Task                            | Status | Priority | Notes                                                       |
+| --- | --- | ------------------------------- | ------ | -------- | ----------------------------------------------------------- |
+| 20  | F24 | Image upload for vision models  | - [ ]  | High     | Add image input when vision model is active                 |
+| 21  | F25 | Markdown rendering in responses | - [ ]  | High     | Render markdown (headers, lists, bold, code) instead of raw |
+| 22  | F26 | Code syntax highlighting        | - [ ]  | Medium   | Highlight code blocks in responses with language detection  |
+| 23  | F27 | `<think>` block auto-collapse   | - [ ]  | Medium   | Detect and collapse reasoning traces, show "Show reasoning" |
+| 24  | F28 | Ollama model library link       | - [ ]  | Low      | Link to ollama.com/library in the pull input placeholder    |
+
+**Implementation details:**
+
+- **F24:** When a vision model is selected (detected by N7 badge logic), show an image upload area in the prompt modal. Convert to base64 and send in the Ollama API `images` field. Support drag-and-drop and file picker. Show thumbnail preview.
+
+- **F25:** Use `react-markdown` (lightweight) to render model responses. Install as dependency. Wrap response `<pre>` blocks with `<ReactMarkdown>`. Preserves streaming (markdown renders incrementally).
+
+- **F26:** Use `react-syntax-highlighter` with a dark theme (e.g., `oneDark`) for fenced code blocks inside markdown. Lazy-load to avoid bundle bloat.
+
+- **F27:** For `<think>` models (detected by N6 logic), parse the response to find `<think>...</think>` blocks. Wrap in a collapsible `<details>` element: `<summary>Show reasoning (X tokens)</summary>`. Display the actual answer below the collapsed reasoning.
+
+- **F28:** Change pull input placeholder from `"Pull a model... e.g. deepseek-r1:32b"` to include a link: add a small "Browse models" link below the input that opens `https://ollama.com/library` in a new tab.
+
+**Acceptance criteria:**
+
+- [ ] Vision models show image upload area in prompt modal
+- [ ] Responses render markdown (headers, lists, code blocks, bold/italic)
+- [ ] Code blocks have syntax highlighting with language labels
+- [ ] `<think>` traces are auto-collapsed with "Show reasoning" toggle
+- [ ] Model library link available near pull input
+
+---
+
+## Phase 6 — Data Persistence & Export _(Sprint 13)_
+
+**Goal:** Protect user data and enable portability.
+
+**Estimated effort:** ~45 minutes
+
+| #   | ID  | Task                           | Status | Priority | Notes                                                     |
+| --- | --- | ------------------------------ | ------ | -------- | --------------------------------------------------------- |
+| 25  | F29 | Export/import settings         | - [ ]  | Medium   | Backup all localStorage data as JSON file                 |
+| 26  | F30 | Inference history log          | - [ ]  | Medium   | Persist prompt/response pairs to localStorage with search |
+| 27  | F31 | Clear all data / factory reset | - [ ]  | Low      | Single button to clear all localStorage keys + state      |
+
+**Implementation details:**
+
+- **F29:** Add a gear icon in the header that opens a settings popover. Include "Export settings" (downloads `llm-dashboard-settings.json` containing all `llm-*` localStorage keys) and "Import settings" (file upload, validates shape, merges). Keys to export: `llm-theme`, `llm-model-tags`, `llm-auto-load-model`, `llm-prompt-history`, `llm-model-benchmarks`, `llm-model-sort`, `llm-chat-*`.
+
+- **F30:** After each prompt/chat completion, save `{ model, prompt, response, metrics, timestamp }` to `llm-inference-log` (capped at 100 entries, FIFO). Add a small "History" panel accessible from the header with search and replay.
+
+- **F31:** Add "Reset dashboard" button in settings. Clears all `llm-*` localStorage keys. Confirm with "This will clear all tags, history, and preferences".
+
+**Acceptance criteria:**
+
+- [ ] Settings can be exported as JSON and re-imported
+- [ ] Inference history persists across page refreshes
+- [ ] Factory reset clears all dashboard state
+- [ ] TypeScript compiles cleanly
+
+---
+
+## Phase 7 — Future Considerations _(Backlog)_
+
+Not planned for immediate implementation. Revisit after Phases 1–6.
+
+| ID  | Feature                        | Complexity | Notes                                                                      |
+| --- | ------------------------------ | ---------- | -------------------------------------------------------------------------- |
+| F17 | WebSocket real-time updates    | High       | Replace 15s polling with push-based updates from Ollama                    |
+| F18 | GPU/Metal utilization chart    | Medium     | macOS `powermetrics` or IOKit for GPU load percentage                      |
+| F19 | Model download queue           | Medium     | Queue multiple pulls, show progress for each                               |
+| F21 | Custom Modelfile editor        | Medium     | Edit and push custom Modelfiles to Ollama                                  |
+| F22 | Benchmark suite                | High       | Run standard prompts across all models, generate comparison table          |
+| F23 | Component decomposition (CQ1b) | High       | Break 1,885-line page.tsx into feature-based modules                       |
+| F32 | Accessibility / ARIA labels    | Medium     | Keyboard nav beyond shortcuts, screen reader support, focus management     |
+| F33 | API latency overlay            | Low        | Show request duration for each panel refresh (network timing)              |
+| F34 | Model disk cleanup wizard      | Medium     | Identify unused models (never loaded, old modified_at), suggest deletion   |
+| F35 | Drag-and-drop model ordering   | Low        | Manual model order with drag handles, persist in localStorage              |
+| F36 | Prompt templates library       | Medium     | Pre-built prompts for common tasks (code review, explain, translate, etc.) |
+| F37 | Response diff viewer           | Medium     | Side-by-side diff of two model responses highlighting differences          |
+| F38 | Multi-model chat               | High       | Chat with multiple models simultaneously, responses interleaved            |

 ---

 ## Summary

-| Phase     | Sprint | Items        | Focus                    | Effort     | Depends On |
-| --------- | ------ | ------------ | ------------------------ | ---------- | ---------- |
-| 1         | 8      | N1, N2, N3   | Pre-load RAM estimates   | ~45 min    | —          |
-| 2         | 9      | N4, N5       | RAM bar + context window | ~60 min    | Phase 1    |
-| 3         | 10     | N6–N10       | Badges + sort + version  | ~45 min    | —          |
-| 4         | 11     | N11–N15      | Runtime metrics + UX     | ~30 min    | —          |
-| **Total** |        | **15 items** |                          | **~3 hrs** |            |
+| Phase     | Sprint | Items          | Focus                          | Effort     | Depends On |
+| --------- | ------ | -------------- | ------------------------------ | ---------- | ---------- |
+| 1         | 8      | N1–N3, BN1–BN5 | Pre-load intelligence + fixes  | ~60 min    | —          |
+| 2         | 9      | N4–N5, BN3–BN4 | Rich metadata + persistence    | ~75 min    | Phase 1    |
+| 3         | 10     | N6–N10         | Intelligence badges + sort     | ~60 min    | —          |
+| 4         | 11     | N11–N14        | Runtime metrics + UX           | ~45 min    | Phase 1    |
+| 5         | 12     | F24–F28        | Response quality + interaction | ~90 min    | Phase 3    |
+| 6         | 13     | F29–F31        | Data persistence + export      | ~45 min    | —          |
+| **Total** |        | **31 items**   |                                | **~6 hrs** |            |

-Phases 1 and 3 can run in parallel. Phase 2 depends on Phase 1 (needs RAM estimates for the budget bar). Phase 4 is independent.
+**Dependency graph:**
+
+- Phase 1 → Phase 2 (N4 budget bar needs N1 RAM estimates)
+- Phase 1 → Phase 4 (N14 co-load suggestions need N1 RAM estimates)
+- Phase 3 → Phase 5 (F24 vision upload needs N7 detection, F27 `<think>` collapse needs N6)
+- Phases 1, 3, 6 can run in parallel
+
+**localStorage key registry** (standardized `llm-` prefix):
+
+| Key                    | Type   | Phase | Purpose                   |
+| ---------------------- | ------ | ----- | ------------------------- |
+| `llm-theme`            | string | Done  | Dark/light theme          |
+| `llm-model-tags`       | JSON   | Done  | User tags per model       |
+| `llm-auto-load-model`  | string | Done  | Preferred auto-load model |
+| `llm-prompt-history`   | JSON   | Done  | Last 20 prompts           |
+| `llm-model-benchmarks` | JSON   | P4    | Tok/s per model           |
+| `llm-model-sort`       | string | P3    | Sort preference           |
+| `llm-chat-{model}`     | JSON   | P2    | Chat messages per model   |
+| `llm-inference-log`    | JSON   | P6    | Full inference history    |