# Kimi “2.5” local deployment on macOS (what’s реально possible)

## Executive summary

- **Running the real “Kimi 2.5 / Kimi K2-class” model fully locally on a Mac is not practical**: the official open-weight Kimi K2 deployment guidance targets **multi-node NVIDIA GPU clusters** (vLLM/SGLang/TensorRT-LLM) and assumes CUDA.
- **What _is_ practical on a Mac:** use **Kimi Code CLI** (official) or the **Moonshot API** (official). That’s not “local inference” (weights on your laptop), but it is “local usage” (client runs on your laptop).
- **VPN / proxy:** yes, clients can usually work through VPN/proxy, but you must be able to reach the provider’s API endpoints. If your network blocks access, you’ll need allowlisting or a different route.

## What GitHub shows (official sources)

### 1) Kimi K2 (open-weight model series)

- Official repo: https://github.com/MoonshotAI/Kimi-K2
- The repo includes a deployment guide at `docs/deploy_guidance.md`.
- Key reality check from the guide: **the smallest FP8 128k-seqlen deployment is described as ~16 GPUs (H200/H20-class)** for mainstream setups (vLLM/SGLang). This is fundamentally not macOS-laptop friendly.

### 2) Kimi Code CLI (best option for macOS)

- Official repo: https://github.com/MoonshotAI/kimi-cli
- Docs: https://moonshotai.github.io/kimi-cli/en/guides/getting-started.html
- Kimi Code CLI is an **agent client** (terminal tool) that talks to a remote provider.
- First-run authentication options:
  - `/login` (browser login; auto-configures models)
  - `/setup` (API key flow)

## Can we locally deploy “Kimi 2.5” on this Mac?

### Practically: no (for K2 / K2-class)

- **macOS has no CUDA**, so GPU-first inference stacks referenced for K2 deployment (vLLM / TensorRT-LLM) are not an option.
- Even if you tried CPU inference, the K2-class model scale (MoE, 1T total params / 32B active) is far beyond what is reasonable on a laptop.

### What’s feasible instead

1. **Use Kimi Code CLI on macOS** (recommended)
2. **Use Moonshot/Kimi APIs from your own scripts** (Python/Node/etc) once your network allows access
3. If you truly need **local weights on Mac**, use a Mac-friendly model/runtime (e.g., MLX/Ollama) — but that would be **a different model**, not Kimi 2.5/K2.

## Options: smaller, Mac-runnable open models (local inference)

If your goal is “no network calls at runtime”, pick a **local runtime** + a **small-enough model + quantization**.

## If you can only access GitHub

If your enterprise network only allows `github.com`, that severely limits how you obtain model weights because most model hosting is **not** on GitHub.

What still works:

- **GitHub Releases assets**: some projects publish quantized model files (often `.gguf`) as release assets.
- **Git LFS inside a repo**: occasionally weights are stored in-repo via LFS (still uncommon for large models).

What usually won’t work:

- Downloading weights from external model registries / hosting sites (blocked by policy).

Reality check: GitHub has practical size limits, so **very large models are rarely hosted there**. Your best bet is to use **small models (7B–14B) in quantized form**.

### How to find downloadable models on GitHub

- Use [oss_llm/find_models_on_github.py](oss_llm/find_models_on_github.py) to search GitHub repos and list any release assets that look like model files.
- Prefer assets ending in `.gguf` if you plan to run with `llama.cpp`.

Example:

```sh
python3 oss_llm/find_models_on_github.py --query "gguf qwen 2.5" --limit 20
```

### Recommended enterprise pattern

If you need a specific model but can’t download it directly:

1. Ask security for an **approved internal mirror** / artifact store.
2. Mirror the model files there.
3. Point your local runtime (Ollama/llama.cpp/MLX) at the internal location.

### “No security review” learning path (best effort): train a tiny model yourself

If you want to avoid downloading any third-party model weights at all, the cleanest option is to:

- clone training code from GitHub
- train a small model on a local/public-domain text file

This won’t produce a state-of-the-art assistant, but it’s excellent for learning tokenization, training loops, sampling, and basic eval.

One popular repo for this is **karpathy/nanoGPT**.

High-level steps:

```sh
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT

# create a python env (choose your preferred method)
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# run the built-in Shakespeare example (it downloads a small text file)
python data/shakespeare_char/prepare.py
python train.py data/shakespeare_char --device=cpu --compile=False

# sample
python sample.py --out_dir=out-shakespeare-char --device=cpu --compile=False
```

If even small downloads are restricted, replace the dataset step with your own local text file and adjust the dataset script accordingly.

### nanoGPT demo (this repo)

This workspace includes a reproducible nanoGPT workflow (CPU and Apple Silicon MPS) plus sampling demo commands:

- See [oss_llm/testNanoGPT/README.md](oss_llm/testNanoGPT/README.md)

### If you do download GGUF weights from GitHub

It can work (some repos commit a `.gguf` directly), but treat it like any third-party binary:

- Prefer repos with **clear licensing** (repo license + explicit model license/provenance)
- Prefer “original publisher” repos over re-uploads
- Keep models small (e.g., ~100M–3B) for macOS learning

### Recommended runtimes (macOS)

- **Ollama**: easiest “download + chat + local HTTP API” experience.
- **LM Studio**: easy GUI; also exposes a local API server.
- **llama.cpp**: most portable; great for CPU/Metal, quantized GGUF models.
- **MLX** (Apple): best when you want Python-native workflows on Apple Silicon.

### Model size guidance (rule of thumb)

Quantized models are what make laptops viable.

- **8–10B @ 4-bit**: typically comfortable on 16GB unified memory.
- **14B @ 4-bit**: better with 24–32GB unified memory.
- **30B+**: usually needs 64GB+ and will still be slow.

### Good “starter” model families (pick one)

These are widely supported by the runtimes above and have strong general utility:

- **Llama 3.x (8B class)**: strong general chat + coding for the size.
- **Qwen 2.5 (7B/14B class)**: strong multilingual + coding.
- **Mistral 7B class**: fast and solid baseline.
- **Gemma 2 (9B class)**: good general-purpose quality.
- **Phi-3.x (mini/small class)**: very fast and lightweight.

### Suggested picks by Mac memory

- **16GB unified memory**: start with an 8–9B model at 4-bit.
- **32GB unified memory**: 14B at 4-bit is a good sweet spot.
- **64GB unified memory**: 27–34B at 4-bit becomes feasible (still slower).

### Practical setup: Ollama (quickest)

1. Install runtime (choose the install method your enterprise allows; Homebrew is common):

```sh
brew install ollama
```

2. Start the local service:

```sh
ollama serve
```

3. Pull/run a model (example placeholder):

```sh
ollama run <model-name>
```

4. Use the local API (optional):

```sh
curl http://127.0.0.1:11434/api/tags
```

### Practical setup: llama.cpp (most controllable)

If you can obtain a quantized GGUF model file via an approved internal mirror:

```sh
brew install llama.cpp
llama-cli -m /path/to/model.gguf -p "Hello" -n 256
```

### Practical setup: MLX (Python-centric on Apple Silicon)

If your environment allows Python packages and you have an MLX-converted model available internally:

```sh
python3 -m venv .venv && source .venv/bin/activate
pip install mlx mlx-lm
python -m mlx_lm.generate --model /path/to/mlx-model --prompt "Hello" --max-tokens 256
```

### Enterprise note (important)

Because model hosting sites may be blocked in your network category, the usual pattern in enterprise is:

1. Security-approved model list
2. Internal artifact store / mirror for model files
3. Local runtime (Ollama/llama.cpp/MLX) pointing to those internally hosted artifacts

## Steps (macOS): run Kimi Code CLI locally (client-side)

Source: Kimi CLI docs.

1. Install

```sh
# Install via uv (Python package manager)
uv tool install --python 3.13 kimi-cli
```

2. Verify

```sh
kimi --version
```

3. Start in a project directory

```sh
cd /Users/sd9235/code/mygh/learning_ai_2nd_brain
kimi
```

4. Authenticate

- Preferred:
  - Run `/login` inside the CLI and complete the browser auth.
- Alternative:
  - Run `/setup` and choose an API platform + API key + model.

5. If models don’t show up

- Kimi CLI FAQ: verify network access to your configured provider’s API endpoints.

## Steps (NOT macOS): deploy Kimi K2 weights (server-side)

If you have access to Linux + NVIDIA GPUs, use the official K2 deployment guide:

- vLLM (requires CUDA; the guide notes vLLM v0.10.0rc1+)
- SGLang
- TensorRT-LLM

This is the realistic path if you truly need “local” (self-hosted) K2/K2-class inference: **run the model on a GPU box/cluster** and call it from your Mac.

## VPN / proxy: are we able to access through it?

### What you need

You generally need outbound access (through your VPN/proxy) to at least:

- your chosen provider’s API host (varies by provider)
- and, if downloading open weights, the model hosting site you plan to use.

### Quick connectivity checks

```sh
# DNS + HTTPS reachability
curl -I https://<YOUR_PROVIDER_API_HOST>

# If you plan to download weights later
curl -I https://<YOUR_MODEL_HOSTING_SITE>
```

### Configure proxy in a shell (typical)

```sh
export HTTP_PROXY="http://127.0.0.1:7890"
export HTTPS_PROXY="http://127.0.0.1:7890"
export ALL_PROXY="socks5://127.0.0.1:7890"
export NO_PROXY="localhost,127.0.0.1"
```

### Configure proxy for Git

```sh
git config --global http.proxy "$HTTPS_PROXY"
git config --global https.proxy "$HTTPS_PROXY"
```

### Configure proxy for Python/pip

```sh
pip config set global.proxy "$HTTPS_PROXY"
```

### Notes about your current network

From within this VS Code environment, requests to some provider/model-hosting sites were redirected to a corporate/web-filter “blockpage” URL. If you see that on your Mac too, you’ll need one of:

- VPN that routes around the filter
- proxy that’s allowed
- allowlist/exception for those domains

## Recommendation

- If your goal is to _use_ Kimi on this Mac: **install Kimi Code CLI** and make sure your VPN/proxy allows access to your configured provider.
- If your goal is “true local inference”: **host Kimi K2 on a CUDA GPU server** (or use a smaller Mac-native model instead).

## Your Mac (detected)

- macOS: 15.7.3 (24G419)
- CPU arch: arm64
- Machine: MacBook Pro (Mac16,7)
- Chip: Apple M4 Pro (14 cores)
- Memory: 48 GB
- Python: 3.13.10

## Will nanoGPT work on this laptop?

Yes for learning, with a couple of caveats.

What will work well:

- **Small CPU/MPS runs** (toy datasets like Shakespeare, short experiments, sampling).
- With 48GB RAM and Apple Silicon, you have plenty of headroom for nanoGPT-style demos.

What might block you:

- Installing dependencies (notably PyTorch) typically requires access to package indexes that may be blocked in your network.
  - In this workspace, PyTorch was successfully installed into the venv and `mps_available` is `True`.
  - If you can’t reach package indexes in a different environment, use an internal Python package mirror, or install from a pre-approved wheelhouse.

Practical suggestion:

- Start with CPU (`--device=cpu`) to keep it simple.
- If your PyTorch build supports Apple Metal (MPS), you can later try `--device=mps` for speed.

Quick verification (after installing torch):

```sh
python3 -c "import torch; print(torch.__version__); print('mps', torch.backends.mps.is_available())"
```

## nanoGPT: validated end-to-end in this workspace

This is a minimal, fast run that was verified on this machine.

### 1) Install deps (workspace venv)

```sh
# from the workspace root
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python -m pip install torch numpy transformers datasets tiktoken wandb tqdm
```

### 2) Clone nanoGPT

```sh
git clone https://github.com/karpathy/nanoGPT.git oss_llm/nanoGPT
```

### 3) Prepare dataset

```sh
cd oss_llm/nanoGPT
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python data/shakespeare_char/prepare.py
```

### 4) Short CPU training (writes `out-shakespeare-char/ckpt.pt`)

```sh
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python train.py config/train_shakespeare_char.py \
  --device=cpu --compile=False \
  --eval_interval=10 --eval_iters=10 --log_interval=10 \
  --block_size=64 --batch_size=12 \
  --n_layer=4 --n_head=4 --n_embd=128 \
  --max_iters=60 --lr_decay_iters=60 --dropout=0.0 \
  --always_save_checkpoint=True
```

### 5) Sample

```sh
/Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python sample.py \
  --out_dir=out-shakespeare-char --device=cpu --max_new_tokens=200
```

Tip: for speed on Apple Silicon, try `--device=mps` once you’re comfortable.