# Kimi “2.5” local deployment on macOS (what’s реально possible) ## Executive summary - **Running the real “Kimi 2.5 / Kimi K2-class” model fully locally on a Mac is not practical**: the official open-weight Kimi K2 deployment guidance targets **multi-node NVIDIA GPU clusters** (vLLM/SGLang/TensorRT-LLM) and assumes CUDA. - **What _is_ practical on a Mac:** use **Kimi Code CLI** (official) or the **Moonshot API** (official). That’s not “local inference” (weights on your laptop), but it is “local usage” (client runs on your laptop). - **VPN / proxy:** yes, clients can usually work through VPN/proxy, but you must be able to reach the provider’s API endpoints. If your network blocks access, you’ll need allowlisting or a different route. ## What GitHub shows (official sources) ### 1) Kimi K2 (open-weight model series) - Official repo: https://github.com/MoonshotAI/Kimi-K2 - The repo includes a deployment guide at `docs/deploy_guidance.md`. - Key reality check from the guide: **the smallest FP8 128k-seqlen deployment is described as ~16 GPUs (H200/H20-class)** for mainstream setups (vLLM/SGLang). This is fundamentally not macOS-laptop friendly. ### 2) Kimi Code CLI (best option for macOS) - Official repo: https://github.com/MoonshotAI/kimi-cli - Docs: https://moonshotai.github.io/kimi-cli/en/guides/getting-started.html - Kimi Code CLI is an **agent client** (terminal tool) that talks to a remote provider. - First-run authentication options: - `/login` (browser login; auto-configures models) - `/setup` (API key flow) ## Can we locally deploy “Kimi 2.5” on this Mac? ### Practically: no (for K2 / K2-class) - **macOS has no CUDA**, so GPU-first inference stacks referenced for K2 deployment (vLLM / TensorRT-LLM) are not an option. - Even if you tried CPU inference, the K2-class model scale (MoE, 1T total params / 32B active) is far beyond what is reasonable on a laptop. ### What’s feasible instead 1. **Use Kimi Code CLI on macOS** (recommended) 2. **Use Moonshot/Kimi APIs from your own scripts** (Python/Node/etc) once your network allows access 3. If you truly need **local weights on Mac**, use a Mac-friendly model/runtime (e.g., MLX/Ollama) — but that would be **a different model**, not Kimi 2.5/K2. ## Options: smaller, Mac-runnable open models (local inference) If your goal is “no network calls at runtime”, pick a **local runtime** + a **small-enough model + quantization**. ## If you can only access GitHub If your enterprise network only allows `github.com`, that severely limits how you obtain model weights because most model hosting is **not** on GitHub. What still works: - **GitHub Releases assets**: some projects publish quantized model files (often `.gguf`) as release assets. - **Git LFS inside a repo**: occasionally weights are stored in-repo via LFS (still uncommon for large models). What usually won’t work: - Downloading weights from external model registries / hosting sites (blocked by policy). Reality check: GitHub has practical size limits, so **very large models are rarely hosted there**. Your best bet is to use **small models (7B–14B) in quantized form**. ### How to find downloadable models on GitHub - Use [oss_llm/find_models_on_github.py](oss_llm/find_models_on_github.py) to search GitHub repos and list any release assets that look like model files. - Prefer assets ending in `.gguf` if you plan to run with `llama.cpp`. Example: ```sh python3 oss_llm/find_models_on_github.py --query "gguf qwen 2.5" --limit 20 ``` ### Recommended enterprise pattern If you need a specific model but can’t download it directly: 1. Ask security for an **approved internal mirror** / artifact store. 2. Mirror the model files there. 3. Point your local runtime (Ollama/llama.cpp/MLX) at the internal location. ### “No security review” learning path (best effort): train a tiny model yourself If you want to avoid downloading any third-party model weights at all, the cleanest option is to: - clone training code from GitHub - train a small model on a local/public-domain text file This won’t produce a state-of-the-art assistant, but it’s excellent for learning tokenization, training loops, sampling, and basic eval. One popular repo for this is **karpathy/nanoGPT**. High-level steps: ```sh git clone https://github.com/karpathy/nanoGPT.git cd nanoGPT # create a python env (choose your preferred method) python3 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt # run the built-in Shakespeare example (it downloads a small text file) python data/shakespeare_char/prepare.py python train.py data/shakespeare_char --device=cpu --compile=False # sample python sample.py --out_dir=out-shakespeare-char --device=cpu --compile=False ``` If even small downloads are restricted, replace the dataset step with your own local text file and adjust the dataset script accordingly. ### nanoGPT demo (this repo) This workspace includes a reproducible nanoGPT workflow (CPU and Apple Silicon MPS) plus sampling demo commands: - See [oss_llm/testNanoGPT/README.md](oss_llm/testNanoGPT/README.md) ### If you do download GGUF weights from GitHub It can work (some repos commit a `.gguf` directly), but treat it like any third-party binary: - Prefer repos with **clear licensing** (repo license + explicit model license/provenance) - Prefer “original publisher” repos over re-uploads - Keep models small (e.g., ~100M–3B) for macOS learning ### Recommended runtimes (macOS) - **Ollama**: easiest “download + chat + local HTTP API” experience. - **LM Studio**: easy GUI; also exposes a local API server. - **llama.cpp**: most portable; great for CPU/Metal, quantized GGUF models. - **MLX** (Apple): best when you want Python-native workflows on Apple Silicon. ### Model size guidance (rule of thumb) Quantized models are what make laptops viable. - **8–10B @ 4-bit**: typically comfortable on 16GB unified memory. - **14B @ 4-bit**: better with 24–32GB unified memory. - **30B+**: usually needs 64GB+ and will still be slow. ### Good “starter” model families (pick one) These are widely supported by the runtimes above and have strong general utility: - **Llama 3.x (8B class)**: strong general chat + coding for the size. - **Qwen 2.5 (7B/14B class)**: strong multilingual + coding. - **Mistral 7B class**: fast and solid baseline. - **Gemma 2 (9B class)**: good general-purpose quality. - **Phi-3.x (mini/small class)**: very fast and lightweight. ### Suggested picks by Mac memory - **16GB unified memory**: start with an 8–9B model at 4-bit. - **32GB unified memory**: 14B at 4-bit is a good sweet spot. - **64GB unified memory**: 27–34B at 4-bit becomes feasible (still slower). ### Practical setup: Ollama (quickest) 1. Install runtime (choose the install method your enterprise allows; Homebrew is common): ```sh brew install ollama ``` 2. Start the local service: ```sh ollama serve ``` 3. Pull/run a model (example placeholder): ```sh ollama run ``` 4. Use the local API (optional): ```sh curl http://127.0.0.1:11434/api/tags ``` ### Practical setup: llama.cpp (most controllable) If you can obtain a quantized GGUF model file via an approved internal mirror: ```sh brew install llama.cpp llama-cli -m /path/to/model.gguf -p "Hello" -n 256 ``` ### Practical setup: MLX (Python-centric on Apple Silicon) If your environment allows Python packages and you have an MLX-converted model available internally: ```sh python3 -m venv .venv && source .venv/bin/activate pip install mlx mlx-lm python -m mlx_lm.generate --model /path/to/mlx-model --prompt "Hello" --max-tokens 256 ``` ### Enterprise note (important) Because model hosting sites may be blocked in your network category, the usual pattern in enterprise is: 1. Security-approved model list 2. Internal artifact store / mirror for model files 3. Local runtime (Ollama/llama.cpp/MLX) pointing to those internally hosted artifacts ## Steps (macOS): run Kimi Code CLI locally (client-side) Source: Kimi CLI docs. 1. Install ```sh # Install via uv (Python package manager) uv tool install --python 3.13 kimi-cli ``` 2. Verify ```sh kimi --version ``` 3. Start in a project directory ```sh cd /Users/sd9235/code/mygh/learning_ai_2nd_brain kimi ``` 4. Authenticate - Preferred: - Run `/login` inside the CLI and complete the browser auth. - Alternative: - Run `/setup` and choose an API platform + API key + model. 5. If models don’t show up - Kimi CLI FAQ: verify network access to your configured provider’s API endpoints. ## Steps (NOT macOS): deploy Kimi K2 weights (server-side) If you have access to Linux + NVIDIA GPUs, use the official K2 deployment guide: - vLLM (requires CUDA; the guide notes vLLM v0.10.0rc1+) - SGLang - TensorRT-LLM This is the realistic path if you truly need “local” (self-hosted) K2/K2-class inference: **run the model on a GPU box/cluster** and call it from your Mac. ## VPN / proxy: are we able to access through it? ### What you need You generally need outbound access (through your VPN/proxy) to at least: - your chosen provider’s API host (varies by provider) - and, if downloading open weights, the model hosting site you plan to use. ### Quick connectivity checks ```sh # DNS + HTTPS reachability curl -I https:// # If you plan to download weights later curl -I https:// ``` ### Configure proxy in a shell (typical) ```sh export HTTP_PROXY="http://127.0.0.1:7890" export HTTPS_PROXY="http://127.0.0.1:7890" export ALL_PROXY="socks5://127.0.0.1:7890" export NO_PROXY="localhost,127.0.0.1" ``` ### Configure proxy for Git ```sh git config --global http.proxy "$HTTPS_PROXY" git config --global https.proxy "$HTTPS_PROXY" ``` ### Configure proxy for Python/pip ```sh pip config set global.proxy "$HTTPS_PROXY" ``` ### Notes about your current network From within this VS Code environment, requests to some provider/model-hosting sites were redirected to a corporate/web-filter “blockpage” URL. If you see that on your Mac too, you’ll need one of: - VPN that routes around the filter - proxy that’s allowed - allowlist/exception for those domains ## Recommendation - If your goal is to _use_ Kimi on this Mac: **install Kimi Code CLI** and make sure your VPN/proxy allows access to your configured provider. - If your goal is “true local inference”: **host Kimi K2 on a CUDA GPU server** (or use a smaller Mac-native model instead). ## Your Mac (detected) - macOS: 15.7.3 (24G419) - CPU arch: arm64 - Machine: MacBook Pro (Mac16,7) - Chip: Apple M4 Pro (14 cores) - Memory: 48 GB - Python: 3.13.10 ## Will nanoGPT work on this laptop? Yes for learning, with a couple of caveats. What will work well: - **Small CPU/MPS runs** (toy datasets like Shakespeare, short experiments, sampling). - With 48GB RAM and Apple Silicon, you have plenty of headroom for nanoGPT-style demos. What might block you: - Installing dependencies (notably PyTorch) typically requires access to package indexes that may be blocked in your network. - In this workspace, PyTorch was successfully installed into the venv and `mps_available` is `True`. - If you can’t reach package indexes in a different environment, use an internal Python package mirror, or install from a pre-approved wheelhouse. Practical suggestion: - Start with CPU (`--device=cpu`) to keep it simple. - If your PyTorch build supports Apple Metal (MPS), you can later try `--device=mps` for speed. Quick verification (after installing torch): ```sh python3 -c "import torch; print(torch.__version__); print('mps', torch.backends.mps.is_available())" ``` ## nanoGPT: validated end-to-end in this workspace This is a minimal, fast run that was verified on this machine. ### 1) Install deps (workspace venv) ```sh # from the workspace root /Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python -m pip install torch numpy transformers datasets tiktoken wandb tqdm ``` ### 2) Clone nanoGPT ```sh git clone https://github.com/karpathy/nanoGPT.git oss_llm/nanoGPT ``` ### 3) Prepare dataset ```sh cd oss_llm/nanoGPT /Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python data/shakespeare_char/prepare.py ``` ### 4) Short CPU training (writes `out-shakespeare-char/ckpt.pt`) ```sh /Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python train.py config/train_shakespeare_char.py \ --device=cpu --compile=False \ --eval_interval=10 --eval_iters=10 --log_interval=10 \ --block_size=64 --batch_size=12 \ --n_layer=4 --n_head=4 --n_embd=128 \ --max_iters=60 --lr_decay_iters=60 --dropout=0.0 \ --always_save_checkpoint=True ``` ### 5) Sample ```sh /Users/sd9235/code/mygh/learning_ai_2nd_brain/.venv/bin/python sample.py \ --out_dir=out-shakespeare-char --device=cpu --max_new_tokens=200 ``` Tip: for speed on Apple Silicon, try `--device=mps` once you’re comfortable.